dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
@ 2021-04-19 10:47 Marek Olšák
  2021-04-19 15:48 ` Jason Ekstrand
                   ` (3 more replies)
  0 siblings, 4 replies; 105+ messages in thread
From: Marek Olšák @ 2021-04-19 10:47 UTC (permalink / raw)
  To: ML Mesa-dev, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 5577 bytes --]

Hi,

This is our initial proposal for explicit fences everywhere and new memory
management that doesn't use BO fences. It's a redesign of how Linux
graphics drivers work, and it can coexist with what we have now.


*1. Introduction*
(skip this if you are already sold on explicit fences)

The current Linux graphics architecture was initially designed for GPUs
with only one graphics queue where everything was executed in the
submission order and per-BO fences were used for memory management and
CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
queues were added on top, which required the introduction of implicit
GPU-GPU synchronization between queues of different processes using per-BO
fences. Recently, even parallel execution within one queue was enabled
where a command buffer starts draws and compute shaders, but doesn't wait
for them, enabling parallelism between back-to-back command buffers.
Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
was created to enable all those use cases, and it's the only reason why the
scheduler exists.

The GPU scheduler, implicit synchronization, BO-fence-based memory
management, and the tracking of per-BO fences increase CPU overhead and
latency, and reduce parallelism. There is a desire to replace all of them
with something much simpler. Below is how we could do it.


*2. Explicit synchronization for window systems and modesetting*

The producer is an application and the consumer is a compositor or a
modesetting driver.

*2.1. The Present request*

As part of the Present request, the producer will pass 2 fences (sync
objects) to the consumer alongside the presented DMABUF BO:
- The submit fence: Initially unsignalled, it will be signalled when the
producer has finished drawing into the presented buffer.
- The return fence: Initially unsignalled, it will be signalled when the
consumer has finished using the presented buffer.

Deadlock mitigation to recover from segfaults:
- The kernel knows which process is obliged to signal which fence. This
information is part of the Present request and supplied by userspace.
- If the producer crashes, the kernel signals the submit fence, so that the
consumer can make forward progress.
- If the consumer crashes, the kernel signals the return fence, so that the
producer can reclaim the buffer.
- A GPU hang signals all fences. Other deadlocks will be handled like GPU
hangs.

Other window system requests can follow the same idea.

Merged fences where one fence object contains multiple fences will be
supported. A merged fence is signalled only when its fences are signalled.
The consumer will have the option to redefine the unsignalled return fence
to a merged fence.

*2.2. Modesetting*

Since a modesetting driver can also be the consumer, the present ioctl will
contain a submit fence and a return fence too. One small problem with this
is that userspace can hang the modesetting driver, but in theory, any later
present ioctl can override the previous one, so the unsignalled
presentation is never used.


*3. New memory management*

The per-BO fences will be removed and the kernel will not know which
buffers are busy. This will reduce CPU overhead and latency. The kernel
will not need per-BO fences with explicit synchronization, so we just need
to remove their last user: buffer evictions. It also resolves the current
OOM deadlock.

*3.1. Evictions*

If the kernel wants to move a buffer, it will have to wait for everything
to go idle, halt all userspace command submissions, move the buffer, and
resume everything. This is not expected to happen when memory is not
exhausted. Other more efficient ways of synchronization are also possible
(e.g. sync only one process), but are not discussed here.

*3.2. Per-process VRAM usage quota*

Each process can optionally and periodically query its VRAM usage quota and
change domains of its buffers to obey that quota. For example, a process
allocated 2 GB of buffers in VRAM, but the kernel decreased the quota to 1
GB. The process can change the domains of the least important buffers to
GTT to get the best outcome for itself. If the process doesn't do it, the
kernel will choose which buffers to evict at random. (thanks to Christian
Koenig for this idea)

*3.3. Buffer destruction without per-BO fences*

When the buffer destroy ioctl is called, an optional fence list can be
passed to the kernel to indicate when it's safe to deallocate the buffer.
If the fence list is empty, the buffer will be deallocated immediately.
Shared buffers will be handled by merging fence lists from all processes
that destroy them. Mitigation of malicious behavior:
- If userspace destroys a busy buffer, it will get a GPU page fault.
- If userspace sends fences that never signal, the kernel will have a
timeout period and then will proceed to deallocate the buffer anyway.

*3.4. Other notes on MM*

Overcommitment of GPU-accessible memory will cause an allocation failure or
invoke the OOM killer. Evictions to GPU-inaccessible memory might not be
supported.

Kernel drivers could move to this new memory management today. Only buffer
residency and evictions would stop using per-BO fences.



*4. Deprecating implicit synchronization*

It can be phased out by introducing a new generation of hardware where the
driver doesn't add support for it (like a driver fork would do), assuming
userspace has all the changes for explicit synchronization. This could
potentially create an isolated part of the kernel DRM where all drivers
only support explicit synchronization.

Marek

[-- Attachment #1.2: Type: text/html, Size: 6723 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-19 10:47 [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal Marek Olšák
@ 2021-04-19 15:48 ` Jason Ekstrand
  2021-04-20  2:25   ` Marek Olšák
  2021-04-20 10:15   ` [Mesa-dev] " Christian König
  2021-04-20 12:01 ` Daniel Vetter
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 105+ messages in thread
From: Jason Ekstrand @ 2021-04-19 15:48 UTC (permalink / raw)
  To: Marek Olšák, Daniel Vetter; +Cc: ML Mesa-dev, dri-devel

Not going to comment on everything on the first pass...

On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák <maraeo@gmail.com> wrote:
>
> Hi,
>
> This is our initial proposal for explicit fences everywhere and new memory management that doesn't use BO fences. It's a redesign of how Linux graphics drivers work, and it can coexist with what we have now.
>
>
> 1. Introduction
> (skip this if you are already sold on explicit fences)
>
> The current Linux graphics architecture was initially designed for GPUs with only one graphics queue where everything was executed in the submission order and per-BO fences were used for memory management and CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple queues were added on top, which required the introduction of implicit GPU-GPU synchronization between queues of different processes using per-BO fences. Recently, even parallel execution within one queue was enabled where a command buffer starts draws and compute shaders, but doesn't wait for them, enabling parallelism between back-to-back command buffers. Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler was created to enable all those use cases, and it's the only reason why the scheduler exists.
>
> The GPU scheduler, implicit synchronization, BO-fence-based memory management, and the tracking of per-BO fences increase CPU overhead and latency, and reduce parallelism. There is a desire to replace all of them with something much simpler. Below is how we could do it.
>
>
> 2. Explicit synchronization for window systems and modesetting
>
> The producer is an application and the consumer is a compositor or a modesetting driver.
>
> 2.1. The Present request
>
> As part of the Present request, the producer will pass 2 fences (sync objects) to the consumer alongside the presented DMABUF BO:
> - The submit fence: Initially unsignalled, it will be signalled when the producer has finished drawing into the presented buffer.
> - The return fence: Initially unsignalled, it will be signalled when the consumer has finished using the presented buffer.

I'm not sure syncobj is what we want.  In the Intel world we're trying
to go even further to something we're calling "userspace fences" which
are a timeline implemented as a single 64-bit value in some
CPU-mappable BO.  The client writes a higher value into the BO to
signal the timeline.  The kernel then provides some helpers for
waiting on them reliably and without spinning.  I don't expect
everyone to support these right away but, If we're going to re-plumb
userspace for explicit synchronization, I'd like to make sure we take
this into account so we only have to do it once.


> Deadlock mitigation to recover from segfaults:
> - The kernel knows which process is obliged to signal which fence. This information is part of the Present request and supplied by userspace.

This isn't clear to me.  Yes, if we're using anything dma-fence based
like syncobj, this is true.  But it doesn't seem totally true as a
general statement.


> - If the producer crashes, the kernel signals the submit fence, so that the consumer can make forward progress.
> - If the consumer crashes, the kernel signals the return fence, so that the producer can reclaim the buffer.
> - A GPU hang signals all fences. Other deadlocks will be handled like GPU hangs.

What do you mean by "all"?  All fences that were supposed to be
signaled by the hung context?


>
> Other window system requests can follow the same idea.
>
> Merged fences where one fence object contains multiple fences will be supported. A merged fence is signalled only when its fences are signalled. The consumer will have the option to redefine the unsignalled return fence to a merged fence.
>
> 2.2. Modesetting
>
> Since a modesetting driver can also be the consumer, the present ioctl will contain a submit fence and a return fence too. One small problem with this is that userspace can hang the modesetting driver, but in theory, any later present ioctl can override the previous one, so the unsignalled presentation is never used.
>
>
> 3. New memory management
>
> The per-BO fences will be removed and the kernel will not know which buffers are busy. This will reduce CPU overhead and latency. The kernel will not need per-BO fences with explicit synchronization, so we just need to remove their last user: buffer evictions. It also resolves the current OOM deadlock.

Is this even really possible?  I'm no kernel MM expert (trying to
learn some) but my understanding is that the use of per-BO dma-fence
runs deep.  I would like to stop using it for implicit synchronization
to be sure, but I'm not sure I believe the claim that we can get rid
of it entirely.  Happy to see someone try, though.


> 3.1. Evictions
>
> If the kernel wants to move a buffer, it will have to wait for everything to go idle, halt all userspace command submissions, move the buffer, and resume everything. This is not expected to happen when memory is not exhausted. Other more efficient ways of synchronization are also possible (e.g. sync only one process), but are not discussed here.
>
> 3.2. Per-process VRAM usage quota
>
> Each process can optionally and periodically query its VRAM usage quota and change domains of its buffers to obey that quota. For example, a process allocated 2 GB of buffers in VRAM, but the kernel decreased the quota to 1 GB. The process can change the domains of the least important buffers to GTT to get the best outcome for itself. If the process doesn't do it, the kernel will choose which buffers to evict at random. (thanks to Christian Koenig for this idea)

This is going to be difficult.  On Intel, we have some resources that
have to be pinned to VRAM and can't be dynamically swapped out by the
kernel.  In GL, we probably can deal with it somewhat dynamically.  In
Vulkan, we'll be entirely dependent on the application to use the
appropriate Vulkan memory budget APIs.

--Jason


> 3.3. Buffer destruction without per-BO fences
>
> When the buffer destroy ioctl is called, an optional fence list can be passed to the kernel to indicate when it's safe to deallocate the buffer. If the fence list is empty, the buffer will be deallocated immediately. Shared buffers will be handled by merging fence lists from all processes that destroy them. Mitigation of malicious behavior:
> - If userspace destroys a busy buffer, it will get a GPU page fault.
> - If userspace sends fences that never signal, the kernel will have a timeout period and then will proceed to deallocate the buffer anyway.
>
> 3.4. Other notes on MM
>
> Overcommitment of GPU-accessible memory will cause an allocation failure or invoke the OOM killer. Evictions to GPU-inaccessible memory might not be supported.
>
> Kernel drivers could move to this new memory management today. Only buffer residency and evictions would stop using per-BO fences.
>
>
> 4. Deprecating implicit synchronization
>
> It can be phased out by introducing a new generation of hardware where the driver doesn't add support for it (like a driver fork would do), assuming userspace has all the changes for explicit synchronization. This could potentially create an isolated part of the kernel DRM where all drivers only support explicit synchronization.
>
> Marek
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-19 15:48 ` Jason Ekstrand
@ 2021-04-20  2:25   ` Marek Olšák
  2021-04-20 10:15   ` [Mesa-dev] " Christian König
  1 sibling, 0 replies; 105+ messages in thread
From: Marek Olšák @ 2021-04-20  2:25 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: ML Mesa-dev, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 9600 bytes --]

We already don't have accurate BO fences in some cases. Instead, BOs can
have fences which are equal to the last seen command buffer for each queue.
It's practically the same as if the kernel had no visibility into command
submissions and just added a fence into all queues when it needed to wait
for idle. That's already one alternative to BO fences that would work
today. The only BOs that need accurate BO fences are shared buffers, and
those use cases can be converted to explicit fences.

Removing memory management from all command buffer submission logic would
be one of the benefits that is quite appealing.

You don't need to depend on apps for budgeting and placement determination.
You can sort buffers according to driver usage, e.g. scratch/spill buffers,
shader IO rings, MSAA images, other images, and buffers. Alternatively, you
can have just internal buffers vs app buffers. Then you assign VRAM from
left to right until you reach the quota. This is optional, so this part can
be ignored.

>> - A GPU hang signals all fences. Other deadlocks will be handled like
GPU hangs.
>
>What do you mean by "all"?  All fences that were supposed to be
>signaled by the hung context?

Yes, that's one of the possibilities. Any GPU hang followed by a GPU reset
can clear VRAM, so all processes should recreate their contexts and
reinitialize resources. A deadlock caused by userspace could be handled
similarly.

I don't know how timeline fences would work across processes and how
resilient they would be to segfaults.

Marek

On Mon, Apr 19, 2021 at 11:48 AM Jason Ekstrand <jason@jlekstrand.net>
wrote:

> Not going to comment on everything on the first pass...
>
> On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák <maraeo@gmail.com> wrote:
> >
> > Hi,
> >
> > This is our initial proposal for explicit fences everywhere and new
> memory management that doesn't use BO fences. It's a redesign of how Linux
> graphics drivers work, and it can coexist with what we have now.
> >
> >
> > 1. Introduction
> > (skip this if you are already sold on explicit fences)
> >
> > The current Linux graphics architecture was initially designed for GPUs
> with only one graphics queue where everything was executed in the
> submission order and per-BO fences were used for memory management and
> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> queues were added on top, which required the introduction of implicit
> GPU-GPU synchronization between queues of different processes using per-BO
> fences. Recently, even parallel execution within one queue was enabled
> where a command buffer starts draws and compute shaders, but doesn't wait
> for them, enabling parallelism between back-to-back command buffers.
> Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
> was created to enable all those use cases, and it's the only reason why the
> scheduler exists.
> >
> > The GPU scheduler, implicit synchronization, BO-fence-based memory
> management, and the tracking of per-BO fences increase CPU overhead and
> latency, and reduce parallelism. There is a desire to replace all of them
> with something much simpler. Below is how we could do it.
> >
> >
> > 2. Explicit synchronization for window systems and modesetting
> >
> > The producer is an application and the consumer is a compositor or a
> modesetting driver.
> >
> > 2.1. The Present request
> >
> > As part of the Present request, the producer will pass 2 fences (sync
> objects) to the consumer alongside the presented DMABUF BO:
> > - The submit fence: Initially unsignalled, it will be signalled when the
> producer has finished drawing into the presented buffer.
> > - The return fence: Initially unsignalled, it will be signalled when the
> consumer has finished using the presented buffer.
>
> I'm not sure syncobj is what we want.  In the Intel world we're trying
> to go even further to something we're calling "userspace fences" which
> are a timeline implemented as a single 64-bit value in some
> CPU-mappable BO.  The client writes a higher value into the BO to
> signal the timeline.  The kernel then provides some helpers for
> waiting on them reliably and without spinning.  I don't expect
> everyone to support these right away but, If we're going to re-plumb
> userspace for explicit synchronization, I'd like to make sure we take
> this into account so we only have to do it once.
>
>
> > Deadlock mitigation to recover from segfaults:
> > - The kernel knows which process is obliged to signal which fence. This
> information is part of the Present request and supplied by userspace.
>
> This isn't clear to me.  Yes, if we're using anything dma-fence based
> like syncobj, this is true.  But it doesn't seem totally true as a
> general statement.
>
>
> > - If the producer crashes, the kernel signals the submit fence, so that
> the consumer can make forward progress.
> > - If the consumer crashes, the kernel signals the return fence, so that
> the producer can reclaim the buffer.
> > - A GPU hang signals all fences. Other deadlocks will be handled like
> GPU hangs.
>
> What do you mean by "all"?  All fences that were supposed to be
> signaled by the hung context?
>
>
> >
> > Other window system requests can follow the same idea.
> >
> > Merged fences where one fence object contains multiple fences will be
> supported. A merged fence is signalled only when its fences are signalled.
> The consumer will have the option to redefine the unsignalled return fence
> to a merged fence.
> >
> > 2.2. Modesetting
> >
> > Since a modesetting driver can also be the consumer, the present ioctl
> will contain a submit fence and a return fence too. One small problem with
> this is that userspace can hang the modesetting driver, but in theory, any
> later present ioctl can override the previous one, so the unsignalled
> presentation is never used.
> >
> >
> > 3. New memory management
> >
> > The per-BO fences will be removed and the kernel will not know which
> buffers are busy. This will reduce CPU overhead and latency. The kernel
> will not need per-BO fences with explicit synchronization, so we just need
> to remove their last user: buffer evictions. It also resolves the current
> OOM deadlock.
>
> Is this even really possible?  I'm no kernel MM expert (trying to
> learn some) but my understanding is that the use of per-BO dma-fence
> runs deep.  I would like to stop using it for implicit synchronization
> to be sure, but I'm not sure I believe the claim that we can get rid
> of it entirely.  Happy to see someone try, though.
>
>
> > 3.1. Evictions
> >
> > If the kernel wants to move a buffer, it will have to wait for
> everything to go idle, halt all userspace command submissions, move the
> buffer, and resume everything. This is not expected to happen when memory
> is not exhausted. Other more efficient ways of synchronization are also
> possible (e.g. sync only one process), but are not discussed here.
> >
> > 3.2. Per-process VRAM usage quota
> >
> > Each process can optionally and periodically query its VRAM usage quota
> and change domains of its buffers to obey that quota. For example, a
> process allocated 2 GB of buffers in VRAM, but the kernel decreased the
> quota to 1 GB. The process can change the domains of the least important
> buffers to GTT to get the best outcome for itself. If the process doesn't
> do it, the kernel will choose which buffers to evict at random. (thanks to
> Christian Koenig for this idea)
>
> This is going to be difficult.  On Intel, we have some resources that
> have to be pinned to VRAM and can't be dynamically swapped out by the
> kernel.  In GL, we probably can deal with it somewhat dynamically.  In
> Vulkan, we'll be entirely dependent on the application to use the
> appropriate Vulkan memory budget APIs.
>
> --Jason
>
>
> > 3.3. Buffer destruction without per-BO fences
> >
> > When the buffer destroy ioctl is called, an optional fence list can be
> passed to the kernel to indicate when it's safe to deallocate the buffer.
> If the fence list is empty, the buffer will be deallocated immediately.
> Shared buffers will be handled by merging fence lists from all processes
> that destroy them. Mitigation of malicious behavior:
> > - If userspace destroys a busy buffer, it will get a GPU page fault.
> > - If userspace sends fences that never signal, the kernel will have a
> timeout period and then will proceed to deallocate the buffer anyway.
> >
> > 3.4. Other notes on MM
> >
> > Overcommitment of GPU-accessible memory will cause an allocation failure
> or invoke the OOM killer. Evictions to GPU-inaccessible memory might not be
> supported.
> >
> > Kernel drivers could move to this new memory management today. Only
> buffer residency and evictions would stop using per-BO fences.
> >
> >
> > 4. Deprecating implicit synchronization
> >
> > It can be phased out by introducing a new generation of hardware where
> the driver doesn't add support for it (like a driver fork would do),
> assuming userspace has all the changes for explicit synchronization. This
> could potentially create an isolated part of the kernel DRM where all
> drivers only support explicit synchronization.
> >
> > Marek
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

[-- Attachment #1.2: Type: text/html, Size: 10784 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-19 15:48 ` Jason Ekstrand
  2021-04-20  2:25   ` Marek Olšák
@ 2021-04-20 10:15   ` Christian König
  2021-04-20 10:34     ` Daniel Vetter
  1 sibling, 1 reply; 105+ messages in thread
From: Christian König @ 2021-04-20 10:15 UTC (permalink / raw)
  To: Jason Ekstrand, Marek Olšák, Daniel Vetter
  Cc: ML Mesa-dev, dri-devel

Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
> Not going to comment on everything on the first pass...
>
> On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák <maraeo@gmail.com> wrote:
>> Hi,
>>
>> This is our initial proposal for explicit fences everywhere and new memory management that doesn't use BO fences. It's a redesign of how Linux graphics drivers work, and it can coexist with what we have now.
>>
>>
>> 1. Introduction
>> (skip this if you are already sold on explicit fences)
>>
>> The current Linux graphics architecture was initially designed for GPUs with only one graphics queue where everything was executed in the submission order and per-BO fences were used for memory management and CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple queues were added on top, which required the introduction of implicit GPU-GPU synchronization between queues of different processes using per-BO fences. Recently, even parallel execution within one queue was enabled where a command buffer starts draws and compute shaders, but doesn't wait for them, enabling parallelism between back-to-back command buffers. Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler was created to enable all those use cases, and it's the only reason why the scheduler exists.
>>
>> The GPU scheduler, implicit synchronization, BO-fence-based memory management, and the tracking of per-BO fences increase CPU overhead and latency, and reduce parallelism. There is a desire to replace all of them with something much simpler. Below is how we could do it.
>>
>>
>> 2. Explicit synchronization for window systems and modesetting
>>
>> The producer is an application and the consumer is a compositor or a modesetting driver.
>>
>> 2.1. The Present request
>>
>> As part of the Present request, the producer will pass 2 fences (sync objects) to the consumer alongside the presented DMABUF BO:
>> - The submit fence: Initially unsignalled, it will be signalled when the producer has finished drawing into the presented buffer.
>> - The return fence: Initially unsignalled, it will be signalled when the consumer has finished using the presented buffer.
> I'm not sure syncobj is what we want.  In the Intel world we're trying
> to go even further to something we're calling "userspace fences" which
> are a timeline implemented as a single 64-bit value in some
> CPU-mappable BO.  The client writes a higher value into the BO to
> signal the timeline.

Well that is exactly what our Windows guys have suggested as well, but 
it strongly looks like that this isn't sufficient.

First of all you run into security problems when any application can 
just write any value to that memory location. Just imagine an 
application sets the counter to zero and X waits forever for some 
rendering to finish.

Additional to that in such a model you can't determine who is the guilty 
queue in case of a hang and can't reset the synchronization primitives 
in case of an error.

Apart from that this is rather inefficient, e.g. we don't have any way 
to prevent priority inversion when used as a synchronization mechanism 
between different GPU queues.

Christian.

>    The kernel then provides some helpers for
> waiting on them reliably and without spinning.  I don't expect
> everyone to support these right away but, If we're going to re-plumb
> userspace for explicit synchronization, I'd like to make sure we take
> this into account so we only have to do it once.
>
>
>> Deadlock mitigation to recover from segfaults:
>> - The kernel knows which process is obliged to signal which fence. This information is part of the Present request and supplied by userspace.
> This isn't clear to me.  Yes, if we're using anything dma-fence based
> like syncobj, this is true.  But it doesn't seem totally true as a
> general statement.
>
>
>> - If the producer crashes, the kernel signals the submit fence, so that the consumer can make forward progress.
>> - If the consumer crashes, the kernel signals the return fence, so that the producer can reclaim the buffer.
>> - A GPU hang signals all fences. Other deadlocks will be handled like GPU hangs.
> What do you mean by "all"?  All fences that were supposed to be
> signaled by the hung context?
>
>
>> Other window system requests can follow the same idea.
>>
>> Merged fences where one fence object contains multiple fences will be supported. A merged fence is signalled only when its fences are signalled. The consumer will have the option to redefine the unsignalled return fence to a merged fence.
>>
>> 2.2. Modesetting
>>
>> Since a modesetting driver can also be the consumer, the present ioctl will contain a submit fence and a return fence too. One small problem with this is that userspace can hang the modesetting driver, but in theory, any later present ioctl can override the previous one, so the unsignalled presentation is never used.
>>
>>
>> 3. New memory management
>>
>> The per-BO fences will be removed and the kernel will not know which buffers are busy. This will reduce CPU overhead and latency. The kernel will not need per-BO fences with explicit synchronization, so we just need to remove their last user: buffer evictions. It also resolves the current OOM deadlock.
> Is this even really possible?  I'm no kernel MM expert (trying to
> learn some) but my understanding is that the use of per-BO dma-fence
> runs deep.  I would like to stop using it for implicit synchronization
> to be sure, but I'm not sure I believe the claim that we can get rid
> of it entirely.  Happy to see someone try, though.
>
>
>> 3.1. Evictions
>>
>> If the kernel wants to move a buffer, it will have to wait for everything to go idle, halt all userspace command submissions, move the buffer, and resume everything. This is not expected to happen when memory is not exhausted. Other more efficient ways of synchronization are also possible (e.g. sync only one process), but are not discussed here.
>>
>> 3.2. Per-process VRAM usage quota
>>
>> Each process can optionally and periodically query its VRAM usage quota and change domains of its buffers to obey that quota. For example, a process allocated 2 GB of buffers in VRAM, but the kernel decreased the quota to 1 GB. The process can change the domains of the least important buffers to GTT to get the best outcome for itself. If the process doesn't do it, the kernel will choose which buffers to evict at random. (thanks to Christian Koenig for this idea)
> This is going to be difficult.  On Intel, we have some resources that
> have to be pinned to VRAM and can't be dynamically swapped out by the
> kernel.  In GL, we probably can deal with it somewhat dynamically.  In
> Vulkan, we'll be entirely dependent on the application to use the
> appropriate Vulkan memory budget APIs.
>
> --Jason
>
>
>> 3.3. Buffer destruction without per-BO fences
>>
>> When the buffer destroy ioctl is called, an optional fence list can be passed to the kernel to indicate when it's safe to deallocate the buffer. If the fence list is empty, the buffer will be deallocated immediately. Shared buffers will be handled by merging fence lists from all processes that destroy them. Mitigation of malicious behavior:
>> - If userspace destroys a busy buffer, it will get a GPU page fault.
>> - If userspace sends fences that never signal, the kernel will have a timeout period and then will proceed to deallocate the buffer anyway.
>>
>> 3.4. Other notes on MM
>>
>> Overcommitment of GPU-accessible memory will cause an allocation failure or invoke the OOM killer. Evictions to GPU-inaccessible memory might not be supported.
>>
>> Kernel drivers could move to this new memory management today. Only buffer residency and evictions would stop using per-BO fences.
>>
>>
>> 4. Deprecating implicit synchronization
>>
>> It can be phased out by introducing a new generation of hardware where the driver doesn't add support for it (like a driver fork would do), assuming userspace has all the changes for explicit synchronization. This could potentially create an isolated part of the kernel DRM where all drivers only support explicit synchronization.
>>
>> Marek
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 10:15   ` [Mesa-dev] " Christian König
@ 2021-04-20 10:34     ` Daniel Vetter
  2021-04-20 11:03       ` Marek Olšák
  0 siblings, 1 reply; 105+ messages in thread
From: Daniel Vetter @ 2021-04-20 10:34 UTC (permalink / raw)
  To: Christian König
  Cc: ML Mesa-dev, dri-devel, Marek Olšák, Jason Ekstrand

On Tue, Apr 20, 2021 at 12:15 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
> > Not going to comment on everything on the first pass...
> >
> > On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák <maraeo@gmail.com> wrote:
> >> Hi,
> >>
> >> This is our initial proposal for explicit fences everywhere and new memory management that doesn't use BO fences. It's a redesign of how Linux graphics drivers work, and it can coexist with what we have now.
> >>
> >>
> >> 1. Introduction
> >> (skip this if you are already sold on explicit fences)
> >>
> >> The current Linux graphics architecture was initially designed for GPUs with only one graphics queue where everything was executed in the submission order and per-BO fences were used for memory management and CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple queues were added on top, which required the introduction of implicit GPU-GPU synchronization between queues of different processes using per-BO fences. Recently, even parallel execution within one queue was enabled where a command buffer starts draws and compute shaders, but doesn't wait for them, enabling parallelism between back-to-back command buffers. Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler was created to enable all those use cases, and it's the only reason why the scheduler exists.
> >>
> >> The GPU scheduler, implicit synchronization, BO-fence-based memory management, and the tracking of per-BO fences increase CPU overhead and latency, and reduce parallelism. There is a desire to replace all of them with something much simpler. Below is how we could do it.
> >>
> >>
> >> 2. Explicit synchronization for window systems and modesetting
> >>
> >> The producer is an application and the consumer is a compositor or a modesetting driver.
> >>
> >> 2.1. The Present request
> >>
> >> As part of the Present request, the producer will pass 2 fences (sync objects) to the consumer alongside the presented DMABUF BO:
> >> - The submit fence: Initially unsignalled, it will be signalled when the producer has finished drawing into the presented buffer.
> >> - The return fence: Initially unsignalled, it will be signalled when the consumer has finished using the presented buffer.
> > I'm not sure syncobj is what we want.  In the Intel world we're trying
> > to go even further to something we're calling "userspace fences" which
> > are a timeline implemented as a single 64-bit value in some
> > CPU-mappable BO.  The client writes a higher value into the BO to
> > signal the timeline.
>
> Well that is exactly what our Windows guys have suggested as well, but
> it strongly looks like that this isn't sufficient.
>
> First of all you run into security problems when any application can
> just write any value to that memory location. Just imagine an
> application sets the counter to zero and X waits forever for some
> rendering to finish.

The thing is, with userspace fences security boundary issue prevent
moves into userspace entirely. And it really doesn't matter whether
the event you're waiting on doesn't complete because the other app
crashed or was stupid or intentionally gave you a wrong fence point:
You have to somehow handle that, e.g. perhaps with conditional
rendering and just using the old frame in compositing if the new one
doesn't show up in time. Or something like that. So trying to get the
kernel involved but also not so much involved sounds like a bad design
to me.

> Additional to that in such a model you can't determine who is the guilty
> queue in case of a hang and can't reset the synchronization primitives
> in case of an error.
>
> Apart from that this is rather inefficient, e.g. we don't have any way
> to prevent priority inversion when used as a synchronization mechanism
> between different GPU queues.

Yeah but you can't have it both ways. Either all the scheduling in the
kernel and fence handling is a problem, or you actually want to
schedule in the kernel. hw seems to definitely move towards the more
stupid spinlock-in-hw model (and direct submit from userspace and all
that), priority inversions be damned. I'm really not sure we should
fight that - if it's really that inefficient then maybe hw will add
support for waiting sync constructs in hardware, or at least be
smarter about scheduling other stuff. E.g. on intel hw both the kernel
scheduler and fw scheduler knows when you're spinning on a hw fence
(whether userspace or kernel doesn't matter) and plugs in something
else. Add in a bit of hw support to watch cachelines, and you have
something which can handle both directions efficiently.

Imo given where hw is going, we shouldn't try to be too clever here.
The only thing we do need to provision is being able to do cpu side
waits without spinning. And that should probably be done in a fairly
gpu specific way still.
-Daniel

> Christian.
>
> >    The kernel then provides some helpers for
> > waiting on them reliably and without spinning.  I don't expect
> > everyone to support these right away but, If we're going to re-plumb
> > userspace for explicit synchronization, I'd like to make sure we take
> > this into account so we only have to do it once.
> >
> >
> >> Deadlock mitigation to recover from segfaults:
> >> - The kernel knows which process is obliged to signal which fence. This information is part of the Present request and supplied by userspace.
> > This isn't clear to me.  Yes, if we're using anything dma-fence based
> > like syncobj, this is true.  But it doesn't seem totally true as a
> > general statement.
> >
> >
> >> - If the producer crashes, the kernel signals the submit fence, so that the consumer can make forward progress.
> >> - If the consumer crashes, the kernel signals the return fence, so that the producer can reclaim the buffer.
> >> - A GPU hang signals all fences. Other deadlocks will be handled like GPU hangs.
> > What do you mean by "all"?  All fences that were supposed to be
> > signaled by the hung context?
> >
> >
> >> Other window system requests can follow the same idea.
> >>
> >> Merged fences where one fence object contains multiple fences will be supported. A merged fence is signalled only when its fences are signalled. The consumer will have the option to redefine the unsignalled return fence to a merged fence.
> >>
> >> 2.2. Modesetting
> >>
> >> Since a modesetting driver can also be the consumer, the present ioctl will contain a submit fence and a return fence too. One small problem with this is that userspace can hang the modesetting driver, but in theory, any later present ioctl can override the previous one, so the unsignalled presentation is never used.
> >>
> >>
> >> 3. New memory management
> >>
> >> The per-BO fences will be removed and the kernel will not know which buffers are busy. This will reduce CPU overhead and latency. The kernel will not need per-BO fences with explicit synchronization, so we just need to remove their last user: buffer evictions. It also resolves the current OOM deadlock.
> > Is this even really possible?  I'm no kernel MM expert (trying to
> > learn some) but my understanding is that the use of per-BO dma-fence
> > runs deep.  I would like to stop using it for implicit synchronization
> > to be sure, but I'm not sure I believe the claim that we can get rid
> > of it entirely.  Happy to see someone try, though.
> >
> >
> >> 3.1. Evictions
> >>
> >> If the kernel wants to move a buffer, it will have to wait for everything to go idle, halt all userspace command submissions, move the buffer, and resume everything. This is not expected to happen when memory is not exhausted. Other more efficient ways of synchronization are also possible (e.g. sync only one process), but are not discussed here.
> >>
> >> 3.2. Per-process VRAM usage quota
> >>
> >> Each process can optionally and periodically query its VRAM usage quota and change domains of its buffers to obey that quota. For example, a process allocated 2 GB of buffers in VRAM, but the kernel decreased the quota to 1 GB. The process can change the domains of the least important buffers to GTT to get the best outcome for itself. If the process doesn't do it, the kernel will choose which buffers to evict at random. (thanks to Christian Koenig for this idea)
> > This is going to be difficult.  On Intel, we have some resources that
> > have to be pinned to VRAM and can't be dynamically swapped out by the
> > kernel.  In GL, we probably can deal with it somewhat dynamically.  In
> > Vulkan, we'll be entirely dependent on the application to use the
> > appropriate Vulkan memory budget APIs.
> >
> > --Jason
> >
> >
> >> 3.3. Buffer destruction without per-BO fences
> >>
> >> When the buffer destroy ioctl is called, an optional fence list can be passed to the kernel to indicate when it's safe to deallocate the buffer. If the fence list is empty, the buffer will be deallocated immediately. Shared buffers will be handled by merging fence lists from all processes that destroy them. Mitigation of malicious behavior:
> >> - If userspace destroys a busy buffer, it will get a GPU page fault.
> >> - If userspace sends fences that never signal, the kernel will have a timeout period and then will proceed to deallocate the buffer anyway.
> >>
> >> 3.4. Other notes on MM
> >>
> >> Overcommitment of GPU-accessible memory will cause an allocation failure or invoke the OOM killer. Evictions to GPU-inaccessible memory might not be supported.
> >>
> >> Kernel drivers could move to this new memory management today. Only buffer residency and evictions would stop using per-BO fences.
> >>
> >>
> >> 4. Deprecating implicit synchronization
> >>
> >> It can be phased out by introducing a new generation of hardware where the driver doesn't add support for it (like a driver fork would do), assuming userspace has all the changes for explicit synchronization. This could potentially create an isolated part of the kernel DRM where all drivers only support explicit synchronization.
> >>
> >> Marek
> >> _______________________________________________
> >> dri-devel mailing list
> >> dri-devel@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > _______________________________________________
> > mesa-dev mailing list
> > mesa-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 10:34     ` Daniel Vetter
@ 2021-04-20 11:03       ` Marek Olšák
  2021-04-20 11:16         ` Daniel Vetter
  0 siblings, 1 reply; 105+ messages in thread
From: Marek Olšák @ 2021-04-20 11:03 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, dri-devel, Jason Ekstrand, ML Mesa-dev


[-- Attachment #1.1: Type: text/plain, Size: 11685 bytes --]

Daniel, are you suggesting that we should skip any deadlock prevention in
the kernel, and just let userspace wait for and signal any fence it has
access to?

Do you have any concern with the deprecation/removal of BO fences in the
kernel assuming userspace is only using explicit fences? Any concern with
the submit and return fences for modesetting and other producer<->consumer
scenarios?

Thanks,
Marek

On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter <daniel@ffwll.ch> wrote:

> On Tue, Apr 20, 2021 at 12:15 PM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
> >
> > Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
> > > Not going to comment on everything on the first pass...
> > >
> > > On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák <maraeo@gmail.com> wrote:
> > >> Hi,
> > >>
> > >> This is our initial proposal for explicit fences everywhere and new
> memory management that doesn't use BO fences. It's a redesign of how Linux
> graphics drivers work, and it can coexist with what we have now.
> > >>
> > >>
> > >> 1. Introduction
> > >> (skip this if you are already sold on explicit fences)
> > >>
> > >> The current Linux graphics architecture was initially designed for
> GPUs with only one graphics queue where everything was executed in the
> submission order and per-BO fences were used for memory management and
> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> queues were added on top, which required the introduction of implicit
> GPU-GPU synchronization between queues of different processes using per-BO
> fences. Recently, even parallel execution within one queue was enabled
> where a command buffer starts draws and compute shaders, but doesn't wait
> for them, enabling parallelism between back-to-back command buffers.
> Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
> was created to enable all those use cases, and it's the only reason why the
> scheduler exists.
> > >>
> > >> The GPU scheduler, implicit synchronization, BO-fence-based memory
> management, and the tracking of per-BO fences increase CPU overhead and
> latency, and reduce parallelism. There is a desire to replace all of them
> with something much simpler. Below is how we could do it.
> > >>
> > >>
> > >> 2. Explicit synchronization for window systems and modesetting
> > >>
> > >> The producer is an application and the consumer is a compositor or a
> modesetting driver.
> > >>
> > >> 2.1. The Present request
> > >>
> > >> As part of the Present request, the producer will pass 2 fences (sync
> objects) to the consumer alongside the presented DMABUF BO:
> > >> - The submit fence: Initially unsignalled, it will be signalled when
> the producer has finished drawing into the presented buffer.
> > >> - The return fence: Initially unsignalled, it will be signalled when
> the consumer has finished using the presented buffer.
> > > I'm not sure syncobj is what we want.  In the Intel world we're trying
> > > to go even further to something we're calling "userspace fences" which
> > > are a timeline implemented as a single 64-bit value in some
> > > CPU-mappable BO.  The client writes a higher value into the BO to
> > > signal the timeline.
> >
> > Well that is exactly what our Windows guys have suggested as well, but
> > it strongly looks like that this isn't sufficient.
> >
> > First of all you run into security problems when any application can
> > just write any value to that memory location. Just imagine an
> > application sets the counter to zero and X waits forever for some
> > rendering to finish.
>
> The thing is, with userspace fences security boundary issue prevent
> moves into userspace entirely. And it really doesn't matter whether
> the event you're waiting on doesn't complete because the other app
> crashed or was stupid or intentionally gave you a wrong fence point:
> You have to somehow handle that, e.g. perhaps with conditional
> rendering and just using the old frame in compositing if the new one
> doesn't show up in time. Or something like that. So trying to get the
> kernel involved but also not so much involved sounds like a bad design
> to me.
>
> > Additional to that in such a model you can't determine who is the guilty
> > queue in case of a hang and can't reset the synchronization primitives
> > in case of an error.
> >
> > Apart from that this is rather inefficient, e.g. we don't have any way
> > to prevent priority inversion when used as a synchronization mechanism
> > between different GPU queues.
>
> Yeah but you can't have it both ways. Either all the scheduling in the
> kernel and fence handling is a problem, or you actually want to
> schedule in the kernel. hw seems to definitely move towards the more
> stupid spinlock-in-hw model (and direct submit from userspace and all
> that), priority inversions be damned. I'm really not sure we should
> fight that - if it's really that inefficient then maybe hw will add
> support for waiting sync constructs in hardware, or at least be
> smarter about scheduling other stuff. E.g. on intel hw both the kernel
> scheduler and fw scheduler knows when you're spinning on a hw fence
> (whether userspace or kernel doesn't matter) and plugs in something
> else. Add in a bit of hw support to watch cachelines, and you have
> something which can handle both directions efficiently.
>
> Imo given where hw is going, we shouldn't try to be too clever here.
> The only thing we do need to provision is being able to do cpu side
> waits without spinning. And that should probably be done in a fairly
> gpu specific way still.
> -Daniel
>
> > Christian.
> >
> > >    The kernel then provides some helpers for
> > > waiting on them reliably and without spinning.  I don't expect
> > > everyone to support these right away but, If we're going to re-plumb
> > > userspace for explicit synchronization, I'd like to make sure we take
> > > this into account so we only have to do it once.
> > >
> > >
> > >> Deadlock mitigation to recover from segfaults:
> > >> - The kernel knows which process is obliged to signal which fence.
> This information is part of the Present request and supplied by userspace.
> > > This isn't clear to me.  Yes, if we're using anything dma-fence based
> > > like syncobj, this is true.  But it doesn't seem totally true as a
> > > general statement.
> > >
> > >
> > >> - If the producer crashes, the kernel signals the submit fence, so
> that the consumer can make forward progress.
> > >> - If the consumer crashes, the kernel signals the return fence, so
> that the producer can reclaim the buffer.
> > >> - A GPU hang signals all fences. Other deadlocks will be handled like
> GPU hangs.
> > > What do you mean by "all"?  All fences that were supposed to be
> > > signaled by the hung context?
> > >
> > >
> > >> Other window system requests can follow the same idea.
> > >>
> > >> Merged fences where one fence object contains multiple fences will be
> supported. A merged fence is signalled only when its fences are signalled.
> The consumer will have the option to redefine the unsignalled return fence
> to a merged fence.
> > >>
> > >> 2.2. Modesetting
> > >>
> > >> Since a modesetting driver can also be the consumer, the present
> ioctl will contain a submit fence and a return fence too. One small problem
> with this is that userspace can hang the modesetting driver, but in theory,
> any later present ioctl can override the previous one, so the unsignalled
> presentation is never used.
> > >>
> > >>
> > >> 3. New memory management
> > >>
> > >> The per-BO fences will be removed and the kernel will not know which
> buffers are busy. This will reduce CPU overhead and latency. The kernel
> will not need per-BO fences with explicit synchronization, so we just need
> to remove their last user: buffer evictions. It also resolves the current
> OOM deadlock.
> > > Is this even really possible?  I'm no kernel MM expert (trying to
> > > learn some) but my understanding is that the use of per-BO dma-fence
> > > runs deep.  I would like to stop using it for implicit synchronization
> > > to be sure, but I'm not sure I believe the claim that we can get rid
> > > of it entirely.  Happy to see someone try, though.
> > >
> > >
> > >> 3.1. Evictions
> > >>
> > >> If the kernel wants to move a buffer, it will have to wait for
> everything to go idle, halt all userspace command submissions, move the
> buffer, and resume everything. This is not expected to happen when memory
> is not exhausted. Other more efficient ways of synchronization are also
> possible (e.g. sync only one process), but are not discussed here.
> > >>
> > >> 3.2. Per-process VRAM usage quota
> > >>
> > >> Each process can optionally and periodically query its VRAM usage
> quota and change domains of its buffers to obey that quota. For example, a
> process allocated 2 GB of buffers in VRAM, but the kernel decreased the
> quota to 1 GB. The process can change the domains of the least important
> buffers to GTT to get the best outcome for itself. If the process doesn't
> do it, the kernel will choose which buffers to evict at random. (thanks to
> Christian Koenig for this idea)
> > > This is going to be difficult.  On Intel, we have some resources that
> > > have to be pinned to VRAM and can't be dynamically swapped out by the
> > > kernel.  In GL, we probably can deal with it somewhat dynamically.  In
> > > Vulkan, we'll be entirely dependent on the application to use the
> > > appropriate Vulkan memory budget APIs.
> > >
> > > --Jason
> > >
> > >
> > >> 3.3. Buffer destruction without per-BO fences
> > >>
> > >> When the buffer destroy ioctl is called, an optional fence list can
> be passed to the kernel to indicate when it's safe to deallocate the
> buffer. If the fence list is empty, the buffer will be deallocated
> immediately. Shared buffers will be handled by merging fence lists from all
> processes that destroy them. Mitigation of malicious behavior:
> > >> - If userspace destroys a busy buffer, it will get a GPU page fault.
> > >> - If userspace sends fences that never signal, the kernel will have a
> timeout period and then will proceed to deallocate the buffer anyway.
> > >>
> > >> 3.4. Other notes on MM
> > >>
> > >> Overcommitment of GPU-accessible memory will cause an allocation
> failure or invoke the OOM killer. Evictions to GPU-inaccessible memory
> might not be supported.
> > >>
> > >> Kernel drivers could move to this new memory management today. Only
> buffer residency and evictions would stop using per-BO fences.
> > >>
> > >>
> > >> 4. Deprecating implicit synchronization
> > >>
> > >> It can be phased out by introducing a new generation of hardware
> where the driver doesn't add support for it (like a driver fork would do),
> assuming userspace has all the changes for explicit synchronization. This
> could potentially create an isolated part of the kernel DRM where all
> drivers only support explicit synchronization.
> > >>
> > >> Marek
> > >> _______________________________________________
> > >> dri-devel mailing list
> > >> dri-devel@lists.freedesktop.org
> > >> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > > _______________________________________________
> > > mesa-dev mailing list
> > > mesa-dev@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>

[-- Attachment #1.2: Type: text/html, Size: 13855 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 11:03       ` Marek Olšák
@ 2021-04-20 11:16         ` Daniel Vetter
  2021-04-20 11:59           ` Christian König
  0 siblings, 1 reply; 105+ messages in thread
From: Daniel Vetter @ 2021-04-20 11:16 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Christian König, dri-devel, Jason Ekstrand, ML Mesa-dev

On Tue, Apr 20, 2021 at 07:03:19AM -0400, Marek Olšák wrote:
> Daniel, are you suggesting that we should skip any deadlock prevention in
> the kernel, and just let userspace wait for and signal any fence it has
> access to?

Yeah. If we go with userspace fences, then userspace can hang itself. Not
the kernel's problem. The only criteria is that the kernel itself must
never rely on these userspace fences, except for stuff like implementing
optimized cpu waits. And in those we must always guarantee that the
userspace process remains interruptible.

It's a completely different world from dma_fence based kernel fences,
whether those are implicit or explicit.

> Do you have any concern with the deprecation/removal of BO fences in the
> kernel assuming userspace is only using explicit fences? Any concern with
> the submit and return fences for modesetting and other producer<->consumer
> scenarios?

Let me work on the full replay for your rfc first, because there's a lot
of details here and nuance.
-Daniel

> 
> Thanks,
> Marek
> 
> On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> 
> > On Tue, Apr 20, 2021 at 12:15 PM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> > >
> > > Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
> > > > Not going to comment on everything on the first pass...
> > > >
> > > > On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák <maraeo@gmail.com> wrote:
> > > >> Hi,
> > > >>
> > > >> This is our initial proposal for explicit fences everywhere and new
> > memory management that doesn't use BO fences. It's a redesign of how Linux
> > graphics drivers work, and it can coexist with what we have now.
> > > >>
> > > >>
> > > >> 1. Introduction
> > > >> (skip this if you are already sold on explicit fences)
> > > >>
> > > >> The current Linux graphics architecture was initially designed for
> > GPUs with only one graphics queue where everything was executed in the
> > submission order and per-BO fences were used for memory management and
> > CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> > queues were added on top, which required the introduction of implicit
> > GPU-GPU synchronization between queues of different processes using per-BO
> > fences. Recently, even parallel execution within one queue was enabled
> > where a command buffer starts draws and compute shaders, but doesn't wait
> > for them, enabling parallelism between back-to-back command buffers.
> > Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
> > was created to enable all those use cases, and it's the only reason why the
> > scheduler exists.
> > > >>
> > > >> The GPU scheduler, implicit synchronization, BO-fence-based memory
> > management, and the tracking of per-BO fences increase CPU overhead and
> > latency, and reduce parallelism. There is a desire to replace all of them
> > with something much simpler. Below is how we could do it.
> > > >>
> > > >>
> > > >> 2. Explicit synchronization for window systems and modesetting
> > > >>
> > > >> The producer is an application and the consumer is a compositor or a
> > modesetting driver.
> > > >>
> > > >> 2.1. The Present request
> > > >>
> > > >> As part of the Present request, the producer will pass 2 fences (sync
> > objects) to the consumer alongside the presented DMABUF BO:
> > > >> - The submit fence: Initially unsignalled, it will be signalled when
> > the producer has finished drawing into the presented buffer.
> > > >> - The return fence: Initially unsignalled, it will be signalled when
> > the consumer has finished using the presented buffer.
> > > > I'm not sure syncobj is what we want.  In the Intel world we're trying
> > > > to go even further to something we're calling "userspace fences" which
> > > > are a timeline implemented as a single 64-bit value in some
> > > > CPU-mappable BO.  The client writes a higher value into the BO to
> > > > signal the timeline.
> > >
> > > Well that is exactly what our Windows guys have suggested as well, but
> > > it strongly looks like that this isn't sufficient.
> > >
> > > First of all you run into security problems when any application can
> > > just write any value to that memory location. Just imagine an
> > > application sets the counter to zero and X waits forever for some
> > > rendering to finish.
> >
> > The thing is, with userspace fences security boundary issue prevent
> > moves into userspace entirely. And it really doesn't matter whether
> > the event you're waiting on doesn't complete because the other app
> > crashed or was stupid or intentionally gave you a wrong fence point:
> > You have to somehow handle that, e.g. perhaps with conditional
> > rendering and just using the old frame in compositing if the new one
> > doesn't show up in time. Or something like that. So trying to get the
> > kernel involved but also not so much involved sounds like a bad design
> > to me.
> >
> > > Additional to that in such a model you can't determine who is the guilty
> > > queue in case of a hang and can't reset the synchronization primitives
> > > in case of an error.
> > >
> > > Apart from that this is rather inefficient, e.g. we don't have any way
> > > to prevent priority inversion when used as a synchronization mechanism
> > > between different GPU queues.
> >
> > Yeah but you can't have it both ways. Either all the scheduling in the
> > kernel and fence handling is a problem, or you actually want to
> > schedule in the kernel. hw seems to definitely move towards the more
> > stupid spinlock-in-hw model (and direct submit from userspace and all
> > that), priority inversions be damned. I'm really not sure we should
> > fight that - if it's really that inefficient then maybe hw will add
> > support for waiting sync constructs in hardware, or at least be
> > smarter about scheduling other stuff. E.g. on intel hw both the kernel
> > scheduler and fw scheduler knows when you're spinning on a hw fence
> > (whether userspace or kernel doesn't matter) and plugs in something
> > else. Add in a bit of hw support to watch cachelines, and you have
> > something which can handle both directions efficiently.
> >
> > Imo given where hw is going, we shouldn't try to be too clever here.
> > The only thing we do need to provision is being able to do cpu side
> > waits without spinning. And that should probably be done in a fairly
> > gpu specific way still.
> > -Daniel
> >
> > > Christian.
> > >
> > > >    The kernel then provides some helpers for
> > > > waiting on them reliably and without spinning.  I don't expect
> > > > everyone to support these right away but, If we're going to re-plumb
> > > > userspace for explicit synchronization, I'd like to make sure we take
> > > > this into account so we only have to do it once.
> > > >
> > > >
> > > >> Deadlock mitigation to recover from segfaults:
> > > >> - The kernel knows which process is obliged to signal which fence.
> > This information is part of the Present request and supplied by userspace.
> > > > This isn't clear to me.  Yes, if we're using anything dma-fence based
> > > > like syncobj, this is true.  But it doesn't seem totally true as a
> > > > general statement.
> > > >
> > > >
> > > >> - If the producer crashes, the kernel signals the submit fence, so
> > that the consumer can make forward progress.
> > > >> - If the consumer crashes, the kernel signals the return fence, so
> > that the producer can reclaim the buffer.
> > > >> - A GPU hang signals all fences. Other deadlocks will be handled like
> > GPU hangs.
> > > > What do you mean by "all"?  All fences that were supposed to be
> > > > signaled by the hung context?
> > > >
> > > >
> > > >> Other window system requests can follow the same idea.
> > > >>
> > > >> Merged fences where one fence object contains multiple fences will be
> > supported. A merged fence is signalled only when its fences are signalled.
> > The consumer will have the option to redefine the unsignalled return fence
> > to a merged fence.
> > > >>
> > > >> 2.2. Modesetting
> > > >>
> > > >> Since a modesetting driver can also be the consumer, the present
> > ioctl will contain a submit fence and a return fence too. One small problem
> > with this is that userspace can hang the modesetting driver, but in theory,
> > any later present ioctl can override the previous one, so the unsignalled
> > presentation is never used.
> > > >>
> > > >>
> > > >> 3. New memory management
> > > >>
> > > >> The per-BO fences will be removed and the kernel will not know which
> > buffers are busy. This will reduce CPU overhead and latency. The kernel
> > will not need per-BO fences with explicit synchronization, so we just need
> > to remove their last user: buffer evictions. It also resolves the current
> > OOM deadlock.
> > > > Is this even really possible?  I'm no kernel MM expert (trying to
> > > > learn some) but my understanding is that the use of per-BO dma-fence
> > > > runs deep.  I would like to stop using it for implicit synchronization
> > > > to be sure, but I'm not sure I believe the claim that we can get rid
> > > > of it entirely.  Happy to see someone try, though.
> > > >
> > > >
> > > >> 3.1. Evictions
> > > >>
> > > >> If the kernel wants to move a buffer, it will have to wait for
> > everything to go idle, halt all userspace command submissions, move the
> > buffer, and resume everything. This is not expected to happen when memory
> > is not exhausted. Other more efficient ways of synchronization are also
> > possible (e.g. sync only one process), but are not discussed here.
> > > >>
> > > >> 3.2. Per-process VRAM usage quota
> > > >>
> > > >> Each process can optionally and periodically query its VRAM usage
> > quota and change domains of its buffers to obey that quota. For example, a
> > process allocated 2 GB of buffers in VRAM, but the kernel decreased the
> > quota to 1 GB. The process can change the domains of the least important
> > buffers to GTT to get the best outcome for itself. If the process doesn't
> > do it, the kernel will choose which buffers to evict at random. (thanks to
> > Christian Koenig for this idea)
> > > > This is going to be difficult.  On Intel, we have some resources that
> > > > have to be pinned to VRAM and can't be dynamically swapped out by the
> > > > kernel.  In GL, we probably can deal with it somewhat dynamically.  In
> > > > Vulkan, we'll be entirely dependent on the application to use the
> > > > appropriate Vulkan memory budget APIs.
> > > >
> > > > --Jason
> > > >
> > > >
> > > >> 3.3. Buffer destruction without per-BO fences
> > > >>
> > > >> When the buffer destroy ioctl is called, an optional fence list can
> > be passed to the kernel to indicate when it's safe to deallocate the
> > buffer. If the fence list is empty, the buffer will be deallocated
> > immediately. Shared buffers will be handled by merging fence lists from all
> > processes that destroy them. Mitigation of malicious behavior:
> > > >> - If userspace destroys a busy buffer, it will get a GPU page fault.
> > > >> - If userspace sends fences that never signal, the kernel will have a
> > timeout period and then will proceed to deallocate the buffer anyway.
> > > >>
> > > >> 3.4. Other notes on MM
> > > >>
> > > >> Overcommitment of GPU-accessible memory will cause an allocation
> > failure or invoke the OOM killer. Evictions to GPU-inaccessible memory
> > might not be supported.
> > > >>
> > > >> Kernel drivers could move to this new memory management today. Only
> > buffer residency and evictions would stop using per-BO fences.
> > > >>
> > > >>
> > > >> 4. Deprecating implicit synchronization
> > > >>
> > > >> It can be phased out by introducing a new generation of hardware
> > where the driver doesn't add support for it (like a driver fork would do),
> > assuming userspace has all the changes for explicit synchronization. This
> > could potentially create an isolated part of the kernel DRM where all
> > drivers only support explicit synchronization.
> > > >>
> > > >> Marek
> > > >> _______________________________________________
> > > >> dri-devel mailing list
> > > >> dri-devel@lists.freedesktop.org
> > > >> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > > > _______________________________________________
> > > > mesa-dev mailing list
> > > > mesa-dev@lists.freedesktop.org
> > > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> > >
> >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
> >

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 11:16         ` Daniel Vetter
@ 2021-04-20 11:59           ` Christian König
  2021-04-20 14:09             ` Daniel Vetter
  2021-04-20 16:19             ` Jason Ekstrand
  0 siblings, 2 replies; 105+ messages in thread
From: Christian König @ 2021-04-20 11:59 UTC (permalink / raw)
  To: Daniel Vetter, Marek Olšák
  Cc: ML Mesa-dev, dri-devel, Jason Ekstrand

> Yeah. If we go with userspace fences, then userspace can hang itself. Not
> the kernel's problem.

Well, the path of inner peace begins with four words. “Not my fucking 
problem.”

But I'm not that much concerned about the kernel, but rather about 
important userspace processes like X, Wayland, SurfaceFlinger etc...

I mean attaching a page to a sync object and allowing to wait/signal 
from both CPU as well as GPU side is not so much of a problem.

> You have to somehow handle that, e.g. perhaps with conditional
> rendering and just using the old frame in compositing if the new one
> doesn't show up in time.

Nice idea, but how would you handle that on the OpenGL/Glamor/Vulkan level.

Regards,
Christian.

Am 20.04.21 um 13:16 schrieb Daniel Vetter:
> On Tue, Apr 20, 2021 at 07:03:19AM -0400, Marek Olšák wrote:
>> Daniel, are you suggesting that we should skip any deadlock prevention in
>> the kernel, and just let userspace wait for and signal any fence it has
>> access to?
> Yeah. If we go with userspace fences, then userspace can hang itself. Not
> the kernel's problem. The only criteria is that the kernel itself must
> never rely on these userspace fences, except for stuff like implementing
> optimized cpu waits. And in those we must always guarantee that the
> userspace process remains interruptible.
>
> It's a completely different world from dma_fence based kernel fences,
> whether those are implicit or explicit.
>
>> Do you have any concern with the deprecation/removal of BO fences in the
>> kernel assuming userspace is only using explicit fences? Any concern with
>> the submit and return fences for modesetting and other producer<->consumer
>> scenarios?
> Let me work on the full replay for your rfc first, because there's a lot
> of details here and nuance.
> -Daniel
>
>> Thanks,
>> Marek
>>
>> On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>>
>>> On Tue, Apr 20, 2021 at 12:15 PM Christian König
>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>> Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
>>>>> Not going to comment on everything on the first pass...
>>>>>
>>>>> On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák <maraeo@gmail.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> This is our initial proposal for explicit fences everywhere and new
>>> memory management that doesn't use BO fences. It's a redesign of how Linux
>>> graphics drivers work, and it can coexist with what we have now.
>>>>>>
>>>>>> 1. Introduction
>>>>>> (skip this if you are already sold on explicit fences)
>>>>>>
>>>>>> The current Linux graphics architecture was initially designed for
>>> GPUs with only one graphics queue where everything was executed in the
>>> submission order and per-BO fences were used for memory management and
>>> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
>>> queues were added on top, which required the introduction of implicit
>>> GPU-GPU synchronization between queues of different processes using per-BO
>>> fences. Recently, even parallel execution within one queue was enabled
>>> where a command buffer starts draws and compute shaders, but doesn't wait
>>> for them, enabling parallelism between back-to-back command buffers.
>>> Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
>>> was created to enable all those use cases, and it's the only reason why the
>>> scheduler exists.
>>>>>> The GPU scheduler, implicit synchronization, BO-fence-based memory
>>> management, and the tracking of per-BO fences increase CPU overhead and
>>> latency, and reduce parallelism. There is a desire to replace all of them
>>> with something much simpler. Below is how we could do it.
>>>>>>
>>>>>> 2. Explicit synchronization for window systems and modesetting
>>>>>>
>>>>>> The producer is an application and the consumer is a compositor or a
>>> modesetting driver.
>>>>>> 2.1. The Present request
>>>>>>
>>>>>> As part of the Present request, the producer will pass 2 fences (sync
>>> objects) to the consumer alongside the presented DMABUF BO:
>>>>>> - The submit fence: Initially unsignalled, it will be signalled when
>>> the producer has finished drawing into the presented buffer.
>>>>>> - The return fence: Initially unsignalled, it will be signalled when
>>> the consumer has finished using the presented buffer.
>>>>> I'm not sure syncobj is what we want.  In the Intel world we're trying
>>>>> to go even further to something we're calling "userspace fences" which
>>>>> are a timeline implemented as a single 64-bit value in some
>>>>> CPU-mappable BO.  The client writes a higher value into the BO to
>>>>> signal the timeline.
>>>> Well that is exactly what our Windows guys have suggested as well, but
>>>> it strongly looks like that this isn't sufficient.
>>>>
>>>> First of all you run into security problems when any application can
>>>> just write any value to that memory location. Just imagine an
>>>> application sets the counter to zero and X waits forever for some
>>>> rendering to finish.
>>> The thing is, with userspace fences security boundary issue prevent
>>> moves into userspace entirely. And it really doesn't matter whether
>>> the event you're waiting on doesn't complete because the other app
>>> crashed or was stupid or intentionally gave you a wrong fence point:
>>> You have to somehow handle that, e.g. perhaps with conditional
>>> rendering and just using the old frame in compositing if the new one
>>> doesn't show up in time. Or something like that. So trying to get the
>>> kernel involved but also not so much involved sounds like a bad design
>>> to me.
>>>
>>>> Additional to that in such a model you can't determine who is the guilty
>>>> queue in case of a hang and can't reset the synchronization primitives
>>>> in case of an error.
>>>>
>>>> Apart from that this is rather inefficient, e.g. we don't have any way
>>>> to prevent priority inversion when used as a synchronization mechanism
>>>> between different GPU queues.
>>> Yeah but you can't have it both ways. Either all the scheduling in the
>>> kernel and fence handling is a problem, or you actually want to
>>> schedule in the kernel. hw seems to definitely move towards the more
>>> stupid spinlock-in-hw model (and direct submit from userspace and all
>>> that), priority inversions be damned. I'm really not sure we should
>>> fight that - if it's really that inefficient then maybe hw will add
>>> support for waiting sync constructs in hardware, or at least be
>>> smarter about scheduling other stuff. E.g. on intel hw both the kernel
>>> scheduler and fw scheduler knows when you're spinning on a hw fence
>>> (whether userspace or kernel doesn't matter) and plugs in something
>>> else. Add in a bit of hw support to watch cachelines, and you have
>>> something which can handle both directions efficiently.
>>>
>>> Imo given where hw is going, we shouldn't try to be too clever here.
>>> The only thing we do need to provision is being able to do cpu side
>>> waits without spinning. And that should probably be done in a fairly
>>> gpu specific way still.
>>> -Daniel
>>>
>>>> Christian.
>>>>
>>>>>     The kernel then provides some helpers for
>>>>> waiting on them reliably and without spinning.  I don't expect
>>>>> everyone to support these right away but, If we're going to re-plumb
>>>>> userspace for explicit synchronization, I'd like to make sure we take
>>>>> this into account so we only have to do it once.
>>>>>
>>>>>
>>>>>> Deadlock mitigation to recover from segfaults:
>>>>>> - The kernel knows which process is obliged to signal which fence.
>>> This information is part of the Present request and supplied by userspace.
>>>>> This isn't clear to me.  Yes, if we're using anything dma-fence based
>>>>> like syncobj, this is true.  But it doesn't seem totally true as a
>>>>> general statement.
>>>>>
>>>>>
>>>>>> - If the producer crashes, the kernel signals the submit fence, so
>>> that the consumer can make forward progress.
>>>>>> - If the consumer crashes, the kernel signals the return fence, so
>>> that the producer can reclaim the buffer.
>>>>>> - A GPU hang signals all fences. Other deadlocks will be handled like
>>> GPU hangs.
>>>>> What do you mean by "all"?  All fences that were supposed to be
>>>>> signaled by the hung context?
>>>>>
>>>>>
>>>>>> Other window system requests can follow the same idea.
>>>>>>
>>>>>> Merged fences where one fence object contains multiple fences will be
>>> supported. A merged fence is signalled only when its fences are signalled.
>>> The consumer will have the option to redefine the unsignalled return fence
>>> to a merged fence.
>>>>>> 2.2. Modesetting
>>>>>>
>>>>>> Since a modesetting driver can also be the consumer, the present
>>> ioctl will contain a submit fence and a return fence too. One small problem
>>> with this is that userspace can hang the modesetting driver, but in theory,
>>> any later present ioctl can override the previous one, so the unsignalled
>>> presentation is never used.
>>>>>>
>>>>>> 3. New memory management
>>>>>>
>>>>>> The per-BO fences will be removed and the kernel will not know which
>>> buffers are busy. This will reduce CPU overhead and latency. The kernel
>>> will not need per-BO fences with explicit synchronization, so we just need
>>> to remove their last user: buffer evictions. It also resolves the current
>>> OOM deadlock.
>>>>> Is this even really possible?  I'm no kernel MM expert (trying to
>>>>> learn some) but my understanding is that the use of per-BO dma-fence
>>>>> runs deep.  I would like to stop using it for implicit synchronization
>>>>> to be sure, but I'm not sure I believe the claim that we can get rid
>>>>> of it entirely.  Happy to see someone try, though.
>>>>>
>>>>>
>>>>>> 3.1. Evictions
>>>>>>
>>>>>> If the kernel wants to move a buffer, it will have to wait for
>>> everything to go idle, halt all userspace command submissions, move the
>>> buffer, and resume everything. This is not expected to happen when memory
>>> is not exhausted. Other more efficient ways of synchronization are also
>>> possible (e.g. sync only one process), but are not discussed here.
>>>>>> 3.2. Per-process VRAM usage quota
>>>>>>
>>>>>> Each process can optionally and periodically query its VRAM usage
>>> quota and change domains of its buffers to obey that quota. For example, a
>>> process allocated 2 GB of buffers in VRAM, but the kernel decreased the
>>> quota to 1 GB. The process can change the domains of the least important
>>> buffers to GTT to get the best outcome for itself. If the process doesn't
>>> do it, the kernel will choose which buffers to evict at random. (thanks to
>>> Christian Koenig for this idea)
>>>>> This is going to be difficult.  On Intel, we have some resources that
>>>>> have to be pinned to VRAM and can't be dynamically swapped out by the
>>>>> kernel.  In GL, we probably can deal with it somewhat dynamically.  In
>>>>> Vulkan, we'll be entirely dependent on the application to use the
>>>>> appropriate Vulkan memory budget APIs.
>>>>>
>>>>> --Jason
>>>>>
>>>>>
>>>>>> 3.3. Buffer destruction without per-BO fences
>>>>>>
>>>>>> When the buffer destroy ioctl is called, an optional fence list can
>>> be passed to the kernel to indicate when it's safe to deallocate the
>>> buffer. If the fence list is empty, the buffer will be deallocated
>>> immediately. Shared buffers will be handled by merging fence lists from all
>>> processes that destroy them. Mitigation of malicious behavior:
>>>>>> - If userspace destroys a busy buffer, it will get a GPU page fault.
>>>>>> - If userspace sends fences that never signal, the kernel will have a
>>> timeout period and then will proceed to deallocate the buffer anyway.
>>>>>> 3.4. Other notes on MM
>>>>>>
>>>>>> Overcommitment of GPU-accessible memory will cause an allocation
>>> failure or invoke the OOM killer. Evictions to GPU-inaccessible memory
>>> might not be supported.
>>>>>> Kernel drivers could move to this new memory management today. Only
>>> buffer residency and evictions would stop using per-BO fences.
>>>>>>
>>>>>> 4. Deprecating implicit synchronization
>>>>>>
>>>>>> It can be phased out by introducing a new generation of hardware
>>> where the driver doesn't add support for it (like a driver fork would do),
>>> assuming userspace has all the changes for explicit synchronization. This
>>> could potentially create an isolated part of the kernel DRM where all
>>> drivers only support explicit synchronization.
>>>>>> Marek
>>>>>> _______________________________________________
>>>>>> dri-devel mailing list
>>>>>> dri-devel@lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>> _______________________________________________
>>>>> mesa-dev mailing list
>>>>> mesa-dev@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>
>>> --
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch
>>>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-19 10:47 [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal Marek Olšák
  2021-04-19 15:48 ` Jason Ekstrand
@ 2021-04-20 12:01 ` Daniel Vetter
  2021-04-20 12:19   ` [Mesa-dev] " Christian König
  2021-04-20 13:03   ` Daniel Stone
  2021-04-20 12:42 ` Daniel Stone
  2021-04-20 14:53 ` Daniel Stone
  3 siblings, 2 replies; 105+ messages in thread
From: Daniel Vetter @ 2021-04-20 12:01 UTC (permalink / raw)
  To: Marek Olšák; +Cc: ML Mesa-dev, dri-devel

On Mon, Apr 19, 2021 at 06:47:48AM -0400, Marek Olšák wrote:
> Hi,
> 
> This is our initial proposal for explicit fences everywhere and new memory
> management that doesn't use BO fences. It's a redesign of how Linux
> graphics drivers work, and it can coexist with what we have now.
> 
> 
> *1. Introduction*
> (skip this if you are already sold on explicit fences)
> 
> The current Linux graphics architecture was initially designed for GPUs
> with only one graphics queue where everything was executed in the
> submission order and per-BO fences were used for memory management and
> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> queues were added on top, which required the introduction of implicit
> GPU-GPU synchronization between queues of different processes using per-BO
> fences. Recently, even parallel execution within one queue was enabled
> where a command buffer starts draws and compute shaders, but doesn't wait
> for them, enabling parallelism between back-to-back command buffers.
> Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
> was created to enable all those use cases, and it's the only reason why the
> scheduler exists.
> 
> The GPU scheduler, implicit synchronization, BO-fence-based memory
> management, and the tracking of per-BO fences increase CPU overhead and
> latency, and reduce parallelism. There is a desire to replace all of them
> with something much simpler. Below is how we could do it.

I get the feeling you're mixing up a lot of things here that have more
nuance, so first some lingo.

- There's kernel based synchronization, based on dma_fence. These come in
  two major variants: Implicit synchronization, where the kernel attaches
  the dma_fences to a dma-buf, and explicit synchronization, where the
  dma_fence gets passed around as a stand-alone object, either a sync_file
  or a drm_syncobj

- Then there's userspace fence synchronization, where userspace issues any
  fences directly and the kernel doesn't even know what's going on. This
  is the only model that allows you to ditch the kernel overhead, and it's
  also the model that vk uses.

  I concur with Jason that this one is the future, it's the model hw
  wants, compute wants and vk wants. Building an explicit fence world
  which doesn't aim at this is imo wasted effort.

Now you smash them into one thing by also changing the memory model, but I
think that doesn't work:

- Relying on gpu page faults across the board wont happen. I think right
  now only amd's GFX10 or so has enough pagefault support to allow this,
  and not even there I'm really sure. Nothing else will anytime soon, at
  least not as far as I know. So we need to support slightly more hw in
  upstream than just that.  Any plan that's realistic needs to cope with
  dma_fence for a really long time.

- Pown^WPin All The Things! is probably not a general enough memory
  management approach. We've kinda tried for years to move away from it.
  Sure we can support it as an optimization in specific workloads, and it
  will make stuff faster, but it's not going to be the default I think.

- We live in a post xf86-video-$vendor world, and all these other
  compositors rely on implicit sync. You're not going to be able to get
  rid of them anytime soon. What's worse, all the various EGL/vk buffer
  sharing things also rely on implicit sync, so you get to fix up tons of
  applications on top. Any plan that's realistic needs to cope with
  implicit/explicit at the same time together won't work.

- Absolute infuriating, but you can't use page-faulting together with any
  dma_fence synchronization primitives, whether implicit or explicit. This
  means until the entire ecosystem moved forward (good luck with that) we
  have to support dma_fence. The only sync model that works together with
  page faults is userspace fence based sync.

Then there's the somewhat aside topic of how amdgpu/radeonsi does implicit
sync, at least last I checked. Currently this oversynchronizes badly
because it's left to the kernel to guess what should be synchronized, and
that gets things wrong. What you need there is explicit implicit
synchronization:

- on the cs side, userspace must set explicit for which buffers the kernel
  should engage in implicit synchronization. That's how it works on all
  other drivers that support more explicit userspace like vk or gl drivers
  that are internally all explicit. So essentially you only set the
  implicit fence slot when you really want to, and only userspace knows
  this. Implementing this without breaking the current logic probably
  needs some flags.

- the other side isn't there yet upstream, but Jason has patches.
  Essentially you also need to sample your implicit sync points at the
  right spot, to avoid oversync on later rendering by the producer.
  Jason's patch solves this by adding an ioctl to dma-buf to get the
  current set.

- without any of this things for pure explicit fencing userspace the
  kernel will simply maintain a list of all current users of a buffer. For
  memory management, which means eviction handling roughly works like you
  describe below, we wait for everything before a buffer can be moved.

This should get rid of the oversync issues, and since implicit sync is
backed in everywhere right now, you'll have to deal with implicit sync for
a very long time.

Next up is reducing the memory manager overhead of all this, without
changing the ecosystem.

- hw option would be page faults, but until we have full explicit
  userspace sync we can't use those. Which currently means compute only.
  Note that for vulkan or maybe also gl this is quite nasty for userspace,
  since as soon as you need to switch to dma_fenc sync or implicit sync
  (winsys buffer, or buffer sharing with any of the current set of
  extensions) you have to flip your internal driver state around all sync
  points over from userspace fencing to dma_fence kernel fencing. Can
  still be all explicit using drm_syncobj ofc.

- next up if your hw has preemption, you could use that, except preemption
  takes a while longer, so from memory pov really should be done with
  dma_fence. Plus it has all the same problems in that it requires
  userspace fences.

- now for making dma_fence O(1) in the fastpath you need the shared
  dma_resv trick and the lru bulk move. radv/amdvlk use that, but I think
  radeonsi not yet. But maybe I missed that. Either way we need to do some
  better kernel work so it can also be fast for shared buffers, if those
  become a problem. On the GL side doing this will use a lot of the tricks
  for residency/working set management you describe below, except the
  kernel can still throw out an entire gpu job. This is essentially what
  you describe with 3.1. Vulkan/compute already work like this.

Now this gets the performance up, but it doesn't give us any road towards
using page faults (outside of compute) and so retiring dma_fence for good.
For that we need a few pieces:

- Full new set of userspace winsys protocols and egl/vk extensions. Pray
  it actually gets adopted, because neither AMD nor Intel have the
  engineers to push these kind of ecosystems/middleware issues forward on
  their payrolls. Good pick is probably using drm_syncobj as the kernel
  primitive for this. Still uses dma_fence underneath.

- Some clever kernel tricks so that we can substitute dma_fence for
  userspace fences within a drm_syncobj. drm_syncobj already has the
  notion of waiting for a dma_fence to materialize. We can abuse that to
  create an upgrade path from dma_fence based sync to userspace fence
  syncing. Ofc none of this will be on the table if userspace hasn't
  adopted explicit sync.

With these two things I think we can have a reasonable upgrade path. None
of this will be break the world type things though.

Bunch of comments below.

> *2. Explicit synchronization for window systems and modesetting*
> 
> The producer is an application and the consumer is a compositor or a
> modesetting driver.
> 
> *2.1. The Present request*
> 
> As part of the Present request, the producer will pass 2 fences (sync
> objects) to the consumer alongside the presented DMABUF BO:
> - The submit fence: Initially unsignalled, it will be signalled when the
> producer has finished drawing into the presented buffer.
> - The return fence: Initially unsignalled, it will be signalled when the
> consumer has finished using the presented buffer.

Build this with syncobj timelines and it makes a lot more sense I think.
We'll need that for having a proper upgrade path, both on the hw/driver
side (being able to support stuff like preempt or gpu page faults) and the
ecosystem side (so that we don't have to rev protocols twice, once going
to explicit dma_fence sync and once more for userspace sync).

> Deadlock mitigation to recover from segfaults:
> - The kernel knows which process is obliged to signal which fence. This
> information is part of the Present request and supplied by userspace.
> - If the producer crashes, the kernel signals the submit fence, so that the
> consumer can make forward progress.
> - If the consumer crashes, the kernel signals the return fence, so that the
> producer can reclaim the buffer.

So for kernel based sync imo simplest is to just reuse dma_fence, same
rules apply.

For userspace fencing the kernel simply doesn't care how stupid userspace
is. Security checks at boundaries (e.g. client vs compositor) is also
usersepace's problem and can be handled by e.g.  timeouts + conditional
rendering on the compositor side. The timeout might be in the compat glue,
e.g. when we stall for a dma_fence to materialize from a drm_syncobj. I
think in vulkan this is defacto already up to applications to deal with
entirely if they deal with untrusted fences.

> - A GPU hang signals all fences. Other deadlocks will be handled like GPU
> hangs.

Nope, we can't just shrug off all deadlocks with "gpu reset rolls in". For
one, with userspace fencing the kernel isn't aware of any deadlocks, you
fundamentally can't tell "has deadlocked" from "is still doing useful
computations" because that amounts to solving the halting problem.

Any programming model we come up with where both kernel and userspace are
involved needs to come up with rules where at least non-evil userspace
never deadlocks. And if you just allow both then it's pretty easy to come
up with scenarios where both userspace and kernel along are deadlock free,
but interactions result in hangs. That's why we've recently documented all
the corner cases around indefinite dma_fences, and also why you can't use
gpu page faults currently anything that uses dma_fence for sync.

That's why I think with userspace fencing the kernel simply should not be
involved at all, aside from providing optimized/blocking cpu wait
functionality.

> Other window system requests can follow the same idea.
> 
> Merged fences where one fence object contains multiple fences will be
> supported. A merged fence is signalled only when its fences are signalled.
> The consumer will have the option to redefine the unsignalled return fence
> to a merged fence.
> 
> *2.2. Modesetting*
> 
> Since a modesetting driver can also be the consumer, the present ioctl will
> contain a submit fence and a return fence too. One small problem with this
> is that userspace can hang the modesetting driver, but in theory, any later
> present ioctl can override the previous one, so the unsignalled
> presentation is never used.
> 
> 
> *3. New memory management*
> 
> The per-BO fences will be removed and the kernel will not know which
> buffers are busy. This will reduce CPU overhead and latency. The kernel
> will not need per-BO fences with explicit synchronization, so we just need
> to remove their last user: buffer evictions. It also resolves the current
> OOM deadlock.

What's "the current OOM deadlock"?

> 
> *3.1. Evictions*
> 
> If the kernel wants to move a buffer, it will have to wait for everything
> to go idle, halt all userspace command submissions, move the buffer, and
> resume everything. This is not expected to happen when memory is not
> exhausted. Other more efficient ways of synchronization are also possible
> (e.g. sync only one process), but are not discussed here.
> 
> *3.2. Per-process VRAM usage quota*
> 
> Each process can optionally and periodically query its VRAM usage quota and
> change domains of its buffers to obey that quota. For example, a process
> allocated 2 GB of buffers in VRAM, but the kernel decreased the quota to 1
> GB. The process can change the domains of the least important buffers to
> GTT to get the best outcome for itself. If the process doesn't do it, the
> kernel will choose which buffers to evict at random. (thanks to Christian
> Koenig for this idea)
> 
> *3.3. Buffer destruction without per-BO fences*
> 
> When the buffer destroy ioctl is called, an optional fence list can be
> passed to the kernel to indicate when it's safe to deallocate the buffer.
> If the fence list is empty, the buffer will be deallocated immediately.
> Shared buffers will be handled by merging fence lists from all processes
> that destroy them. Mitigation of malicious behavior:
> - If userspace destroys a busy buffer, it will get a GPU page fault.
> - If userspace sends fences that never signal, the kernel will have a
> timeout period and then will proceed to deallocate the buffer anyway.
> 
> *3.4. Other notes on MM*
> 
> Overcommitment of GPU-accessible memory will cause an allocation failure or
> invoke the OOM killer. Evictions to GPU-inaccessible memory might not be
> supported.
> 
> Kernel drivers could move to this new memory management today. Only buffer
> residency and evictions would stop using per-BO fences.
> 
> 
> 
> *4. Deprecating implicit synchronization*
> 
> It can be phased out by introducing a new generation of hardware where the
> driver doesn't add support for it (like a driver fork would do), assuming
> userspace has all the changes for explicit synchronization. This could
> potentially create an isolated part of the kernel DRM where all drivers
> only support explicit synchronization.

10-20 years I'd say before that's even an option.
-Daniel

> 
> Marek

> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 12:01 ` Daniel Vetter
@ 2021-04-20 12:19   ` Christian König
  2021-04-20 13:03   ` Daniel Stone
  1 sibling, 0 replies; 105+ messages in thread
From: Christian König @ 2021-04-20 12:19 UTC (permalink / raw)
  To: Daniel Vetter, Marek Olšák; +Cc: ML Mesa-dev, dri-devel

Hi Daniel,

Am 20.04.21 um 14:01 schrieb Daniel Vetter:
> On Mon, Apr 19, 2021 at 06:47:48AM -0400, Marek Olšák wrote:
>> Hi,
>>
>> This is our initial proposal for explicit fences everywhere and new memory
>> management that doesn't use BO fences. It's a redesign of how Linux
>> graphics drivers work, and it can coexist with what we have now.
>>
>>
>> *1. Introduction*
>> (skip this if you are already sold on explicit fences)
>>
>> The current Linux graphics architecture was initially designed for GPUs
>> with only one graphics queue where everything was executed in the
>> submission order and per-BO fences were used for memory management and
>> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
>> queues were added on top, which required the introduction of implicit
>> GPU-GPU synchronization between queues of different processes using per-BO
>> fences. Recently, even parallel execution within one queue was enabled
>> where a command buffer starts draws and compute shaders, but doesn't wait
>> for them, enabling parallelism between back-to-back command buffers.
>> Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
>> was created to enable all those use cases, and it's the only reason why the
>> scheduler exists.
>>
>> The GPU scheduler, implicit synchronization, BO-fence-based memory
>> management, and the tracking of per-BO fences increase CPU overhead and
>> latency, and reduce parallelism. There is a desire to replace all of them
>> with something much simpler. Below is how we could do it.
> I get the feeling you're mixing up a lot of things here that have more
> nuance, so first some lingo.
>
> - There's kernel based synchronization, based on dma_fence. These come in
>    two major variants: Implicit synchronization, where the kernel attaches
>    the dma_fences to a dma-buf, and explicit synchronization, where the
>    dma_fence gets passed around as a stand-alone object, either a sync_file
>    or a drm_syncobj
>
> - Then there's userspace fence synchronization, where userspace issues any
>    fences directly and the kernel doesn't even know what's going on. This
>    is the only model that allows you to ditch the kernel overhead, and it's
>    also the model that vk uses.
>
>    I concur with Jason that this one is the future, it's the model hw
>    wants, compute wants and vk wants. Building an explicit fence world
>    which doesn't aim at this is imo wasted effort.
>
> Now you smash them into one thing by also changing the memory model, but I
> think that doesn't work:
>
> - Relying on gpu page faults across the board wont happen. I think right
>    now only amd's GFX10 or so has enough pagefault support to allow this,

It's even worse. GFX9 has enough support so that in theory can work.

Because of this Felix and his team are working on HMM support based on 
this generation.

On GFX10 some aspects of it are improved while others are totally broken 
again.

>    and not even there I'm really sure. Nothing else will anytime soon, at
>    least not as far as I know. So we need to support slightly more hw in
>    upstream than just that.  Any plan that's realistic needs to cope with
>    dma_fence for a really long time.
>
> - Pown^WPin All The Things! is probably not a general enough memory
>    management approach. We've kinda tried for years to move away from it.
>    Sure we can support it as an optimization in specific workloads, and it
>    will make stuff faster, but it's not going to be the default I think.
>
> - We live in a post xf86-video-$vendor world, and all these other
>    compositors rely on implicit sync. You're not going to be able to get
>    rid of them anytime soon. What's worse, all the various EGL/vk buffer
>    sharing things also rely on implicit sync, so you get to fix up tons of
>    applications on top. Any plan that's realistic needs to cope with
>    implicit/explicit at the same time together won't work.
>
> - Absolute infuriating, but you can't use page-faulting together with any
>    dma_fence synchronization primitives, whether implicit or explicit. This
>    means until the entire ecosystem moved forward (good luck with that) we
>    have to support dma_fence. The only sync model that works together with
>    page faults is userspace fence based sync.
>
> Then there's the somewhat aside topic of how amdgpu/radeonsi does implicit
> sync, at least last I checked. Currently this oversynchronizes badly
> because it's left to the kernel to guess what should be synchronized, and
> that gets things wrong. What you need there is explicit implicit
> synchronization:
>
> - on the cs side, userspace must set explicit for which buffers the kernel
>    should engage in implicit synchronization. That's how it works on all
>    other drivers that support more explicit userspace like vk or gl drivers
>    that are internally all explicit. So essentially you only set the
>    implicit fence slot when you really want to, and only userspace knows
>    this. Implementing this without breaking the current logic probably
>    needs some flags.
>
> - the other side isn't there yet upstream, but Jason has patches.
>    Essentially you also need to sample your implicit sync points at the
>    right spot, to avoid oversync on later rendering by the producer.
>    Jason's patch solves this by adding an ioctl to dma-buf to get the
>    current set.
>
> - without any of this things for pure explicit fencing userspace the
>    kernel will simply maintain a list of all current users of a buffer. For
>    memory management, which means eviction handling roughly works like you
>    describe below, we wait for everything before a buffer can be moved.
>
> This should get rid of the oversync issues, and since implicit sync is
> backed in everywhere right now, you'll have to deal with implicit sync for
> a very long time.
>
> Next up is reducing the memory manager overhead of all this, without
> changing the ecosystem.
>
> - hw option would be page faults, but until we have full explicit
>    userspace sync we can't use those. Which currently means compute only.
>    Note that for vulkan or maybe also gl this is quite nasty for userspace,
>    since as soon as you need to switch to dma_fenc sync or implicit sync
>    (winsys buffer, or buffer sharing with any of the current set of
>    extensions) you have to flip your internal driver state around all sync
>    points over from userspace fencing to dma_fence kernel fencing. Can
>    still be all explicit using drm_syncobj ofc.
>
> - next up if your hw has preemption, you could use that, except preemption
>    takes a while longer, so from memory pov really should be done with
>    dma_fence. Plus it has all the same problems in that it requires
>    userspace fences.
>
> - now for making dma_fence O(1) in the fastpath you need the shared
>    dma_resv trick and the lru bulk move. radv/amdvlk use that, but I think
>    radeonsi not yet. But maybe I missed that. Either way we need to do some
>    better kernel work so it can also be fast for shared buffers, if those
>    become a problem. On the GL side doing this will use a lot of the tricks
>    for residency/working set management you describe below, except the
>    kernel can still throw out an entire gpu job. This is essentially what
>    you describe with 3.1. Vulkan/compute already work like this.
>
> Now this gets the performance up, but it doesn't give us any road towards
> using page faults (outside of compute) and so retiring dma_fence for good.
> For that we need a few pieces:
>
> - Full new set of userspace winsys protocols and egl/vk extensions. Pray
>    it actually gets adopted, because neither AMD nor Intel have the
>    engineers to push these kind of ecosystems/middleware issues forward on
>    their payrolls. Good pick is probably using drm_syncobj as the kernel
>    primitive for this. Still uses dma_fence underneath.
>
> - Some clever kernel tricks so that we can substitute dma_fence for
>    userspace fences within a drm_syncobj. drm_syncobj already has the
>    notion of waiting for a dma_fence to materialize. We can abuse that to
>    create an upgrade path from dma_fence based sync to userspace fence
>    syncing. Ofc none of this will be on the table if userspace hasn't
>    adopted explicit sync.
>
> With these two things I think we can have a reasonable upgrade path. None
> of this will be break the world type things though.

How about this:
1. We extend drm_syncobj to be able to contain both classic dma_fence as 
well as being used for user fence synchronization.

     We already discussed that briefly and I think we should have a 
rough plan for this in our heads.

2. We allow attaching an drm_syncobj on dma_resv for implicit sync.

     This requires that both the consumer as well as the producer side 
will support user fence synchronization.

     We would still have quite a bunch of limitations, especially we 
would need to adjust all the kernel consumers of classic dma_resv 
objects. But I think it should be doable.

Regards,
Christian.

>
> Bunch of comments below.
>
>> *2. Explicit synchronization for window systems and modesetting*
>>
>> The producer is an application and the consumer is a compositor or a
>> modesetting driver.
>>
>> *2.1. The Present request*
>>
>> As part of the Present request, the producer will pass 2 fences (sync
>> objects) to the consumer alongside the presented DMABUF BO:
>> - The submit fence: Initially unsignalled, it will be signalled when the
>> producer has finished drawing into the presented buffer.
>> - The return fence: Initially unsignalled, it will be signalled when the
>> consumer has finished using the presented buffer.
> Build this with syncobj timelines and it makes a lot more sense I think.
> We'll need that for having a proper upgrade path, both on the hw/driver
> side (being able to support stuff like preempt or gpu page faults) and the
> ecosystem side (so that we don't have to rev protocols twice, once going
> to explicit dma_fence sync and once more for userspace sync).
>
>> Deadlock mitigation to recover from segfaults:
>> - The kernel knows which process is obliged to signal which fence. This
>> information is part of the Present request and supplied by userspace.
>> - If the producer crashes, the kernel signals the submit fence, so that the
>> consumer can make forward progress.
>> - If the consumer crashes, the kernel signals the return fence, so that the
>> producer can reclaim the buffer.
> So for kernel based sync imo simplest is to just reuse dma_fence, same
> rules apply.
>
> For userspace fencing the kernel simply doesn't care how stupid userspace
> is. Security checks at boundaries (e.g. client vs compositor) is also
> usersepace's problem and can be handled by e.g.  timeouts + conditional
> rendering on the compositor side. The timeout might be in the compat glue,
> e.g. when we stall for a dma_fence to materialize from a drm_syncobj. I
> think in vulkan this is defacto already up to applications to deal with
> entirely if they deal with untrusted fences.
>
>> - A GPU hang signals all fences. Other deadlocks will be handled like GPU
>> hangs.
> Nope, we can't just shrug off all deadlocks with "gpu reset rolls in". For
> one, with userspace fencing the kernel isn't aware of any deadlocks, you
> fundamentally can't tell "has deadlocked" from "is still doing useful
> computations" because that amounts to solving the halting problem.
>
> Any programming model we come up with where both kernel and userspace are
> involved needs to come up with rules where at least non-evil userspace
> never deadlocks. And if you just allow both then it's pretty easy to come
> up with scenarios where both userspace and kernel along are deadlock free,
> but interactions result in hangs. That's why we've recently documented all
> the corner cases around indefinite dma_fences, and also why you can't use
> gpu page faults currently anything that uses dma_fence for sync.
>
> That's why I think with userspace fencing the kernel simply should not be
> involved at all, aside from providing optimized/blocking cpu wait
> functionality.
>
>> Other window system requests can follow the same idea.
>>
>> Merged fences where one fence object contains multiple fences will be
>> supported. A merged fence is signalled only when its fences are signalled.
>> The consumer will have the option to redefine the unsignalled return fence
>> to a merged fence.
>>
>> *2.2. Modesetting*
>>
>> Since a modesetting driver can also be the consumer, the present ioctl will
>> contain a submit fence and a return fence too. One small problem with this
>> is that userspace can hang the modesetting driver, but in theory, any later
>> present ioctl can override the previous one, so the unsignalled
>> presentation is never used.
>>
>>
>> *3. New memory management*
>>
>> The per-BO fences will be removed and the kernel will not know which
>> buffers are busy. This will reduce CPU overhead and latency. The kernel
>> will not need per-BO fences with explicit synchronization, so we just need
>> to remove their last user: buffer evictions. It also resolves the current
>> OOM deadlock.
> What's "the current OOM deadlock"?
>
>> *3.1. Evictions*
>>
>> If the kernel wants to move a buffer, it will have to wait for everything
>> to go idle, halt all userspace command submissions, move the buffer, and
>> resume everything. This is not expected to happen when memory is not
>> exhausted. Other more efficient ways of synchronization are also possible
>> (e.g. sync only one process), but are not discussed here.
>>
>> *3.2. Per-process VRAM usage quota*
>>
>> Each process can optionally and periodically query its VRAM usage quota and
>> change domains of its buffers to obey that quota. For example, a process
>> allocated 2 GB of buffers in VRAM, but the kernel decreased the quota to 1
>> GB. The process can change the domains of the least important buffers to
>> GTT to get the best outcome for itself. If the process doesn't do it, the
>> kernel will choose which buffers to evict at random. (thanks to Christian
>> Koenig for this idea)
>>
>> *3.3. Buffer destruction without per-BO fences*
>>
>> When the buffer destroy ioctl is called, an optional fence list can be
>> passed to the kernel to indicate when it's safe to deallocate the buffer.
>> If the fence list is empty, the buffer will be deallocated immediately.
>> Shared buffers will be handled by merging fence lists from all processes
>> that destroy them. Mitigation of malicious behavior:
>> - If userspace destroys a busy buffer, it will get a GPU page fault.
>> - If userspace sends fences that never signal, the kernel will have a
>> timeout period and then will proceed to deallocate the buffer anyway.
>>
>> *3.4. Other notes on MM*
>>
>> Overcommitment of GPU-accessible memory will cause an allocation failure or
>> invoke the OOM killer. Evictions to GPU-inaccessible memory might not be
>> supported.
>>
>> Kernel drivers could move to this new memory management today. Only buffer
>> residency and evictions would stop using per-BO fences.
>>
>>
>>
>> *4. Deprecating implicit synchronization*
>>
>> It can be phased out by introducing a new generation of hardware where the
>> driver doesn't add support for it (like a driver fork would do), assuming
>> userspace has all the changes for explicit synchronization. This could
>> potentially create an isolated part of the kernel DRM where all drivers
>> only support explicit synchronization.
> 10-20 years I'd say before that's even an option.
> -Daniel
>
>> Marek
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-19 10:47 [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal Marek Olšák
  2021-04-19 15:48 ` Jason Ekstrand
  2021-04-20 12:01 ` Daniel Vetter
@ 2021-04-20 12:42 ` Daniel Stone
  2021-04-20 15:45   ` Jason Ekstrand
  2021-04-20 14:53 ` Daniel Stone
  3 siblings, 1 reply; 105+ messages in thread
From: Daniel Stone @ 2021-04-20 12:42 UTC (permalink / raw)
  To: Marek Olšák; +Cc: ML Mesa-dev, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 11586 bytes --]

Hi Marek,

On Mon, 19 Apr 2021 at 11:48, Marek Olšák <maraeo@gmail.com> wrote:

> *2. Explicit synchronization for window systems and modesetting*
>
> The producer is an application and the consumer is a compositor or a
> modesetting driver.
>
> *2.1. The Present request*
>

So the 'present request' is an ioctl, right? Not a userspace construct like
it is today? If so, how do we correlate the two?

The terminology is pretty X11-centric so I'll assume that's what you've
designed against, but Wayland and even X11 carry much more auxiliary
information attached to a present request than just 'this buffer, this
swapchain'. Wayland latches a lot of data on presentation, including
non-graphics data such as surface geometry (so we can have resizes which
don't suck), window state (e.g. fullscreen or not, also so we can have
resizes which don't suck), and these requests can also cascade through a
tree of subsurfaces (so we can have embeds which don't suck). X11 mostly
just carries timestamps, which is more tractable.

Given we don't want to move the entirety of Wayland into kernel-visible
objects, how do we synchronise the two streams so they aren't incoherent?
Taking a rough stab at it whilst assuming we do have
DRM_IOCTL_NONMODE_PRESENT, this would create a present object somewhere in
kernel space, which the producer would create and ?? export a FD from, that
the compositor would ?? import.

As part of the Present request, the producer will pass 2 fences (sync
> objects) to the consumer alongside the presented DMABUF BO:
> - The submit fence: Initially unsignalled, it will be signalled when the
> producer has finished drawing into the presented buffer.
>

We have already have this in Wayland through dma_fence. I'm relaxed about
this becoming drm_syncobj or drm_newmappedysncobjthing, it's just a matter
of typing. X11 has patches to DRI3 to support dma_fence, but they never got
merged because it was far too invasive to a server which is no longer
maintained.


> - The return fence: Initially unsignalled, it will be signalled when the
> consumer has finished using the presented buffer.
>

Currently in Wayland the return fence (again a dma_fence) is generated by
the compositor and sent as an event when it's done, because we can't have
speculative/empty/future fences. drm_syncobj would make this possible, but
so far I've been hesitant because I don't see the benefit to it (more
below).


> Deadlock mitigation to recover from segfaults:
> - The kernel knows which process is obliged to signal which fence. This
> information is part of the Present request and supplied by userspace.
>

Same as today with dma_fence. Less true with drm_syncobj if we're using
timelines.


> - If the producer crashes, the kernel signals the submit fence, so that
> the consumer can make forward progress.
>

This is only a change if the producer is now allowed to submit a fence
before it's flushed the work which would eventually fulfill that fence.
Using dma_fence has so far isolated us from this.


> - If the consumer crashes, the kernel signals the return fence, so that
> the producer can reclaim the buffer.
>

'The consumer' is problematic, per below. I think the wording you want is
'if no references are held to the submitted present object'.


> - A GPU hang signals all fences. Other deadlocks will be handled like GPU
> hangs.
>
> Other window system requests can follow the same idea.
>

Which other window system requests did you have in mind? Again, moving the
entirety of Wayland's signaling into the kernel is a total non-starter.
Partly because it means our entire protocol would be subject to the
kernel's ABI rules, partly because the rules and interdependencies between
the requests are extremely complex, but mostly because the kernel is just a
useless proxy: it would be forced to do significant work to reason about
what those requests do and when they should happen, but wouldn't be able to
make those decisions itself so would have to just punt everything to
userspace. Unless we have eBPF compositors.


> Merged fences where one fence object contains multiple fences will be
> supported. A merged fence is signalled only when its fences are signalled.
> The consumer will have the option to redefine the unsignalled return fence
> to a merged fence.
>

An elaboration of how this differed from drm_syncobj would be really
helpful here. I can make some guesses based on the rest of the mail, but
I'm not sure how accurate they are.


> *2.2. Modesetting*
>
> Since a modesetting driver can also be the consumer, the present ioctl
> will contain a submit fence and a return fence too. One small problem with
> this is that userspace can hang the modesetting driver, but in theory, any
> later present ioctl can override the previous one, so the unsignalled
> presentation is never used.
>

This is also problematic. It's not just KMS, but media codecs too - V4L
doesn't yet have explicit fencing, but given the programming model of
codecs and how deeply they interoperate, but it will.

Rather than client (GPU) -> compositor (GPU) -> compositor (KMS), imagine
you're playing a Steam game on your Chromebook which you're streaming via
Twitch or whatever. The full chain looks like:
* Steam game renders with GPU
* Xwayland in container receives dmabuf, forwards dmabuf to Wayland server
(does not directly consume)
* Wayland server (which is actually Chromium) receives dmabuf, forwards
dmabuf to Chromium UI process
* Chromium UI process forwards client dmabuf to KMS for direct scanout
* Chromium UI process _also_ forwards client dmabuf to GPU process
* Chromium GPU process composites Chromium UI + client dmabuf + webcam
frame from V4L to GPU composition job
* Chromium GPU process forwards GPU composition dmabuf (not client dmabuf)
to media codec for streaming

So, we don't have a 1:1 producer:consumer relationship. Even if we accept
it's 1:n, your Chromebook is about to burst into flames and we're dropping
frames to try to keep up. Some of the consumers are FIFO (the codec wants
to push things through in order), and some of them are mailbox (the display
wants to get the latest content, not from half a second ago before the
other player started jumping around and now you're dead). You can't reason
about any of these dependencies ahead of time from a producer PoV, because
userspace will be making these decisions frame by frame. Also someone's
started using the Vulkan present-timing extension because life wasn't
confusing enough already.

As Christian and Daniel were getting at, there are also two 'levels' of
explicit synchronisation.

The first (let's call it 'blind') is plumbing a dma_fence through to be
passed with the dmabuf. When the client submits a buffer for presentation,
it submits a dma_fence as well. When the compositor is finished with it
(i.e. has flushed the last work which will source from that buffer), it
passes a dma_fence back to the client, or no fence if required (buffer was
never accessed, or all accesses are known to be fully retired e.g. the last
fence accessing it has already signaled). This is just a matter of typing,
and is supported by at least Weston. It implies no scheduling change over
implicit fencing in that the compositor can be held hostage by abusive
clients with a really long compute shader in their dependency chain: all
that's happening is that we're plumbing those synchronisation tokens
through userspace instead of having the kernel dig them up from dma_resv.
But we at least have a no-deadlock guarantee, because a dma_fence will
complete in bounded time.

The second (let's call it 'smart') is ... much more than that. Not only
does the compositor accept and generate explicit synchronisation points for
the client, but those synchronisation points aren't dma_fences, but may be
wait-before-signal, or may be wait-never-signal. So in order to avoid a
terminal deadlock, the compositor has to sit on every synchronisation point
and check before it flushes any dependent work that it has signaled, or
will at least signal in bounded time. If that guarantee isn't there, you
have to punt and see if anything happens at your next repaint point. We
don't currently have this support in any compositor, and it's a lot more
work than blind.

Given the interdependencies I've described above for Wayland - say a resize
case, or when a surface commit triggers a cascade of subsurface commits -
GPU-side conditional rendering is not always possible. In those cases, you
_must_ do CPU-side waits and keep both sets of state around. Pain.

Typing all that out has convinced me that the current proposal is a net
loss in every case.

Complex rendering uses (game engine with a billion draw calls, a billion
BOs, complex sync dependencies, wait-before-signal and/or conditional
rendering/descriptor indexing) don't need the complexity of a present ioctl
and checking whether other processes have crashed or whatever. They already
have everything plumbed through for this themselves, and need to implement
so much infrastructure around it that they don't need much/any help from
the kernel. Just give them a sync primitive with almost zero guarantees
that they can map into CPU & GPU address space, let them go wild with it.
drm_syncobj_plus_footgun. Good luck.

Simple presentation uses (desktop, browser, game) don't need the
hyperoptimisation of sync primitives. Frame times are relatively long, and
you can only have so many surfaces which aren't occluded. Either you have a
complex scene to composite, in which case the CPU overhead of something
like dma_fence is lower than the CPU overhead required to walk through a
single compositor repaint cycle anyway, or you have a completely trivial
scene to composite and you can absolutely eat the overhead of exporting and
scheduling like two fences in 10ms.

Complex presentation uses (out-streaming, media sources, deeper
presentation chains) make the trivial present ioctl so complex that its
benefits evaporate. Wait-before-signal pushes so much complexity into the
compositor that you have to eat a lot of CPU overhead there and lose your
ability to do pipelined draws because you have to hang around and see if
they'll ever complete. Cross-device usage means everyone just ends up
spinning on the CPU instead.

So, can we take a step back? What are the problems we're trying to solve?
If it's about optimising the game engine's internal rendering, how would
that benefit from a present ioctl instead of current synchronisation?

If it's about composition, how do we balance the complexity between the
kernel and userspace? What's the global benefit from throwing our hands in
the air and saying 'you deal with it' to all of userspace, given that
existing mailbox systems making frame-by-frame decisions already preclude
deep/speculative pipelining on the client side?

Given that userspace then loses all ability to reason about presentation if
wait-before-signal becomes a thing, do we end up with a global performance
loss by replacing the overhead of kernel dma_fence handling with userspace
spinning on a page? Even if we micro-optimise that by allowing userspace to
be notified on access, is the overhead of pagefault -> kernel signal
handler -> queue signalfd notification -> userspace event loop -> read page
& compare to expected value, actually better than dma_fence?

Cheers,
Daniel

[-- Attachment #1.2: Type: text/html, Size: 14338 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 12:01 ` Daniel Vetter
  2021-04-20 12:19   ` [Mesa-dev] " Christian König
@ 2021-04-20 13:03   ` Daniel Stone
  2021-04-20 14:04     ` Daniel Vetter
  1 sibling, 1 reply; 105+ messages in thread
From: Daniel Stone @ 2021-04-20 13:03 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: ML Mesa-dev, dri-devel, Marek Olšák


[-- Attachment #1.1: Type: text/plain, Size: 2409 bytes --]

Hi,

On Tue, 20 Apr 2021 at 13:01, Daniel Vetter <daniel@ffwll.ch> wrote:

> - We live in a post xf86-video-$vendor world, and all these other
>   compositors rely on implicit sync. You're not going to be able to get
>   rid of them anytime soon. What's worse, all the various EGL/vk buffer
>   sharing things also rely on implicit sync, so you get to fix up tons of
>   applications on top. Any plan that's realistic needs to cope with
>   implicit/explicit at the same time together won't work.
>
> - Absolute infuriating, but you can't use page-faulting together with any
>   dma_fence synchronization primitives, whether implicit or explicit. This
>   means until the entire ecosystem moved forward (good luck with that) we
>   have to support dma_fence. The only sync model that works together with
>   page faults is userspace fence based sync.
>
This should get rid of the oversync issues, and since implicit sync is
> backed in everywhere right now, you'll have to deal with implicit sync for
> a very long time.
>

Depends what you mean by 'implicit sync'. ;)

Getting userspace (Vulkan WSI, EGL, Wayland compositors, browsers, media
clients) over to explicit sync is easy, _provided_ that the explicit sync
gives us the same guarantees as implicit sync, i.e. completes in bounded
time, GPU/display work can be flushed to the kernel predicated on fence
completion with the kernel handling synchronisation and scheduling. It's
just a matter of typing, and until now we haven't had a great reason to do
that typing. Now we do have that reason, so we are implementing it. Whether
it's dma_fence or drm_syncobj is mostly immaterial; we can encode in
protocol requirements that you can't try to use wait-before-signal with
drm_syncobj and you'll get killed if you try.

Getting that userspace over to fully userspace-based sync
(wait-before-signal or wait-never-signal, no kernel assistance but you just
have to roll your own polling or signal handling on either CPU or GPU side)
is not easy. It might never happen, because it's an extraordinary amount of
work, introduces a huge amount of fragility into a super-critical path, and
and so far it's not clear that it's a global performance improvement for
the whole system, just shifting performance problems from kernel to
userspace, and probably (AFAICT) making them worse in addition to the other
problems it brings.

What am I missing?

Cheers,
Daniel

[-- Attachment #1.2: Type: text/html, Size: 3107 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 13:03   ` Daniel Stone
@ 2021-04-20 14:04     ` Daniel Vetter
  0 siblings, 0 replies; 105+ messages in thread
From: Daniel Vetter @ 2021-04-20 14:04 UTC (permalink / raw)
  To: Daniel Stone; +Cc: ML Mesa-dev, dri-devel, Marek Olšák

On Tue, Apr 20, 2021 at 3:04 PM Daniel Stone <daniel@fooishbar.org> wrote:
>
> Hi,
>
> On Tue, 20 Apr 2021 at 13:01, Daniel Vetter <daniel@ffwll.ch> wrote:
>>
>> - We live in a post xf86-video-$vendor world, and all these other
>>   compositors rely on implicit sync. You're not going to be able to get
>>   rid of them anytime soon. What's worse, all the various EGL/vk buffer
>>   sharing things also rely on implicit sync, so you get to fix up tons of
>>   applications on top. Any plan that's realistic needs to cope with
>>   implicit/explicit at the same time together won't work.
>>
>> - Absolute infuriating, but you can't use page-faulting together with any
>>   dma_fence synchronization primitives, whether implicit or explicit. This
>>   means until the entire ecosystem moved forward (good luck with that) we
>>   have to support dma_fence. The only sync model that works together with
>>   page faults is userspace fence based sync.
>>
>> This should get rid of the oversync issues, and since implicit sync is
>> backed in everywhere right now, you'll have to deal with implicit sync for
>> a very long time.
>
>
> Depends what you mean by 'implicit sync'. ;)
>
> Getting userspace (Vulkan WSI, EGL, Wayland compositors, browsers, media clients) over to explicit sync is easy, _provided_ that the explicit sync gives us the same guarantees as implicit sync, i.e. completes in bounded time, GPU/display work can be flushed to the kernel predicated on fence completion with the kernel handling synchronisation and scheduling. It's just a matter of typing, and until now we haven't had a great reason to do that typing. Now we do have that reason, so we are implementing it. Whether it's dma_fence or drm_syncobj is mostly immaterial; we can encode in protocol requirements that you can't try to use wait-before-signal with drm_syncobj and you'll get killed if you try.
>
> Getting that userspace over to fully userspace-based sync (wait-before-signal or wait-never-signal, no kernel assistance but you just have to roll your own polling or signal handling on either CPU or GPU side) is not easy. It might never happen, because it's an extraordinary amount of work, introduces a huge amount of fragility into a super-critical path, and and so far it's not clear that it's a global performance improvement for the whole system, just shifting performance problems from kernel to userspace, and probably (AFAICT) making them worse in addition to the other problems it brings.
>
> What am I missing?

Nothing I think.

Which is why I'm arguing that kernel based sync with all the current
dma_fence guarantees is probably going to stick around for something
close to forever, and we need to assume so.

Only in specific cases does full userspace sync make sense imo:
- anything compute, excluding using compute/shaders to create
displayable buffers, but compute as in your final target is writing
some stuff to files and never interacting with any winsys. Those
really care because "run a compute kernel for a few hours" isn't
supported without userspace sync, and I don't think ever will.
- maybe vulkan direct display, once/if we have the extensions for
atomic kms wired up
- maybe someone wants to write a vulkan based compositor and deal with
all this themselves. That model I think would also imply that they
deal with all the timeouts and fallbacks, irrespective of whether
underneath we actually run on dma_fence timeline syncobjs or userspace
fence timeline syncobjs.

From about 2 years of screaming at this stuff it feels like this will
be a pretty exhaustive list for the next 10 years. Definitely doesn't
include your random linux desktop wayland compositor stack. But
there's definitely some are specific areas where people care enough
for all the pain. For everyone else it's all the other pieces I laid
out.

This also means that I don't think we now have that impedus to start
typing all the explicit sync protocol/compositor bits, since:
- the main driver is compute stuff, that needs mesa work (well vk/ocl
plus all the various repainted copies of cuda)
- with the tricks to make implicit sync work more like explicit sync
the oversyncing can be largely solved without protocol work
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 11:59           ` Christian König
@ 2021-04-20 14:09             ` Daniel Vetter
  2021-04-20 16:24               ` Jason Ekstrand
  2021-04-20 16:19             ` Jason Ekstrand
  1 sibling, 1 reply; 105+ messages in thread
From: Daniel Vetter @ 2021-04-20 14:09 UTC (permalink / raw)
  To: Christian König
  Cc: ML Mesa-dev, dri-devel, Jason Ekstrand, Marek Olšák

On Tue, Apr 20, 2021 at 1:59 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> > Yeah. If we go with userspace fences, then userspace can hang itself. Not
> > the kernel's problem.
>
> Well, the path of inner peace begins with four words. “Not my fucking
> problem.”
>
> But I'm not that much concerned about the kernel, but rather about
> important userspace processes like X, Wayland, SurfaceFlinger etc...
>
> I mean attaching a page to a sync object and allowing to wait/signal
> from both CPU as well as GPU side is not so much of a problem.
>
> > You have to somehow handle that, e.g. perhaps with conditional
> > rendering and just using the old frame in compositing if the new one
> > doesn't show up in time.
>
> Nice idea, but how would you handle that on the OpenGL/Glamor/Vulkan level.

For opengl we do all the same guarantees, so if you get one of these
you just block until the fence is signalled. Doing that properly means
submit thread to support drm_syncobj like for vulkan.

For vulkan we probably want to represent these as proper vk timeline
objects, and the vulkan way is to just let the application (well
compositor) here deal with it. If they import timelines from untrusted
other parties, they need to handle the potential fallback of being
lied at. How is "not vulkan's fucking problem", because that entire
"with great power (well performance) comes great responsibility" is
the entire vk design paradigm.

Glamour will just rely on GL providing nice package of the harsh
reality of gpus, like usual.

So I guess step 1 here for GL would be to provide some kind of
import/export of timeline syncobj, including properly handling this
"future/indefinite fences" aspect of them with submit thread and
everything.
-Daniel

>
> Regards,
> Christian.
>
> Am 20.04.21 um 13:16 schrieb Daniel Vetter:
> > On Tue, Apr 20, 2021 at 07:03:19AM -0400, Marek Olšák wrote:
> >> Daniel, are you suggesting that we should skip any deadlock prevention in
> >> the kernel, and just let userspace wait for and signal any fence it has
> >> access to?
> > Yeah. If we go with userspace fences, then userspace can hang itself. Not
> > the kernel's problem. The only criteria is that the kernel itself must
> > never rely on these userspace fences, except for stuff like implementing
> > optimized cpu waits. And in those we must always guarantee that the
> > userspace process remains interruptible.
> >
> > It's a completely different world from dma_fence based kernel fences,
> > whether those are implicit or explicit.
> >
> >> Do you have any concern with the deprecation/removal of BO fences in the
> >> kernel assuming userspace is only using explicit fences? Any concern with
> >> the submit and return fences for modesetting and other producer<->consumer
> >> scenarios?
> > Let me work on the full replay for your rfc first, because there's a lot
> > of details here and nuance.
> > -Daniel
> >
> >> Thanks,
> >> Marek
> >>
> >> On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >>
> >>> On Tue, Apr 20, 2021 at 12:15 PM Christian König
> >>> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>>> Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
> >>>>> Not going to comment on everything on the first pass...
> >>>>>
> >>>>> On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák <maraeo@gmail.com> wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> This is our initial proposal for explicit fences everywhere and new
> >>> memory management that doesn't use BO fences. It's a redesign of how Linux
> >>> graphics drivers work, and it can coexist with what we have now.
> >>>>>>
> >>>>>> 1. Introduction
> >>>>>> (skip this if you are already sold on explicit fences)
> >>>>>>
> >>>>>> The current Linux graphics architecture was initially designed for
> >>> GPUs with only one graphics queue where everything was executed in the
> >>> submission order and per-BO fences were used for memory management and
> >>> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> >>> queues were added on top, which required the introduction of implicit
> >>> GPU-GPU synchronization between queues of different processes using per-BO
> >>> fences. Recently, even parallel execution within one queue was enabled
> >>> where a command buffer starts draws and compute shaders, but doesn't wait
> >>> for them, enabling parallelism between back-to-back command buffers.
> >>> Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
> >>> was created to enable all those use cases, and it's the only reason why the
> >>> scheduler exists.
> >>>>>> The GPU scheduler, implicit synchronization, BO-fence-based memory
> >>> management, and the tracking of per-BO fences increase CPU overhead and
> >>> latency, and reduce parallelism. There is a desire to replace all of them
> >>> with something much simpler. Below is how we could do it.
> >>>>>>
> >>>>>> 2. Explicit synchronization for window systems and modesetting
> >>>>>>
> >>>>>> The producer is an application and the consumer is a compositor or a
> >>> modesetting driver.
> >>>>>> 2.1. The Present request
> >>>>>>
> >>>>>> As part of the Present request, the producer will pass 2 fences (sync
> >>> objects) to the consumer alongside the presented DMABUF BO:
> >>>>>> - The submit fence: Initially unsignalled, it will be signalled when
> >>> the producer has finished drawing into the presented buffer.
> >>>>>> - The return fence: Initially unsignalled, it will be signalled when
> >>> the consumer has finished using the presented buffer.
> >>>>> I'm not sure syncobj is what we want.  In the Intel world we're trying
> >>>>> to go even further to something we're calling "userspace fences" which
> >>>>> are a timeline implemented as a single 64-bit value in some
> >>>>> CPU-mappable BO.  The client writes a higher value into the BO to
> >>>>> signal the timeline.
> >>>> Well that is exactly what our Windows guys have suggested as well, but
> >>>> it strongly looks like that this isn't sufficient.
> >>>>
> >>>> First of all you run into security problems when any application can
> >>>> just write any value to that memory location. Just imagine an
> >>>> application sets the counter to zero and X waits forever for some
> >>>> rendering to finish.
> >>> The thing is, with userspace fences security boundary issue prevent
> >>> moves into userspace entirely. And it really doesn't matter whether
> >>> the event you're waiting on doesn't complete because the other app
> >>> crashed or was stupid or intentionally gave you a wrong fence point:
> >>> You have to somehow handle that, e.g. perhaps with conditional
> >>> rendering and just using the old frame in compositing if the new one
> >>> doesn't show up in time. Or something like that. So trying to get the
> >>> kernel involved but also not so much involved sounds like a bad design
> >>> to me.
> >>>
> >>>> Additional to that in such a model you can't determine who is the guilty
> >>>> queue in case of a hang and can't reset the synchronization primitives
> >>>> in case of an error.
> >>>>
> >>>> Apart from that this is rather inefficient, e.g. we don't have any way
> >>>> to prevent priority inversion when used as a synchronization mechanism
> >>>> between different GPU queues.
> >>> Yeah but you can't have it both ways. Either all the scheduling in the
> >>> kernel and fence handling is a problem, or you actually want to
> >>> schedule in the kernel. hw seems to definitely move towards the more
> >>> stupid spinlock-in-hw model (and direct submit from userspace and all
> >>> that), priority inversions be damned. I'm really not sure we should
> >>> fight that - if it's really that inefficient then maybe hw will add
> >>> support for waiting sync constructs in hardware, or at least be
> >>> smarter about scheduling other stuff. E.g. on intel hw both the kernel
> >>> scheduler and fw scheduler knows when you're spinning on a hw fence
> >>> (whether userspace or kernel doesn't matter) and plugs in something
> >>> else. Add in a bit of hw support to watch cachelines, and you have
> >>> something which can handle both directions efficiently.
> >>>
> >>> Imo given where hw is going, we shouldn't try to be too clever here.
> >>> The only thing we do need to provision is being able to do cpu side
> >>> waits without spinning. And that should probably be done in a fairly
> >>> gpu specific way still.
> >>> -Daniel
> >>>
> >>>> Christian.
> >>>>
> >>>>>     The kernel then provides some helpers for
> >>>>> waiting on them reliably and without spinning.  I don't expect
> >>>>> everyone to support these right away but, If we're going to re-plumb
> >>>>> userspace for explicit synchronization, I'd like to make sure we take
> >>>>> this into account so we only have to do it once.
> >>>>>
> >>>>>
> >>>>>> Deadlock mitigation to recover from segfaults:
> >>>>>> - The kernel knows which process is obliged to signal which fence.
> >>> This information is part of the Present request and supplied by userspace.
> >>>>> This isn't clear to me.  Yes, if we're using anything dma-fence based
> >>>>> like syncobj, this is true.  But it doesn't seem totally true as a
> >>>>> general statement.
> >>>>>
> >>>>>
> >>>>>> - If the producer crashes, the kernel signals the submit fence, so
> >>> that the consumer can make forward progress.
> >>>>>> - If the consumer crashes, the kernel signals the return fence, so
> >>> that the producer can reclaim the buffer.
> >>>>>> - A GPU hang signals all fences. Other deadlocks will be handled like
> >>> GPU hangs.
> >>>>> What do you mean by "all"?  All fences that were supposed to be
> >>>>> signaled by the hung context?
> >>>>>
> >>>>>
> >>>>>> Other window system requests can follow the same idea.
> >>>>>>
> >>>>>> Merged fences where one fence object contains multiple fences will be
> >>> supported. A merged fence is signalled only when its fences are signalled.
> >>> The consumer will have the option to redefine the unsignalled return fence
> >>> to a merged fence.
> >>>>>> 2.2. Modesetting
> >>>>>>
> >>>>>> Since a modesetting driver can also be the consumer, the present
> >>> ioctl will contain a submit fence and a return fence too. One small problem
> >>> with this is that userspace can hang the modesetting driver, but in theory,
> >>> any later present ioctl can override the previous one, so the unsignalled
> >>> presentation is never used.
> >>>>>>
> >>>>>> 3. New memory management
> >>>>>>
> >>>>>> The per-BO fences will be removed and the kernel will not know which
> >>> buffers are busy. This will reduce CPU overhead and latency. The kernel
> >>> will not need per-BO fences with explicit synchronization, so we just need
> >>> to remove their last user: buffer evictions. It also resolves the current
> >>> OOM deadlock.
> >>>>> Is this even really possible?  I'm no kernel MM expert (trying to
> >>>>> learn some) but my understanding is that the use of per-BO dma-fence
> >>>>> runs deep.  I would like to stop using it for implicit synchronization
> >>>>> to be sure, but I'm not sure I believe the claim that we can get rid
> >>>>> of it entirely.  Happy to see someone try, though.
> >>>>>
> >>>>>
> >>>>>> 3.1. Evictions
> >>>>>>
> >>>>>> If the kernel wants to move a buffer, it will have to wait for
> >>> everything to go idle, halt all userspace command submissions, move the
> >>> buffer, and resume everything. This is not expected to happen when memory
> >>> is not exhausted. Other more efficient ways of synchronization are also
> >>> possible (e.g. sync only one process), but are not discussed here.
> >>>>>> 3.2. Per-process VRAM usage quota
> >>>>>>
> >>>>>> Each process can optionally and periodically query its VRAM usage
> >>> quota and change domains of its buffers to obey that quota. For example, a
> >>> process allocated 2 GB of buffers in VRAM, but the kernel decreased the
> >>> quota to 1 GB. The process can change the domains of the least important
> >>> buffers to GTT to get the best outcome for itself. If the process doesn't
> >>> do it, the kernel will choose which buffers to evict at random. (thanks to
> >>> Christian Koenig for this idea)
> >>>>> This is going to be difficult.  On Intel, we have some resources that
> >>>>> have to be pinned to VRAM and can't be dynamically swapped out by the
> >>>>> kernel.  In GL, we probably can deal with it somewhat dynamically.  In
> >>>>> Vulkan, we'll be entirely dependent on the application to use the
> >>>>> appropriate Vulkan memory budget APIs.
> >>>>>
> >>>>> --Jason
> >>>>>
> >>>>>
> >>>>>> 3.3. Buffer destruction without per-BO fences
> >>>>>>
> >>>>>> When the buffer destroy ioctl is called, an optional fence list can
> >>> be passed to the kernel to indicate when it's safe to deallocate the
> >>> buffer. If the fence list is empty, the buffer will be deallocated
> >>> immediately. Shared buffers will be handled by merging fence lists from all
> >>> processes that destroy them. Mitigation of malicious behavior:
> >>>>>> - If userspace destroys a busy buffer, it will get a GPU page fault.
> >>>>>> - If userspace sends fences that never signal, the kernel will have a
> >>> timeout period and then will proceed to deallocate the buffer anyway.
> >>>>>> 3.4. Other notes on MM
> >>>>>>
> >>>>>> Overcommitment of GPU-accessible memory will cause an allocation
> >>> failure or invoke the OOM killer. Evictions to GPU-inaccessible memory
> >>> might not be supported.
> >>>>>> Kernel drivers could move to this new memory management today. Only
> >>> buffer residency and evictions would stop using per-BO fences.
> >>>>>>
> >>>>>> 4. Deprecating implicit synchronization
> >>>>>>
> >>>>>> It can be phased out by introducing a new generation of hardware
> >>> where the driver doesn't add support for it (like a driver fork would do),
> >>> assuming userspace has all the changes for explicit synchronization. This
> >>> could potentially create an isolated part of the kernel DRM where all
> >>> drivers only support explicit synchronization.
> >>>>>> Marek
> >>>>>> _______________________________________________
> >>>>>> dri-devel mailing list
> >>>>>> dri-devel@lists.freedesktop.org
> >>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >>>>> _______________________________________________
> >>>>> mesa-dev mailing list
> >>>>> mesa-dev@lists.freedesktop.org
> >>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >>>
> >>> --
> >>> Daniel Vetter
> >>> Software Engineer, Intel Corporation
> >>> http://blog.ffwll.ch
> >>>
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-19 10:47 [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal Marek Olšák
                   ` (2 preceding siblings ...)
  2021-04-20 12:42 ` Daniel Stone
@ 2021-04-20 14:53 ` Daniel Stone
  2021-04-20 14:58   ` [Mesa-dev] " Christian König
  3 siblings, 1 reply; 105+ messages in thread
From: Daniel Stone @ 2021-04-20 14:53 UTC (permalink / raw)
  To: Marek Olšák; +Cc: ML Mesa-dev, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 1425 bytes --]

Hi,

On Mon, 19 Apr 2021 at 11:48, Marek Olšák <maraeo@gmail.com> wrote:

> Deadlock mitigation to recover from segfaults:
> - The kernel knows which process is obliged to signal which fence. This
> information is part of the Present request and supplied by userspace.
> - If the producer crashes, the kernel signals the submit fence, so that
> the consumer can make forward progress.
> - If the consumer crashes, the kernel signals the return fence, so that
> the producer can reclaim the buffer.
> - A GPU hang signals all fences. Other deadlocks will be handled like GPU
> hangs.
>

Another thought: with completely arbitrary userspace fencing, none of this
is helpful either. If the compositor can't guarantee that a hostile client
has submitted a fence which will never be signaled, then it won't be
waiting on it, so it already needs infrastructure to handle something like
this. That already handles the crashed-client case, because if the client
crashes, then its connection will be dropped, which will trigger the
compositor to destroy all its resources anyway, including any pending waits.

GPU hangs also look pretty similar; it's an infinite wait, until the client
resubmits a new buffer which would replace (& discard) the old.

So signal-fence-on-process-exit isn't helpful and doesn't provide any extra
reliability; it in fact probably just complicates things.

Cheers,
Daniel

[-- Attachment #1.2: Type: text/html, Size: 1967 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 14:53 ` Daniel Stone
@ 2021-04-20 14:58   ` Christian König
  2021-04-20 15:07     ` Daniel Stone
  0 siblings, 1 reply; 105+ messages in thread
From: Christian König @ 2021-04-20 14:58 UTC (permalink / raw)
  To: Daniel Stone, Marek Olšák; +Cc: ML Mesa-dev, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 2088 bytes --]



Am 20.04.21 um 16:53 schrieb Daniel Stone:
> Hi,
>
> On Mon, 19 Apr 2021 at 11:48, Marek Olšák <maraeo@gmail.com 
> <mailto:maraeo@gmail.com>> wrote:
>
>     Deadlock mitigation to recover from segfaults:
>     - The kernel knows which process is obliged to signal which fence.
>     This information is part of the Present request and supplied by
>     userspace.
>     - If the producer crashes, the kernel signals the submit fence, so
>     that the consumer can make forward progress.
>     - If the consumer crashes, the kernel signals the return fence, so
>     that the producer can reclaim the buffer.
>     - A GPU hang signals all fences. Other deadlocks will be handled
>     like GPU hangs.
>
>
> Another thought: with completely arbitrary userspace fencing, none of 
> this is helpful either. If the compositor can't guarantee that a 
> hostile client has submitted a fence which will never be signaled, 
> then it won't be waiting on it, so it already needs infrastructure to 
> handle something like this.

> That already handles the crashed-client case, because if the client 
> crashes, then its connection will be dropped, which will trigger the 
> compositor to destroy all its resources anyway, including any pending 
> waits.

Exactly that's the problem. A compositor isn't immediately informed that 
the client crashed, instead it is still referencing the buffer and 
trying to use it for compositing.

>
> GPU hangs also look pretty similar; it's an infinite wait, until the 
> client resubmits a new buffer which would replace (& discard) the old.

Correct. You just need to assume that all queues get destroyed and 
re-initialized when a GPU reset happens.

>
> So signal-fence-on-process-exit isn't helpful and doesn't provide any 
> extra reliability; it in fact probably just complicates things.

Well it is when you go for partial GPU resets.

Regards,
Christian.

>
> Cheers,
> Daniel
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[-- Attachment #1.2: Type: text/html, Size: 4888 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 14:58   ` [Mesa-dev] " Christian König
@ 2021-04-20 15:07     ` Daniel Stone
  2021-04-20 15:16       ` Christian König
  0 siblings, 1 reply; 105+ messages in thread
From: Daniel Stone @ 2021-04-20 15:07 UTC (permalink / raw)
  To: Christian König; +Cc: ML Mesa-dev, dri-devel, Marek Olšák


[-- Attachment #1.1: Type: text/plain, Size: 2495 bytes --]

On Tue, 20 Apr 2021 at 15:58, Christian König <
ckoenig.leichtzumerken@gmail.com> wrote:

> Am 20.04.21 um 16:53 schrieb Daniel Stone:
>
> On Mon, 19 Apr 2021 at 11:48, Marek Olšák <maraeo@gmail.com> wrote:
>
>> Deadlock mitigation to recover from segfaults:
>> - The kernel knows which process is obliged to signal which fence. This
>> information is part of the Present request and supplied by userspace.
>> - If the producer crashes, the kernel signals the submit fence, so that
>> the consumer can make forward progress.
>> - If the consumer crashes, the kernel signals the return fence, so that
>> the producer can reclaim the buffer.
>> - A GPU hang signals all fences. Other deadlocks will be handled like GPU
>> hangs.
>>
>
> Another thought: with completely arbitrary userspace fencing, none of this
> is helpful either. If the compositor can't guarantee that a hostile client
> has submitted a fence which will never be signaled, then it won't be
> waiting on it, so it already needs infrastructure to handle something like
> this.
>
>
> That already handles the crashed-client case, because if the client
> crashes, then its connection will be dropped, which will trigger the
> compositor to destroy all its resources anyway, including any pending waits.
>
>
> Exactly that's the problem. A compositor isn't immediately informed that
> the client crashed, instead it is still referencing the buffer and trying
> to use it for compositing.
>

If the compositor no longer has a guarantee that the buffer will be ready
for composition in a reasonable amount of time (which dma_fence gives us,
and this proposal does not appear to give us), then the compositor isn't
trying to use the buffer for compositing, it's waiting asynchronously on a
notification that the fence has signaled before it attempts to use the
buffer.

Marek's initial suggestion is that the kernel signal the fence, which would
unblock composition (and presumably show garbage on screen, or at best jump
back to old content).

My position is that the compositor will know the process has crashed anyway
- because its socket has been closed - at which point we destroy all the
client's resources including its windows and buffers regardless. Signaling
the fence doesn't give us any value here, _unless_ the compositor is just
blindly waiting for the fence to signal ... which it can't do because
there's no guarantee the fence will ever signal.

Cheers,
Daniel

[-- Attachment #1.2: Type: text/html, Size: 4015 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 15:07     ` Daniel Stone
@ 2021-04-20 15:16       ` Christian König
  2021-04-20 15:49         ` Daniel Stone
  0 siblings, 1 reply; 105+ messages in thread
From: Christian König @ 2021-04-20 15:16 UTC (permalink / raw)
  To: Daniel Stone; +Cc: ML Mesa-dev, dri-devel, Marek Olšák


[-- Attachment #1.1: Type: text/plain, Size: 3189 bytes --]



Am 20.04.21 um 17:07 schrieb Daniel Stone:
> On Tue, 20 Apr 2021 at 15:58, Christian König 
> <ckoenig.leichtzumerken@gmail.com 
> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>
>     Am 20.04.21 um 16:53 schrieb Daniel Stone:
>>     On Mon, 19 Apr 2021 at 11:48, Marek Olšák <maraeo@gmail.com
>>     <mailto:maraeo@gmail.com>> wrote:
>>
>>         Deadlock mitigation to recover from segfaults:
>>         - The kernel knows which process is obliged to signal which
>>         fence. This information is part of the Present request and
>>         supplied by userspace.
>>         - If the producer crashes, the kernel signals the submit
>>         fence, so that the consumer can make forward progress.
>>         - If the consumer crashes, the kernel signals the return
>>         fence, so that the producer can reclaim the buffer.
>>         - A GPU hang signals all fences. Other deadlocks will be
>>         handled like GPU hangs.
>>
>>
>>     Another thought: with completely arbitrary userspace fencing,
>>     none of this is helpful either. If the compositor can't guarantee
>>     that a hostile client has submitted a fence which will never be
>>     signaled, then it won't be waiting on it, so it already needs
>>     infrastructure to handle something like this.
>
>>     That already handles the crashed-client case, because if the
>>     client crashes, then its connection will be dropped, which will
>>     trigger the compositor to destroy all its resources anyway,
>>     including any pending waits.
>
>     Exactly that's the problem. A compositor isn't immediately
>     informed that the client crashed, instead it is still referencing
>     the buffer and trying to use it for compositing.
>
>
> If the compositor no longer has a guarantee that the buffer will be 
> ready for composition in a reasonable amount of time (which dma_fence 
> gives us, and this proposal does not appear to give us), then the 
> compositor isn't trying to use the buffer for compositing, it's 
> waiting asynchronously on a notification that the fence has signaled 
> before it attempts to use the buffer.
>
> Marek's initial suggestion is that the kernel signal the fence, which 
> would unblock composition (and presumably show garbage on screen, or 
> at best jump back to old content).
>
> My position is that the compositor will know the process has crashed 
> anyway - because its socket has been closed - at which point we 
> destroy all the client's resources including its windows and buffers 
> regardless. Signaling the fence doesn't give us any value here, 
> _unless_ the compositor is just blindly waiting for the fence to 
> signal ... which it can't do because there's no guarantee the fence 
> will ever signal.

Yeah, but that assumes that the compositor has change to not blindly 
wait for the client to finish rendering and as Daniel explained that is 
rather unrealistic.

What we need is a fallback mechanism which signals the fence after a 
timeout and gives a penalty to the one causing the timeout.

That gives us the same functionality we have today with the in software 
scheduler inside the kernel.

Regards,
Christian.

> Cheers,
> Daniel


[-- Attachment #1.2: Type: text/html, Size: 6289 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 12:42 ` Daniel Stone
@ 2021-04-20 15:45   ` Jason Ekstrand
  2021-04-20 17:44     ` Daniel Stone
  0 siblings, 1 reply; 105+ messages in thread
From: Jason Ekstrand @ 2021-04-20 15:45 UTC (permalink / raw)
  To: Daniel Stone; +Cc: ML Mesa-dev, dri-devel, Marek Olšák

It's still early in the morning here and I'm not awake yet so sorry if
this comes out in bits and pieces...

On Tue, Apr 20, 2021 at 7:43 AM Daniel Stone <daniel@fooishbar.org> wrote:
>
> Hi Marek,
>
> On Mon, 19 Apr 2021 at 11:48, Marek Olšák <maraeo@gmail.com> wrote:
>>
>> 2. Explicit synchronization for window systems and modesetting
>>
>> The producer is an application and the consumer is a compositor or a modesetting driver.
>>
>> 2.1. The Present request
>
>
> So the 'present request' is an ioctl, right? Not a userspace construct like it is today? If so, how do we correlate the two?
>
> The terminology is pretty X11-centric so I'll assume that's what you've designed against, but Wayland and even X11 carry much more auxiliary information attached to a present request than just 'this buffer, this swapchain'. Wayland latches a lot of data on presentation, including non-graphics data such as surface geometry (so we can have resizes which don't suck), window state (e.g. fullscreen or not, also so we can have resizes which don't suck), and these requests can also cascade through a tree of subsurfaces (so we can have embeds which don't suck). X11 mostly just carries timestamps, which is more tractable.
>
> Given we don't want to move the entirety of Wayland into kernel-visible objects, how do we synchronise the two streams so they aren't incoherent? Taking a rough stab at it whilst assuming we do have DRM_IOCTL_NONMODE_PRESENT, this would create a present object somewhere in kernel space, which the producer would create and ?? export a FD from, that the compositor would ?? import.
>
>> As part of the Present request, the producer will pass 2 fences (sync objects) to the consumer alongside the presented DMABUF BO:
>> - The submit fence: Initially unsignalled, it will be signalled when the producer has finished drawing into the presented buffer.
>
>
> We have already have this in Wayland through dma_fence. I'm relaxed about this becoming drm_syncobj or drm_newmappedysncobjthing, it's just a matter of typing. X11 has patches to DRI3 to support dma_fence, but they never got merged because it was far too invasive to a server which is no longer maintained.
>
>>
>> - The return fence: Initially unsignalled, it will be signalled when the consumer has finished using the presented buffer.
>
>
> Currently in Wayland the return fence (again a dma_fence) is generated by the compositor and sent as an event when it's done, because we can't have speculative/empty/future fences. drm_syncobj would make this possible, but so far I've been hesitant because I don't see the benefit to it (more below).
>
>>
>> Deadlock mitigation to recover from segfaults:
>> - The kernel knows which process is obliged to signal which fence. This information is part of the Present request and supplied by userspace.
>
>
> Same as today with dma_fence. Less true with drm_syncobj if we're using timelines.
>
>>
>> - If the producer crashes, the kernel signals the submit fence, so that the consumer can make forward progress.
>
>
> This is only a change if the producer is now allowed to submit a fence before it's flushed the work which would eventually fulfill that fence. Using dma_fence has so far isolated us from this.
>
>>
>> - If the consumer crashes, the kernel signals the return fence, so that the producer can reclaim the buffer.
>
>
> 'The consumer' is problematic, per below. I think the wording you want is 'if no references are held to the submitted present object'.
>
>>
>> - A GPU hang signals all fences. Other deadlocks will be handled like GPU hangs.
>>
>> Other window system requests can follow the same idea.
>
>
> Which other window system requests did you have in mind? Again, moving the entirety of Wayland's signaling into the kernel is a total non-starter. Partly because it means our entire protocol would be subject to the kernel's ABI rules, partly because the rules and interdependencies between the requests are extremely complex, but mostly because the kernel is just a useless proxy: it would be forced to do significant work to reason about what those requests do and when they should happen, but wouldn't be able to make those decisions itself so would have to just punt everything to userspace. Unless we have eBPF compositors.
>
>>
>> Merged fences where one fence object contains multiple fences will be supported. A merged fence is signalled only when its fences are signalled. The consumer will have the option to redefine the unsignalled return fence to a merged fence.
>
>
> An elaboration of how this differed from drm_syncobj would be really helpful here. I can make some guesses based on the rest of the mail, but I'm not sure how accurate they are.
>
>>
>> 2.2. Modesetting
>>
>> Since a modesetting driver can also be the consumer, the present ioctl will contain a submit fence and a return fence too. One small problem with this is that userspace can hang the modesetting driver, but in theory, any later present ioctl can override the previous one, so the unsignalled presentation is never used.
>
>
> This is also problematic. It's not just KMS, but media codecs too - V4L doesn't yet have explicit fencing, but given the programming model of codecs and how deeply they interoperate, but it will.
>
> Rather than client (GPU) -> compositor (GPU) -> compositor (KMS), imagine you're playing a Steam game on your Chromebook which you're streaming via Twitch or whatever. The full chain looks like:
> * Steam game renders with GPU
> * Xwayland in container receives dmabuf, forwards dmabuf to Wayland server (does not directly consume)
> * Wayland server (which is actually Chromium) receives dmabuf, forwards dmabuf to Chromium UI process
> * Chromium UI process forwards client dmabuf to KMS for direct scanout
> * Chromium UI process _also_ forwards client dmabuf to GPU process
> * Chromium GPU process composites Chromium UI + client dmabuf + webcam frame from V4L to GPU composition job
> * Chromium GPU process forwards GPU composition dmabuf (not client dmabuf) to media codec for streaming
>
> So, we don't have a 1:1 producer:consumer relationship. Even if we accept it's 1:n, your Chromebook is about to burst into flames and we're dropping frames to try to keep up. Some of the consumers are FIFO (the codec wants to push things through in order), and some of them are mailbox (the display wants to get the latest content, not from half a second ago before the other player started jumping around and now you're dead). You can't reason about any of these dependencies ahead of time from a producer PoV, because userspace will be making these decisions frame by frame. Also someone's started using the Vulkan present-timing extension because life wasn't confusing enough already.
>
> As Christian and Daniel were getting at, there are also two 'levels' of explicit synchronisation.
>
> The first (let's call it 'blind') is plumbing a dma_fence through to be passed with the dmabuf. When the client submits a buffer for presentation, it submits a dma_fence as well. When the compositor is finished with it (i.e. has flushed the last work which will source from that buffer), it passes a dma_fence back to the client, or no fence if required (buffer was never accessed, or all accesses are known to be fully retired e.g. the last fence accessing it has already signaled). This is just a matter of typing, and is supported by at least Weston. It implies no scheduling change over implicit fencing in that the compositor can be held hostage by abusive clients with a really long compute shader in their dependency chain: all that's happening is that we're plumbing those synchronisation tokens through userspace instead of having the kernel dig them up from dma_resv. But we at least have a no-deadlock guarantee, because a dma_fence will complete in bounded time.
>
> The second (let's call it 'smart') is ... much more than that. Not only does the compositor accept and generate explicit synchronisation points for the client, but those synchronisation points aren't dma_fences, but may be wait-before-signal, or may be wait-never-signal. So in order to avoid a terminal deadlock, the compositor has to sit on every synchronisation point and check before it flushes any dependent work that it has signaled, or will at least signal in bounded time. If that guarantee isn't there, you have to punt and see if anything happens at your next repaint point. We don't currently have this support in any compositor, and it's a lot more work than blind.
>
> Given the interdependencies I've described above for Wayland - say a resize case, or when a surface commit triggers a cascade of subsurface commits - GPU-side conditional rendering is not always possible. In those cases, you _must_ do CPU-side waits and keep both sets of state around. Pain.
>
> Typing all that out has convinced me that the current proposal is a net loss in every case.
>
> Complex rendering uses (game engine with a billion draw calls, a billion BOs, complex sync dependencies, wait-before-signal and/or conditional rendering/descriptor indexing) don't need the complexity of a present ioctl and checking whether other processes have crashed or whatever. They already have everything plumbed through for this themselves, and need to implement so much infrastructure around it that they don't need much/any help from the kernel. Just give them a sync primitive with almost zero guarantees that they can map into CPU & GPU address space, let them go wild with it. drm_syncobj_plus_footgun. Good luck.
>
> Simple presentation uses (desktop, browser, game) don't need the hyperoptimisation of sync primitives. Frame times are relatively long, and you can only have so many surfaces which aren't occluded. Either you have a complex scene to composite, in which case the CPU overhead of something like dma_fence is lower than the CPU overhead required to walk through a single compositor repaint cycle anyway, or you have a completely trivial scene to composite and you can absolutely eat the overhead of exporting and scheduling like two fences in 10ms.
>
> Complex presentation uses (out-streaming, media sources, deeper presentation chains) make the trivial present ioctl so complex that its benefits evaporate. Wait-before-signal pushes so much complexity into the compositor that you have to eat a lot of CPU overhead there and lose your ability to do pipelined draws because you have to hang around and see if they'll ever complete. Cross-device usage means everyone just ends up spinning on the CPU instead.
>
> So, can we take a step back? What are the problems we're trying to solve? If it's about optimising the game engine's internal rendering, how would that benefit from a present ioctl instead of current synchronisation?

IMO, there are two problems being solved here which are related in
very subtle and tricky ways.  They're also, admittedly, driver
problems, not really winsys problems.  Unfortunately, they may have
winsys implications.

First, is better/real timelines for Vulkan and compute.  With
VK_KHR_timeline_semaphore, we introduced the timeline programming
model to Vulkan.  This is a massively better programming model for
complex rendering apps which want to be doing all sorts of crazy.  It
comes with all the fun toys including wait-before-signal and no
timeouts on any particular time points (a single command buffer may
still time out).  Unfortunately, the current implementation involves a
lot of driver complexity, both in user space and kernel space.  The
"ideal" implementation for timelines (which is what Win10 does) is to
have a trivial implementation where each timeline is a 64-bit integer
living somewhere, clients signal whatever value they want, and you
just throw the whole mess at the wall and hope the scheduler sorts it
out.  I'm going to call these "memory fences" rather than "userspace
fences" because they could, in theory, be hidden entirely inside the
kernel.

We also want something like this for compute workloads.  Not only
because Vulkan and level0 provide this as part of their core API but
because compute very much doesn't want dma-fence guarantees.  You can,
in theory, have a compute kernel sitting there running for hours and
it should be ok assuming your scheduler can preempt and time-slice it
with other stuff.  This means that we can't ever have a long-running
compute batch which triggers a dma-fence.  We have to be able to
trigger SOMETHING at the ends of those batches.  What do we use?  TBD
but memory fences are the current proposal.

The second biting issue is that, in the current kernel implementation
of dma-fence and dma_resv, we've lumped internal synchronization for
memory management together with execution synchronization for
userspace dependency tracking.  And we have no way to tell the
difference between the two internally.  Even if user space is passing
around sync_files and trying to do explicit sync, once you get inside
the kernel, they're all dma-fences and it can't tell the difference.
If we move to a more userspace-controlled synchronization model with
wait-before-signal and no timeouts unless requested, regardless of the
implementation, it plays really badly dma-fence.  And, by "badly" I
mean the two are nearly incompatible.  From a user space PoV, it means
it's tricky to provide the finite time dma-fence guarantee.  From a
kernel PoV, it's way worse.  Currently, the way dma-fence is
constructed, it's impossible to deadlock assuming everyone follows the
rules.  The moment we allow user space to deadlock itself and allow
those deadlocks to leak into the kernel, we have a problem.  Even if
we throw in some timeouts, we still have a scenario where user space
has one linearizable dependency graph for execution synchronization
and the kernel has a different linearizable dependency graph for
memory management and, when you smash them together, you may have
cycles in your graph.

So how do we sort this all out?  Good question.  It's a hard problem.
Probably the hardest problem here is the second one: the intermixing
of synchronization types.  Solving that one is likely going to require
some user space re-plumbing because all the user space APIs we have
for explicit sync are built on dma-fence.

--Jason


> If it's about composition, how do we balance the complexity between the kernel and userspace? What's the global benefit from throwing our hands in the air and saying 'you deal with it' to all of userspace, given that existing mailbox systems making frame-by-frame decisions already preclude deep/speculative pipelining on the client side?
>
> Given that userspace then loses all ability to reason about presentation if wait-before-signal becomes a thing, do we end up with a global performance loss by replacing the overhead of kernel dma_fence handling with userspace spinning on a page? Even if we micro-optimise that by allowing userspace to be notified on access, is the overhead of pagefault -> kernel signal handler -> queue signalfd notification -> userspace event loop -> read page & compare to expected value, actually better than dma_fence?
>
> Cheers,
> Daniel
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 15:16       ` Christian König
@ 2021-04-20 15:49         ` Daniel Stone
  2021-04-20 16:25           ` Marek Olšák
  0 siblings, 1 reply; 105+ messages in thread
From: Daniel Stone @ 2021-04-20 15:49 UTC (permalink / raw)
  To: Christian König; +Cc: ML Mesa-dev, dri-devel, Marek Olšák


[-- Attachment #1.1: Type: text/plain, Size: 3802 bytes --]

Hi,

On Tue, 20 Apr 2021 at 16:16, Christian König <
ckoenig.leichtzumerken@gmail.com> wrote:

> Am 20.04.21 um 17:07 schrieb Daniel Stone:
>
> If the compositor no longer has a guarantee that the buffer will be ready
> for composition in a reasonable amount of time (which dma_fence gives us,
> and this proposal does not appear to give us), then the compositor isn't
> trying to use the buffer for compositing, it's waiting asynchronously on a
> notification that the fence has signaled before it attempts to use the
> buffer.
>
> Marek's initial suggestion is that the kernel signal the fence, which
> would unblock composition (and presumably show garbage on screen, or at
> best jump back to old content).
>
> My position is that the compositor will know the process has crashed
> anyway - because its socket has been closed - at which point we destroy all
> the client's resources including its windows and buffers regardless.
> Signaling the fence doesn't give us any value here, _unless_ the compositor
> is just blindly waiting for the fence to signal ... which it can't do
> because there's no guarantee the fence will ever signal.
>
>
> Yeah, but that assumes that the compositor has change to not blindly wait
> for the client to finish rendering and as Daniel explained that is rather
> unrealistic.
>
> What we need is a fallback mechanism which signals the fence after a
> timeout and gives a penalty to the one causing the timeout.
>
> That gives us the same functionality we have today with the in software
> scheduler inside the kernel.
>

OK, if that's the case then I think I'm really missing something which
isn't explained in this thread, because I don't understand what the
additional complexity and API change gains us (see my first reply in this
thread).

By way of example - say I have a blind-but-explicit compositor that takes a
drm_syncobj along with a dmabuf with each client presentation request, but
doesn't check syncobj completion, it just imports that into a VkSemaphore +
VkImage and schedules work for the next frame.

Currently, that generates an execbuf ioctl for the composition (ignore KMS
for now) with a sync point to wait on, and the kernel+GPU scheduling
guarantees that the composition work will not begin until the client
rendering work has retired. We have a further guarantee that this work will
complete in reasonable time, for some value of 'reasonable'.

My understanding of this current proposal is that:
* userspace creates a 'present fence' with this new ioctl
* the fence becomes signaled when a value is written to a location in
memory, which is visible through both CPU and GPU mappings of that page
* this 'present fence' is imported as a VkSemaphore (?) and the userspace
Vulkan driver will somehow wait on this value  either before submitting
work or as a possibly-hardware-assisted GPU-side wait (?)
* the kernel's scheduler is thus eliminated from the equation, and every
execbuf is submitted directly to hardware, because either userspace knows
that the fence has already been signaled, or it will issue a GPU-side wait
(?)
* but the kernel is still required to monitor completion of every fence
itself, so it can forcibly complete, or penalise the client (?)

Lastly, let's say we stop ignoring KMS: what happens for the
render-with-GPU-display-on-KMS case? Do we need to do the equivalent of
glFinish() in userspace and only submit the KMS atomic request when the GPU
work has fully retired?

Clarifying those points would be really helpful so this is less of a
strawman. I have some further opinions, but I'm going to wait until I
understand what I'm actually arguing against before I go too far. :) The
last point is very salient though.

Cheers,
Daniel

[-- Attachment #1.2: Type: text/html, Size: 4883 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 11:59           ` Christian König
  2021-04-20 14:09             ` Daniel Vetter
@ 2021-04-20 16:19             ` Jason Ekstrand
  1 sibling, 0 replies; 105+ messages in thread
From: Jason Ekstrand @ 2021-04-20 16:19 UTC (permalink / raw)
  To: Christian König; +Cc: ML Mesa-dev, dri-devel, Marek Olšák

Sorry for the mega-reply but timezones...

On Tue, Apr 20, 2021 at 6:59 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> > Yeah. If we go with userspace fences, then userspace can hang itself. Not
> > the kernel's problem.
>
> Well, the path of inner peace begins with four words. “Not my fucking
> problem.”

🧘

> But I'm not that much concerned about the kernel, but rather about
> important userspace processes like X, Wayland, SurfaceFlinger etc...
>
> I mean attaching a page to a sync object and allowing to wait/signal
> from both CPU as well as GPU side is not so much of a problem.

Yup... Sorting out these issues is what makes this a hard problem.


> > You have to somehow handle that, e.g. perhaps with conditional
> > rendering and just using the old frame in compositing if the new one
> > doesn't show up in time.
>
> Nice idea, but how would you handle that on the OpenGL/Glamor/Vulkan level.

"Just handle it with conditional rendering" is a pretty trite answer.
If we have memory fences, we could expose a Vulkan extension to allow
them to be read by conditional rendering or by a shader.  However, as
Daniel has pointed out multiple times, composition pipelines are long
and complex and cheap tricks like that aren't something we can rely on
for solving the problem.  If we're going to solve the problem, we need
to make driver-internal stuff nice while still providing something
that looks very much like a sync_file with finite time semantics to
the composition pipeline.  How?  That's the question.


> Regards,
> Christian.
>
> Am 20.04.21 um 13:16 schrieb Daniel Vetter:
> > On Tue, Apr 20, 2021 at 07:03:19AM -0400, Marek Olšák wrote:
> >> Daniel, are you suggesting that we should skip any deadlock prevention in
> >> the kernel, and just let userspace wait for and signal any fence it has
> >> access to?
> > Yeah. If we go with userspace fences, then userspace can hang itself. Not
> > the kernel's problem. The only criteria is that the kernel itself must
> > never rely on these userspace fences, except for stuff like implementing
> > optimized cpu waits. And in those we must always guarantee that the
> > userspace process remains interruptible.
> >
> > It's a completely different world from dma_fence based kernel fences,
> > whether those are implicit or explicit.
> >
> >> Do you have any concern with the deprecation/removal of BO fences in the
> >> kernel assuming userspace is only using explicit fences? Any concern with
> >> the submit and return fences for modesetting and other producer<->consumer
> >> scenarios?
> > Let me work on the full replay for your rfc first, because there's a lot
> > of details here and nuance.
> > -Daniel
> >
> >> Thanks,
> >> Marek
> >>
> >> On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >>
> >>> On Tue, Apr 20, 2021 at 12:15 PM Christian König
> >>> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>>> Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
> >>>>> Not going to comment on everything on the first pass...
> >>>>>
> >>>>> On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák <maraeo@gmail.com> wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> This is our initial proposal for explicit fences everywhere and new
> >>> memory management that doesn't use BO fences. It's a redesign of how Linux
> >>> graphics drivers work, and it can coexist with what we have now.
> >>>>>>
> >>>>>> 1. Introduction
> >>>>>> (skip this if you are already sold on explicit fences)
> >>>>>>
> >>>>>> The current Linux graphics architecture was initially designed for
> >>> GPUs with only one graphics queue where everything was executed in the
> >>> submission order and per-BO fences were used for memory management and
> >>> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> >>> queues were added on top, which required the introduction of implicit
> >>> GPU-GPU synchronization between queues of different processes using per-BO
> >>> fences. Recently, even parallel execution within one queue was enabled
> >>> where a command buffer starts draws and compute shaders, but doesn't wait
> >>> for them, enabling parallelism between back-to-back command buffers.
> >>> Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
> >>> was created to enable all those use cases, and it's the only reason why the
> >>> scheduler exists.
> >>>>>> The GPU scheduler, implicit synchronization, BO-fence-based memory
> >>> management, and the tracking of per-BO fences increase CPU overhead and
> >>> latency, and reduce parallelism. There is a desire to replace all of them
> >>> with something much simpler. Below is how we could do it.
> >>>>>>
> >>>>>> 2. Explicit synchronization for window systems and modesetting
> >>>>>>
> >>>>>> The producer is an application and the consumer is a compositor or a
> >>> modesetting driver.
> >>>>>> 2.1. The Present request
> >>>>>>
> >>>>>> As part of the Present request, the producer will pass 2 fences (sync
> >>> objects) to the consumer alongside the presented DMABUF BO:
> >>>>>> - The submit fence: Initially unsignalled, it will be signalled when
> >>> the producer has finished drawing into the presented buffer.
> >>>>>> - The return fence: Initially unsignalled, it will be signalled when
> >>> the consumer has finished using the presented buffer.
> >>>>> I'm not sure syncobj is what we want.  In the Intel world we're trying
> >>>>> to go even further to something we're calling "userspace fences" which
> >>>>> are a timeline implemented as a single 64-bit value in some
> >>>>> CPU-mappable BO.  The client writes a higher value into the BO to
> >>>>> signal the timeline.
> >>>> Well that is exactly what our Windows guys have suggested as well, but
> >>>> it strongly looks like that this isn't sufficient.
> >>>>
> >>>> First of all you run into security problems when any application can
> >>>> just write any value to that memory location. Just imagine an
> >>>> application sets the counter to zero and X waits forever for some
> >>>> rendering to finish.
> >>> The thing is, with userspace fences security boundary issue prevent
> >>> moves into userspace entirely. And it really doesn't matter whether
> >>> the event you're waiting on doesn't complete because the other app
> >>> crashed or was stupid or intentionally gave you a wrong fence point:
> >>> You have to somehow handle that, e.g. perhaps with conditional
> >>> rendering and just using the old frame in compositing if the new one
> >>> doesn't show up in time. Or something like that. So trying to get the
> >>> kernel involved but also not so much involved sounds like a bad design
> >>> to me.

Sorry that my initial reply was so turse.  I'm not claiming (nor will
I ever) that memory fences are an easy solution.  They're certainly
fraught with potential issues.  I do, however, think that they are the
basis of the future of synchronization.  The fact that Windows 10 and
the consoles have been using them to great effect for 5+ years
indicates that they do, in fact, work.  However, Microsoft has never
supported both a memory fence and dma-fence-like model in the same
Windows version so there's no prior art for smashing the two models
together.


> >>>> Additional to that in such a model you can't determine who is the guilty
> >>>> queue in case of a hang and can't reset the synchronization primitives
> >>>> in case of an error.

For this, Windows has two solutions.  One is that everyone is
hang-aware in some sense.  It means extra complexity in the window
system but some amount of that is necessary if you don't have easy
error propagation.

Second is that the fences aren't actually singaled from users space as
Daniel suggests but are signaled from the kernel.  This means that the
kernel is aware of all the fences which are supposed to be signaled
from a given context/engine.  When a hang occurs, it has a mode where
it smashes all timelines which the context is supposed to be signaling
to UINT64_MAX, unblocking anything which depends on them.  There are a
lot of details here which are unclear to me such as what happens if
some other operation smashes it back to a lower value.  Does it keep
smashing to UINT64_MAX until it all clears?  I'm not sure.


> >>>> Apart from that this is rather inefficient, e.g. we don't have any way
> >>>> to prevent priority inversion when used as a synchronization mechanism
> >>>> between different GPU queues.
> >>> Yeah but you can't have it both ways. Either all the scheduling in the
> >>> kernel and fence handling is a problem, or you actually want to
> >>> schedule in the kernel. hw seems to definitely move towards the more
> >>> stupid spinlock-in-hw model (and direct submit from userspace and all
> >>> that), priority inversions be damned. I'm really not sure we should
> >>> fight that - if it's really that inefficient then maybe hw will add
> >>> support for waiting sync constructs in hardware, or at least be
> >>> smarter about scheduling other stuff. E.g. on intel hw both the kernel
> >>> scheduler and fw scheduler knows when you're spinning on a hw fence
> >>> (whether userspace or kernel doesn't matter) and plugs in something
> >>> else. Add in a bit of hw support to watch cachelines, and you have
> >>> something which can handle both directions efficiently.
> >>>
> >>> Imo given where hw is going, we shouldn't try to be too clever here.
> >>> The only thing we do need to provision is being able to do cpu side
> >>> waits without spinning. And that should probably be done in a fairly
> >>> gpu specific way still.

Yup.  The synchronization model that Windows, consoles, and hardware
is moving towards is a model where you have memory fences for
execution synchronization and explicit memory binding and residency
management and then the ask from userspace to the kernel is "put
everything in place and then run as fast as you can".  I think that
last bit is roughly what Marek is asking for here.  The difference is
in the details on how it all works internally.

Also, just to be clear, my comments here weren't so much "please solve
all the problems" as asking that, as we improve explicit
synchronization plumbing, we do it with memory fences in mind.  I do
think they're the future, even if it's a difficult future to get to,
and I'm trying to find a path.

--Jason


> >>> -Daniel
> >>>
> >>>> Christian.
> >>>>
> >>>>>     The kernel then provides some helpers for
> >>>>> waiting on them reliably and without spinning.  I don't expect
> >>>>> everyone to support these right away but, If we're going to re-plumb
> >>>>> userspace for explicit synchronization, I'd like to make sure we take
> >>>>> this into account so we only have to do it once.
> >>>>>
> >>>>>
> >>>>>> Deadlock mitigation to recover from segfaults:
> >>>>>> - The kernel knows which process is obliged to signal which fence.
> >>> This information is part of the Present request and supplied by userspace.
> >>>>> This isn't clear to me.  Yes, if we're using anything dma-fence based
> >>>>> like syncobj, this is true.  But it doesn't seem totally true as a
> >>>>> general statement.
> >>>>>
> >>>>>
> >>>>>> - If the producer crashes, the kernel signals the submit fence, so
> >>> that the consumer can make forward progress.
> >>>>>> - If the consumer crashes, the kernel signals the return fence, so
> >>> that the producer can reclaim the buffer.
> >>>>>> - A GPU hang signals all fences. Other deadlocks will be handled like
> >>> GPU hangs.
> >>>>> What do you mean by "all"?  All fences that were supposed to be
> >>>>> signaled by the hung context?
> >>>>>
> >>>>>
> >>>>>> Other window system requests can follow the same idea.
> >>>>>>
> >>>>>> Merged fences where one fence object contains multiple fences will be
> >>> supported. A merged fence is signalled only when its fences are signalled.
> >>> The consumer will have the option to redefine the unsignalled return fence
> >>> to a merged fence.
> >>>>>> 2.2. Modesetting
> >>>>>>
> >>>>>> Since a modesetting driver can also be the consumer, the present
> >>> ioctl will contain a submit fence and a return fence too. One small problem
> >>> with this is that userspace can hang the modesetting driver, but in theory,
> >>> any later present ioctl can override the previous one, so the unsignalled
> >>> presentation is never used.
> >>>>>>
> >>>>>> 3. New memory management
> >>>>>>
> >>>>>> The per-BO fences will be removed and the kernel will not know which
> >>> buffers are busy. This will reduce CPU overhead and latency. The kernel
> >>> will not need per-BO fences with explicit synchronization, so we just need
> >>> to remove their last user: buffer evictions. It also resolves the current
> >>> OOM deadlock.
> >>>>> Is this even really possible?  I'm no kernel MM expert (trying to
> >>>>> learn some) but my understanding is that the use of per-BO dma-fence
> >>>>> runs deep.  I would like to stop using it for implicit synchronization
> >>>>> to be sure, but I'm not sure I believe the claim that we can get rid
> >>>>> of it entirely.  Happy to see someone try, though.
> >>>>>
> >>>>>
> >>>>>> 3.1. Evictions
> >>>>>>
> >>>>>> If the kernel wants to move a buffer, it will have to wait for
> >>> everything to go idle, halt all userspace command submissions, move the
> >>> buffer, and resume everything. This is not expected to happen when memory
> >>> is not exhausted. Other more efficient ways of synchronization are also
> >>> possible (e.g. sync only one process), but are not discussed here.
> >>>>>> 3.2. Per-process VRAM usage quota
> >>>>>>
> >>>>>> Each process can optionally and periodically query its VRAM usage
> >>> quota and change domains of its buffers to obey that quota. For example, a
> >>> process allocated 2 GB of buffers in VRAM, but the kernel decreased the
> >>> quota to 1 GB. The process can change the domains of the least important
> >>> buffers to GTT to get the best outcome for itself. If the process doesn't
> >>> do it, the kernel will choose which buffers to evict at random. (thanks to
> >>> Christian Koenig for this idea)
> >>>>> This is going to be difficult.  On Intel, we have some resources that
> >>>>> have to be pinned to VRAM and can't be dynamically swapped out by the
> >>>>> kernel.  In GL, we probably can deal with it somewhat dynamically.  In
> >>>>> Vulkan, we'll be entirely dependent on the application to use the
> >>>>> appropriate Vulkan memory budget APIs.
> >>>>>
> >>>>> --Jason
> >>>>>
> >>>>>
> >>>>>> 3.3. Buffer destruction without per-BO fences
> >>>>>>
> >>>>>> When the buffer destroy ioctl is called, an optional fence list can
> >>> be passed to the kernel to indicate when it's safe to deallocate the
> >>> buffer. If the fence list is empty, the buffer will be deallocated
> >>> immediately. Shared buffers will be handled by merging fence lists from all
> >>> processes that destroy them. Mitigation of malicious behavior:
> >>>>>> - If userspace destroys a busy buffer, it will get a GPU page fault.
> >>>>>> - If userspace sends fences that never signal, the kernel will have a
> >>> timeout period and then will proceed to deallocate the buffer anyway.
> >>>>>> 3.4. Other notes on MM
> >>>>>>
> >>>>>> Overcommitment of GPU-accessible memory will cause an allocation
> >>> failure or invoke the OOM killer. Evictions to GPU-inaccessible memory
> >>> might not be supported.
> >>>>>> Kernel drivers could move to this new memory management today. Only
> >>> buffer residency and evictions would stop using per-BO fences.
> >>>>>>
> >>>>>> 4. Deprecating implicit synchronization
> >>>>>>
> >>>>>> It can be phased out by introducing a new generation of hardware
> >>> where the driver doesn't add support for it (like a driver fork would do),
> >>> assuming userspace has all the changes for explicit synchronization. This
> >>> could potentially create an isolated part of the kernel DRM where all
> >>> drivers only support explicit synchronization.
> >>>>>> Marek
> >>>>>> _______________________________________________
> >>>>>> dri-devel mailing list
> >>>>>> dri-devel@lists.freedesktop.org
> >>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >>>>> _______________________________________________
> >>>>> mesa-dev mailing list
> >>>>> mesa-dev@lists.freedesktop.org
> >>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >>>
> >>> --
> >>> Daniel Vetter
> >>> Software Engineer, Intel Corporation
> >>> http://blog.ffwll.ch
> >>>
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 14:09             ` Daniel Vetter
@ 2021-04-20 16:24               ` Jason Ekstrand
  0 siblings, 0 replies; 105+ messages in thread
From: Jason Ekstrand @ 2021-04-20 16:24 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Christian König, dri-devel, Marek Olšák, ML Mesa-dev

On Tue, Apr 20, 2021 at 9:10 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, Apr 20, 2021 at 1:59 PM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
> >
> > > Yeah. If we go with userspace fences, then userspace can hang itself. Not
> > > the kernel's problem.
> >
> > Well, the path of inner peace begins with four words. “Not my fucking
> > problem.”
> >
> > But I'm not that much concerned about the kernel, but rather about
> > important userspace processes like X, Wayland, SurfaceFlinger etc...
> >
> > I mean attaching a page to a sync object and allowing to wait/signal
> > from both CPU as well as GPU side is not so much of a problem.
> >
> > > You have to somehow handle that, e.g. perhaps with conditional
> > > rendering and just using the old frame in compositing if the new one
> > > doesn't show up in time.
> >
> > Nice idea, but how would you handle that on the OpenGL/Glamor/Vulkan level.
>
> For opengl we do all the same guarantees, so if you get one of these
> you just block until the fence is signalled. Doing that properly means
> submit thread to support drm_syncobj like for vulkan.
>
> For vulkan we probably want to represent these as proper vk timeline
> objects, and the vulkan way is to just let the application (well
> compositor) here deal with it. If they import timelines from untrusted
> other parties, they need to handle the potential fallback of being
> lied at. How is "not vulkan's fucking problem", because that entire
> "with great power (well performance) comes great responsibility" is
> the entire vk design paradigm.

The security aspects are currently an unsolved problem in Vulkan.  The
assumption is that everyone trusts everyone else to be careful with
the scissors.  It's a great model!

I think we can do something in Vulkan to allow apps to protect
themselves a bit but it's tricky and non-obvious.

--Jason


> Glamour will just rely on GL providing nice package of the harsh
> reality of gpus, like usual.
>
> So I guess step 1 here for GL would be to provide some kind of
> import/export of timeline syncobj, including properly handling this
> "future/indefinite fences" aspect of them with submit thread and
> everything.
>
> -Daniel
>
> >
> > Regards,
> > Christian.
> >
> > Am 20.04.21 um 13:16 schrieb Daniel Vetter:
> > > On Tue, Apr 20, 2021 at 07:03:19AM -0400, Marek Olšák wrote:
> > >> Daniel, are you suggesting that we should skip any deadlock prevention in
> > >> the kernel, and just let userspace wait for and signal any fence it has
> > >> access to?
> > > Yeah. If we go with userspace fences, then userspace can hang itself. Not
> > > the kernel's problem. The only criteria is that the kernel itself must
> > > never rely on these userspace fences, except for stuff like implementing
> > > optimized cpu waits. And in those we must always guarantee that the
> > > userspace process remains interruptible.
> > >
> > > It's a completely different world from dma_fence based kernel fences,
> > > whether those are implicit or explicit.
> > >
> > >> Do you have any concern with the deprecation/removal of BO fences in the
> > >> kernel assuming userspace is only using explicit fences? Any concern with
> > >> the submit and return fences for modesetting and other producer<->consumer
> > >> scenarios?
> > > Let me work on the full replay for your rfc first, because there's a lot
> > > of details here and nuance.
> > > -Daniel
> > >
> > >> Thanks,
> > >> Marek
> > >>
> > >> On Tue, Apr 20, 2021 at 6:34 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >>
> > >>> On Tue, Apr 20, 2021 at 12:15 PM Christian König
> > >>> <ckoenig.leichtzumerken@gmail.com> wrote:
> > >>>> Am 19.04.21 um 17:48 schrieb Jason Ekstrand:
> > >>>>> Not going to comment on everything on the first pass...
> > >>>>>
> > >>>>> On Mon, Apr 19, 2021 at 5:48 AM Marek Olšák <maraeo@gmail.com> wrote:
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> This is our initial proposal for explicit fences everywhere and new
> > >>> memory management that doesn't use BO fences. It's a redesign of how Linux
> > >>> graphics drivers work, and it can coexist with what we have now.
> > >>>>>>
> > >>>>>> 1. Introduction
> > >>>>>> (skip this if you are already sold on explicit fences)
> > >>>>>>
> > >>>>>> The current Linux graphics architecture was initially designed for
> > >>> GPUs with only one graphics queue where everything was executed in the
> > >>> submission order and per-BO fences were used for memory management and
> > >>> CPU-GPU synchronization, not GPU-GPU synchronization. Later, multiple
> > >>> queues were added on top, which required the introduction of implicit
> > >>> GPU-GPU synchronization between queues of different processes using per-BO
> > >>> fences. Recently, even parallel execution within one queue was enabled
> > >>> where a command buffer starts draws and compute shaders, but doesn't wait
> > >>> for them, enabling parallelism between back-to-back command buffers.
> > >>> Modesetting also uses per-BO fences for scheduling flips. Our GPU scheduler
> > >>> was created to enable all those use cases, and it's the only reason why the
> > >>> scheduler exists.
> > >>>>>> The GPU scheduler, implicit synchronization, BO-fence-based memory
> > >>> management, and the tracking of per-BO fences increase CPU overhead and
> > >>> latency, and reduce parallelism. There is a desire to replace all of them
> > >>> with something much simpler. Below is how we could do it.
> > >>>>>>
> > >>>>>> 2. Explicit synchronization for window systems and modesetting
> > >>>>>>
> > >>>>>> The producer is an application and the consumer is a compositor or a
> > >>> modesetting driver.
> > >>>>>> 2.1. The Present request
> > >>>>>>
> > >>>>>> As part of the Present request, the producer will pass 2 fences (sync
> > >>> objects) to the consumer alongside the presented DMABUF BO:
> > >>>>>> - The submit fence: Initially unsignalled, it will be signalled when
> > >>> the producer has finished drawing into the presented buffer.
> > >>>>>> - The return fence: Initially unsignalled, it will be signalled when
> > >>> the consumer has finished using the presented buffer.
> > >>>>> I'm not sure syncobj is what we want.  In the Intel world we're trying
> > >>>>> to go even further to something we're calling "userspace fences" which
> > >>>>> are a timeline implemented as a single 64-bit value in some
> > >>>>> CPU-mappable BO.  The client writes a higher value into the BO to
> > >>>>> signal the timeline.
> > >>>> Well that is exactly what our Windows guys have suggested as well, but
> > >>>> it strongly looks like that this isn't sufficient.
> > >>>>
> > >>>> First of all you run into security problems when any application can
> > >>>> just write any value to that memory location. Just imagine an
> > >>>> application sets the counter to zero and X waits forever for some
> > >>>> rendering to finish.
> > >>> The thing is, with userspace fences security boundary issue prevent
> > >>> moves into userspace entirely. And it really doesn't matter whether
> > >>> the event you're waiting on doesn't complete because the other app
> > >>> crashed or was stupid or intentionally gave you a wrong fence point:
> > >>> You have to somehow handle that, e.g. perhaps with conditional
> > >>> rendering and just using the old frame in compositing if the new one
> > >>> doesn't show up in time. Or something like that. So trying to get the
> > >>> kernel involved but also not so much involved sounds like a bad design
> > >>> to me.
> > >>>
> > >>>> Additional to that in such a model you can't determine who is the guilty
> > >>>> queue in case of a hang and can't reset the synchronization primitives
> > >>>> in case of an error.
> > >>>>
> > >>>> Apart from that this is rather inefficient, e.g. we don't have any way
> > >>>> to prevent priority inversion when used as a synchronization mechanism
> > >>>> between different GPU queues.
> > >>> Yeah but you can't have it both ways. Either all the scheduling in the
> > >>> kernel and fence handling is a problem, or you actually want to
> > >>> schedule in the kernel. hw seems to definitely move towards the more
> > >>> stupid spinlock-in-hw model (and direct submit from userspace and all
> > >>> that), priority inversions be damned. I'm really not sure we should
> > >>> fight that - if it's really that inefficient then maybe hw will add
> > >>> support for waiting sync constructs in hardware, or at least be
> > >>> smarter about scheduling other stuff. E.g. on intel hw both the kernel
> > >>> scheduler and fw scheduler knows when you're spinning on a hw fence
> > >>> (whether userspace or kernel doesn't matter) and plugs in something
> > >>> else. Add in a bit of hw support to watch cachelines, and you have
> > >>> something which can handle both directions efficiently.
> > >>>
> > >>> Imo given where hw is going, we shouldn't try to be too clever here.
> > >>> The only thing we do need to provision is being able to do cpu side
> > >>> waits without spinning. And that should probably be done in a fairly
> > >>> gpu specific way still.
> > >>> -Daniel
> > >>>
> > >>>> Christian.
> > >>>>
> > >>>>>     The kernel then provides some helpers for
> > >>>>> waiting on them reliably and without spinning.  I don't expect
> > >>>>> everyone to support these right away but, If we're going to re-plumb
> > >>>>> userspace for explicit synchronization, I'd like to make sure we take
> > >>>>> this into account so we only have to do it once.
> > >>>>>
> > >>>>>
> > >>>>>> Deadlock mitigation to recover from segfaults:
> > >>>>>> - The kernel knows which process is obliged to signal which fence.
> > >>> This information is part of the Present request and supplied by userspace.
> > >>>>> This isn't clear to me.  Yes, if we're using anything dma-fence based
> > >>>>> like syncobj, this is true.  But it doesn't seem totally true as a
> > >>>>> general statement.
> > >>>>>
> > >>>>>
> > >>>>>> - If the producer crashes, the kernel signals the submit fence, so
> > >>> that the consumer can make forward progress.
> > >>>>>> - If the consumer crashes, the kernel signals the return fence, so
> > >>> that the producer can reclaim the buffer.
> > >>>>>> - A GPU hang signals all fences. Other deadlocks will be handled like
> > >>> GPU hangs.
> > >>>>> What do you mean by "all"?  All fences that were supposed to be
> > >>>>> signaled by the hung context?
> > >>>>>
> > >>>>>
> > >>>>>> Other window system requests can follow the same idea.
> > >>>>>>
> > >>>>>> Merged fences where one fence object contains multiple fences will be
> > >>> supported. A merged fence is signalled only when its fences are signalled.
> > >>> The consumer will have the option to redefine the unsignalled return fence
> > >>> to a merged fence.
> > >>>>>> 2.2. Modesetting
> > >>>>>>
> > >>>>>> Since a modesetting driver can also be the consumer, the present
> > >>> ioctl will contain a submit fence and a return fence too. One small problem
> > >>> with this is that userspace can hang the modesetting driver, but in theory,
> > >>> any later present ioctl can override the previous one, so the unsignalled
> > >>> presentation is never used.
> > >>>>>>
> > >>>>>> 3. New memory management
> > >>>>>>
> > >>>>>> The per-BO fences will be removed and the kernel will not know which
> > >>> buffers are busy. This will reduce CPU overhead and latency. The kernel
> > >>> will not need per-BO fences with explicit synchronization, so we just need
> > >>> to remove their last user: buffer evictions. It also resolves the current
> > >>> OOM deadlock.
> > >>>>> Is this even really possible?  I'm no kernel MM expert (trying to
> > >>>>> learn some) but my understanding is that the use of per-BO dma-fence
> > >>>>> runs deep.  I would like to stop using it for implicit synchronization
> > >>>>> to be sure, but I'm not sure I believe the claim that we can get rid
> > >>>>> of it entirely.  Happy to see someone try, though.
> > >>>>>
> > >>>>>
> > >>>>>> 3.1. Evictions
> > >>>>>>
> > >>>>>> If the kernel wants to move a buffer, it will have to wait for
> > >>> everything to go idle, halt all userspace command submissions, move the
> > >>> buffer, and resume everything. This is not expected to happen when memory
> > >>> is not exhausted. Other more efficient ways of synchronization are also
> > >>> possible (e.g. sync only one process), but are not discussed here.
> > >>>>>> 3.2. Per-process VRAM usage quota
> > >>>>>>
> > >>>>>> Each process can optionally and periodically query its VRAM usage
> > >>> quota and change domains of its buffers to obey that quota. For example, a
> > >>> process allocated 2 GB of buffers in VRAM, but the kernel decreased the
> > >>> quota to 1 GB. The process can change the domains of the least important
> > >>> buffers to GTT to get the best outcome for itself. If the process doesn't
> > >>> do it, the kernel will choose which buffers to evict at random. (thanks to
> > >>> Christian Koenig for this idea)
> > >>>>> This is going to be difficult.  On Intel, we have some resources that
> > >>>>> have to be pinned to VRAM and can't be dynamically swapped out by the
> > >>>>> kernel.  In GL, we probably can deal with it somewhat dynamically.  In
> > >>>>> Vulkan, we'll be entirely dependent on the application to use the
> > >>>>> appropriate Vulkan memory budget APIs.
> > >>>>>
> > >>>>> --Jason
> > >>>>>
> > >>>>>
> > >>>>>> 3.3. Buffer destruction without per-BO fences
> > >>>>>>
> > >>>>>> When the buffer destroy ioctl is called, an optional fence list can
> > >>> be passed to the kernel to indicate when it's safe to deallocate the
> > >>> buffer. If the fence list is empty, the buffer will be deallocated
> > >>> immediately. Shared buffers will be handled by merging fence lists from all
> > >>> processes that destroy them. Mitigation of malicious behavior:
> > >>>>>> - If userspace destroys a busy buffer, it will get a GPU page fault.
> > >>>>>> - If userspace sends fences that never signal, the kernel will have a
> > >>> timeout period and then will proceed to deallocate the buffer anyway.
> > >>>>>> 3.4. Other notes on MM
> > >>>>>>
> > >>>>>> Overcommitment of GPU-accessible memory will cause an allocation
> > >>> failure or invoke the OOM killer. Evictions to GPU-inaccessible memory
> > >>> might not be supported.
> > >>>>>> Kernel drivers could move to this new memory management today. Only
> > >>> buffer residency and evictions would stop using per-BO fences.
> > >>>>>>
> > >>>>>> 4. Deprecating implicit synchronization
> > >>>>>>
> > >>>>>> It can be phased out by introducing a new generation of hardware
> > >>> where the driver doesn't add support for it (like a driver fork would do),
> > >>> assuming userspace has all the changes for explicit synchronization. This
> > >>> could potentially create an isolated part of the kernel DRM where all
> > >>> drivers only support explicit synchronization.
> > >>>>>> Marek
> > >>>>>> _______________________________________________
> > >>>>>> dri-devel mailing list
> > >>>>>> dri-devel@lists.freedesktop.org
> > >>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > >>>>> _______________________________________________
> > >>>>> mesa-dev mailing list
> > >>>>> mesa-dev@lists.freedesktop.org
> > >>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> > >>>
> > >>> --
> > >>> Daniel Vetter
> > >>> Software Engineer, Intel Corporation
> > >>> http://blog.ffwll.ch
> > >>>
> >
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 15:49         ` Daniel Stone
@ 2021-04-20 16:25           ` Marek Olšák
  2021-04-20 16:42             ` Jacob Lifshay
                               ` (2 more replies)
  0 siblings, 3 replies; 105+ messages in thread
From: Marek Olšák @ 2021-04-20 16:25 UTC (permalink / raw)
  To: Daniel Stone; +Cc: Christian König, dri-devel, ML Mesa-dev


[-- Attachment #1.1: Type: text/plain, Size: 4384 bytes --]

Daniel, imagine hardware that can only do what Windows does: future fences
signalled by userspace whenever userspace wants, and no kernel queues like
we have today.

The only reason why current AMD GPUs work is because they have a ring
buffer per queue with pointers to userspace command buffers followed by
fences. What will we do if that ring buffer is removed?

Marek

On Tue, Apr 20, 2021 at 11:50 AM Daniel Stone <daniel@fooishbar.org> wrote:

> Hi,
>
> On Tue, 20 Apr 2021 at 16:16, Christian König <
> ckoenig.leichtzumerken@gmail.com> wrote:
>
>> Am 20.04.21 um 17:07 schrieb Daniel Stone:
>>
>> If the compositor no longer has a guarantee that the buffer will be ready
>> for composition in a reasonable amount of time (which dma_fence gives us,
>> and this proposal does not appear to give us), then the compositor isn't
>> trying to use the buffer for compositing, it's waiting asynchronously on a
>> notification that the fence has signaled before it attempts to use the
>> buffer.
>>
>> Marek's initial suggestion is that the kernel signal the fence, which
>> would unblock composition (and presumably show garbage on screen, or at
>> best jump back to old content).
>>
>> My position is that the compositor will know the process has crashed
>> anyway - because its socket has been closed - at which point we destroy all
>> the client's resources including its windows and buffers regardless.
>> Signaling the fence doesn't give us any value here, _unless_ the compositor
>> is just blindly waiting for the fence to signal ... which it can't do
>> because there's no guarantee the fence will ever signal.
>>
>>
>> Yeah, but that assumes that the compositor has change to not blindly wait
>> for the client to finish rendering and as Daniel explained that is rather
>> unrealistic.
>>
>> What we need is a fallback mechanism which signals the fence after a
>> timeout and gives a penalty to the one causing the timeout.
>>
>> That gives us the same functionality we have today with the in software
>> scheduler inside the kernel.
>>
>
> OK, if that's the case then I think I'm really missing something which
> isn't explained in this thread, because I don't understand what the
> additional complexity and API change gains us (see my first reply in this
> thread).
>
> By way of example - say I have a blind-but-explicit compositor that takes
> a drm_syncobj along with a dmabuf with each client presentation request,
> but doesn't check syncobj completion, it just imports that into a
> VkSemaphore + VkImage and schedules work for the next frame.
>
> Currently, that generates an execbuf ioctl for the composition (ignore KMS
> for now) with a sync point to wait on, and the kernel+GPU scheduling
> guarantees that the composition work will not begin until the client
> rendering work has retired. We have a further guarantee that this work will
> complete in reasonable time, for some value of 'reasonable'.
>
> My understanding of this current proposal is that:
> * userspace creates a 'present fence' with this new ioctl
> * the fence becomes signaled when a value is written to a location in
> memory, which is visible through both CPU and GPU mappings of that page
> * this 'present fence' is imported as a VkSemaphore (?) and the userspace
> Vulkan driver will somehow wait on this value  either before submitting
> work or as a possibly-hardware-assisted GPU-side wait (?)
> * the kernel's scheduler is thus eliminated from the equation, and every
> execbuf is submitted directly to hardware, because either userspace knows
> that the fence has already been signaled, or it will issue a GPU-side wait
> (?)
> * but the kernel is still required to monitor completion of every fence
> itself, so it can forcibly complete, or penalise the client (?)
>
> Lastly, let's say we stop ignoring KMS: what happens for the
> render-with-GPU-display-on-KMS case? Do we need to do the equivalent of
> glFinish() in userspace and only submit the KMS atomic request when the GPU
> work has fully retired?
>
> Clarifying those points would be really helpful so this is less of a
> strawman. I have some further opinions, but I'm going to wait until I
> understand what I'm actually arguing against before I go too far. :) The
> last point is very salient though.
>
> Cheers,
> Daniel
>

[-- Attachment #1.2: Type: text/html, Size: 5691 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 16:25           ` Marek Olšák
@ 2021-04-20 16:42             ` Jacob Lifshay
  2021-04-20 18:03             ` Daniel Stone
  2021-04-20 18:39             ` Daniel Vetter
  2 siblings, 0 replies; 105+ messages in thread
From: Jacob Lifshay @ 2021-04-20 16:42 UTC (permalink / raw)
  To: Marek Olšák; +Cc: Christian König, dri-devel, ML Mesa-dev


[-- Attachment #1.1: Type: text/plain, Size: 660 bytes --]

On Tue, Apr 20, 2021, 09:25 Marek Olšák <maraeo@gmail.com> wrote:

> Daniel, imagine hardware that can only do what Windows does: future fences
> signalled by userspace whenever userspace wants, and no kernel queues like
> we have today.
>

Hmm, that sounds kinda like what we're trying to do for Libre-SOC's gpu
which is basically where the cpu (exactly the same cores as the gpu) runs a
user-space software renderer with extra instructions to make it go fast, so
the kernel only gets involved for futex-wait or for video scan-out. This
causes problems when figuring out how to interact with dma-fences for
interoperability...

Jacob Lifshay

[-- Attachment #1.2: Type: text/html, Size: 1087 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 15:45   ` Jason Ekstrand
@ 2021-04-20 17:44     ` Daniel Stone
  2021-04-20 18:00       ` [Mesa-dev] " Christian König
  2021-04-20 18:53       ` Daniel Vetter
  0 siblings, 2 replies; 105+ messages in thread
From: Daniel Stone @ 2021-04-20 17:44 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: ML Mesa-dev, dri-devel, Marek Olšák


[-- Attachment #1.1: Type: text/plain, Size: 11147 bytes --]

Hi,

On Tue, 20 Apr 2021 at 16:46, Jason Ekstrand <jason@jlekstrand.net> wrote:

> It's still early in the morning here and I'm not awake yet so sorry if
> this comes out in bits and pieces...
>

No problem, it's helpful. If I weren't on this thread I'd be attempting to
put together a 73-piece chest of drawers whose instructions are about as
clear as this so far, so I'm in the right head space anyway.


> IMO, there are two problems being solved here which are related in
> very subtle and tricky ways.  They're also, admittedly, driver
> problems, not really winsys problems.  Unfortunately, they may have
> winsys implications.
>

Yeah ... bingo.


> First, is better/real timelines for Vulkan and compute.  [...]
>
> We also want something like this for compute workloads.  [...]
>

Totally understand and agree with all of this. Memory fences seem like a
good and useful primitive here.


> The second biting issue is that, in the current kernel implementation
> of dma-fence and dma_resv, we've lumped internal synchronization for
> memory management together with execution synchronization for
> userspace dependency tracking.  And we have no way to tell the
> difference between the two internally.  Even if user space is passing
> around sync_files and trying to do explicit sync, once you get inside
> the kernel, they're all dma-fences and it can't tell the difference.
>

Funny, because 'lumped [the two] together' is exactly the crux of my issues
...


> If we move


Stop here, because ...


> to a more userspace-controlled synchronization model with
> wait-before-signal and no timeouts unless requested, regardless of the
> implementation, it plays really badly dma-fence.  And, by "badly" I
> mean the two are nearly incompatible.


I would go further than that, and say completely, fundamentally,
conceptually, incompatible.


> From a user space PoV, it means
> it's tricky to provide the finite time dma-fence guarantee.  From a
> kernel PoV, it's way worse.  Currently, the way dma-fence is
> constructed, it's impossible to deadlock assuming everyone follows the
> rules.  The moment we allow user space to deadlock itself and allow
> those deadlocks to leak into the kernel, we have a problem.  Even if
> we throw in some timeouts, we still have a scenario where user space
> has one linearizable dependency graph for execution synchronization
> and the kernel has a different linearizable dependency graph for
> memory management and, when you smash them together, you may have
> cycles in your graph.
>
> So how do we sort this all out?  Good question.  It's a hard problem.
> Probably the hardest problem here is the second one: the intermixing
> of synchronization types.  Solving that one is likely going to require
> some user space re-plumbing because all the user space APIs we have
> for explicit sync are built on dma-fence.
>

Gotcha.

Firstly, let's stop, as you say, lumping things together. Timeline
semaphores and compute's GPU-side spinlocks etc, are one thing. I accept
those now have a hard requirement on something like memory fences, where
any responsibility is totally abrogated. So let's run with that in our
strawman: Vulkan compute & graphics & transfer queues all degenerate to
something spinning (hopefully GPU-assisted gentle spin) on a uint64
somewhere. The kernel has (in the general case) no visibility or
responsibility into these things. Fine - that's one side of the story.

But winsys is something _completely_ different. Yes, you're using the GPU
to do things with buffers A, B, and C to produce buffer Z. Yes, you're
using vkQueuePresentKHR to schedule that work. Yes, Mutter's composition
job might depend on a Chromium composition job which depends on GTA's
render job which depends on GTA's compute job which might take a year to
complete. Mutter's composition job needs to complete in 'reasonable'
(again, FSVO) time, no matter what. The two are compatible.

How? Don't lump them together. Isolate them aggressively, and _predictably_
in a way that you can reason about.

What clients do in their own process space is their own business. Games can
deadlock themselves if they get wait-before-signal wrong. Compute jobs can
run for a year. Their problem. Winsys is not that, because you're crossing
every isolation boundary possible. Process, user, container, VM - every
kind of privilege boundary. Thus far, dma_fence has protected us from the
most egregious abuses by guaranteeing bounded-time completion; it also acts
as a sequencing primitive, but from the perspective of a winsys person
that's of secondary importance, which is probably one of the bigger
disconnects between winsys people and GPU driver people.

Anyway, one of the great things about winsys (there are some! trust me) is
we don't need to be as hopelessly general as for game engines, nor as
hyperoptimised. We place strict demands on our clients, and we literally
kill them every single time they get something wrong in a way that's
visible to us. Our demands on the GPU are so embarrassingly simple that you
can run every modern desktop environment on GPUs which don't have unified
shaders. And on certain platforms who don't share tiling formats between
texture/render-target/scanout ... and it all still runs fast enough that
people don't complain.

We're happy to bear the pain of being the ones setting strict and
unreasonable expectations. To me, this 'present ioctl' falls into the
uncanny valley of the kernel trying to bear too much of the weight to be
tractable, whilst not bearing enough of the weight to be useful for winsys.

So here's my principles for a counter-strawman:

Remove the 'return fence'. Burn it with fire, do not look back. Modern
presentation pipelines are not necessarily 1:1, they are not necessarily
FIFO (as opposed to mailbox), and they are not necessarily round-robin
either. The current proposal provides no tangible benefits to modern
userspace, and fixing that requires either hobbling userspace to remove
capability and flexibility (ironic given that the motivation for this is
all about userspace flexibility?), or pushing so much complexity into the
kernel that we break it forever (you can't compile Mutter's per-frame
decision tree into eBPF).

Give us a primitive representing work completion, so we can keep
optimistically pipelining operations. We're happy to pass around
explicit-synchronisation tokens (dma_fence, drm_syncobj, drm_newthing,
whatever it is): plumbing through a sync token to synchronise compositor
operations against client operations in both directions is just a matter of
boring typing.

Make that primitive something that is every bit as usable across subsystems
as it is across processes. It should be a lowest common denominator for
middleware that ultimately provokes GPU execbuf, KMS commit, and media
codec ops; currently that would be both wait and signal for all
of VkSemaphore, EGLSyncKHR, KMS fence, V4L (D)QBUF, and VA-API {en,de}code
ops. It must be exportable to and importable from an FD, which can be
poll()ed on and read(). GPU-side visibility for late binding is nice, but
not at all essential.

Make that primitive complete in 'reasonable' time, no matter what. There
will always be failures in extremis, no matter what the design: absent
hard-realtime principles from hardware all the way up to userspace,
something will always be able to fail somewhere: non-terminating GPU work,
actual GPU hang/reset, GPU queue DoSed, CPU scheduler, I/O DoSed. As long
as the general case is bounded-time completion, each of these can be
mitigated separately as long as userspace has enough visibility into the
underlying mechanics, and cares enough to take meaningful action on it.

And something more concrete:

dma_fence.

This already has all of the properties described above. Kernel-wise, it
already devolves to CPU-side signaling when it crosses device boundaries.
We need to support it roughly forever since it's been plumbed so far and so
wide. Any primitive which is acceptable for winsys-like usage which crosses
so many device/subsystem/process/security boundaries has to meet the same
requirements. So why reinvent something which looks so similar, and has the
same requirements of the kernel babysitting completion, providing little to
no benefit for that difference?

It's not usable for complex usecases, as we've established, but winsys is
not that usecase. We can draw a hard boundary between the two worlds. For
example, a client could submit an infinitely deep CS -> VS/FS/etc job chain
with potentially-infinite completion, with the FS output being passed to
the winsys for composition. Draw the line post-FS: export a dma_fence
against FS completion. But instead of this being based on monitoring the
_fence_ per se, base it on monitoring the job; if the final job doesn't
retire in reasonable time, signal the fence and signal (like, SIGKILL, or
just tear down the context and permanently -EIO, whatever) the client.
Maybe for future hardware that would be the same thing - the kernel setting
a timeout and comparing a read on a particular address against a particular
value - but the 'present fence' proposal seems like it requires exactly
this anyway.

That to me is the best compromise. We allow clients complete arbitrary
flexibility, but as soon as they vkQueuePresentKHR, they're crossing a
boundary out of happy fun GPU land and into strange hostile winsys land.
We've got a lot of practice at being the bad guys who hate users and are
always trying to ruin their dreams, so we'll happily wear the impact of
continuing to do that. In doing so, we collectively don't have to invent a
third new synchronisation primitive (to add to dma_fence and drm_syncobj)
and a third new synchronisation model (implicit sync, explicit-but-bounded
sync, explicit-and-maybe-unbounded sync) to support this, and we don't have
to do an NT4 where GDI was shoved into the kernel.

It doesn't help with the goal of ridding dma_fence from the kernel, but it
does very clearly segregate the two worlds. Drawing that hard boundary
would allow drivers to hyperoptimise for clients which want to be extremely
clever and agile and quick because they're sailing so close to the wind
that they cannot bear the overhead of dma_fence, whilst also providing the
guarantees we need when crossing isolation boundaries. In the latter case,
the overhead of bouncing into a less-optimised primitive is totally
acceptable because it's not even measurable: vkQueuePresentKHR requires
client CPU activity -> kernel IPC -> compositor CPU activity -> wait for
repaint cycle -> prepare scene -> composition, against which dma_fence
overhead isn't and will never be measurable (even if it doesn't cross
device/subsystem boundaries, which it probably does). And the converse for
vkAcquireNextImageKHR.

tl;dr: we don't need to move winsys into the kernel, winsys and compute
don't need to share sync primitives, the client/winsys boundary does need
to have a sync primitive does need strong and onerous guarantees, and that
transition can be several orders of magnitude less efficient than
intra-client sync primitives

Shoot me down. :)

Cheers,
Daniel

[-- Attachment #1.2: Type: text/html, Size: 13382 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 17:44     ` Daniel Stone
@ 2021-04-20 18:00       ` Christian König
  2021-04-20 18:15         ` Daniel Stone
  2021-04-20 18:53       ` Daniel Vetter
  1 sibling, 1 reply; 105+ messages in thread
From: Christian König @ 2021-04-20 18:00 UTC (permalink / raw)
  To: Daniel Stone, Jason Ekstrand; +Cc: ML Mesa-dev, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 12846 bytes --]



Am 20.04.21 um 19:44 schrieb Daniel Stone:
> Hi,
>
> On Tue, 20 Apr 2021 at 16:46, Jason Ekstrand <jason@jlekstrand.net 
> <mailto:jason@jlekstrand.net>> wrote:
>
>     It's still early in the morning here and I'm not awake yet so sorry if
>     this comes out in bits and pieces...
>
>
> No problem, it's helpful. If I weren't on this thread I'd be 
> attempting to put together a 73-piece chest of drawers whose 
> instructions are about as clear as this so far, so I'm in the right 
> head space anyway.
>
>     IMO, there are two problems being solved here which are related in
>     very subtle and tricky ways.  They're also, admittedly, driver
>     problems, not really winsys problems.  Unfortunately, they may have
>     winsys implications.
>
>
> Yeah ... bingo.
>
>     First, is better/real timelines for Vulkan and compute. [...]
>
>     We also want something like this for compute workloads. [...]
>
>
> Totally understand and agree with all of this. Memory fences seem like 
> a good and useful primitive here.

Completely agree.

>     The second biting issue is that, in the current kernel implementation
>     of dma-fence and dma_resv, we've lumped internal synchronization for
>     memory management together with execution synchronization for
>     userspace dependency tracking.  And we have no way to tell the
>     difference between the two internally.  Even if user space is passing
>     around sync_files and trying to do explicit sync, once you get inside
>     the kernel, they're all dma-fences and it can't tell the difference.
>
>
> Funny, because 'lumped [the two] together' is exactly the crux of my 
> issues ...
>
>     If we move
>
>
> Stop here, because ...
>
>     to a more userspace-controlled synchronization model with
>     wait-before-signal and no timeouts unless requested, regardless of the
>     implementation, it plays really badly dma-fence.  And, by "badly" I
>     mean the two are nearly incompatible.
>
>
> I would go further than that, and say completely, fundamentally, 
> conceptually, incompatible.

+1

>     From a user space PoV, it means
>     it's tricky to provide the finite time dma-fence guarantee. From a
>     kernel PoV, it's way worse.  Currently, the way dma-fence is
>     constructed, it's impossible to deadlock assuming everyone follows the
>     rules.  The moment we allow user space to deadlock itself and allow
>     those deadlocks to leak into the kernel, we have a problem. Even if
>     we throw in some timeouts, we still have a scenario where user space
>     has one linearizable dependency graph for execution synchronization
>     and the kernel has a different linearizable dependency graph for
>     memory management and, when you smash them together, you may have
>     cycles in your graph.
>
>     So how do we sort this all out?  Good question.  It's a hard problem.
>     Probably the hardest problem here is the second one: the intermixing
>     of synchronization types.  Solving that one is likely going to require
>     some user space re-plumbing because all the user space APIs we have
>     for explicit sync are built on dma-fence.
>
>
> Gotcha.
>
> Firstly, let's stop, as you say, lumping things together. Timeline 
> semaphores and compute's GPU-side spinlocks etc, are one thing. I 
> accept those now have a hard requirement on something like memory 
> fences, where any responsibility is totally abrogated. So let's run 
> with that in our strawman: Vulkan compute & graphics & transfer queues 
> all degenerate to something spinning (hopefully GPU-assisted gentle 
> spin) on a uint64 somewhere. The kernel has (in the general case) no 
> visibility or responsibility into these things. Fine - that's one side 
> of the story.

Exactly, yes.

>
> But winsys is something _completely_ different. Yes, you're using the 
> GPU to do things with buffers A, B, and C to produce buffer Z. Yes, 
> you're using vkQueuePresentKHR to schedule that work. Yes, Mutter's 
> composition job might depend on a Chromium composition job which 
> depends on GTA's render job which depends on GTA's compute job which 
> might take a year to complete. Mutter's composition job needs to 
> complete in 'reasonable' (again, FSVO) time, no matter what. The two 
> are compatible.
>
> How? Don't lump them together. Isolate them aggressively, and 
> _predictably_ in a way that you can reason about.
>
> What clients do in their own process space is their own 
> business. Games can deadlock themselves if they get wait-before-signal 
> wrong. Compute jobs can run for a year. Their problem. Winsys is not 
> that, because you're crossing every isolation boundary possible. 
> Process, user, container, VM - every kind of privilege boundary. Thus 
> far, dma_fence has protected us from the most egregious abuses by 
> guaranteeing bounded-time completion; it also acts as a sequencing 
> primitive, but from the perspective of a winsys person that's of 
> secondary importance, which is probably one of the bigger disconnects 
> between winsys people and GPU driver people.

Finally somebody who understands me :)

Well the question is then how do we get winsys and your own process 
space together then?

>
> Anyway, one of the great things about winsys (there are some! trust 
> me) is we don't need to be as hopelessly general as for game engines, 
> nor as hyperoptimised. We place strict demands on our clients, and we 
> literally kill them every single time they get something wrong in a 
> way that's visible to us. Our demands on the GPU are so embarrassingly 
> simple that you can run every modern desktop environment on GPUs which 
> don't have unified shaders. And on certain platforms who don't share 
> tiling formats between texture/render-target/scanout ... and it all 
> still runs fast enough that people don't complain.

Ignoring everything below since that is the display pipeline I'm not 
really interested in. My concern is how to get the buffer from the 
client to the server without allowing the client to get the server into 
trouble?

My thinking is still to use timeouts to acquire texture locks. E.g. when 
the compositor needs to access texture it grabs a lock and if that lock 
isn't available in less than 20ms whoever is holding it is killed hard 
and the lock given to the compositor.

It's perfectly fine if a process has a hung queue, but if it tries to 
send buffers which should be filled by that queue to the compositor it 
just gets a corrupted window content.

Regards,
Christian.

>
> We're happy to bear the pain of being the ones setting strict and 
> unreasonable expectations. To me, this 'present ioctl' falls into the 
> uncanny valley of the kernel trying to bear too much of the weight to 
> be tractable, whilst not bearing enough of the weight to be useful for 
> winsys.
>
> So here's my principles for a counter-strawman:
>
> Remove the 'return fence'. Burn it with fire, do not look back. Modern 
> presentation pipelines are not necessarily 1:1, they are not 
> necessarily FIFO (as opposed to mailbox), and they are not necessarily 
> round-robin either. The current proposal provides no tangible benefits 
> to modern userspace, and fixing that requires either hobbling 
> userspace to remove capability and flexibility (ironic given that the 
> motivation for this is all about userspace flexibility?), or pushing 
> so much complexity into the kernel that we break it forever (you can't 
> compile Mutter's per-frame decision tree into eBPF).
>
> Give us a primitive representing work completion, so we can keep 
> optimistically pipelining operations. We're happy to pass around 
> explicit-synchronisation tokens (dma_fence, drm_syncobj, drm_newthing, 
> whatever it is): plumbing through a sync token to synchronise 
> compositor operations against client operations in both directions is 
> just a matter of boring typing.
>
> Make that primitive something that is every bit as usable across 
> subsystems as it is across processes. It should be a lowest common 
> denominator for middleware that ultimately provokes GPU execbuf, KMS 
> commit, and media codec ops; currently that would be both wait and 
> signal for all of VkSemaphore, EGLSyncKHR, KMS fence, V4L (D)QBUF, and 
> VA-API {en,de}code ops. It must be exportable to and importable from 
> an FD, which can be poll()ed on and read(). GPU-side visibility for 
> late binding is nice, but not at all essential.
>
> Make that primitive complete in 'reasonable' time, no matter what. 
> There will always be failures in extremis, no matter what the design: 
> absent hard-realtime principles from hardware all the way up to 
> userspace, something will always be able to fail somewhere: 
> non-terminating GPU work, actual GPU hang/reset, GPU queue DoSed, CPU 
> scheduler, I/O DoSed. As long as the general case is bounded-time 
> completion, each of these can be mitigated separately as long as 
> userspace has enough visibility into the underlying mechanics, and 
> cares enough to take meaningful action on it.
>
> And something more concrete:
>
> dma_fence.
>
> This already has all of the properties described above. Kernel-wise, 
> it already devolves to CPU-side signaling when it crosses device 
> boundaries. We need to support it roughly forever since it's been 
> plumbed so far and so wide. Any primitive which is acceptable for 
> winsys-like usage which crosses so many 
> device/subsystem/process/security boundaries has to meet the same 
> requirements. So why reinvent something which looks so similar, and 
> has the same requirements of the kernel babysitting completion, 
> providing little to no benefit for that difference?
>
> It's not usable for complex usecases, as we've established, but winsys 
> is not that usecase. We can draw a hard boundary between the two 
> worlds. For example, a client could submit an infinitely deep CS -> 
> VS/FS/etc job chain with potentially-infinite completion, with the FS 
> output being passed to the winsys for composition. Draw the line 
> post-FS: export a dma_fence against FS completion. But instead of this 
> being based on monitoring the _fence_ per se, base it on monitoring 
> the job; if the final job doesn't retire in reasonable time, signal 
> the fence and signal (like, SIGKILL, or just tear down the context and 
> permanently -EIO, whatever) the client. Maybe for future hardware that 
> would be the same thing - the kernel setting a timeout and comparing a 
> read on a particular address against a particular value - but the 
> 'present fence' proposal seems like it requires exactly this anyway.
>
> That to me is the best compromise. We allow clients complete arbitrary 
> flexibility, but as soon as they vkQueuePresentKHR, they're crossing a 
> boundary out of happy fun GPU land and into strange hostile winsys 
> land. We've got a lot of practice at being the bad guys who hate users 
> and are always trying to ruin their dreams, so we'll happily wear the 
> impact of continuing to do that. In doing so, we collectively don't 
> have to invent a third new synchronisation primitive (to add to 
> dma_fence and drm_syncobj) and a third new synchronisation model 
> (implicit sync, explicit-but-bounded sync, 
> explicit-and-maybe-unbounded sync) to support this, and we don't have 
> to do an NT4 where GDI was shoved into the kernel.
>
> It doesn't help with the goal of ridding dma_fence from the kernel, 
> but it does very clearly segregate the two worlds. Drawing that hard 
> boundary would allow drivers to hyperoptimise for clients which want 
> to be extremely clever and agile and quick because they're sailing so 
> close to the wind that they cannot bear the overhead of dma_fence, 
> whilst also providing the guarantees we need when crossing isolation 
> boundaries. In the latter case, the overhead of bouncing into a 
> less-optimised primitive is totally acceptable because it's not even 
> measurable: vkQueuePresentKHR requires client CPU activity -> kernel 
> IPC -> compositor CPU activity -> wait for repaint cycle -> prepare 
> scene -> composition, against which dma_fence overhead isn't and will 
> never be measurable (even if it doesn't cross device/subsystem 
> boundaries, which it probably does). And the converse for 
> vkAcquireNextImageKHR.
>
> tl;dr: we don't need to move winsys into the kernel, winsys and 
> compute don't need to share sync primitives, the client/winsys 
> boundary does need to have a sync primitive does need strong and 
> onerous guarantees, and that transition can be several orders of 
> magnitude less efficient than intra-client sync primitives
>
> Shoot me down. :)
>
> Cheers,
> Daniel
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[-- Attachment #1.2: Type: text/html, Size: 19964 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 16:25           ` Marek Olšák
  2021-04-20 16:42             ` Jacob Lifshay
@ 2021-04-20 18:03             ` Daniel Stone
  2021-04-20 18:39             ` Daniel Vetter
  2 siblings, 0 replies; 105+ messages in thread
From: Daniel Stone @ 2021-04-20 18:03 UTC (permalink / raw)
  To: Marek Olšák; +Cc: Christian König, dri-devel, ML Mesa-dev


[-- Attachment #1.1: Type: text/plain, Size: 2332 bytes --]

On Tue, 20 Apr 2021 at 17:25, Marek Olšák <maraeo@gmail.com> wrote:

> Daniel, imagine hardware that can only do what Windows does: future fences
> signalled by userspace whenever userspace wants, and no kernel queues like
> we have today.
>
> The only reason why current AMD GPUs work is because they have a ring
> buffer per queue with pointers to userspace command buffers followed by
> fences. What will we do if that ring buffer is removed?
>

I can totally imagine that; memory fences are clearly a reality and we need
to make them work for functionality as well as performance. Let's imagine
that winsys joins that flying-car future of totally arbitrary sync, that we
work only on memory fences and nothing else, and that this all happens by
the time we're all vaccinated and can go cram into a room with 8000
other people at FOSDEM instead of trying to do this over email.

But the first couple of sentences of your proposal has the kernel
monitoring those synchronisation points to ensure that they complete in
bounded time. That already _completely_ destroys the purity of the simple
picture you paint. Either there are no guarantees and userspace has to
figure it out, or there are guarantees and we have to compromise that
purity.

I understand how you arrived at your proposal from your perspective as an
extremely skilled driver developer who has delivered gigantic performance
improvements to real-world clients. As a winsys person with a very
different perspective, I disagree with you on where you are drawing the
boundaries, to the point that I think your initial proposal is worse than
useless; doing glFinish() or the VkFence equivalent in clients would be
better in most cases than the first mail.

I don't want to do glFinish (which I'm right about), and you don't want to
do dma_fence (which you're right about). So let's work together to find a
middle ground which we're both happy with. That middle ground does exist,
and we as winsys people are happy to eat a significant amount of pain to
arrive at that middle ground. Your current proposal is at once too gentle
on the winsys, and far too harsh on it. I only want to move where and how
those lines are drawn, not to pretend that all the world is still a
single-context FIFO execution engine.

Cheers,
Daniel

[-- Attachment #1.2: Type: text/html, Size: 2905 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 18:00       ` [Mesa-dev] " Christian König
@ 2021-04-20 18:15         ` Daniel Stone
  2021-04-20 19:03           ` Bas Nieuwenhuizen
  0 siblings, 1 reply; 105+ messages in thread
From: Daniel Stone @ 2021-04-20 18:15 UTC (permalink / raw)
  To: Christian König; +Cc: ML Mesa-dev, dri-devel, Jason Ekstrand


[-- Attachment #1.1: Type: text/plain, Size: 4999 bytes --]

On Tue, 20 Apr 2021 at 19:00, Christian König <
ckoenig.leichtzumerken@gmail.com> wrote:

> Am 20.04.21 um 19:44 schrieb Daniel Stone:
>
> But winsys is something _completely_ different. Yes, you're using the GPU
> to do things with buffers A, B, and C to produce buffer Z. Yes, you're
> using vkQueuePresentKHR to schedule that work. Yes, Mutter's composition
> job might depend on a Chromium composition job which depends on GTA's
> render job which depends on GTA's compute job which might take a year to
> complete. Mutter's composition job needs to complete in 'reasonable'
> (again, FSVO) time, no matter what. The two are compatible.
>
> How? Don't lump them together. Isolate them aggressively, and
> _predictably_ in a way that you can reason about.
>
> What clients do in their own process space is their own business. Games
> can deadlock themselves if they get wait-before-signal wrong. Compute jobs
> can run for a year. Their problem. Winsys is not that, because you're
> crossing every isolation boundary possible. Process, user, container, VM -
> every kind of privilege boundary. Thus far, dma_fence has protected us from
> the most egregious abuses by guaranteeing bounded-time completion; it also
> acts as a sequencing primitive, but from the perspective of a winsys person
> that's of secondary importance, which is probably one of the bigger
> disconnects between winsys people and GPU driver people.
>
>
> Finally somebody who understands me :)
>
> Well the question is then how do we get winsys and your own process space
> together then?
>

It's a jarring transition. If you take a very narrow view and say 'it's all
GPU work in shared buffers so it should all work the same', then
client<->winsys looks the same as client<->client gbuffer. But this is a
trap.

Just because you can mmap() a file on an NFS server in New Zealand doesn't
mean that you should have the same expectations of memory access to that
file as you do to of a pointer from alloca(). Even if the primitives look
the same, you are crossing significant boundaries, and those do not come
without a compromise and a penalty.


> Anyway, one of the great things about winsys (there are some! trust me) is
> we don't need to be as hopelessly general as for game engines, nor as
> hyperoptimised. We place strict demands on our clients, and we literally
> kill them every single time they get something wrong in a way that's
> visible to us. Our demands on the GPU are so embarrassingly simple that you
> can run every modern desktop environment on GPUs which don't have unified
> shaders. And on certain platforms who don't share tiling formats between
> texture/render-target/scanout ... and it all still runs fast enough that
> people don't complain.
>
>
> Ignoring everything below since that is the display pipeline I'm not
> really interested in. My concern is how to get the buffer from the client
> to the server without allowing the client to get the server into trouble?
>
> My thinking is still to use timeouts to acquire texture locks. E.g. when
> the compositor needs to access texture it grabs a lock and if that lock
> isn't available in less than 20ms whoever is holding it is killed hard and
> the lock given to the compositor.
>
> It's perfectly fine if a process has a hung queue, but if it tries to send
> buffers which should be filled by that queue to the compositor it just gets
> a corrupted window content.
>

Kill the client hard. If the compositor has speculatively queued sampling
against rendering which never completed, let it access garbage. You'll have
one frame of garbage (outdated content, all black, random pattern; the
failure mode is equally imperfect, because there is no perfect answer),
then the compositor will notice the client has disappeared and remove all
its resources.

It's not possible to completely prevent this situation if the compositor
wants to speculatively pipeline work, only ameliorate it. From a
system-global point of view, just expose the situation and let it bubble
up. Watch the number of fences which failed to retire in time, and destroy
the context if there are enough of them (maybe 1, maybe 100). Watch the
number of contexts the file description get forcibly destroyed, and destroy
the file description if there are enough of them. Watch the number of
descriptions which get forcibly destroyed, and destroy the process if there
are enough of them. Watch the number of processes in a cgroup/pidns which
get forcibly destroyed, and destroy the ... etc. Whether it's the DRM
driver or an external monitor such as systemd/Flatpak/podman/Docker doing
that is pretty immaterial, as long as the concept of failure bubbling up
remains.

(20ms is objectively the wrong answer FWIW, because we're not a hard RTOS.
But if our biggest point of disagreement is 20 vs. 200 vs. 2000 vs. 20000
ms, then this thread has been a huge success!)

Cheers,
Daniel

[-- Attachment #1.2: Type: text/html, Size: 6539 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 16:25           ` Marek Olšák
  2021-04-20 16:42             ` Jacob Lifshay
  2021-04-20 18:03             ` Daniel Stone
@ 2021-04-20 18:39             ` Daniel Vetter
  2021-04-20 19:20               ` Marek Olšák
  2 siblings, 1 reply; 105+ messages in thread
From: Daniel Vetter @ 2021-04-20 18:39 UTC (permalink / raw)
  To: Marek Olšák; +Cc: Christian König, dri-devel, ML Mesa-dev

On Tue, Apr 20, 2021 at 6:25 PM Marek Olšák <maraeo@gmail.com> wrote:
>
> Daniel, imagine hardware that can only do what Windows does: future fences signalled by userspace whenever userspace wants, and no kernel queues like we have today.
>
> The only reason why current AMD GPUs work is because they have a ring buffer per queue with pointers to userspace command buffers followed by fences. What will we do if that ring buffer is removed?

Well this is an entirely different problem than what you set out to
describe. This is essentially the problem where hw does not have any
support for priviledged commands and separate priviledges command
buffer, and direct userspace submit is the only thing that is
available.

I think if this is your problem, then you get to implement some very
interesting compat shim. But that's an entirely different problem from
what you've described in your mail. This pretty much assumes at the hw
level the only thing that works is ATS/pasid, and vram is managed with
HMM exclusively. Once you have that pure driver stack you get to fake
it in the kernel for compat with everything that exists already. How
exactly that will look and how exactly you best construct your
dma_fences for compat will depend highly upon how much is still there
in this hw (e.g. wrt interrupt generation). A lot of the
infrastructure was also done as part of drm_syncobj. I mean we have
entirely fake kernel drivers like vgem/vkms that create dma_fence, so
a hw ringbuffer is really not required.

So ... is this your problem underneath it all, or was that more a wild
strawman for the discussion?
-Daniel


> Marek
>
> On Tue, Apr 20, 2021 at 11:50 AM Daniel Stone <daniel@fooishbar.org> wrote:
>>
>> Hi,
>>
>> On Tue, 20 Apr 2021 at 16:16, Christian König <ckoenig.leichtzumerken@gmail.com> wrote:
>>>
>>> Am 20.04.21 um 17:07 schrieb Daniel Stone:
>>>
>>> If the compositor no longer has a guarantee that the buffer will be ready for composition in a reasonable amount of time (which dma_fence gives us, and this proposal does not appear to give us), then the compositor isn't trying to use the buffer for compositing, it's waiting asynchronously on a notification that the fence has signaled before it attempts to use the buffer.
>>>
>>> Marek's initial suggestion is that the kernel signal the fence, which would unblock composition (and presumably show garbage on screen, or at best jump back to old content).
>>>
>>> My position is that the compositor will know the process has crashed anyway - because its socket has been closed - at which point we destroy all the client's resources including its windows and buffers regardless. Signaling the fence doesn't give us any value here, _unless_ the compositor is just blindly waiting for the fence to signal ... which it can't do because there's no guarantee the fence will ever signal.
>>>
>>>
>>> Yeah, but that assumes that the compositor has change to not blindly wait for the client to finish rendering and as Daniel explained that is rather unrealistic.
>>>
>>> What we need is a fallback mechanism which signals the fence after a timeout and gives a penalty to the one causing the timeout.
>>>
>>> That gives us the same functionality we have today with the in software scheduler inside the kernel.
>>
>>
>> OK, if that's the case then I think I'm really missing something which isn't explained in this thread, because I don't understand what the additional complexity and API change gains us (see my first reply in this thread).
>>
>> By way of example - say I have a blind-but-explicit compositor that takes a drm_syncobj along with a dmabuf with each client presentation request, but doesn't check syncobj completion, it just imports that into a VkSemaphore + VkImage and schedules work for the next frame.
>>
>> Currently, that generates an execbuf ioctl for the composition (ignore KMS for now) with a sync point to wait on, and the kernel+GPU scheduling guarantees that the composition work will not begin until the client rendering work has retired. We have a further guarantee that this work will complete in reasonable time, for some value of 'reasonable'.
>>
>> My understanding of this current proposal is that:
>> * userspace creates a 'present fence' with this new ioctl
>> * the fence becomes signaled when a value is written to a location in memory, which is visible through both CPU and GPU mappings of that page
>> * this 'present fence' is imported as a VkSemaphore (?) and the userspace Vulkan driver will somehow wait on this value  either before submitting work or as a possibly-hardware-assisted GPU-side wait (?)
>> * the kernel's scheduler is thus eliminated from the equation, and every execbuf is submitted directly to hardware, because either userspace knows that the fence has already been signaled, or it will issue a GPU-side wait (?)
>> * but the kernel is still required to monitor completion of every fence itself, so it can forcibly complete, or penalise the client (?)
>>
>> Lastly, let's say we stop ignoring KMS: what happens for the render-with-GPU-display-on-KMS case? Do we need to do the equivalent of glFinish() in userspace and only submit the KMS atomic request when the GPU work has fully retired?
>>
>> Clarifying those points would be really helpful so this is less of a strawman. I have some further opinions, but I'm going to wait until I understand what I'm actually arguing against before I go too far. :) The last point is very salient though.
>>
>> Cheers,
>> Daniel
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 17:44     ` Daniel Stone
  2021-04-20 18:00       ` [Mesa-dev] " Christian König
@ 2021-04-20 18:53       ` Daniel Vetter
  2021-04-20 19:14         ` Daniel Stone
  2021-04-20 19:16         ` Jason Ekstrand
  1 sibling, 2 replies; 105+ messages in thread
From: Daniel Vetter @ 2021-04-20 18:53 UTC (permalink / raw)
  To: Daniel Stone; +Cc: ML Mesa-dev, dri-devel, Jason Ekstrand

On Tue, Apr 20, 2021 at 7:45 PM Daniel Stone <daniel@fooishbar.org> wrote:
>
> Hi,
>
> On Tue, 20 Apr 2021 at 16:46, Jason Ekstrand <jason@jlekstrand.net> wrote:
>>
>> It's still early in the morning here and I'm not awake yet so sorry if
>> this comes out in bits and pieces...
>
>
> No problem, it's helpful. If I weren't on this thread I'd be attempting to put together a 73-piece chest of drawers whose instructions are about as clear as this so far, so I'm in the right head space anyway.
>
>>
>> IMO, there are two problems being solved here which are related in
>> very subtle and tricky ways.  They're also, admittedly, driver
>> problems, not really winsys problems.  Unfortunately, they may have
>> winsys implications.
>
>
> Yeah ... bingo.
>
>>
>> First, is better/real timelines for Vulkan and compute.  [...]
>>
>> We also want something like this for compute workloads.  [...]
>
>
> Totally understand and agree with all of this. Memory fences seem like a good and useful primitive here.
>
>>
>> The second biting issue is that, in the current kernel implementation
>> of dma-fence and dma_resv, we've lumped internal synchronization for
>> memory management together with execution synchronization for
>> userspace dependency tracking.  And we have no way to tell the
>> difference between the two internally.  Even if user space is passing
>> around sync_files and trying to do explicit sync, once you get inside
>> the kernel, they're all dma-fences and it can't tell the difference.
>
>
> Funny, because 'lumped [the two] together' is exactly the crux of my issues ...
>
>>
>> If we move
>
>
> Stop here, because ...
>
>>
>> to a more userspace-controlled synchronization model with
>> wait-before-signal and no timeouts unless requested, regardless of the
>> implementation, it plays really badly dma-fence.  And, by "badly" I
>> mean the two are nearly incompatible.
>
>
> I would go further than that, and say completely, fundamentally, conceptually, incompatible.
>
>>
>> From a user space PoV, it means
>> it's tricky to provide the finite time dma-fence guarantee.  From a
>> kernel PoV, it's way worse.  Currently, the way dma-fence is
>> constructed, it's impossible to deadlock assuming everyone follows the
>> rules.  The moment we allow user space to deadlock itself and allow
>> those deadlocks to leak into the kernel, we have a problem.  Even if
>> we throw in some timeouts, we still have a scenario where user space
>> has one linearizable dependency graph for execution synchronization
>> and the kernel has a different linearizable dependency graph for
>> memory management and, when you smash them together, you may have
>> cycles in your graph.
>>
>> So how do we sort this all out?  Good question.  It's a hard problem.
>> Probably the hardest problem here is the second one: the intermixing
>> of synchronization types.  Solving that one is likely going to require
>> some user space re-plumbing because all the user space APIs we have
>> for explicit sync are built on dma-fence.
>
>
> Gotcha.
>
> Firstly, let's stop, as you say, lumping things together. Timeline semaphores and compute's GPU-side spinlocks etc, are one thing. I accept those now have a hard requirement on something like memory fences, where any responsibility is totally abrogated. So let's run with that in our strawman: Vulkan compute & graphics & transfer queues all degenerate to something spinning (hopefully GPU-assisted gentle spin) on a uint64 somewhere. The kernel has (in the general case) no visibility or responsibility into these things. Fine - that's one side of the story.
>
> But winsys is something _completely_ different. Yes, you're using the GPU to do things with buffers A, B, and C to produce buffer Z. Yes, you're using vkQueuePresentKHR to schedule that work. Yes, Mutter's composition job might depend on a Chromium composition job which depends on GTA's render job which depends on GTA's compute job which might take a year to complete. Mutter's composition job needs to complete in 'reasonable' (again, FSVO) time, no matter what. The two are compatible.
>
> How? Don't lump them together. Isolate them aggressively, and _predictably_ in a way that you can reason about.
>
> What clients do in their own process space is their own business. Games can deadlock themselves if they get wait-before-signal wrong. Compute jobs can run for a year. Their problem. Winsys is not that, because you're crossing every isolation boundary possible. Process, user, container, VM - every kind of privilege boundary. Thus far, dma_fence has protected us from the most egregious abuses by guaranteeing bounded-time completion; it also acts as a sequencing primitive, but from the perspective of a winsys person that's of secondary importance, which is probably one of the bigger disconnects between winsys people and GPU driver people.
>
> Anyway, one of the great things about winsys (there are some! trust me) is we don't need to be as hopelessly general as for game engines, nor as hyperoptimised. We place strict demands on our clients, and we literally kill them every single time they get something wrong in a way that's visible to us. Our demands on the GPU are so embarrassingly simple that you can run every modern desktop environment on GPUs which don't have unified shaders. And on certain platforms who don't share tiling formats between texture/render-target/scanout ... and it all still runs fast enough that people don't complain.
>
> We're happy to bear the pain of being the ones setting strict and unreasonable expectations. To me, this 'present ioctl' falls into the uncanny valley of the kernel trying to bear too much of the weight to be tractable, whilst not bearing enough of the weight to be useful for winsys.
>
> So here's my principles for a counter-strawman:
>
> Remove the 'return fence'. Burn it with fire, do not look back. Modern presentation pipelines are not necessarily 1:1, they are not necessarily FIFO (as opposed to mailbox), and they are not necessarily round-robin either. The current proposal provides no tangible benefits to modern userspace, and fixing that requires either hobbling userspace to remove capability and flexibility (ironic given that the motivation for this is all about userspace flexibility?), or pushing so much complexity into the kernel that we break it forever (you can't compile Mutter's per-frame decision tree into eBPF).
>
> Give us a primitive representing work completion, so we can keep optimistically pipelining operations. We're happy to pass around explicit-synchronisation tokens (dma_fence, drm_syncobj, drm_newthing, whatever it is): plumbing through a sync token to synchronise compositor operations against client operations in both directions is just a matter of boring typing.
>
> Make that primitive something that is every bit as usable across subsystems as it is across processes. It should be a lowest common denominator for middleware that ultimately provokes GPU execbuf, KMS commit, and media codec ops; currently that would be both wait and signal for all of VkSemaphore, EGLSyncKHR, KMS fence, V4L (D)QBUF, and VA-API {en,de}code ops. It must be exportable to and importable from an FD, which can be poll()ed on and read(). GPU-side visibility for late binding is nice, but not at all essential.
>
> Make that primitive complete in 'reasonable' time, no matter what. There will always be failures in extremis, no matter what the design: absent hard-realtime principles from hardware all the way up to userspace, something will always be able to fail somewhere: non-terminating GPU work, actual GPU hang/reset, GPU queue DoSed, CPU scheduler, I/O DoSed. As long as the general case is bounded-time completion, each of these can be mitigated separately as long as userspace has enough visibility into the underlying mechanics, and cares enough to take meaningful action on it.
>
> And something more concrete:
>
> dma_fence.
>
> This already has all of the properties described above. Kernel-wise, it already devolves to CPU-side signaling when it crosses device boundaries. We need to support it roughly forever since it's been plumbed so far and so wide. Any primitive which is acceptable for winsys-like usage which crosses so many device/subsystem/process/security boundaries has to meet the same requirements. So why reinvent something which looks so similar, and has the same requirements of the kernel babysitting completion, providing little to no benefit for that difference?
>
> It's not usable for complex usecases, as we've established, but winsys is not that usecase. We can draw a hard boundary between the two worlds. For example, a client could submit an infinitely deep CS -> VS/FS/etc job chain with potentially-infinite completion, with the FS output being passed to the winsys for composition. Draw the line post-FS: export a dma_fence against FS completion. But instead of this being based on monitoring the _fence_ per se, base it on monitoring the job; if the final job doesn't retire in reasonable time, signal the fence and signal (like, SIGKILL, or just tear down the context and permanently -EIO, whatever) the client. Maybe for future hardware that would be the same thing - the kernel setting a timeout and comparing a read on a particular address against a particular value - but the 'present fence' proposal seems like it requires exactly this anyway.

Yeah return fence for flips/presents sounds unappealing. Android tried
it, we convinced them it's not great and they changed that.

> That to me is the best compromise. We allow clients complete arbitrary flexibility, but as soon as they vkQueuePresentKHR, they're crossing a boundary out of happy fun GPU land and into strange hostile winsys land. We've got a lot of practice at being the bad guys who hate users and are always trying to ruin their dreams, so we'll happily wear the impact of continuing to do that. In doing so, we collectively don't have to invent a third new synchronisation primitive (to add to dma_fence and drm_syncobj) and a third new synchronisation model (implicit sync, explicit-but-bounded sync, explicit-and-maybe-unbounded sync) to support this, and we don't have to do an NT4 where GDI was shoved into the kernel.
>
> It doesn't help with the goal of ridding dma_fence from the kernel, but it does very clearly segregate the two worlds. Drawing that hard boundary would allow drivers to hyperoptimise for clients which want to be extremely clever and agile and quick because they're sailing so close to the wind that they cannot bear the overhead of dma_fence, whilst also providing the guarantees we need when crossing isolation boundaries. In the latter case, the overhead of bouncing into a less-optimised primitive is totally acceptable because it's not even measurable: vkQueuePresentKHR requires client CPU activity -> kernel IPC -> compositor CPU activity -> wait for repaint cycle -> prepare scene -> composition, against which dma_fence overhead isn't and will never be measurable (even if it doesn't cross device/subsystem boundaries, which it probably does). And the converse for vkAcquireNextImageKHR.
>
> tl;dr: we don't need to move winsys into the kernel, winsys and compute don't need to share sync primitives, the client/winsys boundary does need to have a sync primitive does need strong and onerous guarantees, and that transition can be several orders of magnitude less efficient than intra-client sync primitives
>
> Shoot me down. :)

So I can mostly get behind this, except it's _not_ going to be
dma_fence. That thing has horrendous internal ordering constraints
within the kernel, and the one thing that doesn't allow you is to make
a dma_fence depend upon a userspace fence.

But what we can do is use the same currently existing container
objects like drm_syncobj or sync_file (timeline syncobj would fit best
tbh), and stuff a userspace fence behind it. The only trouble is that
currently timeline syncobj implement vulkan's spec, which means if you
build a wait-before-signal deadlock, you'll wait forever. Well until
the user ragequits and kills your process.

So for winsys we'd need to be able to specify the wait timeout
somewhere for waiting for that dma_fence to materialize (plus the
submit thread, but userspace needs that anyway to support timeline
syncobj) if you're importing an untrusted timeline syncobj. And I
think that's roughly it.

The fancy version would allow you to access the underlying memory
fence from the cmd streamer and do fancy conditional rendering and fun
stuff like that (pick old/new frame depending which one is ready), but
that's the fancy advanced compositor on top here. The "give me the
same thing as I get with dma_fence implicit sync today" would just
need the timeout for imporiting untrusted timeline syncobj.

So a vk extension, and also probably a gl extension for timeline
syncobj (not sure that exists already), which probably wants to
specify the reasonable timeout limit by default. Because that's more
the gl way of doing things.

Oh also I really don't want to support this for implicit sync, but
heck we could even do that. It would stall pretty bad because there's
no submit thread in userspace. But we could then optimize that with
some new dma-buf ioctl to get out the syncobj, kinda like what Jason
has already proposed for sync_file or so. And then userspace which has
a submit thread could handle it correctly.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 18:15         ` Daniel Stone
@ 2021-04-20 19:03           ` Bas Nieuwenhuizen
  2021-04-20 19:18             ` Daniel Stone
  0 siblings, 1 reply; 105+ messages in thread
From: Bas Nieuwenhuizen @ 2021-04-20 19:03 UTC (permalink / raw)
  To: Daniel Stone; +Cc: Christian König, dri-devel, ML Mesa-dev


[-- Attachment #1.1: Type: text/plain, Size: 5896 bytes --]

On Tue, Apr 20, 2021 at 8:16 PM Daniel Stone <daniel@fooishbar.org> wrote:

> On Tue, 20 Apr 2021 at 19:00, Christian König <
> ckoenig.leichtzumerken@gmail.com> wrote:
>
>> Am 20.04.21 um 19:44 schrieb Daniel Stone:
>>
>> But winsys is something _completely_ different. Yes, you're using the GPU
>> to do things with buffers A, B, and C to produce buffer Z. Yes, you're
>> using vkQueuePresentKHR to schedule that work. Yes, Mutter's composition
>> job might depend on a Chromium composition job which depends on GTA's
>> render job which depends on GTA's compute job which might take a year to
>> complete. Mutter's composition job needs to complete in 'reasonable'
>> (again, FSVO) time, no matter what. The two are compatible.
>>
>> How? Don't lump them together. Isolate them aggressively, and
>> _predictably_ in a way that you can reason about.
>>
>> What clients do in their own process space is their own business. Games
>> can deadlock themselves if they get wait-before-signal wrong. Compute jobs
>> can run for a year. Their problem. Winsys is not that, because you're
>> crossing every isolation boundary possible. Process, user, container, VM -
>> every kind of privilege boundary. Thus far, dma_fence has protected us from
>> the most egregious abuses by guaranteeing bounded-time completion; it also
>> acts as a sequencing primitive, but from the perspective of a winsys person
>> that's of secondary importance, which is probably one of the bigger
>> disconnects between winsys people and GPU driver people.
>>
>>
>> Finally somebody who understands me :)
>>
>> Well the question is then how do we get winsys and your own process space
>> together then?
>>
>
> It's a jarring transition. If you take a very narrow view and say 'it's
> all GPU work in shared buffers so it should all work the same', then
> client<->winsys looks the same as client<->client gbuffer. But this is a
> trap.
>

I think this is where I think we have have a serious gap of what a winsys
or a compositor is. Like if you have only a single wayland server running
on a physical machine this is easy. But add a VR compositor, an
intermediate compositor (say gamescope), Xwayland and some containers/VM,
some video capture  (or, gasp, a browser that doubles as compositor) and
this story gets seriously complicated. Like who are you protecting from
who? at what point is something client<->winsys vs. client<->client?



> Just because you can mmap() a file on an NFS server in New Zealand doesn't
> mean that you should have the same expectations of memory access to that
> file as you do to of a pointer from alloca(). Even if the primitives look
> the same, you are crossing significant boundaries, and those do not come
> without a compromise and a penalty.
>
>
>> Anyway, one of the great things about winsys (there are some! trust me)
>> is we don't need to be as hopelessly general as for game engines, nor as
>> hyperoptimised. We place strict demands on our clients, and we literally
>> kill them every single time they get something wrong in a way that's
>> visible to us. Our demands on the GPU are so embarrassingly simple that you
>> can run every modern desktop environment on GPUs which don't have unified
>> shaders. And on certain platforms who don't share tiling formats between
>> texture/render-target/scanout ... and it all still runs fast enough that
>> people don't complain.
>>
>>
>> Ignoring everything below since that is the display pipeline I'm not
>> really interested in. My concern is how to get the buffer from the client
>> to the server without allowing the client to get the server into trouble?
>>
>> My thinking is still to use timeouts to acquire texture locks. E.g. when
>> the compositor needs to access texture it grabs a lock and if that lock
>> isn't available in less than 20ms whoever is holding it is killed hard and
>> the lock given to the compositor.
>>
>> It's perfectly fine if a process has a hung queue, but if it tries to
>> send buffers which should be filled by that queue to the compositor it just
>> gets a corrupted window content.
>>
>
> Kill the client hard. If the compositor has speculatively queued sampling
> against rendering which never completed, let it access garbage. You'll have
> one frame of garbage (outdated content, all black, random pattern; the
> failure mode is equally imperfect, because there is no perfect answer),
> then the compositor will notice the client has disappeared and remove all
> its resources.
>
> It's not possible to completely prevent this situation if the compositor
> wants to speculatively pipeline work, only ameliorate it. From a
> system-global point of view, just expose the situation and let it bubble
> up. Watch the number of fences which failed to retire in time, and destroy
> the context if there are enough of them (maybe 1, maybe 100). Watch the
> number of contexts the file description get forcibly destroyed, and destroy
> the file description if there are enough of them. Watch the number of
> descriptions which get forcibly destroyed, and destroy the process if there
> are enough of them. Watch the number of processes in a cgroup/pidns which
> get forcibly destroyed, and destroy the ... etc. Whether it's the DRM
> driver or an external monitor such as systemd/Flatpak/podman/Docker doing
> that is pretty immaterial, as long as the concept of failure bubbling up
> remains.
>
> (20ms is objectively the wrong answer FWIW, because we're not a hard RTOS.
> But if our biggest point of disagreement is 20 vs. 200 vs. 2000 vs. 20000
> ms, then this thread has been a huge success!)
>
> Cheers,
> Daniel
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>

[-- Attachment #1.2: Type: text/html, Size: 8036 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 18:53       ` Daniel Vetter
@ 2021-04-20 19:14         ` Daniel Stone
  2021-04-20 19:29           ` Daniel Vetter
  2021-04-20 19:16         ` Jason Ekstrand
  1 sibling, 1 reply; 105+ messages in thread
From: Daniel Stone @ 2021-04-20 19:14 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: ML Mesa-dev, dri-devel, Jason Ekstrand


[-- Attachment #1.1: Type: text/plain, Size: 2707 bytes --]

Hi,

On Tue, 20 Apr 2021 at 19:54, Daniel Vetter <daniel@ffwll.ch> wrote:

> So I can mostly get behind this, except it's _not_ going to be
> dma_fence. That thing has horrendous internal ordering constraints
> within the kernel, and the one thing that doesn't allow you is to make
> a dma_fence depend upon a userspace fence.
>
> But what we can do is use the same currently existing container
> objects like drm_syncobj or sync_file (timeline syncobj would fit best
> tbh), and stuff a userspace fence behind it. The only trouble is that
> currently timeline syncobj implement vulkan's spec, which means if you
> build a wait-before-signal deadlock, you'll wait forever. Well until
> the user ragequits and kills your process.
>
> So for winsys we'd need to be able to specify the wait timeout
> somewhere for waiting for that dma_fence to materialize (plus the
> submit thread, but userspace needs that anyway to support timeline
> syncobj) if you're importing an untrusted timeline syncobj. And I
> think that's roughly it.
>

Right. The only way you get to materialise a dma_fence from an execbuf is
that you take a hard timeout, with a penalty for not meeting that timeout.
When I say dma_fence I mean dma_fence, because there is no extant winsys
support for drm_symcobj, so this is greenfield: the winsys gets to specify
its terms of engagement, and again, we've been the orange/green-site
enemies of users for quite some time already, so we're happy to continue
doing so. If the actual underlying primitive is not a dma_fence, and
compositors/protocol/clients need to eat a bunch of typing to deal with a
different primitive which offers the same guarantees, then that's fine, as
long as there is some tangible whole-of-system benefit.

How that timeout is actually realised is an implementation detail. Whether
it's a property of the last GPU job itself that the CPU-side driver can
observe, or that the kernel driver guarantees that there is a GPU job
launched in parallel which monitors the memory-fence status and reports
back through a mailbox/doorbell, or the CPU-side driver enqueues kqueue
work for $n milliseconds' time to check the value in memory and kill the
context if it doesn't meet expectations - whatever. I don't believe any of
those choices meaningfully impact on kernel driver complexity relative to
the initial proposal, but they do allow us to continue to provide the
guarantees we do today when buffers cross security boundaries.

There might well be an argument for significantly weakening those security
boundaries and shifting the complexity from the DRM scheduler into
userspace compositors. So far though, I have yet to see that argument made
coherently.

Cheers,
Daniel

[-- Attachment #1.2: Type: text/html, Size: 3210 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 18:53       ` Daniel Vetter
  2021-04-20 19:14         ` Daniel Stone
@ 2021-04-20 19:16         ` Jason Ekstrand
  2021-04-20 19:27           ` Daniel Vetter
  1 sibling, 1 reply; 105+ messages in thread
From: Jason Ekstrand @ 2021-04-20 19:16 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: ML Mesa-dev, dri-devel

On Tue, Apr 20, 2021 at 1:54 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, Apr 20, 2021 at 7:45 PM Daniel Stone <daniel@fooishbar.org> wrote:
>
> > And something more concrete:
> >
> > dma_fence.
> >
> > This already has all of the properties described above. Kernel-wise, it already devolves to CPU-side signaling when it crosses device boundaries. We need to support it roughly forever since it's been plumbed so far and so wide. Any primitive which is acceptable for winsys-like usage which crosses so many device/subsystem/process/security boundaries has to meet the same requirements. So why reinvent something which looks so similar, and has the same requirements of the kernel babysitting completion, providing little to no benefit for that difference?
>
> So I can mostly get behind this, except it's _not_ going to be
> dma_fence. That thing has horrendous internal ordering constraints
> within the kernel, and the one thing that doesn't allow you is to make
> a dma_fence depend upon a userspace fence.

Let me elaborate on this a bit.  One of the problems I mentioned
earlier is the conflation of fence types inside the kernel.  dma_fence
is used for solving two different semi-related but different problems:
client command synchronization and memory residency synchronization.
In the old implicit GL world, we conflated these two and thought we
were providing ourselves a service.  Not so much....

It's all well and good to say that we should turn the memory fence
into a dma_fence and throw a timeout on it.  However, these
window-system sync primitives, as you said, have to be able to be
shared across everything.  In particular, we have to be able to share
them with drivers that don't make a good separation between command
and memory synchronization.

Let's say we're rendering on ANV with memory fences and presenting on
some USB display adapter whose kernel driver is a bit old-school.
When we pass that fence to the other driver via a sync_file or
similar, that driver may shove that dma_fence into the dma_resv on
some buffer somewhere.  Then our client, completely unaware of
internal kernel dependencies, binds that buffer into its address space
and kicks off another command buffer.  So i915 throws in a dependency
on that dma_resv which contains the previously created dma_fence and
refuses to execute any more command buffers until it signals.
Unfortunately, unbeknownst to i915, that command buffer which the
client kicked off after doing that bind was required for signaling the
memory fence on which our first dma_fence depends.  Deadlock.

Sure, we put a timeout on the dma_fence and it will eventually fire
and unblock everything.  However, there's one very important point
that's easy to miss here:  Neither i915 nor the client did anything
wrong in the above scenario.  The Vulkan footgun approach works
because there are a set of rules and, if you follow those rules,
you're guaranteed everything works.  In the above scenario, however,
the client followed all of the rules and got a deadlock anyway.  We
can't have that.


> But what we can do is use the same currently existing container
> objects like drm_syncobj or sync_file (timeline syncobj would fit best
> tbh), and stuff a userspace fence behind it. The only trouble is that
> currently timeline syncobj implement vulkan's spec, which means if you
> build a wait-before-signal deadlock, you'll wait forever. Well until
> the user ragequits and kills your process.

Yeah, it may be that this approach can be made to work.  Instead of
reusing dma_fence, maybe we can reuse syncobj and have another form of
syncobj which is a memory fence, a value to wait on, and a timeout.

--Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 19:03           ` Bas Nieuwenhuizen
@ 2021-04-20 19:18             ` Daniel Stone
  0 siblings, 0 replies; 105+ messages in thread
From: Daniel Stone @ 2021-04-20 19:18 UTC (permalink / raw)
  To: Bas Nieuwenhuizen; +Cc: Christian König, dri-devel, ML Mesa-dev


[-- Attachment #1.1: Type: text/plain, Size: 1741 bytes --]

Hi,

On Tue, 20 Apr 2021 at 20:03, Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
wrote:

> On Tue, Apr 20, 2021 at 8:16 PM Daniel Stone <daniel@fooishbar.org> wrote:
>
>> It's a jarring transition. If you take a very narrow view and say 'it's
>> all GPU work in shared buffers so it should all work the same', then
>> client<->winsys looks the same as client<->client gbuffer. But this is a
>> trap.
>>
>
> I think this is where I think we have have a serious gap of what a winsys
> or a compositor is. Like if you have only a single wayland server running
> on a physical machine this is easy. But add a VR compositor, an
> intermediate compositor (say gamescope), Xwayland and some containers/VM,
> some video capture  (or, gasp, a browser that doubles as compositor) and
> this story gets seriously complicated. Like who are you protecting from
> who? at what point is something client<->winsys vs. client<->client?
>

As I've said upthread, the line is _seriously_ blurred, and is only getting
less clear. Right now, DRI3 cannot even accept a dma_fence, let alone a
drm_syncobj, let alone a memory fence.

Crossing those boundaries is hard, and requires as much thinking as typing.
That's a good thing.

Conflating every synchronisation desire into a single
userspace-visible primitive makes this harder, because it treats game
threads the same as other game threads the same as VR compositors the same
as embedding browsers the same as compositors etc. Drawing very clear lines
between game threads and the external world, with explicit weakening as
necessary, makes those jarring transitions of privilege and expectation
clear and explicit. Which is a good thing, since we're trying to move away
from magic and implicit.

Cheers,
Daniel

[-- Attachment #1.2: Type: text/html, Size: 2667 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 18:39             ` Daniel Vetter
@ 2021-04-20 19:20               ` Marek Olšák
  0 siblings, 0 replies; 105+ messages in thread
From: Marek Olšák @ 2021-04-20 19:20 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Christian König, dri-devel, ML Mesa-dev


[-- Attachment #1.1: Type: text/plain, Size: 1815 bytes --]

On Tue, Apr 20, 2021 at 2:39 PM Daniel Vetter <daniel@ffwll.ch> wrote:

> On Tue, Apr 20, 2021 at 6:25 PM Marek Olšák <maraeo@gmail.com> wrote:
> >
> > Daniel, imagine hardware that can only do what Windows does: future
> fences signalled by userspace whenever userspace wants, and no kernel
> queues like we have today.
> >
> > The only reason why current AMD GPUs work is because they have a ring
> buffer per queue with pointers to userspace command buffers followed by
> fences. What will we do if that ring buffer is removed?
>
> Well this is an entirely different problem than what you set out to
> describe. This is essentially the problem where hw does not have any
> support for priviledged commands and separate priviledges command
> buffer, and direct userspace submit is the only thing that is
> available.
>
> I think if this is your problem, then you get to implement some very
> interesting compat shim. But that's an entirely different problem from
> what you've described in your mail. This pretty much assumes at the hw
> level the only thing that works is ATS/pasid, and vram is managed with
> HMM exclusively. Once you have that pure driver stack you get to fake
> it in the kernel for compat with everything that exists already. How
> exactly that will look and how exactly you best construct your
> dma_fences for compat will depend highly upon how much is still there
> in this hw (e.g. wrt interrupt generation). A lot of the
> infrastructure was also done as part of drm_syncobj. I mean we have
> entirely fake kernel drivers like vgem/vkms that create dma_fence, so
> a hw ringbuffer is really not required.
>
> So ... is this your problem underneath it all, or was that more a wild
> strawman for the discussion?
>

Yes, that's the problem.

Marek

[-- Attachment #1.2: Type: text/html, Size: 2291 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 19:16         ` Jason Ekstrand
@ 2021-04-20 19:27           ` Daniel Vetter
  0 siblings, 0 replies; 105+ messages in thread
From: Daniel Vetter @ 2021-04-20 19:27 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: ML Mesa-dev, dri-devel

On Tue, Apr 20, 2021 at 9:17 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> On Tue, Apr 20, 2021 at 1:54 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Tue, Apr 20, 2021 at 7:45 PM Daniel Stone <daniel@fooishbar.org> wrote:
> >
> > > And something more concrete:
> > >
> > > dma_fence.
> > >
> > > This already has all of the properties described above. Kernel-wise, it already devolves to CPU-side signaling when it crosses device boundaries. We need to support it roughly forever since it's been plumbed so far and so wide. Any primitive which is acceptable for winsys-like usage which crosses so many device/subsystem/process/security boundaries has to meet the same requirements. So why reinvent something which looks so similar, and has the same requirements of the kernel babysitting completion, providing little to no benefit for that difference?
> >
> > So I can mostly get behind this, except it's _not_ going to be
> > dma_fence. That thing has horrendous internal ordering constraints
> > within the kernel, and the one thing that doesn't allow you is to make
> > a dma_fence depend upon a userspace fence.
>
> Let me elaborate on this a bit.  One of the problems I mentioned
> earlier is the conflation of fence types inside the kernel.  dma_fence
> is used for solving two different semi-related but different problems:
> client command synchronization and memory residency synchronization.
> In the old implicit GL world, we conflated these two and thought we
> were providing ourselves a service.  Not so much....
>
> It's all well and good to say that we should turn the memory fence
> into a dma_fence and throw a timeout on it.  However, these
> window-system sync primitives, as you said, have to be able to be
> shared across everything.  In particular, we have to be able to share
> them with drivers that don't make a good separation between command
> and memory synchronization.
>
> Let's say we're rendering on ANV with memory fences and presenting on
> some USB display adapter whose kernel driver is a bit old-school.
> When we pass that fence to the other driver via a sync_file or
> similar, that driver may shove that dma_fence into the dma_resv on
> some buffer somewhere.  Then our client, completely unaware of
> internal kernel dependencies, binds that buffer into its address space
> and kicks off another command buffer.  So i915 throws in a dependency
> on that dma_resv which contains the previously created dma_fence and
> refuses to execute any more command buffers until it signals.
> Unfortunately, unbeknownst to i915, that command buffer which the
> client kicked off after doing that bind was required for signaling the
> memory fence on which our first dma_fence depends.  Deadlock.

Nope. Because the waiting for this future fence will only happen in two places:
- driver submit thread, which is just userspace without holding
anything. From the kernel pov this can be preempted, memory
temporarily taken away, all these things. Until that's done you will
_not_ get a real dma_fence, but just another future fence.
- but what about the usb display you're asking? well for that we'll
need a new atomic extension, which takes a timeline syncobj and gives
you back a timeline syncobj. And the rules are that if one of the is a
future fence/userspace fence, so will the other (even if it's created
by the kernel)

Either way you get a timeline syncobj back which anv can then again
handle properly with it's submit thread. Not a dma_fence with a funny
timeout because there's deadlock issues with those.

So no you wont be able to get a dma_fence out of your slight of hands here.

> Sure, we put a timeout on the dma_fence and it will eventually fire
> and unblock everything.  However, there's one very important point
> that's easy to miss here:  Neither i915 nor the client did anything
> wrong in the above scenario.  The Vulkan footgun approach works
> because there are a set of rules and, if you follow those rules,
> you're guaranteed everything works.  In the above scenario, however,
> the client followed all of the rules and got a deadlock anyway.  We
> can't have that.
>
>
> > But what we can do is use the same currently existing container
> > objects like drm_syncobj or sync_file (timeline syncobj would fit best
> > tbh), and stuff a userspace fence behind it. The only trouble is that
> > currently timeline syncobj implement vulkan's spec, which means if you
> > build a wait-before-signal deadlock, you'll wait forever. Well until
> > the user ragequits and kills your process.
>
> Yeah, it may be that this approach can be made to work.  Instead of
> reusing dma_fence, maybe we can reuse syncobj and have another form of
> syncobj which is a memory fence, a value to wait on, and a timeout.

It's going to be the same container. But very much not a dma_fence.

Note the other approach is if you split the kernel's notion of what a
dma_fence is into two parts: memory fence and synchronization
primitive. The trouble is that there's tons of hw for which these are
by necessity the same things (because they can't preempt or dont have
a scheduler), so the value of this for the overall ecosystem is slim.
And the work to make it happen (plump future fences through the
drm/scheduler and everything) is giantic. drm/i915-gem tried, the
result is not pretty and we're now backing it largely all out least
because it's not where hw/vulkan/compute are actually going I think.

So that's an approach which I think does exist in theory, but really
not something I think we should attempt.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 19:14         ` Daniel Stone
@ 2021-04-20 19:29           ` Daniel Vetter
  2021-04-20 20:32             ` Daniel Stone
  0 siblings, 1 reply; 105+ messages in thread
From: Daniel Vetter @ 2021-04-20 19:29 UTC (permalink / raw)
  To: Daniel Stone; +Cc: ML Mesa-dev, dri-devel, Jason Ekstrand

On Tue, Apr 20, 2021 at 9:14 PM Daniel Stone <daniel@fooishbar.org> wrote:
>
> Hi,
>
> On Tue, 20 Apr 2021 at 19:54, Daniel Vetter <daniel@ffwll.ch> wrote:
>>
>> So I can mostly get behind this, except it's _not_ going to be
>> dma_fence. That thing has horrendous internal ordering constraints
>> within the kernel, and the one thing that doesn't allow you is to make
>> a dma_fence depend upon a userspace fence.
>>
>> But what we can do is use the same currently existing container
>> objects like drm_syncobj or sync_file (timeline syncobj would fit best
>> tbh), and stuff a userspace fence behind it. The only trouble is that
>> currently timeline syncobj implement vulkan's spec, which means if you
>> build a wait-before-signal deadlock, you'll wait forever. Well until
>> the user ragequits and kills your process.
>>
>> So for winsys we'd need to be able to specify the wait timeout
>> somewhere for waiting for that dma_fence to materialize (plus the
>> submit thread, but userspace needs that anyway to support timeline
>> syncobj) if you're importing an untrusted timeline syncobj. And I
>> think that's roughly it.
>
>
> Right. The only way you get to materialise a dma_fence from an execbuf is that you take a hard timeout, with a penalty for not meeting that timeout. When I say dma_fence I mean dma_fence, because there is no extant winsys support for drm_symcobj, so this is greenfield: the winsys gets to specify its terms of engagement, and again, we've been the orange/green-site enemies of users for quite some time already, so we're happy to continue doing so. If the actual underlying primitive is not a dma_fence, and compositors/protocol/clients need to eat a bunch of typing to deal with a different primitive which offers the same guarantees, then that's fine, as long as there is some tangible whole-of-system benefit.

So atm sync_file doesn't support future fences, but we could add the
support for those there. And since vulkan doesn't really say anything
about those, we could make the wait time out by default.

> How that timeout is actually realised is an implementation detail. Whether it's a property of the last GPU job itself that the CPU-side driver can observe, or that the kernel driver guarantees that there is a GPU job launched in parallel which monitors the memory-fence status and reports back through a mailbox/doorbell, or the CPU-side driver enqueues kqueue work for $n milliseconds' time to check the value in memory and kill the context if it doesn't meet expectations - whatever. I don't believe any of those choices meaningfully impact on kernel driver complexity relative to the initial proposal, but they do allow us to continue to provide the guarantees we do today when buffers cross security boundaries.

The thing is, you can't do this in drm/scheduler. At least not without
splitting up the dma_fence in the kernel into separate memory fences
and sync fences, and the work to get there is imo just not worth it.
We've bikeshedded this ad nauseaum for vk timeline syncobj, and the
solution was to have the submit thread in the userspace driver.

It won't really change anything wrt what applications can observe from
the egl/gl side of things though.

> There might well be an argument for significantly weakening those security boundaries and shifting the complexity from the DRM scheduler into userspace compositors. So far though, I have yet to see that argument made coherently.

Ah we've had that argument. We have moved that into userspace as part
of vk submit threads. It aint pretty, but it's better than the other
option :-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 19:29           ` Daniel Vetter
@ 2021-04-20 20:32             ` Daniel Stone
  2021-04-26 20:59               ` Marek Olšák
  0 siblings, 1 reply; 105+ messages in thread
From: Daniel Stone @ 2021-04-20 20:32 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: ML Mesa-dev, dri-devel, Jason Ekstrand


[-- Attachment #1.1: Type: text/plain, Size: 641 bytes --]

Hi,

On Tue, 20 Apr 2021 at 20:30, Daniel Vetter <daniel@ffwll.ch> wrote:

> The thing is, you can't do this in drm/scheduler. At least not without
> splitting up the dma_fence in the kernel into separate memory fences
> and sync fences


I'm starting to think this thread needs its own glossary ...

I propose we use 'residency fence' for execution fences which enact
memory-residency operations, e.g. faulting in a page ultimately depending
on GPU work retiring.

And 'value fence' for the pure-userspace model suggested by timeline
semaphores, i.e. fences being (*addr == val) rather than being able to look
at ctx seqno.

Cheers,
Daniel

[-- Attachment #1.2: Type: text/html, Size: 1243 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-20 20:32             ` Daniel Stone
@ 2021-04-26 20:59               ` Marek Olšák
  2021-04-27  8:02                 ` Daniel Vetter
  0 siblings, 1 reply; 105+ messages in thread
From: Marek Olšák @ 2021-04-26 20:59 UTC (permalink / raw)
  To: Daniel Stone; +Cc: ML Mesa-dev, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 2472 bytes --]

Thanks everybody. The initial proposal is dead. Here are some thoughts on
how to do it differently.

I think we can have direct command submission from userspace via
memory-mapped queues ("user queues") without changing window systems.

The memory management doesn't have to use GPU page faults like HMM.
Instead, it can wait for user queues of a specific process to go idle and
then unmap the queues, so that userspace can't submit anything. Buffer
evictions, pinning, etc. can be executed when all queues are unmapped
(suspended). Thus, no BO fences and page faults are needed.

Inter-process synchronization can use timeline semaphores. Userspace will
query the wait and signal value for a shared buffer from the kernel. The
kernel will keep a history of those queries to know which process is
responsible for signalling which buffer. There is only the wait-timeout
issue and how to identify the culprit. One of the solutions is to have the
GPU send all GPU signal commands and all timed out wait commands via an
interrupt to the kernel driver to monitor and validate userspace behavior.
With that, it can be identified whether the culprit is the waiting process
or the signalling process and which one. Invalid signal/wait parameters can
also be detected. The kernel can force-signal only the semaphores that time
out, and punish the processes which caused the timeout or used invalid
signal/wait parameters.

The question is whether this synchronization solution is robust enough for
dma_fence and whatever the kernel and window systems need.

Marek

On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone <daniel@fooishbar.org> wrote:

> Hi,
>
> On Tue, 20 Apr 2021 at 20:30, Daniel Vetter <daniel@ffwll.ch> wrote:
>
>> The thing is, you can't do this in drm/scheduler. At least not without
>> splitting up the dma_fence in the kernel into separate memory fences
>> and sync fences
>
>
> I'm starting to think this thread needs its own glossary ...
>
> I propose we use 'residency fence' for execution fences which enact
> memory-residency operations, e.g. faulting in a page ultimately depending
> on GPU work retiring.
>
> And 'value fence' for the pure-userspace model suggested by timeline
> semaphores, i.e. fences being (*addr == val) rather than being able to look
> at ctx seqno.
>
> Cheers,
> Daniel
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>

[-- Attachment #1.2: Type: text/html, Size: 3628 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-26 20:59               ` Marek Olšák
@ 2021-04-27  8:02                 ` Daniel Vetter
  2021-04-27 11:49                   ` Marek Olšák
  0 siblings, 1 reply; 105+ messages in thread
From: Daniel Vetter @ 2021-04-27  8:02 UTC (permalink / raw)
  To: Marek Olšák; +Cc: ML Mesa-dev, dri-devel

On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
> Thanks everybody. The initial proposal is dead. Here are some thoughts on
> how to do it differently.
> 
> I think we can have direct command submission from userspace via
> memory-mapped queues ("user queues") without changing window systems.
> 
> The memory management doesn't have to use GPU page faults like HMM.
> Instead, it can wait for user queues of a specific process to go idle and
> then unmap the queues, so that userspace can't submit anything. Buffer
> evictions, pinning, etc. can be executed when all queues are unmapped
> (suspended). Thus, no BO fences and page faults are needed.
> 
> Inter-process synchronization can use timeline semaphores. Userspace will
> query the wait and signal value for a shared buffer from the kernel. The
> kernel will keep a history of those queries to know which process is
> responsible for signalling which buffer. There is only the wait-timeout
> issue and how to identify the culprit. One of the solutions is to have the
> GPU send all GPU signal commands and all timed out wait commands via an
> interrupt to the kernel driver to monitor and validate userspace behavior.
> With that, it can be identified whether the culprit is the waiting process
> or the signalling process and which one. Invalid signal/wait parameters can
> also be detected. The kernel can force-signal only the semaphores that time
> out, and punish the processes which caused the timeout or used invalid
> signal/wait parameters.
> 
> The question is whether this synchronization solution is robust enough for
> dma_fence and whatever the kernel and window systems need.

The proper model here is the preempt-ctx dma_fence that amdkfd uses
(without page faults). That means dma_fence for synchronization is doa, at
least as-is, and we're back to figuring out the winsys problem.

"We'll solve it with timeouts" is very tempting, but doesn't work. It's
akin to saying that we're solving deadlock issues in a locking design by
doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
avoids having to reach the reset button, but that's about it.

And the fundamental problem is that once you throw in userspace command
submission (and syncing, at least within the userspace driver, otherwise
there's kinda no point if you still need the kernel for cross-engine sync)
means you get deadlocks if you still use dma_fence for sync under
perfectly legit use-case. We've discussed that one ad nauseam last summer:

https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences

See silly diagramm at the bottom.

Now I think all isn't lost, because imo the first step to getting to this
brave new world is rebuilding the driver on top of userspace fences, and
with the adjusted cmd submit model. You probably don't want to use amdkfd,
but port that as a context flag or similar to render nodes for gl/vk. Of
course that means you can only use this mode in headless, without
glx/wayland winsys support, but it's a start.
-Daniel

> 
> Marek
> 
> On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone <daniel@fooishbar.org> wrote:
> 
> > Hi,
> >
> > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> >> The thing is, you can't do this in drm/scheduler. At least not without
> >> splitting up the dma_fence in the kernel into separate memory fences
> >> and sync fences
> >
> >
> > I'm starting to think this thread needs its own glossary ...
> >
> > I propose we use 'residency fence' for execution fences which enact
> > memory-residency operations, e.g. faulting in a page ultimately depending
> > on GPU work retiring.
> >
> > And 'value fence' for the pure-userspace model suggested by timeline
> > semaphores, i.e. fences being (*addr == val) rather than being able to look
> > at ctx seqno.
> >
> > Cheers,
> > Daniel
> > _______________________________________________
> > mesa-dev mailing list
> > mesa-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27  8:02                 ` Daniel Vetter
@ 2021-04-27 11:49                   ` Marek Olšák
  2021-04-27 12:06                     ` Christian König
  2021-04-27 12:12                     ` Daniel Vetter
  0 siblings, 2 replies; 105+ messages in thread
From: Marek Olšák @ 2021-04-27 11:49 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: ML Mesa-dev, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 5071 bytes --]

If we don't use future fences for DMA fences at all, e.g. we don't use them
for memory management, it can work, right? Memory management can suspend
user queues anytime. It doesn't need to use DMA fences. There might be
something that I'm missing here.

What would we lose without DMA fences? Just inter-device synchronization? I
think that might be acceptable.

The only case when the kernel will wait on a future fence is before a page
flip. Everything today already depends on userspace not hanging the gpu,
which makes everything a future fence.

Marek

On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, <daniel@ffwll.ch> wrote:

> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
> > Thanks everybody. The initial proposal is dead. Here are some thoughts on
> > how to do it differently.
> >
> > I think we can have direct command submission from userspace via
> > memory-mapped queues ("user queues") without changing window systems.
> >
> > The memory management doesn't have to use GPU page faults like HMM.
> > Instead, it can wait for user queues of a specific process to go idle and
> > then unmap the queues, so that userspace can't submit anything. Buffer
> > evictions, pinning, etc. can be executed when all queues are unmapped
> > (suspended). Thus, no BO fences and page faults are needed.
> >
> > Inter-process synchronization can use timeline semaphores. Userspace will
> > query the wait and signal value for a shared buffer from the kernel. The
> > kernel will keep a history of those queries to know which process is
> > responsible for signalling which buffer. There is only the wait-timeout
> > issue and how to identify the culprit. One of the solutions is to have
> the
> > GPU send all GPU signal commands and all timed out wait commands via an
> > interrupt to the kernel driver to monitor and validate userspace
> behavior.
> > With that, it can be identified whether the culprit is the waiting
> process
> > or the signalling process and which one. Invalid signal/wait parameters
> can
> > also be detected. The kernel can force-signal only the semaphores that
> time
> > out, and punish the processes which caused the timeout or used invalid
> > signal/wait parameters.
> >
> > The question is whether this synchronization solution is robust enough
> for
> > dma_fence and whatever the kernel and window systems need.
>
> The proper model here is the preempt-ctx dma_fence that amdkfd uses
> (without page faults). That means dma_fence for synchronization is doa, at
> least as-is, and we're back to figuring out the winsys problem.
>
> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
> akin to saying that we're solving deadlock issues in a locking design by
> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
> avoids having to reach the reset button, but that's about it.
>
> And the fundamental problem is that once you throw in userspace command
> submission (and syncing, at least within the userspace driver, otherwise
> there's kinda no point if you still need the kernel for cross-engine sync)
> means you get deadlocks if you still use dma_fence for sync under
> perfectly legit use-case. We've discussed that one ad nauseam last summer:
>
>
> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>
> See silly diagramm at the bottom.
>
> Now I think all isn't lost, because imo the first step to getting to this
> brave new world is rebuilding the driver on top of userspace fences, and
> with the adjusted cmd submit model. You probably don't want to use amdkfd,
> but port that as a context flag or similar to render nodes for gl/vk. Of
> course that means you can only use this mode in headless, without
> glx/wayland winsys support, but it's a start.
> -Daniel
>
> >
> > Marek
> >
> > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone <daniel@fooishbar.org>
> wrote:
> >
> > > Hi,
> > >
> > > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > >> The thing is, you can't do this in drm/scheduler. At least not without
> > >> splitting up the dma_fence in the kernel into separate memory fences
> > >> and sync fences
> > >
> > >
> > > I'm starting to think this thread needs its own glossary ...
> > >
> > > I propose we use 'residency fence' for execution fences which enact
> > > memory-residency operations, e.g. faulting in a page ultimately
> depending
> > > on GPU work retiring.
> > >
> > > And 'value fence' for the pure-userspace model suggested by timeline
> > > semaphores, i.e. fences being (*addr == val) rather than being able to
> look
> > > at ctx seqno.
> > >
> > > Cheers,
> > > Daniel
> > > _______________________________________________
> > > mesa-dev mailing list
> > > mesa-dev@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> > >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>

[-- Attachment #1.2: Type: text/html, Size: 6612 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 11:49                   ` Marek Olšák
@ 2021-04-27 12:06                     ` Christian König
  2021-04-27 12:11                       ` Marek Olšák
  2021-04-27 18:38                       ` Dave Airlie
  2021-04-27 12:12                     ` Daniel Vetter
  1 sibling, 2 replies; 105+ messages in thread
From: Christian König @ 2021-04-27 12:06 UTC (permalink / raw)
  To: Marek Olšák, Daniel Vetter; +Cc: ML Mesa-dev, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 6393 bytes --]

Correct, we wouldn't have synchronization between device with and 
without user queues any more.

That could only be a problem for A+I Laptops.

Memory management will just work with preemption fences which pause the 
user queues of a process before evicting something. That will be a 
dma_fence, but also a well known approach.

Christian.

Am 27.04.21 um 13:49 schrieb Marek Olšák:
> If we don't use future fences for DMA fences at all, e.g. we don't use 
> them for memory management, it can work, right? Memory management can 
> suspend user queues anytime. It doesn't need to use DMA fences. There 
> might be something that I'm missing here.
>
> What would we lose without DMA fences? Just inter-device 
> synchronization? I think that might be acceptable.
>
> The only case when the kernel will wait on a future fence is before a 
> page flip. Everything today already depends on userspace not hanging 
> the gpu, which makes everything a future fence.
>
> Marek
>
> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, <daniel@ffwll.ch 
> <mailto:daniel@ffwll.ch>> wrote:
>
>     On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>     > Thanks everybody. The initial proposal is dead. Here are some
>     thoughts on
>     > how to do it differently.
>     >
>     > I think we can have direct command submission from userspace via
>     > memory-mapped queues ("user queues") without changing window
>     systems.
>     >
>     > The memory management doesn't have to use GPU page faults like HMM.
>     > Instead, it can wait for user queues of a specific process to go
>     idle and
>     > then unmap the queues, so that userspace can't submit anything.
>     Buffer
>     > evictions, pinning, etc. can be executed when all queues are
>     unmapped
>     > (suspended). Thus, no BO fences and page faults are needed.
>     >
>     > Inter-process synchronization can use timeline semaphores.
>     Userspace will
>     > query the wait and signal value for a shared buffer from the
>     kernel. The
>     > kernel will keep a history of those queries to know which process is
>     > responsible for signalling which buffer. There is only the
>     wait-timeout
>     > issue and how to identify the culprit. One of the solutions is
>     to have the
>     > GPU send all GPU signal commands and all timed out wait commands
>     via an
>     > interrupt to the kernel driver to monitor and validate userspace
>     behavior.
>     > With that, it can be identified whether the culprit is the
>     waiting process
>     > or the signalling process and which one. Invalid signal/wait
>     parameters can
>     > also be detected. The kernel can force-signal only the
>     semaphores that time
>     > out, and punish the processes which caused the timeout or used
>     invalid
>     > signal/wait parameters.
>     >
>     > The question is whether this synchronization solution is robust
>     enough for
>     > dma_fence and whatever the kernel and window systems need.
>
>     The proper model here is the preempt-ctx dma_fence that amdkfd uses
>     (without page faults). That means dma_fence for synchronization is
>     doa, at
>     least as-is, and we're back to figuring out the winsys problem.
>
>     "We'll solve it with timeouts" is very tempting, but doesn't work.
>     It's
>     akin to saying that we're solving deadlock issues in a locking
>     design by
>     doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
>     avoids having to reach the reset button, but that's about it.
>
>     And the fundamental problem is that once you throw in userspace
>     command
>     submission (and syncing, at least within the userspace driver,
>     otherwise
>     there's kinda no point if you still need the kernel for
>     cross-engine sync)
>     means you get deadlocks if you still use dma_fence for sync under
>     perfectly legit use-case. We've discussed that one ad nauseam last
>     summer:
>
>     https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>     <https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences>
>
>     See silly diagramm at the bottom.
>
>     Now I think all isn't lost, because imo the first step to getting
>     to this
>     brave new world is rebuilding the driver on top of userspace
>     fences, and
>     with the adjusted cmd submit model. You probably don't want to use
>     amdkfd,
>     but port that as a context flag or similar to render nodes for
>     gl/vk. Of
>     course that means you can only use this mode in headless, without
>     glx/wayland winsys support, but it's a start.
>     -Daniel
>
>     >
>     > Marek
>     >
>     > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone
>     <daniel@fooishbar.org <mailto:daniel@fooishbar.org>> wrote:
>     >
>     > > Hi,
>     > >
>     > > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter <daniel@ffwll.ch
>     <mailto:daniel@ffwll.ch>> wrote:
>     > >
>     > >> The thing is, you can't do this in drm/scheduler. At least
>     not without
>     > >> splitting up the dma_fence in the kernel into separate memory
>     fences
>     > >> and sync fences
>     > >
>     > >
>     > > I'm starting to think this thread needs its own glossary ...
>     > >
>     > > I propose we use 'residency fence' for execution fences which
>     enact
>     > > memory-residency operations, e.g. faulting in a page
>     ultimately depending
>     > > on GPU work retiring.
>     > >
>     > > And 'value fence' for the pure-userspace model suggested by
>     timeline
>     > > semaphores, i.e. fences being (*addr == val) rather than being
>     able to look
>     > > at ctx seqno.
>     > >
>     > > Cheers,
>     > > Daniel
>     > > _______________________________________________
>     > > mesa-dev mailing list
>     > > mesa-dev@lists.freedesktop.org
>     <mailto:mesa-dev@lists.freedesktop.org>
>     > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>     <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
>     > >
>
>     -- 
>     Daniel Vetter
>     Software Engineer, Intel Corporation
>     http://blog.ffwll.ch <http://blog.ffwll.ch>
>
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[-- Attachment #1.2: Type: text/html, Size: 10324 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 12:06                     ` Christian König
@ 2021-04-27 12:11                       ` Marek Olšák
  2021-04-27 12:15                         ` Daniel Vetter
  2021-04-27 18:38                       ` Dave Airlie
  1 sibling, 1 reply; 105+ messages in thread
From: Marek Olšák @ 2021-04-27 12:11 UTC (permalink / raw)
  To: Christian König; +Cc: ML Mesa-dev, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 5976 bytes --]

Ok. I'll interpret this as "yes, it will work, let's do it".

Marek

On Tue., Apr. 27, 2021, 08:06 Christian König, <
ckoenig.leichtzumerken@gmail.com> wrote:

> Correct, we wouldn't have synchronization between device with and without
> user queues any more.
>
> That could only be a problem for A+I Laptops.
>
> Memory management will just work with preemption fences which pause the
> user queues of a process before evicting something. That will be a
> dma_fence, but also a well known approach.
>
> Christian.
>
> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>
> If we don't use future fences for DMA fences at all, e.g. we don't use
> them for memory management, it can work, right? Memory management can
> suspend user queues anytime. It doesn't need to use DMA fences. There might
> be something that I'm missing here.
>
> What would we lose without DMA fences? Just inter-device synchronization?
> I think that might be acceptable.
>
> The only case when the kernel will wait on a future fence is before a page
> flip. Everything today already depends on userspace not hanging the gpu,
> which makes everything a future fence.
>
> Marek
>
> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, <daniel@ffwll.ch> wrote:
>
>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>> > Thanks everybody. The initial proposal is dead. Here are some thoughts
>> on
>> > how to do it differently.
>> >
>> > I think we can have direct command submission from userspace via
>> > memory-mapped queues ("user queues") without changing window systems.
>> >
>> > The memory management doesn't have to use GPU page faults like HMM.
>> > Instead, it can wait for user queues of a specific process to go idle
>> and
>> > then unmap the queues, so that userspace can't submit anything. Buffer
>> > evictions, pinning, etc. can be executed when all queues are unmapped
>> > (suspended). Thus, no BO fences and page faults are needed.
>> >
>> > Inter-process synchronization can use timeline semaphores. Userspace
>> will
>> > query the wait and signal value for a shared buffer from the kernel. The
>> > kernel will keep a history of those queries to know which process is
>> > responsible for signalling which buffer. There is only the wait-timeout
>> > issue and how to identify the culprit. One of the solutions is to have
>> the
>> > GPU send all GPU signal commands and all timed out wait commands via an
>> > interrupt to the kernel driver to monitor and validate userspace
>> behavior.
>> > With that, it can be identified whether the culprit is the waiting
>> process
>> > or the signalling process and which one. Invalid signal/wait parameters
>> can
>> > also be detected. The kernel can force-signal only the semaphores that
>> time
>> > out, and punish the processes which caused the timeout or used invalid
>> > signal/wait parameters.
>> >
>> > The question is whether this synchronization solution is robust enough
>> for
>> > dma_fence and whatever the kernel and window systems need.
>>
>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
>> (without page faults). That means dma_fence for synchronization is doa, at
>> least as-is, and we're back to figuring out the winsys problem.
>>
>> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
>> akin to saying that we're solving deadlock issues in a locking design by
>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
>> avoids having to reach the reset button, but that's about it.
>>
>> And the fundamental problem is that once you throw in userspace command
>> submission (and syncing, at least within the userspace driver, otherwise
>> there's kinda no point if you still need the kernel for cross-engine sync)
>> means you get deadlocks if you still use dma_fence for sync under
>> perfectly legit use-case. We've discussed that one ad nauseam last summer:
>>
>>
>> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>>
>> See silly diagramm at the bottom.
>>
>> Now I think all isn't lost, because imo the first step to getting to this
>> brave new world is rebuilding the driver on top of userspace fences, and
>> with the adjusted cmd submit model. You probably don't want to use amdkfd,
>> but port that as a context flag or similar to render nodes for gl/vk. Of
>> course that means you can only use this mode in headless, without
>> glx/wayland winsys support, but it's a start.
>> -Daniel
>>
>> >
>> > Marek
>> >
>> > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone <daniel@fooishbar.org>
>> wrote:
>> >
>> > > Hi,
>> > >
>> > > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter <daniel@ffwll.ch> wrote:
>> > >
>> > >> The thing is, you can't do this in drm/scheduler. At least not
>> without
>> > >> splitting up the dma_fence in the kernel into separate memory fences
>> > >> and sync fences
>> > >
>> > >
>> > > I'm starting to think this thread needs its own glossary ...
>> > >
>> > > I propose we use 'residency fence' for execution fences which enact
>> > > memory-residency operations, e.g. faulting in a page ultimately
>> depending
>> > > on GPU work retiring.
>> > >
>> > > And 'value fence' for the pure-userspace model suggested by timeline
>> > > semaphores, i.e. fences being (*addr == val) rather than being able
>> to look
>> > > at ctx seqno.
>> > >
>> > > Cheers,
>> > > Daniel
>> > > _______________________________________________
>> > > mesa-dev mailing list
>> > > mesa-dev@lists.freedesktop.org
>> > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>> > >
>>
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch
>>
>
> _______________________________________________
> mesa-dev mailing listmesa-dev@lists.freedesktop.orghttps://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
>
>

[-- Attachment #1.2: Type: text/html, Size: 10551 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 11:49                   ` Marek Olšák
  2021-04-27 12:06                     ` Christian König
@ 2021-04-27 12:12                     ` Daniel Vetter
  1 sibling, 0 replies; 105+ messages in thread
From: Daniel Vetter @ 2021-04-27 12:12 UTC (permalink / raw)
  To: Marek Olšák; +Cc: ML Mesa-dev, dri-devel

On Tue, Apr 27, 2021 at 1:49 PM Marek Olšák <maraeo@gmail.com> wrote:
>
> If we don't use future fences for DMA fences at all, e.g. we don't use them for memory management, it can work, right? Memory management can suspend user queues anytime. It doesn't need to use DMA fences. There might be something that I'm missing here.

Other drivers use dma_fence for their memory management. So unles
you've converted them all over to the dma_fence/memory fence split,
dma_fence fences stay memory fences. In theory this is possible, but
maybe not if you want to complete the job this decade :-)

> What would we lose without DMA fences? Just inter-device synchronization? I think that might be acceptable.
>
> The only case when the kernel will wait on a future fence is before a page flip. Everything today already depends on userspace not hanging the gpu, which makes everything a future fence.

That's not quite what we defined as future fences, because tdr
guarantees those complete, even if userspace hangs. It's when you put
userspace fence waits into the cs buffer you've submitted to the
kernel (or directly to hw) where the "real" future fences kick in.
-Daniel

>
> Marek
>
> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, <daniel@ffwll.ch> wrote:
>>
>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>> > Thanks everybody. The initial proposal is dead. Here are some thoughts on
>> > how to do it differently.
>> >
>> > I think we can have direct command submission from userspace via
>> > memory-mapped queues ("user queues") without changing window systems.
>> >
>> > The memory management doesn't have to use GPU page faults like HMM.
>> > Instead, it can wait for user queues of a specific process to go idle and
>> > then unmap the queues, so that userspace can't submit anything. Buffer
>> > evictions, pinning, etc. can be executed when all queues are unmapped
>> > (suspended). Thus, no BO fences and page faults are needed.
>> >
>> > Inter-process synchronization can use timeline semaphores. Userspace will
>> > query the wait and signal value for a shared buffer from the kernel. The
>> > kernel will keep a history of those queries to know which process is
>> > responsible for signalling which buffer. There is only the wait-timeout
>> > issue and how to identify the culprit. One of the solutions is to have the
>> > GPU send all GPU signal commands and all timed out wait commands via an
>> > interrupt to the kernel driver to monitor and validate userspace behavior.
>> > With that, it can be identified whether the culprit is the waiting process
>> > or the signalling process and which one. Invalid signal/wait parameters can
>> > also be detected. The kernel can force-signal only the semaphores that time
>> > out, and punish the processes which caused the timeout or used invalid
>> > signal/wait parameters.
>> >
>> > The question is whether this synchronization solution is robust enough for
>> > dma_fence and whatever the kernel and window systems need.
>>
>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
>> (without page faults). That means dma_fence for synchronization is doa, at
>> least as-is, and we're back to figuring out the winsys problem.
>>
>> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
>> akin to saying that we're solving deadlock issues in a locking design by
>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
>> avoids having to reach the reset button, but that's about it.
>>
>> And the fundamental problem is that once you throw in userspace command
>> submission (and syncing, at least within the userspace driver, otherwise
>> there's kinda no point if you still need the kernel for cross-engine sync)
>> means you get deadlocks if you still use dma_fence for sync under
>> perfectly legit use-case. We've discussed that one ad nauseam last summer:
>>
>> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>>
>> See silly diagramm at the bottom.
>>
>> Now I think all isn't lost, because imo the first step to getting to this
>> brave new world is rebuilding the driver on top of userspace fences, and
>> with the adjusted cmd submit model. You probably don't want to use amdkfd,
>> but port that as a context flag or similar to render nodes for gl/vk. Of
>> course that means you can only use this mode in headless, without
>> glx/wayland winsys support, but it's a start.
>> -Daniel
>>
>> >
>> > Marek
>> >
>> > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone <daniel@fooishbar.org> wrote:
>> >
>> > > Hi,
>> > >
>> > > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter <daniel@ffwll.ch> wrote:
>> > >
>> > >> The thing is, you can't do this in drm/scheduler. At least not without
>> > >> splitting up the dma_fence in the kernel into separate memory fences
>> > >> and sync fences
>> > >
>> > >
>> > > I'm starting to think this thread needs its own glossary ...
>> > >
>> > > I propose we use 'residency fence' for execution fences which enact
>> > > memory-residency operations, e.g. faulting in a page ultimately depending
>> > > on GPU work retiring.
>> > >
>> > > And 'value fence' for the pure-userspace model suggested by timeline
>> > > semaphores, i.e. fences being (*addr == val) rather than being able to look
>> > > at ctx seqno.
>> > >
>> > > Cheers,
>> > > Daniel
>> > > _______________________________________________
>> > > mesa-dev mailing list
>> > > mesa-dev@lists.freedesktop.org
>> > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>> > >
>>
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 12:11                       ` Marek Olšák
@ 2021-04-27 12:15                         ` Daniel Vetter
  2021-04-27 12:27                           ` Christian König
  2021-04-27 12:46                           ` Marek Olšák
  0 siblings, 2 replies; 105+ messages in thread
From: Daniel Vetter @ 2021-04-27 12:15 UTC (permalink / raw)
  To: Marek Olšák; +Cc: Christian König, dri-devel, ML Mesa-dev

On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák <maraeo@gmail.com> wrote:
> Ok. I'll interpret this as "yes, it will work, let's do it".

It works if all you care about is drm/amdgpu. I'm not sure that's a
reasonable approach for upstream, but it definitely is an approach :-)

We've already gone somewhat through the pain of drm/amdgpu redefining
how implicit sync works without sufficiently talking with other
people, maybe we should avoid a repeat of this ...
-Daniel

>
> Marek
>
> On Tue., Apr. 27, 2021, 08:06 Christian König, <ckoenig.leichtzumerken@gmail.com> wrote:
>>
>> Correct, we wouldn't have synchronization between device with and without user queues any more.
>>
>> That could only be a problem for A+I Laptops.
>>
>> Memory management will just work with preemption fences which pause the user queues of a process before evicting something. That will be a dma_fence, but also a well known approach.
>>
>> Christian.
>>
>> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>>
>> If we don't use future fences for DMA fences at all, e.g. we don't use them for memory management, it can work, right? Memory management can suspend user queues anytime. It doesn't need to use DMA fences. There might be something that I'm missing here.
>>
>> What would we lose without DMA fences? Just inter-device synchronization? I think that might be acceptable.
>>
>> The only case when the kernel will wait on a future fence is before a page flip. Everything today already depends on userspace not hanging the gpu, which makes everything a future fence.
>>
>> Marek
>>
>> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, <daniel@ffwll.ch> wrote:
>>>
>>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>>> > Thanks everybody. The initial proposal is dead. Here are some thoughts on
>>> > how to do it differently.
>>> >
>>> > I think we can have direct command submission from userspace via
>>> > memory-mapped queues ("user queues") without changing window systems.
>>> >
>>> > The memory management doesn't have to use GPU page faults like HMM.
>>> > Instead, it can wait for user queues of a specific process to go idle and
>>> > then unmap the queues, so that userspace can't submit anything. Buffer
>>> > evictions, pinning, etc. can be executed when all queues are unmapped
>>> > (suspended). Thus, no BO fences and page faults are needed.
>>> >
>>> > Inter-process synchronization can use timeline semaphores. Userspace will
>>> > query the wait and signal value for a shared buffer from the kernel. The
>>> > kernel will keep a history of those queries to know which process is
>>> > responsible for signalling which buffer. There is only the wait-timeout
>>> > issue and how to identify the culprit. One of the solutions is to have the
>>> > GPU send all GPU signal commands and all timed out wait commands via an
>>> > interrupt to the kernel driver to monitor and validate userspace behavior.
>>> > With that, it can be identified whether the culprit is the waiting process
>>> > or the signalling process and which one. Invalid signal/wait parameters can
>>> > also be detected. The kernel can force-signal only the semaphores that time
>>> > out, and punish the processes which caused the timeout or used invalid
>>> > signal/wait parameters.
>>> >
>>> > The question is whether this synchronization solution is robust enough for
>>> > dma_fence and whatever the kernel and window systems need.
>>>
>>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
>>> (without page faults). That means dma_fence for synchronization is doa, at
>>> least as-is, and we're back to figuring out the winsys problem.
>>>
>>> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
>>> akin to saying that we're solving deadlock issues in a locking design by
>>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
>>> avoids having to reach the reset button, but that's about it.
>>>
>>> And the fundamental problem is that once you throw in userspace command
>>> submission (and syncing, at least within the userspace driver, otherwise
>>> there's kinda no point if you still need the kernel for cross-engine sync)
>>> means you get deadlocks if you still use dma_fence for sync under
>>> perfectly legit use-case. We've discussed that one ad nauseam last summer:
>>>
>>> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>>>
>>> See silly diagramm at the bottom.
>>>
>>> Now I think all isn't lost, because imo the first step to getting to this
>>> brave new world is rebuilding the driver on top of userspace fences, and
>>> with the adjusted cmd submit model. You probably don't want to use amdkfd,
>>> but port that as a context flag or similar to render nodes for gl/vk. Of
>>> course that means you can only use this mode in headless, without
>>> glx/wayland winsys support, but it's a start.
>>> -Daniel
>>>
>>> >
>>> > Marek
>>> >
>>> > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone <daniel@fooishbar.org> wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter <daniel@ffwll.ch> wrote:
>>> > >
>>> > >> The thing is, you can't do this in drm/scheduler. At least not without
>>> > >> splitting up the dma_fence in the kernel into separate memory fences
>>> > >> and sync fences
>>> > >
>>> > >
>>> > > I'm starting to think this thread needs its own glossary ...
>>> > >
>>> > > I propose we use 'residency fence' for execution fences which enact
>>> > > memory-residency operations, e.g. faulting in a page ultimately depending
>>> > > on GPU work retiring.
>>> > >
>>> > > And 'value fence' for the pure-userspace model suggested by timeline
>>> > > semaphores, i.e. fences being (*addr == val) rather than being able to look
>>> > > at ctx seqno.
>>> > >
>>> > > Cheers,
>>> > > Daniel
>>> > > _______________________________________________
>>> > > mesa-dev mailing list
>>> > > mesa-dev@lists.freedesktop.org
>>> > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>> > >
>>>
>>> --
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch
>>
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 12:15                         ` Daniel Vetter
@ 2021-04-27 12:27                           ` Christian König
  2021-04-27 12:46                           ` Marek Olšák
  1 sibling, 0 replies; 105+ messages in thread
From: Christian König @ 2021-04-27 12:27 UTC (permalink / raw)
  To: Daniel Vetter, Marek Olšák; +Cc: ML Mesa-dev, dri-devel

Am 27.04.21 um 14:15 schrieb Daniel Vetter:
> On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák <maraeo@gmail.com> wrote:
>> Ok. I'll interpret this as "yes, it will work, let's do it".
> It works if all you care about is drm/amdgpu. I'm not sure that's a
> reasonable approach for upstream, but it definitely is an approach :-)
>
> We've already gone somewhat through the pain of drm/amdgpu redefining
> how implicit sync works without sufficiently talking with other
> people, maybe we should avoid a repeat of this ...

BTW: This is coming up again for the plan here.

We once more need to think about the "other" fences which don't 
participate in the implicit sync here.

Christian.

> -Daniel
>
>> Marek
>>
>> On Tue., Apr. 27, 2021, 08:06 Christian König, <ckoenig.leichtzumerken@gmail.com> wrote:
>>> Correct, we wouldn't have synchronization between device with and without user queues any more.
>>>
>>> That could only be a problem for A+I Laptops.
>>>
>>> Memory management will just work with preemption fences which pause the user queues of a process before evicting something. That will be a dma_fence, but also a well known approach.
>>>
>>> Christian.
>>>
>>> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>>>
>>> If we don't use future fences for DMA fences at all, e.g. we don't use them for memory management, it can work, right? Memory management can suspend user queues anytime. It doesn't need to use DMA fences. There might be something that I'm missing here.
>>>
>>> What would we lose without DMA fences? Just inter-device synchronization? I think that might be acceptable.
>>>
>>> The only case when the kernel will wait on a future fence is before a page flip. Everything today already depends on userspace not hanging the gpu, which makes everything a future fence.
>>>
>>> Marek
>>>
>>> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, <daniel@ffwll.ch> wrote:
>>>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>>>>> Thanks everybody. The initial proposal is dead. Here are some thoughts on
>>>>> how to do it differently.
>>>>>
>>>>> I think we can have direct command submission from userspace via
>>>>> memory-mapped queues ("user queues") without changing window systems.
>>>>>
>>>>> The memory management doesn't have to use GPU page faults like HMM.
>>>>> Instead, it can wait for user queues of a specific process to go idle and
>>>>> then unmap the queues, so that userspace can't submit anything. Buffer
>>>>> evictions, pinning, etc. can be executed when all queues are unmapped
>>>>> (suspended). Thus, no BO fences and page faults are needed.
>>>>>
>>>>> Inter-process synchronization can use timeline semaphores. Userspace will
>>>>> query the wait and signal value for a shared buffer from the kernel. The
>>>>> kernel will keep a history of those queries to know which process is
>>>>> responsible for signalling which buffer. There is only the wait-timeout
>>>>> issue and how to identify the culprit. One of the solutions is to have the
>>>>> GPU send all GPU signal commands and all timed out wait commands via an
>>>>> interrupt to the kernel driver to monitor and validate userspace behavior.
>>>>> With that, it can be identified whether the culprit is the waiting process
>>>>> or the signalling process and which one. Invalid signal/wait parameters can
>>>>> also be detected. The kernel can force-signal only the semaphores that time
>>>>> out, and punish the processes which caused the timeout or used invalid
>>>>> signal/wait parameters.
>>>>>
>>>>> The question is whether this synchronization solution is robust enough for
>>>>> dma_fence and whatever the kernel and window systems need.
>>>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
>>>> (without page faults). That means dma_fence for synchronization is doa, at
>>>> least as-is, and we're back to figuring out the winsys problem.
>>>>
>>>> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
>>>> akin to saying that we're solving deadlock issues in a locking design by
>>>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
>>>> avoids having to reach the reset button, but that's about it.
>>>>
>>>> And the fundamental problem is that once you throw in userspace command
>>>> submission (and syncing, at least within the userspace driver, otherwise
>>>> there's kinda no point if you still need the kernel for cross-engine sync)
>>>> means you get deadlocks if you still use dma_fence for sync under
>>>> perfectly legit use-case. We've discussed that one ad nauseam last summer:
>>>>
>>>> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>>>>
>>>> See silly diagramm at the bottom.
>>>>
>>>> Now I think all isn't lost, because imo the first step to getting to this
>>>> brave new world is rebuilding the driver on top of userspace fences, and
>>>> with the adjusted cmd submit model. You probably don't want to use amdkfd,
>>>> but port that as a context flag or similar to render nodes for gl/vk. Of
>>>> course that means you can only use this mode in headless, without
>>>> glx/wayland winsys support, but it's a start.
>>>> -Daniel
>>>>
>>>>> Marek
>>>>>
>>>>> On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone <daniel@fooishbar.org> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On Tue, 20 Apr 2021 at 20:30, Daniel Vetter <daniel@ffwll.ch> wrote:
>>>>>>
>>>>>>> The thing is, you can't do this in drm/scheduler. At least not without
>>>>>>> splitting up the dma_fence in the kernel into separate memory fences
>>>>>>> and sync fences
>>>>>>
>>>>>> I'm starting to think this thread needs its own glossary ...
>>>>>>
>>>>>> I propose we use 'residency fence' for execution fences which enact
>>>>>> memory-residency operations, e.g. faulting in a page ultimately depending
>>>>>> on GPU work retiring.
>>>>>>
>>>>>> And 'value fence' for the pure-userspace model suggested by timeline
>>>>>> semaphores, i.e. fences being (*addr == val) rather than being able to look
>>>>>> at ctx seqno.
>>>>>>
>>>>>> Cheers,
>>>>>> Daniel
>>>>>> _______________________________________________
>>>>>> mesa-dev mailing list
>>>>>> mesa-dev@lists.freedesktop.org
>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>>>
>>>> --
>>>> Daniel Vetter
>>>> Software Engineer, Intel Corporation
>>>> http://blog.ffwll.ch
>>>
>>> _______________________________________________
>>> mesa-dev mailing list
>>> mesa-dev@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>
>>>
>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 12:15                         ` Daniel Vetter
  2021-04-27 12:27                           ` Christian König
@ 2021-04-27 12:46                           ` Marek Olšák
  2021-04-27 12:50                             ` Christian König
  1 sibling, 1 reply; 105+ messages in thread
From: Marek Olšák @ 2021-04-27 12:46 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Christian König, dri-devel, ML Mesa-dev


[-- Attachment #1.1: Type: text/plain, Size: 7337 bytes --]

I'll defer to Christian and Alex to decide whether dropping sync with
non-amd devices (GPUs, cameras etc.) is acceptable.

Rewriting those drivers to this new sync model could be done on a case by
case basis.

For now, would we only lose the "amd -> external" dependency? Or the
"external -> amd" dependency too?

Marek

On Tue., Apr. 27, 2021, 08:15 Daniel Vetter, <daniel@ffwll.ch> wrote:

> On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák <maraeo@gmail.com> wrote:
> > Ok. I'll interpret this as "yes, it will work, let's do it".
>
> It works if all you care about is drm/amdgpu. I'm not sure that's a
> reasonable approach for upstream, but it definitely is an approach :-)
>
> We've already gone somewhat through the pain of drm/amdgpu redefining
> how implicit sync works without sufficiently talking with other
> people, maybe we should avoid a repeat of this ...
> -Daniel
>
> >
> > Marek
> >
> > On Tue., Apr. 27, 2021, 08:06 Christian König, <
> ckoenig.leichtzumerken@gmail.com> wrote:
> >>
> >> Correct, we wouldn't have synchronization between device with and
> without user queues any more.
> >>
> >> That could only be a problem for A+I Laptops.
> >>
> >> Memory management will just work with preemption fences which pause the
> user queues of a process before evicting something. That will be a
> dma_fence, but also a well known approach.
> >>
> >> Christian.
> >>
> >> Am 27.04.21 um 13:49 schrieb Marek Olšák:
> >>
> >> If we don't use future fences for DMA fences at all, e.g. we don't use
> them for memory management, it can work, right? Memory management can
> suspend user queues anytime. It doesn't need to use DMA fences. There might
> be something that I'm missing here.
> >>
> >> What would we lose without DMA fences? Just inter-device
> synchronization? I think that might be acceptable.
> >>
> >> The only case when the kernel will wait on a future fence is before a
> page flip. Everything today already depends on userspace not hanging the
> gpu, which makes everything a future fence.
> >>
> >> Marek
> >>
> >> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, <daniel@ffwll.ch> wrote:
> >>>
> >>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
> >>> > Thanks everybody. The initial proposal is dead. Here are some
> thoughts on
> >>> > how to do it differently.
> >>> >
> >>> > I think we can have direct command submission from userspace via
> >>> > memory-mapped queues ("user queues") without changing window systems.
> >>> >
> >>> > The memory management doesn't have to use GPU page faults like HMM.
> >>> > Instead, it can wait for user queues of a specific process to go
> idle and
> >>> > then unmap the queues, so that userspace can't submit anything.
> Buffer
> >>> > evictions, pinning, etc. can be executed when all queues are unmapped
> >>> > (suspended). Thus, no BO fences and page faults are needed.
> >>> >
> >>> > Inter-process synchronization can use timeline semaphores. Userspace
> will
> >>> > query the wait and signal value for a shared buffer from the kernel.
> The
> >>> > kernel will keep a history of those queries to know which process is
> >>> > responsible for signalling which buffer. There is only the
> wait-timeout
> >>> > issue and how to identify the culprit. One of the solutions is to
> have the
> >>> > GPU send all GPU signal commands and all timed out wait commands via
> an
> >>> > interrupt to the kernel driver to monitor and validate userspace
> behavior.
> >>> > With that, it can be identified whether the culprit is the waiting
> process
> >>> > or the signalling process and which one. Invalid signal/wait
> parameters can
> >>> > also be detected. The kernel can force-signal only the semaphores
> that time
> >>> > out, and punish the processes which caused the timeout or used
> invalid
> >>> > signal/wait parameters.
> >>> >
> >>> > The question is whether this synchronization solution is robust
> enough for
> >>> > dma_fence and whatever the kernel and window systems need.
> >>>
> >>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
> >>> (without page faults). That means dma_fence for synchronization is
> doa, at
> >>> least as-is, and we're back to figuring out the winsys problem.
> >>>
> >>> "We'll solve it with timeouts" is very tempting, but doesn't work. It's
> >>> akin to saying that we're solving deadlock issues in a locking design
> by
> >>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
> >>> avoids having to reach the reset button, but that's about it.
> >>>
> >>> And the fundamental problem is that once you throw in userspace command
> >>> submission (and syncing, at least within the userspace driver,
> otherwise
> >>> there's kinda no point if you still need the kernel for cross-engine
> sync)
> >>> means you get deadlocks if you still use dma_fence for sync under
> >>> perfectly legit use-case. We've discussed that one ad nauseam last
> summer:
> >>>
> >>>
> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
> >>>
> >>> See silly diagramm at the bottom.
> >>>
> >>> Now I think all isn't lost, because imo the first step to getting to
> this
> >>> brave new world is rebuilding the driver on top of userspace fences,
> and
> >>> with the adjusted cmd submit model. You probably don't want to use
> amdkfd,
> >>> but port that as a context flag or similar to render nodes for gl/vk.
> Of
> >>> course that means you can only use this mode in headless, without
> >>> glx/wayland winsys support, but it's a start.
> >>> -Daniel
> >>>
> >>> >
> >>> > Marek
> >>> >
> >>> > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone <daniel@fooishbar.org>
> wrote:
> >>> >
> >>> > > Hi,
> >>> > >
> >>> > > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter <daniel@ffwll.ch>
> wrote:
> >>> > >
> >>> > >> The thing is, you can't do this in drm/scheduler. At least not
> without
> >>> > >> splitting up the dma_fence in the kernel into separate memory
> fences
> >>> > >> and sync fences
> >>> > >
> >>> > >
> >>> > > I'm starting to think this thread needs its own glossary ...
> >>> > >
> >>> > > I propose we use 'residency fence' for execution fences which enact
> >>> > > memory-residency operations, e.g. faulting in a page ultimately
> depending
> >>> > > on GPU work retiring.
> >>> > >
> >>> > > And 'value fence' for the pure-userspace model suggested by
> timeline
> >>> > > semaphores, i.e. fences being (*addr == val) rather than being
> able to look
> >>> > > at ctx seqno.
> >>> > >
> >>> > > Cheers,
> >>> > > Daniel
> >>> > > _______________________________________________
> >>> > > mesa-dev mailing list
> >>> > > mesa-dev@lists.freedesktop.org
> >>> > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >>> > >
> >>>
> >>> --
> >>> Daniel Vetter
> >>> Software Engineer, Intel Corporation
> >>> http://blog.ffwll.ch
> >>
> >>
> >> _______________________________________________
> >> mesa-dev mailing list
> >> mesa-dev@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >>
> >>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>

[-- Attachment #1.2: Type: text/html, Size: 10811 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 12:46                           ` Marek Olšák
@ 2021-04-27 12:50                             ` Christian König
  2021-04-27 13:26                               ` Marek Olšák
  0 siblings, 1 reply; 105+ messages in thread
From: Christian König @ 2021-04-27 12:50 UTC (permalink / raw)
  To: Marek Olšák, Daniel Vetter; +Cc: ML Mesa-dev, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 9072 bytes --]

Only amd -> external.

We can easily install something in an user queue which waits for a 
dma_fence in the kernel.

But we can't easily wait for an user queue as dependency of a dma_fence.

The good thing is we have this wait before signal case on Vulkan 
timeline semaphores which have the same problem in the kernel.

The good news is I think we can relatively easily convert i915 and older 
amdgpu device to something which is compatible with user fences.

So yes, getting that fixed case by case should work.

Christian

Am 27.04.21 um 14:46 schrieb Marek Olšák:
> I'll defer to Christian and Alex to decide whether dropping sync with 
> non-amd devices (GPUs, cameras etc.) is acceptable.
>
> Rewriting those drivers to this new sync model could be done on a case 
> by case basis.
>
> For now, would we only lose the "amd -> external" dependency? Or the 
> "external -> amd" dependency too?
>
> Marek
>
> On Tue., Apr. 27, 2021, 08:15 Daniel Vetter, <daniel@ffwll.ch 
> <mailto:daniel@ffwll.ch>> wrote:
>
>     On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák <maraeo@gmail.com
>     <mailto:maraeo@gmail.com>> wrote:
>     > Ok. I'll interpret this as "yes, it will work, let's do it".
>
>     It works if all you care about is drm/amdgpu. I'm not sure that's a
>     reasonable approach for upstream, but it definitely is an approach :-)
>
>     We've already gone somewhat through the pain of drm/amdgpu redefining
>     how implicit sync works without sufficiently talking with other
>     people, maybe we should avoid a repeat of this ...
>     -Daniel
>
>     >
>     > Marek
>     >
>     > On Tue., Apr. 27, 2021, 08:06 Christian König,
>     <ckoenig.leichtzumerken@gmail.com
>     <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>     >>
>     >> Correct, we wouldn't have synchronization between device with
>     and without user queues any more.
>     >>
>     >> That could only be a problem for A+I Laptops.
>     >>
>     >> Memory management will just work with preemption fences which
>     pause the user queues of a process before evicting something. That
>     will be a dma_fence, but also a well known approach.
>     >>
>     >> Christian.
>     >>
>     >> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>     >>
>     >> If we don't use future fences for DMA fences at all, e.g. we
>     don't use them for memory management, it can work, right? Memory
>     management can suspend user queues anytime. It doesn't need to use
>     DMA fences. There might be something that I'm missing here.
>     >>
>     >> What would we lose without DMA fences? Just inter-device
>     synchronization? I think that might be acceptable.
>     >>
>     >> The only case when the kernel will wait on a future fence is
>     before a page flip. Everything today already depends on userspace
>     not hanging the gpu, which makes everything a future fence.
>     >>
>     >> Marek
>     >>
>     >> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, <daniel@ffwll.ch
>     <mailto:daniel@ffwll.ch>> wrote:
>     >>>
>     >>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>     >>> > Thanks everybody. The initial proposal is dead. Here are
>     some thoughts on
>     >>> > how to do it differently.
>     >>> >
>     >>> > I think we can have direct command submission from userspace via
>     >>> > memory-mapped queues ("user queues") without changing window
>     systems.
>     >>> >
>     >>> > The memory management doesn't have to use GPU page faults
>     like HMM.
>     >>> > Instead, it can wait for user queues of a specific process
>     to go idle and
>     >>> > then unmap the queues, so that userspace can't submit
>     anything. Buffer
>     >>> > evictions, pinning, etc. can be executed when all queues are
>     unmapped
>     >>> > (suspended). Thus, no BO fences and page faults are needed.
>     >>> >
>     >>> > Inter-process synchronization can use timeline semaphores.
>     Userspace will
>     >>> > query the wait and signal value for a shared buffer from the
>     kernel. The
>     >>> > kernel will keep a history of those queries to know which
>     process is
>     >>> > responsible for signalling which buffer. There is only the
>     wait-timeout
>     >>> > issue and how to identify the culprit. One of the solutions
>     is to have the
>     >>> > GPU send all GPU signal commands and all timed out wait
>     commands via an
>     >>> > interrupt to the kernel driver to monitor and validate
>     userspace behavior.
>     >>> > With that, it can be identified whether the culprit is the
>     waiting process
>     >>> > or the signalling process and which one. Invalid signal/wait
>     parameters can
>     >>> > also be detected. The kernel can force-signal only the
>     semaphores that time
>     >>> > out, and punish the processes which caused the timeout or
>     used invalid
>     >>> > signal/wait parameters.
>     >>> >
>     >>> > The question is whether this synchronization solution is
>     robust enough for
>     >>> > dma_fence and whatever the kernel and window systems need.
>     >>>
>     >>> The proper model here is the preempt-ctx dma_fence that amdkfd
>     uses
>     >>> (without page faults). That means dma_fence for
>     synchronization is doa, at
>     >>> least as-is, and we're back to figuring out the winsys problem.
>     >>>
>     >>> "We'll solve it with timeouts" is very tempting, but doesn't
>     work. It's
>     >>> akin to saying that we're solving deadlock issues in a locking
>     design by
>     >>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel.
>     Sure it
>     >>> avoids having to reach the reset button, but that's about it.
>     >>>
>     >>> And the fundamental problem is that once you throw in
>     userspace command
>     >>> submission (and syncing, at least within the userspace driver,
>     otherwise
>     >>> there's kinda no point if you still need the kernel for
>     cross-engine sync)
>     >>> means you get deadlocks if you still use dma_fence for sync under
>     >>> perfectly legit use-case. We've discussed that one ad nauseam
>     last summer:
>     >>>
>     >>>
>     https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>     <https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences>
>     >>>
>     >>> See silly diagramm at the bottom.
>     >>>
>     >>> Now I think all isn't lost, because imo the first step to
>     getting to this
>     >>> brave new world is rebuilding the driver on top of userspace
>     fences, and
>     >>> with the adjusted cmd submit model. You probably don't want to
>     use amdkfd,
>     >>> but port that as a context flag or similar to render nodes for
>     gl/vk. Of
>     >>> course that means you can only use this mode in headless, without
>     >>> glx/wayland winsys support, but it's a start.
>     >>> -Daniel
>     >>>
>     >>> >
>     >>> > Marek
>     >>> >
>     >>> > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone
>     <daniel@fooishbar.org <mailto:daniel@fooishbar.org>> wrote:
>     >>> >
>     >>> > > Hi,
>     >>> > >
>     >>> > > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter
>     <daniel@ffwll.ch <mailto:daniel@ffwll.ch>> wrote:
>     >>> > >
>     >>> > >> The thing is, you can't do this in drm/scheduler. At
>     least not without
>     >>> > >> splitting up the dma_fence in the kernel into separate
>     memory fences
>     >>> > >> and sync fences
>     >>> > >
>     >>> > >
>     >>> > > I'm starting to think this thread needs its own glossary ...
>     >>> > >
>     >>> > > I propose we use 'residency fence' for execution fences
>     which enact
>     >>> > > memory-residency operations, e.g. faulting in a page
>     ultimately depending
>     >>> > > on GPU work retiring.
>     >>> > >
>     >>> > > And 'value fence' for the pure-userspace model suggested
>     by timeline
>     >>> > > semaphores, i.e. fences being (*addr == val) rather than
>     being able to look
>     >>> > > at ctx seqno.
>     >>> > >
>     >>> > > Cheers,
>     >>> > > Daniel
>     >>> > > _______________________________________________
>     >>> > > mesa-dev mailing list
>     >>> > > mesa-dev@lists.freedesktop.org
>     <mailto:mesa-dev@lists.freedesktop.org>
>     >>> > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>     <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
>     >>> > >
>     >>>
>     >>> --
>     >>> Daniel Vetter
>     >>> Software Engineer, Intel Corporation
>     >>> http://blog.ffwll.ch <http://blog.ffwll.ch>
>     >>
>     >>
>     >> _______________________________________________
>     >> mesa-dev mailing list
>     >> mesa-dev@lists.freedesktop.org
>     <mailto:mesa-dev@lists.freedesktop.org>
>     >> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>     <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
>     >>
>     >>
>
>
>     -- 
>     Daniel Vetter
>     Software Engineer, Intel Corporation
>     http://blog.ffwll.ch <http://blog.ffwll.ch>
>


[-- Attachment #1.2: Type: text/html, Size: 15612 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 12:50                             ` Christian König
@ 2021-04-27 13:26                               ` Marek Olšák
  2021-04-27 15:13                                 ` Christian König
  2021-04-27 17:31                                 ` Lucas Stach
  0 siblings, 2 replies; 105+ messages in thread
From: Marek Olšák @ 2021-04-27 13:26 UTC (permalink / raw)
  To: Christian König; +Cc: ML Mesa-dev, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 8709 bytes --]

Ok. So that would only make the following use cases broken for now:
- amd render -> external gpu
- amd video encode -> network device

What about the case when we get a buffer from an external device and we're
supposed to make it "busy" when we are using it, and the external device
wants to wait until we stop using it? Is it something that can happen, thus
turning "external -> amd" into "external <-> amd"?

Marek

On Tue., Apr. 27, 2021, 08:50 Christian König, <
ckoenig.leichtzumerken@gmail.com> wrote:

> Only amd -> external.
>
> We can easily install something in an user queue which waits for a
> dma_fence in the kernel.
>
> But we can't easily wait for an user queue as dependency of a dma_fence.
>
> The good thing is we have this wait before signal case on Vulkan timeline
> semaphores which have the same problem in the kernel.
>
> The good news is I think we can relatively easily convert i915 and older
> amdgpu device to something which is compatible with user fences.
>
> So yes, getting that fixed case by case should work.
>
> Christian
>
> Am 27.04.21 um 14:46 schrieb Marek Olšák:
>
> I'll defer to Christian and Alex to decide whether dropping sync with
> non-amd devices (GPUs, cameras etc.) is acceptable.
>
> Rewriting those drivers to this new sync model could be done on a case by
> case basis.
>
> For now, would we only lose the "amd -> external" dependency? Or the
> "external -> amd" dependency too?
>
> Marek
>
> On Tue., Apr. 27, 2021, 08:15 Daniel Vetter, <daniel@ffwll.ch> wrote:
>
>> On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák <maraeo@gmail.com> wrote:
>> > Ok. I'll interpret this as "yes, it will work, let's do it".
>>
>> It works if all you care about is drm/amdgpu. I'm not sure that's a
>> reasonable approach for upstream, but it definitely is an approach :-)
>>
>> We've already gone somewhat through the pain of drm/amdgpu redefining
>> how implicit sync works without sufficiently talking with other
>> people, maybe we should avoid a repeat of this ...
>> -Daniel
>>
>> >
>> > Marek
>> >
>> > On Tue., Apr. 27, 2021, 08:06 Christian König, <
>> ckoenig.leichtzumerken@gmail.com> wrote:
>> >>
>> >> Correct, we wouldn't have synchronization between device with and
>> without user queues any more.
>> >>
>> >> That could only be a problem for A+I Laptops.
>> >>
>> >> Memory management will just work with preemption fences which pause
>> the user queues of a process before evicting something. That will be a
>> dma_fence, but also a well known approach.
>> >>
>> >> Christian.
>> >>
>> >> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>> >>
>> >> If we don't use future fences for DMA fences at all, e.g. we don't use
>> them for memory management, it can work, right? Memory management can
>> suspend user queues anytime. It doesn't need to use DMA fences. There might
>> be something that I'm missing here.
>> >>
>> >> What would we lose without DMA fences? Just inter-device
>> synchronization? I think that might be acceptable.
>> >>
>> >> The only case when the kernel will wait on a future fence is before a
>> page flip. Everything today already depends on userspace not hanging the
>> gpu, which makes everything a future fence.
>> >>
>> >> Marek
>> >>
>> >> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter, <daniel@ffwll.ch> wrote:
>> >>>
>> >>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>> >>> > Thanks everybody. The initial proposal is dead. Here are some
>> thoughts on
>> >>> > how to do it differently.
>> >>> >
>> >>> > I think we can have direct command submission from userspace via
>> >>> > memory-mapped queues ("user queues") without changing window
>> systems.
>> >>> >
>> >>> > The memory management doesn't have to use GPU page faults like HMM.
>> >>> > Instead, it can wait for user queues of a specific process to go
>> idle and
>> >>> > then unmap the queues, so that userspace can't submit anything.
>> Buffer
>> >>> > evictions, pinning, etc. can be executed when all queues are
>> unmapped
>> >>> > (suspended). Thus, no BO fences and page faults are needed.
>> >>> >
>> >>> > Inter-process synchronization can use timeline semaphores.
>> Userspace will
>> >>> > query the wait and signal value for a shared buffer from the
>> kernel. The
>> >>> > kernel will keep a history of those queries to know which process is
>> >>> > responsible for signalling which buffer. There is only the
>> wait-timeout
>> >>> > issue and how to identify the culprit. One of the solutions is to
>> have the
>> >>> > GPU send all GPU signal commands and all timed out wait commands
>> via an
>> >>> > interrupt to the kernel driver to monitor and validate userspace
>> behavior.
>> >>> > With that, it can be identified whether the culprit is the waiting
>> process
>> >>> > or the signalling process and which one. Invalid signal/wait
>> parameters can
>> >>> > also be detected. The kernel can force-signal only the semaphores
>> that time
>> >>> > out, and punish the processes which caused the timeout or used
>> invalid
>> >>> > signal/wait parameters.
>> >>> >
>> >>> > The question is whether this synchronization solution is robust
>> enough for
>> >>> > dma_fence and whatever the kernel and window systems need.
>> >>>
>> >>> The proper model here is the preempt-ctx dma_fence that amdkfd uses
>> >>> (without page faults). That means dma_fence for synchronization is
>> doa, at
>> >>> least as-is, and we're back to figuring out the winsys problem.
>> >>>
>> >>> "We'll solve it with timeouts" is very tempting, but doesn't work.
>> It's
>> >>> akin to saying that we're solving deadlock issues in a locking design
>> by
>> >>> doing a global s/mutex_lock/mutex_lock_timeout/ in the kernel. Sure it
>> >>> avoids having to reach the reset button, but that's about it.
>> >>>
>> >>> And the fundamental problem is that once you throw in userspace
>> command
>> >>> submission (and syncing, at least within the userspace driver,
>> otherwise
>> >>> there's kinda no point if you still need the kernel for cross-engine
>> sync)
>> >>> means you get deadlocks if you still use dma_fence for sync under
>> >>> perfectly legit use-case. We've discussed that one ad nauseam last
>> summer:
>> >>>
>> >>>
>> https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>> >>>
>> >>> See silly diagramm at the bottom.
>> >>>
>> >>> Now I think all isn't lost, because imo the first step to getting to
>> this
>> >>> brave new world is rebuilding the driver on top of userspace fences,
>> and
>> >>> with the adjusted cmd submit model. You probably don't want to use
>> amdkfd,
>> >>> but port that as a context flag or similar to render nodes for gl/vk.
>> Of
>> >>> course that means you can only use this mode in headless, without
>> >>> glx/wayland winsys support, but it's a start.
>> >>> -Daniel
>> >>>
>> >>> >
>> >>> > Marek
>> >>> >
>> >>> > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone <daniel@fooishbar.org>
>> wrote:
>> >>> >
>> >>> > > Hi,
>> >>> > >
>> >>> > > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter <daniel@ffwll.ch>
>> wrote:
>> >>> > >
>> >>> > >> The thing is, you can't do this in drm/scheduler. At least not
>> without
>> >>> > >> splitting up the dma_fence in the kernel into separate memory
>> fences
>> >>> > >> and sync fences
>> >>> > >
>> >>> > >
>> >>> > > I'm starting to think this thread needs its own glossary ...
>> >>> > >
>> >>> > > I propose we use 'residency fence' for execution fences which
>> enact
>> >>> > > memory-residency operations, e.g. faulting in a page ultimately
>> depending
>> >>> > > on GPU work retiring.
>> >>> > >
>> >>> > > And 'value fence' for the pure-userspace model suggested by
>> timeline
>> >>> > > semaphores, i.e. fences being (*addr == val) rather than being
>> able to look
>> >>> > > at ctx seqno.
>> >>> > >
>> >>> > > Cheers,
>> >>> > > Daniel
>> >>> > > _______________________________________________
>> >>> > > mesa-dev mailing list
>> >>> > > mesa-dev@lists.freedesktop.org
>> >>> > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>> >>> > >
>> >>>
>> >>> --
>> >>> Daniel Vetter
>> >>> Software Engineer, Intel Corporation
>> >>> http://blog.ffwll.ch
>> >>
>> >>
>> >> _______________________________________________
>> >> mesa-dev mailing list
>> >> mesa-dev@lists.freedesktop.org
>> >> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>> >>
>> >>
>>
>>
>> --
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch
>>
>
>

[-- Attachment #1.2: Type: text/html, Size: 16143 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 13:26                               ` Marek Olšák
@ 2021-04-27 15:13                                 ` Christian König
  2021-04-27 17:31                                 ` Lucas Stach
  1 sibling, 0 replies; 105+ messages in thread
From: Christian König @ 2021-04-27 15:13 UTC (permalink / raw)
  To: Marek Olšák; +Cc: ML Mesa-dev, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 11095 bytes --]

Uff good question. DMA-buf certainly supports that use case, but I have 
no idea if that is actually used somewhere.

Daniel do you know any case?

Christian.

Am 27.04.21 um 15:26 schrieb Marek Olšák:
> Ok. So that would only make the following use cases broken for now:
> - amd render -> external gpu
> - amd video encode -> network device
>
> What about the case when we get a buffer from an external device and 
> we're supposed to make it "busy" when we are using it, and the 
> external device wants to wait until we stop using it? Is it something 
> that can happen, thus turning "external -> amd" into "external <-> amd"?
>
> Marek
>
> On Tue., Apr. 27, 2021, 08:50 Christian König, 
> <ckoenig.leichtzumerken@gmail.com 
> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>
>     Only amd -> external.
>
>     We can easily install something in an user queue which waits for a
>     dma_fence in the kernel.
>
>     But we can't easily wait for an user queue as dependency of a
>     dma_fence.
>
>     The good thing is we have this wait before signal case on Vulkan
>     timeline semaphores which have the same problem in the kernel.
>
>     The good news is I think we can relatively easily convert i915 and
>     older amdgpu device to something which is compatible with user fences.
>
>     So yes, getting that fixed case by case should work.
>
>     Christian
>
>     Am 27.04.21 um 14:46 schrieb Marek Olšák:
>>     I'll defer to Christian and Alex to decide whether dropping sync
>>     with non-amd devices (GPUs, cameras etc.) is acceptable.
>>
>>     Rewriting those drivers to this new sync model could be done on a
>>     case by case basis.
>>
>>     For now, would we only lose the "amd -> external" dependency? Or
>>     the "external -> amd" dependency too?
>>
>>     Marek
>>
>>     On Tue., Apr. 27, 2021, 08:15 Daniel Vetter, <daniel@ffwll.ch
>>     <mailto:daniel@ffwll.ch>> wrote:
>>
>>         On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák <maraeo@gmail.com
>>         <mailto:maraeo@gmail.com>> wrote:
>>         > Ok. I'll interpret this as "yes, it will work, let's do it".
>>
>>         It works if all you care about is drm/amdgpu. I'm not sure
>>         that's a
>>         reasonable approach for upstream, but it definitely is an
>>         approach :-)
>>
>>         We've already gone somewhat through the pain of drm/amdgpu
>>         redefining
>>         how implicit sync works without sufficiently talking with other
>>         people, maybe we should avoid a repeat of this ...
>>         -Daniel
>>
>>         >
>>         > Marek
>>         >
>>         > On Tue., Apr. 27, 2021, 08:06 Christian König,
>>         <ckoenig.leichtzumerken@gmail.com
>>         <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>>         >>
>>         >> Correct, we wouldn't have synchronization between device
>>         with and without user queues any more.
>>         >>
>>         >> That could only be a problem for A+I Laptops.
>>         >>
>>         >> Memory management will just work with preemption fences
>>         which pause the user queues of a process before evicting
>>         something. That will be a dma_fence, but also a well known
>>         approach.
>>         >>
>>         >> Christian.
>>         >>
>>         >> Am 27.04.21 um 13:49 schrieb Marek Olšák:
>>         >>
>>         >> If we don't use future fences for DMA fences at all, e.g.
>>         we don't use them for memory management, it can work, right?
>>         Memory management can suspend user queues anytime. It doesn't
>>         need to use DMA fences. There might be something that I'm
>>         missing here.
>>         >>
>>         >> What would we lose without DMA fences? Just inter-device
>>         synchronization? I think that might be acceptable.
>>         >>
>>         >> The only case when the kernel will wait on a future fence
>>         is before a page flip. Everything today already depends on
>>         userspace not hanging the gpu, which makes everything a
>>         future fence.
>>         >>
>>         >> Marek
>>         >>
>>         >> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,
>>         <daniel@ffwll.ch <mailto:daniel@ffwll.ch>> wrote:
>>         >>>
>>         >>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák wrote:
>>         >>> > Thanks everybody. The initial proposal is dead. Here
>>         are some thoughts on
>>         >>> > how to do it differently.
>>         >>> >
>>         >>> > I think we can have direct command submission from
>>         userspace via
>>         >>> > memory-mapped queues ("user queues") without changing
>>         window systems.
>>         >>> >
>>         >>> > The memory management doesn't have to use GPU page
>>         faults like HMM.
>>         >>> > Instead, it can wait for user queues of a specific
>>         process to go idle and
>>         >>> > then unmap the queues, so that userspace can't submit
>>         anything. Buffer
>>         >>> > evictions, pinning, etc. can be executed when all
>>         queues are unmapped
>>         >>> > (suspended). Thus, no BO fences and page faults are needed.
>>         >>> >
>>         >>> > Inter-process synchronization can use timeline
>>         semaphores. Userspace will
>>         >>> > query the wait and signal value for a shared buffer
>>         from the kernel. The
>>         >>> > kernel will keep a history of those queries to know
>>         which process is
>>         >>> > responsible for signalling which buffer. There is only
>>         the wait-timeout
>>         >>> > issue and how to identify the culprit. One of the
>>         solutions is to have the
>>         >>> > GPU send all GPU signal commands and all timed out wait
>>         commands via an
>>         >>> > interrupt to the kernel driver to monitor and validate
>>         userspace behavior.
>>         >>> > With that, it can be identified whether the culprit is
>>         the waiting process
>>         >>> > or the signalling process and which one. Invalid
>>         signal/wait parameters can
>>         >>> > also be detected. The kernel can force-signal only the
>>         semaphores that time
>>         >>> > out, and punish the processes which caused the timeout
>>         or used invalid
>>         >>> > signal/wait parameters.
>>         >>> >
>>         >>> > The question is whether this synchronization solution
>>         is robust enough for
>>         >>> > dma_fence and whatever the kernel and window systems need.
>>         >>>
>>         >>> The proper model here is the preempt-ctx dma_fence that
>>         amdkfd uses
>>         >>> (without page faults). That means dma_fence for
>>         synchronization is doa, at
>>         >>> least as-is, and we're back to figuring out the winsys
>>         problem.
>>         >>>
>>         >>> "We'll solve it with timeouts" is very tempting, but
>>         doesn't work. It's
>>         >>> akin to saying that we're solving deadlock issues in a
>>         locking design by
>>         >>> doing a global s/mutex_lock/mutex_lock_timeout/ in the
>>         kernel. Sure it
>>         >>> avoids having to reach the reset button, but that's about it.
>>         >>>
>>         >>> And the fundamental problem is that once you throw in
>>         userspace command
>>         >>> submission (and syncing, at least within the userspace
>>         driver, otherwise
>>         >>> there's kinda no point if you still need the kernel for
>>         cross-engine sync)
>>         >>> means you get deadlocks if you still use dma_fence for
>>         sync under
>>         >>> perfectly legit use-case. We've discussed that one ad
>>         nauseam last summer:
>>         >>>
>>         >>>
>>         https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
>>         <https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences>
>>         >>>
>>         >>> See silly diagramm at the bottom.
>>         >>>
>>         >>> Now I think all isn't lost, because imo the first step to
>>         getting to this
>>         >>> brave new world is rebuilding the driver on top of
>>         userspace fences, and
>>         >>> with the adjusted cmd submit model. You probably don't
>>         want to use amdkfd,
>>         >>> but port that as a context flag or similar to render
>>         nodes for gl/vk. Of
>>         >>> course that means you can only use this mode in headless,
>>         without
>>         >>> glx/wayland winsys support, but it's a start.
>>         >>> -Daniel
>>         >>>
>>         >>> >
>>         >>> > Marek
>>         >>> >
>>         >>> > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone
>>         <daniel@fooishbar.org <mailto:daniel@fooishbar.org>> wrote:
>>         >>> >
>>         >>> > > Hi,
>>         >>> > >
>>         >>> > > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter
>>         <daniel@ffwll.ch <mailto:daniel@ffwll.ch>> wrote:
>>         >>> > >
>>         >>> > >> The thing is, you can't do this in drm/scheduler. At
>>         least not without
>>         >>> > >> splitting up the dma_fence in the kernel into
>>         separate memory fences
>>         >>> > >> and sync fences
>>         >>> > >
>>         >>> > >
>>         >>> > > I'm starting to think this thread needs its own
>>         glossary ...
>>         >>> > >
>>         >>> > > I propose we use 'residency fence' for execution
>>         fences which enact
>>         >>> > > memory-residency operations, e.g. faulting in a page
>>         ultimately depending
>>         >>> > > on GPU work retiring.
>>         >>> > >
>>         >>> > > And 'value fence' for the pure-userspace model
>>         suggested by timeline
>>         >>> > > semaphores, i.e. fences being (*addr == val) rather
>>         than being able to look
>>         >>> > > at ctx seqno.
>>         >>> > >
>>         >>> > > Cheers,
>>         >>> > > Daniel
>>         >>> > > _______________________________________________
>>         >>> > > mesa-dev mailing list
>>         >>> > > mesa-dev@lists.freedesktop.org
>>         <mailto:mesa-dev@lists.freedesktop.org>
>>         >>> > >
>>         https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>         <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
>>         >>> > >
>>         >>>
>>         >>> --
>>         >>> Daniel Vetter
>>         >>> Software Engineer, Intel Corporation
>>         >>> http://blog.ffwll.ch <http://blog.ffwll.ch>
>>         >>
>>         >>
>>         >> _______________________________________________
>>         >> mesa-dev mailing list
>>         >> mesa-dev@lists.freedesktop.org
>>         <mailto:mesa-dev@lists.freedesktop.org>
>>         >> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>         <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
>>         >>
>>         >>
>>
>>
>>         -- 
>>         Daniel Vetter
>>         Software Engineer, Intel Corporation
>>         http://blog.ffwll.ch <http://blog.ffwll.ch>
>>
>


[-- Attachment #1.2: Type: text/html, Size: 20325 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 13:26                               ` Marek Olšák
  2021-04-27 15:13                                 ` Christian König
@ 2021-04-27 17:31                                 ` Lucas Stach
  2021-04-27 17:35                                   ` Simon Ser
  2021-04-27 19:41                                   ` Jason Ekstrand
  1 sibling, 2 replies; 105+ messages in thread
From: Lucas Stach @ 2021-04-27 17:31 UTC (permalink / raw)
  To: Marek Olšák, Christian König; +Cc: ML Mesa-dev, dri-devel

Hi,

Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
> Ok. So that would only make the following use cases broken for now:
> - amd render -> external gpu
> - amd video encode -> network device

FWIW, "only" breaking amd render -> external gpu will make us pretty
unhappy, as we have some cases where we are combining an AMD APU with a
FPGA based graphics card. I can't go into the specifics of this use-
case too much but basically the AMD graphics is rendering content that
gets composited on top of a live video pipeline running through the
FPGA.

> What about the case when we get a buffer from an external device and
> we're supposed to make it "busy" when we are using it, and the
> external device wants to wait until we stop using it? Is it something
> that can happen, thus turning "external -> amd" into "external <->
> amd"?

Zero-copy texture sampling from a video input certainly appreciates
this very much. Trying to pass the render fence through the various
layers of userspace to be able to tell when the video input can reuse a
buffer is a great experience in yak shaving. Allowing the video input
to reuse the buffer as soon as the read dma_fence from the GPU is
signaled is much more straight forward.

Regards,
Lucas

> Marek
> 
> On Tue., Apr. 27, 2021, 08:50 Christian König, < 
> ckoenig.leichtzumerken@gmail.com> wrote:
> >  Only amd -> external.
> >  
> >  We can easily install something in an user queue which waits for a
> > dma_fence in the kernel.
> >  
> >  But we can't easily wait for an user queue as dependency of a
> > dma_fence.
> >  
> >  The good thing is we have this wait before signal case on Vulkan
> > timeline semaphores which have the same problem in the kernel.
> >  
> >  The good news is I think we can relatively easily convert i915 and
> > older amdgpu device to something which is compatible with user
> > fences.
> >  
> >  So yes, getting that fixed case by case should work.
> >  
> >  Christian
> >  
> > Am 27.04.21 um 14:46 schrieb Marek Olšák:
> >  
> > > I'll defer to Christian and Alex to decide whether dropping sync
> > > with non-amd devices (GPUs, cameras etc.) is acceptable.
> > > 
> > > Rewriting those drivers to this new sync model could be done on a
> > > case by case basis.
> > > 
> > > For now, would we only lose the "amd -> external" dependency? Or
> > > the "external -> amd" dependency too?
> > > 
> > > Marek
> > > 
> > > On Tue., Apr. 27, 2021, 08:15 Daniel Vetter, <daniel@ffwll.ch>
> > > wrote:
> > >  
> > > > On Tue, Apr 27, 2021 at 2:11 PM Marek Olšák <maraeo@gmail.com>
> > > > wrote:
> > > >  > Ok. I'll interpret this as "yes, it will work, let's do it".
> > > >  
> > > >  It works if all you care about is drm/amdgpu. I'm not sure
> > > > that's a
> > > >  reasonable approach for upstream, but it definitely is an
> > > > approach :-)
> > > >  
> > > >  We've already gone somewhat through the pain of drm/amdgpu
> > > > redefining
> > > >  how implicit sync works without sufficiently talking with
> > > > other
> > > >  people, maybe we should avoid a repeat of this ...
> > > >  -Daniel
> > > >  
> > > >  >
> > > >  > Marek
> > > >  >
> > > >  > On Tue., Apr. 27, 2021, 08:06 Christian König,
> > > > <ckoenig.leichtzumerken@gmail.com> wrote:
> > > >  >>
> > > >  >> Correct, we wouldn't have synchronization between device
> > > > with
> > > > and without user queues any more.
> > > >  >>
> > > >  >> That could only be a problem for A+I Laptops.
> > > >  >>
> > > >  >> Memory management will just work with preemption fences
> > > > which
> > > > pause the user queues of a process before evicting something.
> > > > That will be a dma_fence, but also a well known approach.
> > > >  >>
> > > >  >> Christian.
> > > >  >>
> > > >  >> Am 27.04.21 um 13:49 schrieb Marek Olšák:
> > > >  >>
> > > >  >> If we don't use future fences for DMA fences at all, e.g.
> > > > we
> > > > don't use them for memory management, it can work, right?
> > > > Memory
> > > > management can suspend user queues anytime. It doesn't need to
> > > > use DMA fences. There might be something that I'm missing here.
> > > >  >>
> > > >  >> What would we lose without DMA fences? Just inter-device
> > > > synchronization? I think that might be acceptable.
> > > >  >>
> > > >  >> The only case when the kernel will wait on a future fence
> > > > is
> > > > before a page flip. Everything today already depends on
> > > > userspace
> > > > not hanging the gpu, which makes everything a future fence.
> > > >  >>
> > > >  >> Marek
> > > >  >>
> > > >  >> On Tue., Apr. 27, 2021, 04:02 Daniel Vetter,
> > > > <daniel@ffwll.ch> wrote:
> > > >  >>>
> > > >  >>> On Mon, Apr 26, 2021 at 04:59:28PM -0400, Marek Olšák
> > > > wrote:
> > > >  >>> > Thanks everybody. The initial proposal is dead. Here are
> > > > some thoughts on
> > > >  >>> > how to do it differently.
> > > >  >>> >
> > > >  >>> > I think we can have direct command submission from
> > > > userspace via
> > > >  >>> > memory-mapped queues ("user queues") without changing
> > > > window systems.
> > > >  >>> >
> > > >  >>> > The memory management doesn't have to use GPU page
> > > > faults
> > > > like HMM.
> > > >  >>> > Instead, it can wait for user queues of a specific
> > > > process
> > > > to go idle and
> > > >  >>> > then unmap the queues, so that userspace can't submit
> > > > anything. Buffer
> > > >  >>> > evictions, pinning, etc. can be executed when all queues
> > > > are unmapped
> > > >  >>> > (suspended). Thus, no BO fences and page faults are
> > > > needed.
> > > >  >>> >
> > > >  >>> > Inter-process synchronization can use timeline
> > > > semaphores.
> > > > Userspace will
> > > >  >>> > query the wait and signal value for a shared buffer from
> > > > the kernel. The
> > > >  >>> > kernel will keep a history of those queries to know
> > > > which
> > > > process is
> > > >  >>> > responsible for signalling which buffer. There is only
> > > > the
> > > > wait-timeout
> > > >  >>> > issue and how to identify the culprit. One of the
> > > > solutions is to have the
> > > >  >>> > GPU send all GPU signal commands and all timed out wait
> > > > commands via an
> > > >  >>> > interrupt to the kernel driver to monitor and validate
> > > > userspace behavior.
> > > >  >>> > With that, it can be identified whether the culprit is
> > > > the
> > > > waiting process
> > > >  >>> > or the signalling process and which one. Invalid
> > > > signal/wait parameters can
> > > >  >>> > also be detected. The kernel can force-signal only the
> > > > semaphores that time
> > > >  >>> > out, and punish the processes which caused the timeout
> > > > or
> > > > used invalid
> > > >  >>> > signal/wait parameters.
> > > >  >>> >
> > > >  >>> > The question is whether this synchronization solution is
> > > > robust enough for
> > > >  >>> > dma_fence and whatever the kernel and window systems
> > > > need.
> > > >  >>>
> > > >  >>> The proper model here is the preempt-ctx dma_fence that
> > > > amdkfd uses
> > > >  >>> (without page faults). That means dma_fence for
> > > > synchronization is doa, at
> > > >  >>> least as-is, and we're back to figuring out the winsys
> > > > problem.
> > > >  >>>
> > > >  >>> "We'll solve it with timeouts" is very tempting, but
> > > > doesn't
> > > > work. It's
> > > >  >>> akin to saying that we're solving deadlock issues in a
> > > > locking design by
> > > >  >>> doing a global s/mutex_lock/mutex_lock_timeout/ in the
> > > > kernel. Sure it
> > > >  >>> avoids having to reach the reset button, but that's about
> > > > it.
> > > >  >>>
> > > >  >>> And the fundamental problem is that once you throw in
> > > > userspace command
> > > >  >>> submission (and syncing, at least within the userspace
> > > > driver, otherwise
> > > >  >>> there's kinda no point if you still need the kernel for
> > > > cross-engine sync)
> > > >  >>> means you get deadlocks if you still use dma_fence for
> > > > sync
> > > > under
> > > >  >>> perfectly legit use-case. We've discussed that one ad
> > > > nauseam last summer:
> > > >  >>>
> > > >  >>>
> > > >  
> > > > https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_fence#indefinite-dma-fences
> > > >  >>>
> > > >  >>> See silly diagramm at the bottom.
> > > >  >>>
> > > >  >>> Now I think all isn't lost, because imo the first step to
> > > > getting to this
> > > >  >>> brave new world is rebuilding the driver on top of
> > > > userspace
> > > > fences, and
> > > >  >>> with the adjusted cmd submit model. You probably don't
> > > > want
> > > > to use amdkfd,
> > > >  >>> but port that as a context flag or similar to render nodes
> > > > for gl/vk. Of
> > > >  >>> course that means you can only use this mode in headless,
> > > > without
> > > >  >>> glx/wayland winsys support, but it's a start.
> > > >  >>> -Daniel
> > > >  >>>
> > > >  >>> >
> > > >  >>> > Marek
> > > >  >>> >
> > > >  >>> > On Tue, Apr 20, 2021 at 4:34 PM Daniel Stone
> > > > <daniel@fooishbar.org> wrote:
> > > >  >>> >
> > > >  >>> > > Hi,
> > > >  >>> > >
> > > >  >>> > > On Tue, 20 Apr 2021 at 20:30, Daniel Vetter
> > > > <daniel@ffwll.ch> wrote:
> > > >  >>> > >
> > > >  >>> > >> The thing is, you can't do this in drm/scheduler. At
> > > > least not without
> > > >  >>> > >> splitting up the dma_fence in the kernel into
> > > > separate
> > > > memory fences
> > > >  >>> > >> and sync fences
> > > >  >>> > >
> > > >  >>> > >
> > > >  >>> > > I'm starting to think this thread needs its own
> > > > glossary
> > > > ...
> > > >  >>> > >
> > > >  >>> > > I propose we use 'residency fence' for execution
> > > > fences
> > > > which enact
> > > >  >>> > > memory-residency operations, e.g. faulting in a page
> > > > ultimately depending
> > > >  >>> > > on GPU work retiring.
> > > >  >>> > >
> > > >  >>> > > And 'value fence' for the pure-userspace model
> > > > suggested
> > > > by timeline
> > > >  >>> > > semaphores, i.e. fences being (*addr == val) rather
> > > > than
> > > > being able to look
> > > >  >>> > > at ctx seqno.
> > > >  >>> > >
> > > >  >>> > > Cheers,
> > > >  >>> > > Daniel
> > > >  >>> > > _______________________________________________
> > > >  >>> > > mesa-dev mailing list
> > > >  >>> > > mesa-dev@lists.freedesktop.org
> > > >  >>> > >  
> > > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> > > >  >>> > >
> > > >  >>>
> > > >  >>> --
> > > >  >>> Daniel Vetter
> > > >  >>> Software Engineer, Intel Corporation
> > > >  >>> http://blog.ffwll.ch
> > > >  >>
> > > >  >>
> > > >  >> _______________________________________________
> > > >  >> mesa-dev mailing list
> > > >  >> mesa-dev@lists.freedesktop.org
> > > >  >> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> > > >  >>
> > > >  >>
> > > >  
> > > >  
> > > >  -- 
> > > >  Daniel Vetter
> > > >  Software Engineer, Intel Corporation
> > > >  http://blog.ffwll.ch
> > > > 
> >  
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 17:31                                 ` Lucas Stach
@ 2021-04-27 17:35                                   ` Simon Ser
  2021-04-27 18:01                                     ` Alex Deucher
  2021-04-27 19:41                                   ` Jason Ekstrand
  1 sibling, 1 reply; 105+ messages in thread
From: Simon Ser @ 2021-04-27 17:35 UTC (permalink / raw)
  To: Lucas Stach
  Cc: Christian König, dri-devel, Marek Olšák, ML Mesa-dev

On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach@pengutronix.de> wrote:

> > Ok. So that would only make the following use cases broken for now:
> >
> > - amd render -> external gpu
> > - amd video encode -> network device
>
> FWIW, "only" breaking amd render -> external gpu will make us pretty
> unhappy

I concur. I have quite a few users with a multi-GPU setup involving
AMD hardware.

Note, if this brokenness can't be avoided, I'd prefer a to get a clear
error, and not bad results on screen because nothing is synchronized
anymore.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 17:35                                   ` Simon Ser
@ 2021-04-27 18:01                                     ` Alex Deucher
  2021-04-27 18:27                                       ` Simon Ser
  2021-04-28 10:05                                       ` Daniel Vetter
  0 siblings, 2 replies; 105+ messages in thread
From: Alex Deucher @ 2021-04-27 18:01 UTC (permalink / raw)
  To: Simon Ser
  Cc: Christian König, ML Mesa-dev, Marek Olšák, dri-devel

On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact@emersion.fr> wrote:
>
> On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
>
> > > Ok. So that would only make the following use cases broken for now:
> > >
> > > - amd render -> external gpu
> > > - amd video encode -> network device
> >
> > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > unhappy
>
> I concur. I have quite a few users with a multi-GPU setup involving
> AMD hardware.
>
> Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> error, and not bad results on screen because nothing is synchronized
> anymore.

It's an upcoming requirement for windows[1], so you are likely to
start seeing this across all GPU vendors that support windows.  I
think the timing depends on how quickly the legacy hardware support
sticks around for each vendor.

Alex


[1] - https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/


> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 18:01                                     ` Alex Deucher
@ 2021-04-27 18:27                                       ` Simon Ser
  2021-04-28 10:01                                         ` Daniel Vetter
  2021-04-28 10:05                                       ` Daniel Vetter
  1 sibling, 1 reply; 105+ messages in thread
From: Simon Ser @ 2021-04-27 18:27 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Christian König, ML Mesa-dev, Marek Olšák, dri-devel

On Tuesday, April 27th, 2021 at 8:01 PM, Alex Deucher <alexdeucher@gmail.com> wrote:

> It's an upcoming requirement for windows[1], so you are likely to
> start seeing this across all GPU vendors that support windows. I
> think the timing depends on how quickly the legacy hardware support
> sticks around for each vendor.

Hm, okay.

Will using the existing explicit synchronization APIs make it work
properly? (e.g. IN_FENCE_FD + OUT_FENCE_PTR in KMS, EGL_KHR_fence_sync +
EGL_ANDROID_native_fence_sync + EGL_KHR_wait_sync in EGL)
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 12:06                     ` Christian König
  2021-04-27 12:11                       ` Marek Olšák
@ 2021-04-27 18:38                       ` Dave Airlie
  2021-04-27 19:23                         ` Marek Olšák
  2021-04-27 20:49                         ` Jason Ekstrand
  1 sibling, 2 replies; 105+ messages in thread
From: Dave Airlie @ 2021-04-27 18:38 UTC (permalink / raw)
  To: Christian König; +Cc: ML Mesa-dev, dri-devel, Marek Olšák

On Tue, 27 Apr 2021 at 22:06, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Correct, we wouldn't have synchronization between device with and without user queues any more.
>
> That could only be a problem for A+I Laptops.

Since I think you mentioned you'd only be enabling this on newer
chipsets, won't it be a problem for A+A where one A is a generation
behind the other?

I'm not really liking where this is going btw, seems like a ill
thought out concept, if AMD is really going down the road of designing
hw that is currently Linux incompatible, you are going to have to
accept a big part of the burden in bringing this support in to more
than just amd drivers for upcoming generations of gpu.

Dave.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 18:38                       ` Dave Airlie
@ 2021-04-27 19:23                         ` Marek Olšák
  2021-04-28  6:59                           ` Christian König
  2021-04-27 20:49                         ` Jason Ekstrand
  1 sibling, 1 reply; 105+ messages in thread
From: Marek Olšák @ 2021-04-27 19:23 UTC (permalink / raw)
  To: Dave Airlie; +Cc: Christian König, dri-devel, ML Mesa-dev


[-- Attachment #1.1: Type: text/plain, Size: 1152 bytes --]

Supporting interop with any device is always possible. It depends on which
drivers we need to interoperate with and update them. We've already found
the path forward for amdgpu. We just need to find out how many other
drivers need to be updated and evaluate the cost/benefit aspect.

Marek

On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie <airlied@gmail.com> wrote:

> On Tue, 27 Apr 2021 at 22:06, Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
> >
> > Correct, we wouldn't have synchronization between device with and
> without user queues any more.
> >
> > That could only be a problem for A+I Laptops.
>
> Since I think you mentioned you'd only be enabling this on newer
> chipsets, won't it be a problem for A+A where one A is a generation
> behind the other?
>
> I'm not really liking where this is going btw, seems like a ill
> thought out concept, if AMD is really going down the road of designing
> hw that is currently Linux incompatible, you are going to have to
> accept a big part of the burden in bringing this support in to more
> than just amd drivers for upcoming generations of gpu.
>
> Dave.
>

[-- Attachment #1.2: Type: text/html, Size: 1604 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 17:31                                 ` Lucas Stach
  2021-04-27 17:35                                   ` Simon Ser
@ 2021-04-27 19:41                                   ` Jason Ekstrand
  2021-04-27 21:58                                     ` Marek Olšák
  1 sibling, 1 reply; 105+ messages in thread
From: Jason Ekstrand @ 2021-04-27 19:41 UTC (permalink / raw)
  To: Lucas Stach
  Cc: Christian König, dri-devel, Marek Olšák, ML Mesa-dev

Trying to figure out which e-mail in this mess is the right one to reply to....

On Tue, Apr 27, 2021 at 12:31 PM Lucas Stach <l.stach@pengutronix.de> wrote:
>
> Hi,
>
> Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
> > Ok. So that would only make the following use cases broken for now:
> > - amd render -> external gpu

Assuming said external GPU doesn't support memory fences.  If we do
amdgpu and i915 at the same time, that covers basically most of the
external GPU use-cases.  Of course, we'd want to convert nouveau as
well for the rest.

> > - amd video encode -> network device
>
> FWIW, "only" breaking amd render -> external gpu will make us pretty
> unhappy, as we have some cases where we are combining an AMD APU with a
> FPGA based graphics card. I can't go into the specifics of this use-
> case too much but basically the AMD graphics is rendering content that
> gets composited on top of a live video pipeline running through the
> FPGA.

I think it's worth taking a step back and asking what's being here
before we freak out too much.  If we do go this route, it doesn't mean
that your FPGA use-case can't work, it just means it won't work
out-of-the box anymore.  You'll have to separate execution and memory
dependencies inside your FPGA driver.  That's still not great but it's
not as bad as you maybe made it sound.

> > What about the case when we get a buffer from an external device and
> > we're supposed to make it "busy" when we are using it, and the
> > external device wants to wait until we stop using it? Is it something
> > that can happen, thus turning "external -> amd" into "external <->
> > amd"?
>
> Zero-copy texture sampling from a video input certainly appreciates
> this very much. Trying to pass the render fence through the various
> layers of userspace to be able to tell when the video input can reuse a
> buffer is a great experience in yak shaving. Allowing the video input
> to reuse the buffer as soon as the read dma_fence from the GPU is
> signaled is much more straight forward.

Oh, it's definitely worse than that.  Every window system interaction
is bi-directional.  The X server has to wait on the client before
compositing from it and the client has to wait on X before re-using
that back-buffer.  Of course, we can break that later dependency by
doing a full CPU wait but that's going to mean either more latency or
reserving more back buffers.  There's no good clean way to claim that
any of this is one-directional.

--Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 18:38                       ` Dave Airlie
  2021-04-27 19:23                         ` Marek Olšák
@ 2021-04-27 20:49                         ` Jason Ekstrand
  1 sibling, 0 replies; 105+ messages in thread
From: Jason Ekstrand @ 2021-04-27 20:49 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Christian König, Marek Olšák, dri-devel, ML Mesa-dev

On Tue, Apr 27, 2021 at 1:38 PM Dave Airlie <airlied@gmail.com> wrote:
>
> On Tue, 27 Apr 2021 at 22:06, Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
> >
> > Correct, we wouldn't have synchronization between device with and without user queues any more.
> >
> > That could only be a problem for A+I Laptops.
>
> Since I think you mentioned you'd only be enabling this on newer
> chipsets, won't it be a problem for A+A where one A is a generation
> behind the other?
>
> I'm not really liking where this is going btw, seems like a ill
> thought out concept, if AMD is really going down the road of designing
> hw that is currently Linux incompatible, you are going to have to
> accept a big part of the burden in bringing this support in to more
> than just amd drivers for upcoming generations of gpu.

In case my previous e-mail sounded too enthusiastic, I'm also pensive
about this direction.  I'm not sure I'm ready to totally give up on
all of Linux WSI just yet.  We definitely want to head towards memory
fences and direct submission but I'm not convinced that throwing out
all of interop is necessary.  It's certainly a very big hammer and we
should try to figure out something less destructive, if that's
possible.  (I don't know for sure that it is.)

--Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 19:41                                   ` Jason Ekstrand
@ 2021-04-27 21:58                                     ` Marek Olšák
  2021-04-28  4:01                                       ` Jason Ekstrand
  0 siblings, 1 reply; 105+ messages in thread
From: Marek Olšák @ 2021-04-27 21:58 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Christian König, ML Mesa-dev, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 3654 bytes --]

Jason, both memory-based signalling as well as interrupt-based signalling
to the CPU would be supported by amdgpu. External devices don't need to
support memory-based sync objects. The only limitation is that they can't
convert amdgpu sync objects to dma_fence.

The sad thing is that "external -> amdgpu" dependencies are really
"external <-> amdgpu" dependencies due to mutually-exclusive access
required by non-explicitly-sync'd buffers, so amdgpu-amdgpu interop is the
only interop that would initially work with those buffers. Explicitly
sync'd buffers also won't work if other drivers convert explicit fences to
dma_fence. Thus, both implicit sync and explicit sync might not work with
other drivers at all. The only interop that would initially work is
explicit fences with memory-based waiting and signalling on the external
device to keep the kernel out of the picture.

Marek


On Tue, Apr 27, 2021 at 3:41 PM Jason Ekstrand <jason@jlekstrand.net> wrote:

> Trying to figure out which e-mail in this mess is the right one to reply
> to....
>
> On Tue, Apr 27, 2021 at 12:31 PM Lucas Stach <l.stach@pengutronix.de>
> wrote:
> >
> > Hi,
> >
> > Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
> > > Ok. So that would only make the following use cases broken for now:
> > > - amd render -> external gpu
>
> Assuming said external GPU doesn't support memory fences.  If we do
> amdgpu and i915 at the same time, that covers basically most of the
> external GPU use-cases.  Of course, we'd want to convert nouveau as
> well for the rest.
>
> > > - amd video encode -> network device
> >
> > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > unhappy, as we have some cases where we are combining an AMD APU with a
> > FPGA based graphics card. I can't go into the specifics of this use-
> > case too much but basically the AMD graphics is rendering content that
> > gets composited on top of a live video pipeline running through the
> > FPGA.
>
> I think it's worth taking a step back and asking what's being here
> before we freak out too much.  If we do go this route, it doesn't mean
> that your FPGA use-case can't work, it just means it won't work
> out-of-the box anymore.  You'll have to separate execution and memory
> dependencies inside your FPGA driver.  That's still not great but it's
> not as bad as you maybe made it sound.
>
> > > What about the case when we get a buffer from an external device and
> > > we're supposed to make it "busy" when we are using it, and the
> > > external device wants to wait until we stop using it? Is it something
> > > that can happen, thus turning "external -> amd" into "external <->
> > > amd"?
> >
> > Zero-copy texture sampling from a video input certainly appreciates
> > this very much. Trying to pass the render fence through the various
> > layers of userspace to be able to tell when the video input can reuse a
> > buffer is a great experience in yak shaving. Allowing the video input
> > to reuse the buffer as soon as the read dma_fence from the GPU is
> > signaled is much more straight forward.
>
> Oh, it's definitely worse than that.  Every window system interaction
> is bi-directional.  The X server has to wait on the client before
> compositing from it and the client has to wait on X before re-using
> that back-buffer.  Of course, we can break that later dependency by
> doing a full CPU wait but that's going to mean either more latency or
> reserving more back buffers.  There's no good clean way to claim that
> any of this is one-directional.
>
> --Jason
>

[-- Attachment #1.2: Type: text/html, Size: 4458 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 21:58                                     ` Marek Olšák
@ 2021-04-28  4:01                                       ` Jason Ekstrand
  2021-04-28  5:19                                         ` Marek Olšák
  0 siblings, 1 reply; 105+ messages in thread
From: Jason Ekstrand @ 2021-04-28  4:01 UTC (permalink / raw)
  To: Marek Olšák; +Cc: Christian König, ML Mesa-dev, dri-devel

On Tue, Apr 27, 2021 at 4:59 PM Marek Olšák <maraeo@gmail.com> wrote:
>
> Jason, both memory-based signalling as well as interrupt-based signalling to the CPU would be supported by amdgpu. External devices don't need to support memory-based sync objects. The only limitation is that they can't convert amdgpu sync objects to dma_fence.

Sure.  I'm not worried about the mechanism.  We just need a word that
means "the new fence thing" and I've been throwing "memory fence"
around for that.  Other mechanisms may work as well.

> The sad thing is that "external -> amdgpu" dependencies are really "external <-> amdgpu" dependencies due to mutually-exclusive access required by non-explicitly-sync'd buffers, so amdgpu-amdgpu interop is the only interop that would initially work with those buffers. Explicitly sync'd buffers also won't work if other drivers convert explicit fences to dma_fence. Thus, both implicit sync and explicit sync might not work with other drivers at all. The only interop that would initially work is explicit fences with memory-based waiting and signalling on the external device to keep the kernel out of the picture.

Yup.  This is where things get hard.  That said, I'm not quite ready
to give up on memory/interrupt fences just yet.

One thought that came to mind which might help would be if we added an
extremely strict concept of memory ownership.  The idea would be that
any given BO would be in one of two states at any given time:

 1. legacy: dma_fences and implicit sync works as normal but it cannot
be resident in any "modern" (direct submission, ULLS, whatever you
want to call it) context

 2. modern: In this mode they should not be used by any legacy
context.  We can't strictly prevent this, unfortunately, but maybe we
can say reading produces garbage and writes may be discarded.  In this
mode, they can be bound to modern contexts.

In theory, when in "modern" mode, you could bind the same buffer in
multiple modern contexts at a time.  However, when that's the case, it
makes ownership really tricky to track.  Therefore, we might want some
sort of dma-buf create flag for "always modern" vs. "switchable" and
only allow binding to one modern context at a time when it's
switchable.

If we did this, we may be able to move any dma_fence shenanigans to
the ownership transition points.  We'd still need some sort of "wait
for fence and transition" which has a timeout.  However, then we'd be
fairly well guaranteed that the application (not just Mesa!) has
really and truly decided it's done with the buffer and we wouldn't (I
hope!) end up with the accidental edges in the dependency graph.

Of course, I've not yet proven any of this correct so feel free to
tell me why it won't work. :-)  It was just one of those "about to go
to bed and had a thunk" type thoughts.

--Jason

P.S.  Daniel was 100% right when he said this discussion needs a glossary.


> Marek
>
>
> On Tue, Apr 27, 2021 at 3:41 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>>
>> Trying to figure out which e-mail in this mess is the right one to reply to....
>>
>> On Tue, Apr 27, 2021 at 12:31 PM Lucas Stach <l.stach@pengutronix.de> wrote:
>> >
>> > Hi,
>> >
>> > Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
>> > > Ok. So that would only make the following use cases broken for now:
>> > > - amd render -> external gpu
>>
>> Assuming said external GPU doesn't support memory fences.  If we do
>> amdgpu and i915 at the same time, that covers basically most of the
>> external GPU use-cases.  Of course, we'd want to convert nouveau as
>> well for the rest.
>>
>> > > - amd video encode -> network device
>> >
>> > FWIW, "only" breaking amd render -> external gpu will make us pretty
>> > unhappy, as we have some cases where we are combining an AMD APU with a
>> > FPGA based graphics card. I can't go into the specifics of this use-
>> > case too much but basically the AMD graphics is rendering content that
>> > gets composited on top of a live video pipeline running through the
>> > FPGA.
>>
>> I think it's worth taking a step back and asking what's being here
>> before we freak out too much.  If we do go this route, it doesn't mean
>> that your FPGA use-case can't work, it just means it won't work
>> out-of-the box anymore.  You'll have to separate execution and memory
>> dependencies inside your FPGA driver.  That's still not great but it's
>> not as bad as you maybe made it sound.
>>
>> > > What about the case when we get a buffer from an external device and
>> > > we're supposed to make it "busy" when we are using it, and the
>> > > external device wants to wait until we stop using it? Is it something
>> > > that can happen, thus turning "external -> amd" into "external <->
>> > > amd"?
>> >
>> > Zero-copy texture sampling from a video input certainly appreciates
>> > this very much. Trying to pass the render fence through the various
>> > layers of userspace to be able to tell when the video input can reuse a
>> > buffer is a great experience in yak shaving. Allowing the video input
>> > to reuse the buffer as soon as the read dma_fence from the GPU is
>> > signaled is much more straight forward.
>>
>> Oh, it's definitely worse than that.  Every window system interaction
>> is bi-directional.  The X server has to wait on the client before
>> compositing from it and the client has to wait on X before re-using
>> that back-buffer.  Of course, we can break that later dependency by
>> doing a full CPU wait but that's going to mean either more latency or
>> reserving more back buffers.  There's no good clean way to claim that
>> any of this is one-directional.
>>
>> --Jason
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28  4:01                                       ` Jason Ekstrand
@ 2021-04-28  5:19                                         ` Marek Olšák
  0 siblings, 0 replies; 105+ messages in thread
From: Marek Olšák @ 2021-04-28  5:19 UTC (permalink / raw)
  To: Jason Ekstrand; +Cc: Christian König, ML Mesa-dev, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 7111 bytes --]

On Wed., Apr. 28, 2021, 00:01 Jason Ekstrand, <jason@jlekstrand.net> wrote:

> On Tue, Apr 27, 2021 at 4:59 PM Marek Olšák <maraeo@gmail.com> wrote:
> >
> > Jason, both memory-based signalling as well as interrupt-based
> signalling to the CPU would be supported by amdgpu. External devices don't
> need to support memory-based sync objects. The only limitation is that they
> can't convert amdgpu sync objects to dma_fence.
>
> Sure.  I'm not worried about the mechanism.  We just need a word that
> means "the new fence thing" and I've been throwing "memory fence"
> around for that.  Other mechanisms may work as well.
>
> > The sad thing is that "external -> amdgpu" dependencies are really
> "external <-> amdgpu" dependencies due to mutually-exclusive access
> required by non-explicitly-sync'd buffers, so amdgpu-amdgpu interop is the
> only interop that would initially work with those buffers. Explicitly
> sync'd buffers also won't work if other drivers convert explicit fences to
> dma_fence. Thus, both implicit sync and explicit sync might not work with
> other drivers at all. The only interop that would initially work is
> explicit fences with memory-based waiting and signalling on the external
> device to keep the kernel out of the picture.
>
> Yup.  This is where things get hard.  That said, I'm not quite ready
> to give up on memory/interrupt fences just yet.
>
> One thought that came to mind which might help would be if we added an
> extremely strict concept of memory ownership.  The idea would be that
> any given BO would be in one of two states at any given time:
>
>  1. legacy: dma_fences and implicit sync works as normal but it cannot
> be resident in any "modern" (direct submission, ULLS, whatever you
> want to call it) context
>
>  2. modern: In this mode they should not be used by any legacy
> context.  We can't strictly prevent this, unfortunately, but maybe we
> can say reading produces garbage and writes may be discarded.  In this
> mode, they can be bound to modern contexts.
>
> In theory, when in "modern" mode, you could bind the same buffer in
> multiple modern contexts at a time.  However, when that's the case, it
> makes ownership really tricky to track.  Therefore, we might want some
> sort of dma-buf create flag for "always modern" vs. "switchable" and
> only allow binding to one modern context at a time when it's
> switchable.
>
> If we did this, we may be able to move any dma_fence shenanigans to
> the ownership transition points.  We'd still need some sort of "wait
> for fence and transition" which has a timeout.  However, then we'd be
> fairly well guaranteed that the application (not just Mesa!) has
> really and truly decided it's done with the buffer and we wouldn't (I
> hope!) end up with the accidental edges in the dependency graph.
>
> Of course, I've not yet proven any of this correct so feel free to
> tell me why it won't work. :-)  It was just one of those "about to go
> to bed and had a thunk" type thoughts.
>

We'd like to keep userspace outside of Mesa drivers intact and working
except for interop where we don't have much choice. At the same time,
future hw may remove support for kernel queues, so we might not have much
choice there either, depending on what the hw interface will look like.

The idea is to have an ioctl for querying a timeline semaphore buffer
associated with a shared BO, and an ioctl for querying the next wait and
signal number (e.g. n and n+1) for that semaphore. Waiting for n would be
like mutex lock and signaling would be like mutex unlock. The next process
would use the same ioctl and get n+1 and n+2, etc. There is a deadlock
condition because one process can do lock A, lock B, and another can do
lock B, lock A, which can be prevented such that the ioctl that returns the
numbers would return them for multiple buffers at once. This solution needs
no changes to userspace outside of Mesa drivers, and we'll also keep the BO
wait ioctl for GPU-CPU sync.

Marek


> --Jason
>
> P.S.  Daniel was 100% right when he said this discussion needs a glossary.
>
>
> > Marek
> >
> >
> > On Tue, Apr 27, 2021 at 3:41 PM Jason Ekstrand <jason@jlekstrand.net>
> wrote:
> >>
> >> Trying to figure out which e-mail in this mess is the right one to
> reply to....
> >>
> >> On Tue, Apr 27, 2021 at 12:31 PM Lucas Stach <l.stach@pengutronix.de>
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > Am Dienstag, dem 27.04.2021 um 09:26 -0400 schrieb Marek Olšák:
> >> > > Ok. So that would only make the following use cases broken for now:
> >> > > - amd render -> external gpu
> >>
> >> Assuming said external GPU doesn't support memory fences.  If we do
> >> amdgpu and i915 at the same time, that covers basically most of the
> >> external GPU use-cases.  Of course, we'd want to convert nouveau as
> >> well for the rest.
> >>
> >> > > - amd video encode -> network device
> >> >
> >> > FWIW, "only" breaking amd render -> external gpu will make us pretty
> >> > unhappy, as we have some cases where we are combining an AMD APU with
> a
> >> > FPGA based graphics card. I can't go into the specifics of this use-
> >> > case too much but basically the AMD graphics is rendering content that
> >> > gets composited on top of a live video pipeline running through the
> >> > FPGA.
> >>
> >> I think it's worth taking a step back and asking what's being here
> >> before we freak out too much.  If we do go this route, it doesn't mean
> >> that your FPGA use-case can't work, it just means it won't work
> >> out-of-the box anymore.  You'll have to separate execution and memory
> >> dependencies inside your FPGA driver.  That's still not great but it's
> >> not as bad as you maybe made it sound.
> >>
> >> > > What about the case when we get a buffer from an external device and
> >> > > we're supposed to make it "busy" when we are using it, and the
> >> > > external device wants to wait until we stop using it? Is it
> something
> >> > > that can happen, thus turning "external -> amd" into "external <->
> >> > > amd"?
> >> >
> >> > Zero-copy texture sampling from a video input certainly appreciates
> >> > this very much. Trying to pass the render fence through the various
> >> > layers of userspace to be able to tell when the video input can reuse
> a
> >> > buffer is a great experience in yak shaving. Allowing the video input
> >> > to reuse the buffer as soon as the read dma_fence from the GPU is
> >> > signaled is much more straight forward.
> >>
> >> Oh, it's definitely worse than that.  Every window system interaction
> >> is bi-directional.  The X server has to wait on the client before
> >> compositing from it and the client has to wait on X before re-using
> >> that back-buffer.  Of course, we can break that later dependency by
> >> doing a full CPU wait but that's going to mean either more latency or
> >> reserving more back buffers.  There's no good clean way to claim that
> >> any of this is one-directional.
> >>
> >> --Jason
>

[-- Attachment #1.2: Type: text/html, Size: 8949 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 19:23                         ` Marek Olšák
@ 2021-04-28  6:59                           ` Christian König
  2021-04-28  9:07                             ` Michel Dänzer
  2021-04-28  9:54                             ` Daniel Vetter
  0 siblings, 2 replies; 105+ messages in thread
From: Christian König @ 2021-04-28  6:59 UTC (permalink / raw)
  To: Marek Olšák, Dave Airlie; +Cc: ML Mesa-dev, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 1635 bytes --]

Hi Dave,

Am 27.04.21 um 21:23 schrieb Marek Olšák:
> Supporting interop with any device is always possible. It depends on 
> which drivers we need to interoperate with and update them. We've 
> already found the path forward for amdgpu. We just need to find out 
> how many other drivers need to be updated and evaluate the 
> cost/benefit aspect.
>
> Marek
>
> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie <airlied@gmail.com 
> <mailto:airlied@gmail.com>> wrote:
>
>     On Tue, 27 Apr 2021 at 22:06, Christian König
>     <ckoenig.leichtzumerken@gmail.com
>     <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>     >
>     > Correct, we wouldn't have synchronization between device with
>     and without user queues any more.
>     >
>     > That could only be a problem for A+I Laptops.
>
>     Since I think you mentioned you'd only be enabling this on newer
>     chipsets, won't it be a problem for A+A where one A is a generation
>     behind the other?
>

Crap, that is a good point as well.

>
>     I'm not really liking where this is going btw, seems like a ill
>     thought out concept, if AMD is really going down the road of designing
>     hw that is currently Linux incompatible, you are going to have to
>     accept a big part of the burden in bringing this support in to more
>     than just amd drivers for upcoming generations of gpu.
>

Well we don't really like that either, but we have no other option as 
far as I can see.

I have a couple of ideas how to handle this in the kernel without 
dma_fences, but it always require more or less changes to all existing 
drivers.

Christian.

>
>     Dave.
>


[-- Attachment #1.2: Type: text/html, Size: 3578 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28  6:59                           ` Christian König
@ 2021-04-28  9:07                             ` Michel Dänzer
  2021-04-28  9:57                               ` Daniel Vetter
  2021-05-01 22:27                               ` Marek Olšák
  2021-04-28  9:54                             ` Daniel Vetter
  1 sibling, 2 replies; 105+ messages in thread
From: Michel Dänzer @ 2021-04-28  9:07 UTC (permalink / raw)
  To: Christian König, Marek Olšák, Dave Airlie
  Cc: ML Mesa-dev, dri-devel

On 2021-04-28 8:59 a.m., Christian König wrote:
> Hi Dave,
> 
> Am 27.04.21 um 21:23 schrieb Marek Olšák:
>> Supporting interop with any device is always possible. It depends on which drivers we need to interoperate with and update them. We've already found the path forward for amdgpu. We just need to find out how many other drivers need to be updated and evaluate the cost/benefit aspect.
>>
>> Marek
>>
>> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie <airlied@gmail.com <mailto:airlied@gmail.com>> wrote:
>>
>>     On Tue, 27 Apr 2021 at 22:06, Christian König
>>     <ckoenig.leichtzumerken@gmail.com <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>>     >
>>     > Correct, we wouldn't have synchronization between device with and without user queues any more.
>>     >
>>     > That could only be a problem for A+I Laptops.
>>
>>     Since I think you mentioned you'd only be enabling this on newer
>>     chipsets, won't it be a problem for A+A where one A is a generation
>>     behind the other?
>>
> 
> Crap, that is a good point as well.
> 
>>
>>     I'm not really liking where this is going btw, seems like a ill
>>     thought out concept, if AMD is really going down the road of designing
>>     hw that is currently Linux incompatible, you are going to have to
>>     accept a big part of the burden in bringing this support in to more
>>     than just amd drivers for upcoming generations of gpu.
>>
> 
> Well we don't really like that either, but we have no other option as far as I can see.

I don't really understand what "future hw may remove support for kernel queues" means exactly. While the per-context queues can be mapped to userspace directly, they don't *have* to be, do they? I.e. the kernel driver should be able to either intercept userspace access to the queues, or in the worst case do it all itself, and provide the existing synchronization semantics as needed?

Surely there are resource limits for the per-context queues, so the kernel driver needs to do some kind of virtualization / multi-plexing anyway, or we'll get sad user faces when there's no queue available for <current hot game>.

I'm probably missing something though, awaiting enlightenment. :)


-- 
Earthling Michel Dänzer               |               https://redhat.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28  6:59                           ` Christian König
  2021-04-28  9:07                             ` Michel Dänzer
@ 2021-04-28  9:54                             ` Daniel Vetter
  1 sibling, 0 replies; 105+ messages in thread
From: Daniel Vetter @ 2021-04-28  9:54 UTC (permalink / raw)
  To: Christian König; +Cc: ML Mesa-dev, dri-devel, Marek Olšák

On Wed, Apr 28, 2021 at 08:59:47AM +0200, Christian König wrote:
> Hi Dave,
> 
> Am 27.04.21 um 21:23 schrieb Marek Olšák:
> > Supporting interop with any device is always possible. It depends on
> > which drivers we need to interoperate with and update them. We've
> > already found the path forward for amdgpu. We just need to find out how
> > many other drivers need to be updated and evaluate the cost/benefit
> > aspect.
> > 
> > Marek
> > 
> > On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie <airlied@gmail.com
> > <mailto:airlied@gmail.com>> wrote:
> > 
> >     On Tue, 27 Apr 2021 at 22:06, Christian König
> >     <ckoenig.leichtzumerken@gmail.com
> >     <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
> >     >
> >     > Correct, we wouldn't have synchronization between device with
> >     and without user queues any more.
> >     >
> >     > That could only be a problem for A+I Laptops.
> > 
> >     Since I think you mentioned you'd only be enabling this on newer
> >     chipsets, won't it be a problem for A+A where one A is a generation
> >     behind the other?
> > 
> 
> Crap, that is a good point as well.
> 
> > 
> >     I'm not really liking where this is going btw, seems like a ill
> >     thought out concept, if AMD is really going down the road of designing
> >     hw that is currently Linux incompatible, you are going to have to
> >     accept a big part of the burden in bringing this support in to more
> >     than just amd drivers for upcoming generations of gpu.
> > 
> 
> Well we don't really like that either, but we have no other option as far as
> I can see.
> 
> I have a couple of ideas how to handle this in the kernel without
> dma_fences, but it always require more or less changes to all existing
> drivers.

Yeah one horrible idea is to essentially do the plan we hashed out for
adding userspace fences to drm_syncobj timelines. And then add drm_syncobj
as another implicit fencing thing to dma-buf.

But:
- This is horrible. We're all agreeing that implicit sync is not a great
  idea, building an entire new world on this flawed thing doesn't sound
  like a good path forward.

- It's kernel uapi, so it's going to be forever.

- It's only fixing the correctness issue, since you have to stall for
  future/indefinite fences at the beginning of the CS ioctl. Or at the
  beginning of the atomic modeset ioctl, which kinda defeats the point of
  nonblocking.

- You still have to touch all kmd drivers.

- For performance, you still have to glue a submit thread onto all gl
  drivers.

It is horrendous.
-Daniel

> 
> Christian.
> 
> > 
> >     Dave.
> > 
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28  9:07                             ` Michel Dänzer
@ 2021-04-28  9:57                               ` Daniel Vetter
  2021-05-01 22:27                               ` Marek Olšák
  1 sibling, 0 replies; 105+ messages in thread
From: Daniel Vetter @ 2021-04-28  9:57 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Christian König, dri-devel, Marek Olšák, ML Mesa-dev

On Wed, Apr 28, 2021 at 11:07:09AM +0200, Michel Dänzer wrote:
> On 2021-04-28 8:59 a.m., Christian König wrote:
> > Hi Dave,
> > 
> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> >> Supporting interop with any device is always possible. It depends on which drivers we need to interoperate with and update them. We've already found the path forward for amdgpu. We just need to find out how many other drivers need to be updated and evaluate the cost/benefit aspect.
> >>
> >> Marek
> >>
> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie <airlied@gmail.com <mailto:airlied@gmail.com>> wrote:
> >>
> >>     On Tue, 27 Apr 2021 at 22:06, Christian König
> >>     <ckoenig.leichtzumerken@gmail.com <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
> >>     >
> >>     > Correct, we wouldn't have synchronization between device with and without user queues any more.
> >>     >
> >>     > That could only be a problem for A+I Laptops.
> >>
> >>     Since I think you mentioned you'd only be enabling this on newer
> >>     chipsets, won't it be a problem for A+A where one A is a generation
> >>     behind the other?
> >>
> > 
> > Crap, that is a good point as well.
> > 
> >>
> >>     I'm not really liking where this is going btw, seems like a ill
> >>     thought out concept, if AMD is really going down the road of designing
> >>     hw that is currently Linux incompatible, you are going to have to
> >>     accept a big part of the burden in bringing this support in to more
> >>     than just amd drivers for upcoming generations of gpu.
> >>
> > 
> > Well we don't really like that either, but we have no other option as far as I can see.
> 
> I don't really understand what "future hw may remove support for kernel
> queues" means exactly. While the per-context queues can be mapped to
> userspace directly, they don't *have* to be, do they? I.e. the kernel
> driver should be able to either intercept userspace access to the
> queues, or in the worst case do it all itself, and provide the existing
> synchronization semantics as needed?
> 
> Surely there are resource limits for the per-context queues, so the
> kernel driver needs to do some kind of virtualization / multi-plexing
> anyway, or we'll get sad user faces when there's no queue available for
> <current hot game>.
> 
> I'm probably missing something though, awaiting enlightenment. :)

Yeah in all this discussion what's unclear to me is, is this a hard amdgpu
requirement going forward, in which case you need a time machine and lots
of people to retroactively fix this because this aint fast to get fixed.

Or is this just musings for an ecosystem that better fits current&future
hw, for which I think we all agree where the rough direction is?

The former is quite a glorious situation, and I'm with Dave here that if
your hw engineers really removed the bit to not map the ringbuffers to
userspace, then amd gets to eat a big chunk of the cost here.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 18:27                                       ` Simon Ser
@ 2021-04-28 10:01                                         ` Daniel Vetter
  0 siblings, 0 replies; 105+ messages in thread
From: Daniel Vetter @ 2021-04-28 10:01 UTC (permalink / raw)
  To: Simon Ser; +Cc: Christian König, dri-devel, ML Mesa-dev

On Tue, Apr 27, 2021 at 06:27:27PM +0000, Simon Ser wrote:
> On Tuesday, April 27th, 2021 at 8:01 PM, Alex Deucher <alexdeucher@gmail.com> wrote:
> 
> > It's an upcoming requirement for windows[1], so you are likely to
> > start seeing this across all GPU vendors that support windows. I
> > think the timing depends on how quickly the legacy hardware support
> > sticks around for each vendor.
> 
> Hm, okay.
> 
> Will using the existing explicit synchronization APIs make it work
> properly? (e.g. IN_FENCE_FD + OUT_FENCE_PTR in KMS, EGL_KHR_fence_sync +
> EGL_ANDROID_native_fence_sync + EGL_KHR_wait_sync in EGL)

If you have hw which really _only_ supports userspace direct submission
(i.e. the ringbuffer has to be in the same gpu vm as everything else by
design, and can't be protected at all with e.g. read-only pte entries)
then all that stuff would be broken.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-27 18:01                                     ` Alex Deucher
  2021-04-27 18:27                                       ` Simon Ser
@ 2021-04-28 10:05                                       ` Daniel Vetter
  2021-04-28 10:31                                         ` Christian König
  1 sibling, 1 reply; 105+ messages in thread
From: Daniel Vetter @ 2021-04-28 10:05 UTC (permalink / raw)
  To: Alex Deucher; +Cc: ML Mesa-dev, dri-devel, Christian König

On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact@emersion.fr> wrote:
> >
> > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
> >
> > > > Ok. So that would only make the following use cases broken for now:
> > > >
> > > > - amd render -> external gpu
> > > > - amd video encode -> network device
> > >
> > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > unhappy
> >
> > I concur. I have quite a few users with a multi-GPU setup involving
> > AMD hardware.
> >
> > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > error, and not bad results on screen because nothing is synchronized
> > anymore.
> 
> It's an upcoming requirement for windows[1], so you are likely to
> start seeing this across all GPU vendors that support windows.  I
> think the timing depends on how quickly the legacy hardware support
> sticks around for each vendor.

Yeah but hw scheduling doesn't mean the hw has to be constructed to not
support isolating the ringbuffer at all.

E.g. even if the hw loses the bit to put the ringbuffer outside of the
userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
pte flags. Otherwise the entire "share address space with cpu side,
seamlessly" thing is out of the window.

And with that r/o bit on the ringbuffer you can once more force submit
through kernel space, and all the legacy dma_fence based stuff keeps
working. And we don't have to invent some horrendous userspace fence based
implicit sync mechanism in the kernel, but can instead do this transition
properly with drm_syncobj timeline explicit sync and protocol reving.

At least I think you'd have to work extra hard to create a gpu which
cannot possibly be intercepted by the kernel, even when it's designed to
support userspace direct submit only.

Or are your hw engineers more creative here and we're screwed?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28 10:05                                       ` Daniel Vetter
@ 2021-04-28 10:31                                         ` Christian König
  2021-04-28 12:21                                           ` Daniel Vetter
  2021-04-28 13:03                                           ` Alex Deucher
  0 siblings, 2 replies; 105+ messages in thread
From: Christian König @ 2021-04-28 10:31 UTC (permalink / raw)
  To: Daniel Vetter, Alex Deucher; +Cc: dri-devel, ML Mesa-dev

Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
>> On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact@emersion.fr> wrote:
>>> On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
>>>
>>>>> Ok. So that would only make the following use cases broken for now:
>>>>>
>>>>> - amd render -> external gpu
>>>>> - amd video encode -> network device
>>>> FWIW, "only" breaking amd render -> external gpu will make us pretty
>>>> unhappy
>>> I concur. I have quite a few users with a multi-GPU setup involving
>>> AMD hardware.
>>>
>>> Note, if this brokenness can't be avoided, I'd prefer a to get a clear
>>> error, and not bad results on screen because nothing is synchronized
>>> anymore.
>> It's an upcoming requirement for windows[1], so you are likely to
>> start seeing this across all GPU vendors that support windows.  I
>> think the timing depends on how quickly the legacy hardware support
>> sticks around for each vendor.
> Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> support isolating the ringbuffer at all.
>
> E.g. even if the hw loses the bit to put the ringbuffer outside of the
> userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> pte flags. Otherwise the entire "share address space with cpu side,
> seamlessly" thing is out of the window.
>
> And with that r/o bit on the ringbuffer you can once more force submit
> through kernel space, and all the legacy dma_fence based stuff keeps
> working. And we don't have to invent some horrendous userspace fence based
> implicit sync mechanism in the kernel, but can instead do this transition
> properly with drm_syncobj timeline explicit sync and protocol reving.
>
> At least I think you'd have to work extra hard to create a gpu which
> cannot possibly be intercepted by the kernel, even when it's designed to
> support userspace direct submit only.
>
> Or are your hw engineers more creative here and we're screwed?

The upcomming hardware generation will have this hardware scheduler as a 
must have, but there are certain ways we can still stick to the old 
approach:

1. The new hardware scheduler currently still supports kernel queues 
which essentially is the same as the old hardware ring buffer.

2. Mapping the top level ring buffer into the VM at least partially 
solves the problem. This way you can't manipulate the ring buffer 
content, but the location for the fence must still be writeable.

For now and the next hardware we are save to support the old submission 
model, but the functionality of kernel queues will sooner or later go 
away if it is only for Linux.

So we need to work on something which works in the long term and get us 
away from this implicit sync.

Christian.

> -Daniel

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28 10:31                                         ` Christian König
@ 2021-04-28 12:21                                           ` Daniel Vetter
  2021-04-28 12:26                                             ` Daniel Vetter
  2021-04-28 12:45                                             ` Simon Ser
  2021-04-28 13:03                                           ` Alex Deucher
  1 sibling, 2 replies; 105+ messages in thread
From: Daniel Vetter @ 2021-04-28 12:21 UTC (permalink / raw)
  To: Christian König; +Cc: dri-devel, ML Mesa-dev

On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact@emersion.fr> wrote:
> > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
> > > > 
> > > > > > Ok. So that would only make the following use cases broken for now:
> > > > > > 
> > > > > > - amd render -> external gpu
> > > > > > - amd video encode -> network device
> > > > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > > > unhappy
> > > > I concur. I have quite a few users with a multi-GPU setup involving
> > > > AMD hardware.
> > > > 
> > > > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > > > error, and not bad results on screen because nothing is synchronized
> > > > anymore.
> > > It's an upcoming requirement for windows[1], so you are likely to
> > > start seeing this across all GPU vendors that support windows.  I
> > > think the timing depends on how quickly the legacy hardware support
> > > sticks around for each vendor.
> > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > support isolating the ringbuffer at all.
> > 
> > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > pte flags. Otherwise the entire "share address space with cpu side,
> > seamlessly" thing is out of the window.
> > 
> > And with that r/o bit on the ringbuffer you can once more force submit
> > through kernel space, and all the legacy dma_fence based stuff keeps
> > working. And we don't have to invent some horrendous userspace fence based
> > implicit sync mechanism in the kernel, but can instead do this transition
> > properly with drm_syncobj timeline explicit sync and protocol reving.
> > 
> > At least I think you'd have to work extra hard to create a gpu which
> > cannot possibly be intercepted by the kernel, even when it's designed to
> > support userspace direct submit only.
> > 
> > Or are your hw engineers more creative here and we're screwed?
> 
> The upcomming hardware generation will have this hardware scheduler as a
> must have, but there are certain ways we can still stick to the old
> approach:
> 
> 1. The new hardware scheduler currently still supports kernel queues which
> essentially is the same as the old hardware ring buffer.
> 
> 2. Mapping the top level ring buffer into the VM at least partially solves
> the problem. This way you can't manipulate the ring buffer content, but the
> location for the fence must still be writeable.

Yeah allowing userspace to lie about completion fences in this model is
ok. Though I haven't thought through full consequences of that, but I
think it's not any worse than userspace lying about which buffers/address
it uses in the current model - we rely on hw vm ptes to catch that stuff.

Also it might be good to switch to a non-recoverable ctx model for these.
That's already what we do in i915 (opt-in, but all current umd use that
mode). So any hang/watchdog just kills the entire ctx and you don't have
to worry about userspace doing something funny with it's ringbuffer.
Simplifies everything.

Also ofc userspace fencing still disallowed, but since userspace would
queu up all writes to its ringbuffer through the drm/scheduler, we'd
handle dependencies through that still. Not great, but workable.

Thinking about this, not even mapping the ringbuffer r/o is required, it's
just that we must queue things throug the kernel to resolve dependencies
and everything without breaking dma_fence. If userspace lies, tdr will
shoot it and the kernel stops running that context entirely.

So I think even if we have hw with 100% userspace submit model only we
should be still fine. It's ofc silly, because instead of using userspace
fences and gpu semaphores the hw scheduler understands we still take the
detour through drm/scheduler, but at least it's not a break-the-world
event.

Or do I miss something here?

> For now and the next hardware we are save to support the old submission
> model, but the functionality of kernel queues will sooner or later go away
> if it is only for Linux.
> 
> So we need to work on something which works in the long term and get us away
> from this implicit sync.

Yeah I think we have pretty clear consensus on that goal, just no one yet
volunteered to get going with the winsys/wayland work to plumb drm_syncobj
through, and the kernel/mesa work to make that optionally a userspace
fence underneath. And it's for a sure a lot of work.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28 12:21                                           ` Daniel Vetter
@ 2021-04-28 12:26                                             ` Daniel Vetter
  2021-04-28 13:11                                               ` Christian König
  2021-04-28 12:45                                             ` Simon Ser
  1 sibling, 1 reply; 105+ messages in thread
From: Daniel Vetter @ 2021-04-28 12:26 UTC (permalink / raw)
  To: Christian König; +Cc: dri-devel, ML Mesa-dev

On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact@emersion.fr> wrote:
> > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
> > > > > 
> > > > > > > Ok. So that would only make the following use cases broken for now:
> > > > > > > 
> > > > > > > - amd render -> external gpu
> > > > > > > - amd video encode -> network device
> > > > > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > > > > unhappy
> > > > > I concur. I have quite a few users with a multi-GPU setup involving
> > > > > AMD hardware.
> > > > > 
> > > > > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > > > > error, and not bad results on screen because nothing is synchronized
> > > > > anymore.
> > > > It's an upcoming requirement for windows[1], so you are likely to
> > > > start seeing this across all GPU vendors that support windows.  I
> > > > think the timing depends on how quickly the legacy hardware support
> > > > sticks around for each vendor.
> > > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > > support isolating the ringbuffer at all.
> > > 
> > > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > > pte flags. Otherwise the entire "share address space with cpu side,
> > > seamlessly" thing is out of the window.
> > > 
> > > And with that r/o bit on the ringbuffer you can once more force submit
> > > through kernel space, and all the legacy dma_fence based stuff keeps
> > > working. And we don't have to invent some horrendous userspace fence based
> > > implicit sync mechanism in the kernel, but can instead do this transition
> > > properly with drm_syncobj timeline explicit sync and protocol reving.
> > > 
> > > At least I think you'd have to work extra hard to create a gpu which
> > > cannot possibly be intercepted by the kernel, even when it's designed to
> > > support userspace direct submit only.
> > > 
> > > Or are your hw engineers more creative here and we're screwed?
> > 
> > The upcomming hardware generation will have this hardware scheduler as a
> > must have, but there are certain ways we can still stick to the old
> > approach:
> > 
> > 1. The new hardware scheduler currently still supports kernel queues which
> > essentially is the same as the old hardware ring buffer.
> > 
> > 2. Mapping the top level ring buffer into the VM at least partially solves
> > the problem. This way you can't manipulate the ring buffer content, but the
> > location for the fence must still be writeable.
> 
> Yeah allowing userspace to lie about completion fences in this model is
> ok. Though I haven't thought through full consequences of that, but I
> think it's not any worse than userspace lying about which buffers/address
> it uses in the current model - we rely on hw vm ptes to catch that stuff.
> 
> Also it might be good to switch to a non-recoverable ctx model for these.
> That's already what we do in i915 (opt-in, but all current umd use that
> mode). So any hang/watchdog just kills the entire ctx and you don't have
> to worry about userspace doing something funny with it's ringbuffer.
> Simplifies everything.
> 
> Also ofc userspace fencing still disallowed, but since userspace would
> queu up all writes to its ringbuffer through the drm/scheduler, we'd
> handle dependencies through that still. Not great, but workable.
> 
> Thinking about this, not even mapping the ringbuffer r/o is required, it's
> just that we must queue things throug the kernel to resolve dependencies
> and everything without breaking dma_fence. If userspace lies, tdr will
> shoot it and the kernel stops running that context entirely.
> 
> So I think even if we have hw with 100% userspace submit model only we
> should be still fine. It's ofc silly, because instead of using userspace
> fences and gpu semaphores the hw scheduler understands we still take the
> detour through drm/scheduler, but at least it's not a break-the-world
> event.

Also no page fault support, userptr invalidates still stall until
end-of-batch instead of just preempting it, and all that too. But I mean
there needs to be some motivation to fix this and roll out explicit sync
:-)
-Daniel

> 
> Or do I miss something here?
> 
> > For now and the next hardware we are save to support the old submission
> > model, but the functionality of kernel queues will sooner or later go away
> > if it is only for Linux.
> > 
> > So we need to work on something which works in the long term and get us away
> > from this implicit sync.
> 
> Yeah I think we have pretty clear consensus on that goal, just no one yet
> volunteered to get going with the winsys/wayland work to plumb drm_syncobj
> through, and the kernel/mesa work to make that optionally a userspace
> fence underneath. And it's for a sure a lot of work.
> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28 12:21                                           ` Daniel Vetter
  2021-04-28 12:26                                             ` Daniel Vetter
@ 2021-04-28 12:45                                             ` Simon Ser
  1 sibling, 0 replies; 105+ messages in thread
From: Simon Ser @ 2021-04-28 12:45 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Christian König, dri-devel, ML Mesa-dev

On Wednesday, April 28th, 2021 at 2:21 PM, Daniel Vetter <daniel@ffwll.ch> wrote:

> Yeah I think we have pretty clear consensus on that goal, just no one yet
> volunteered to get going with the winsys/wayland work to plumb drm_syncobj
> through, and the kernel/mesa work to make that optionally a userspace
> fence underneath. And it's for a sure a lot of work.

I'm interested in helping with the winsys/wayland bits, assuming the
following:

- We are pretty confident that drm_syncobj won't be superseded by
  something else in the near future. It seems to me like a lot of
  effort has gone into plumbing sync_file stuff all over, and it
  already needs replacing (I mean, it'll keep working, but we have a
  better replacement now. So compositors which have decided to ignore
  explicit sync for all this time won't have to do the work twice.)
- Plumbing drm_syncobj solves the synchronization issues with upcoming
  AMD hardware, and all of this works fine in cross-vendor multi-GPU
  setups.
- Someone is willing to spend a bit of time bearing with me and
  explaining how this all works. (I only know about sync_file for now,
  I'll start reading the Vulkan bits.)

Are these points something we can agree on?

Thanks,

Simon
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28 10:31                                         ` Christian König
  2021-04-28 12:21                                           ` Daniel Vetter
@ 2021-04-28 13:03                                           ` Alex Deucher
  1 sibling, 0 replies; 105+ messages in thread
From: Alex Deucher @ 2021-04-28 13:03 UTC (permalink / raw)
  To: Christian König; +Cc: dri-devel, ML Mesa-dev

On Wed, Apr 28, 2021 at 6:31 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> >> On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact@emersion.fr> wrote:
> >>> On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
> >>>
> >>>>> Ok. So that would only make the following use cases broken for now:
> >>>>>
> >>>>> - amd render -> external gpu
> >>>>> - amd video encode -> network device
> >>>> FWIW, "only" breaking amd render -> external gpu will make us pretty
> >>>> unhappy
> >>> I concur. I have quite a few users with a multi-GPU setup involving
> >>> AMD hardware.
> >>>
> >>> Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> >>> error, and not bad results on screen because nothing is synchronized
> >>> anymore.
> >> It's an upcoming requirement for windows[1], so you are likely to
> >> start seeing this across all GPU vendors that support windows.  I
> >> think the timing depends on how quickly the legacy hardware support
> >> sticks around for each vendor.
> > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > support isolating the ringbuffer at all.
> >
> > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > pte flags. Otherwise the entire "share address space with cpu side,
> > seamlessly" thing is out of the window.
> >
> > And with that r/o bit on the ringbuffer you can once more force submit
> > through kernel space, and all the legacy dma_fence based stuff keeps
> > working. And we don't have to invent some horrendous userspace fence based
> > implicit sync mechanism in the kernel, but can instead do this transition
> > properly with drm_syncobj timeline explicit sync and protocol reving.
> >
> > At least I think you'd have to work extra hard to create a gpu which
> > cannot possibly be intercepted by the kernel, even when it's designed to
> > support userspace direct submit only.
> >
> > Or are your hw engineers more creative here and we're screwed?
>
> The upcomming hardware generation will have this hardware scheduler as a
> must have, but there are certain ways we can still stick to the old
> approach:
>
> 1. The new hardware scheduler currently still supports kernel queues
> which essentially is the same as the old hardware ring buffer.
>
> 2. Mapping the top level ring buffer into the VM at least partially
> solves the problem. This way you can't manipulate the ring buffer
> content, but the location for the fence must still be writeable.
>
> For now and the next hardware we are save to support the old submission
> model, but the functionality of kernel queues will sooner or later go
> away if it is only for Linux.

Even if it didn't go away completely, no one else will be using it.
This leaves a lot of under-validated execution paths that lead to
subtle bugs.  When everyone else moved to KIQ for queue management, we
stuck with MMIO for a while in Linux and we ran into tons of subtle
bugs that disappeared when we moved to KIQ.  There were lots of
assumptions about how software would use different firmware interfaces
or not which impacted lots of interactions with clock and powergating
to name a few.  On top of that, you need to use the scheduler to
utilize stuff like preemption properly.  Also, if you want to do stuff
like gang scheduling (UMD scheduling multiple queues together), it's
really hard to do with kernel software schedulers.

Alex

>
> So we need to work on something which works in the long term and get us
> away from this implicit sync.
>
> Christian.
>
> > -Daniel
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28 12:26                                             ` Daniel Vetter
@ 2021-04-28 13:11                                               ` Christian König
  2021-04-28 13:34                                                 ` Daniel Vetter
  0 siblings, 1 reply; 105+ messages in thread
From: Christian König @ 2021-04-28 13:11 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: dri-devel, ML Mesa-dev

Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
>> On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
>>> Am 28.04.21 um 12:05 schrieb Daniel Vetter:
>>>> On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
>>>>> On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact@emersion.fr> wrote:
>>>>>> On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
>>>>>>
>>>>>>>> Ok. So that would only make the following use cases broken for now:
>>>>>>>>
>>>>>>>> - amd render -> external gpu
>>>>>>>> - amd video encode -> network device
>>>>>>> FWIW, "only" breaking amd render -> external gpu will make us pretty
>>>>>>> unhappy
>>>>>> I concur. I have quite a few users with a multi-GPU setup involving
>>>>>> AMD hardware.
>>>>>>
>>>>>> Note, if this brokenness can't be avoided, I'd prefer a to get a clear
>>>>>> error, and not bad results on screen because nothing is synchronized
>>>>>> anymore.
>>>>> It's an upcoming requirement for windows[1], so you are likely to
>>>>> start seeing this across all GPU vendors that support windows.  I
>>>>> think the timing depends on how quickly the legacy hardware support
>>>>> sticks around for each vendor.
>>>> Yeah but hw scheduling doesn't mean the hw has to be constructed to not
>>>> support isolating the ringbuffer at all.
>>>>
>>>> E.g. even if the hw loses the bit to put the ringbuffer outside of the
>>>> userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
>>>> pte flags. Otherwise the entire "share address space with cpu side,
>>>> seamlessly" thing is out of the window.
>>>>
>>>> And with that r/o bit on the ringbuffer you can once more force submit
>>>> through kernel space, and all the legacy dma_fence based stuff keeps
>>>> working. And we don't have to invent some horrendous userspace fence based
>>>> implicit sync mechanism in the kernel, but can instead do this transition
>>>> properly with drm_syncobj timeline explicit sync and protocol reving.
>>>>
>>>> At least I think you'd have to work extra hard to create a gpu which
>>>> cannot possibly be intercepted by the kernel, even when it's designed to
>>>> support userspace direct submit only.
>>>>
>>>> Or are your hw engineers more creative here and we're screwed?
>>> The upcomming hardware generation will have this hardware scheduler as a
>>> must have, but there are certain ways we can still stick to the old
>>> approach:
>>>
>>> 1. The new hardware scheduler currently still supports kernel queues which
>>> essentially is the same as the old hardware ring buffer.
>>>
>>> 2. Mapping the top level ring buffer into the VM at least partially solves
>>> the problem. This way you can't manipulate the ring buffer content, but the
>>> location for the fence must still be writeable.
>> Yeah allowing userspace to lie about completion fences in this model is
>> ok. Though I haven't thought through full consequences of that, but I
>> think it's not any worse than userspace lying about which buffers/address
>> it uses in the current model - we rely on hw vm ptes to catch that stuff.
>>
>> Also it might be good to switch to a non-recoverable ctx model for these.
>> That's already what we do in i915 (opt-in, but all current umd use that
>> mode). So any hang/watchdog just kills the entire ctx and you don't have
>> to worry about userspace doing something funny with it's ringbuffer.
>> Simplifies everything.
>>
>> Also ofc userspace fencing still disallowed, but since userspace would
>> queu up all writes to its ringbuffer through the drm/scheduler, we'd
>> handle dependencies through that still. Not great, but workable.
>>
>> Thinking about this, not even mapping the ringbuffer r/o is required, it's
>> just that we must queue things throug the kernel to resolve dependencies
>> and everything without breaking dma_fence. If userspace lies, tdr will
>> shoot it and the kernel stops running that context entirely.

Thinking more about that approach I don't think that it will work correctly.

See we not only need to write the fence as signal that an IB is 
submitted, but also adjust a bunch of privileged hardware registers.

When userspace could do that from its IBs as well then there is nothing 
blocking it from reprogramming the page table base address for example.

We could do those writes with the CPU as well, but that would be a huge 
performance drop because of the additional latency.

Christian.

>>
>> So I think even if we have hw with 100% userspace submit model only we
>> should be still fine. It's ofc silly, because instead of using userspace
>> fences and gpu semaphores the hw scheduler understands we still take the
>> detour through drm/scheduler, but at least it's not a break-the-world
>> event.
> Also no page fault support, userptr invalidates still stall until
> end-of-batch instead of just preempting it, and all that too. But I mean
> there needs to be some motivation to fix this and roll out explicit sync
> :-)
> -Daniel
>
>> Or do I miss something here?
>>
>>> For now and the next hardware we are save to support the old submission
>>> model, but the functionality of kernel queues will sooner or later go away
>>> if it is only for Linux.
>>>
>>> So we need to work on something which works in the long term and get us away
>>> from this implicit sync.
>> Yeah I think we have pretty clear consensus on that goal, just no one yet
>> volunteered to get going with the winsys/wayland work to plumb drm_syncobj
>> through, and the kernel/mesa work to make that optionally a userspace
>> fence underneath. And it's for a sure a lot of work.
>> -Daniel
>> -- 
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28 13:11                                               ` Christian König
@ 2021-04-28 13:34                                                 ` Daniel Vetter
  2021-04-28 13:37                                                   ` Christian König
  0 siblings, 1 reply; 105+ messages in thread
From: Daniel Vetter @ 2021-04-28 13:34 UTC (permalink / raw)
  To: Christian König; +Cc: dri-devel, ML Mesa-dev

On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact@emersion.fr> wrote:
> > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
> > > > > > > 
> > > > > > > > > Ok. So that would only make the following use cases broken for now:
> > > > > > > > > 
> > > > > > > > > - amd render -> external gpu
> > > > > > > > > - amd video encode -> network device
> > > > > > > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > > > > > > unhappy
> > > > > > > I concur. I have quite a few users with a multi-GPU setup involving
> > > > > > > AMD hardware.
> > > > > > > 
> > > > > > > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > > > > > > error, and not bad results on screen because nothing is synchronized
> > > > > > > anymore.
> > > > > > It's an upcoming requirement for windows[1], so you are likely to
> > > > > > start seeing this across all GPU vendors that support windows.  I
> > > > > > think the timing depends on how quickly the legacy hardware support
> > > > > > sticks around for each vendor.
> > > > > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > > > > support isolating the ringbuffer at all.
> > > > > 
> > > > > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > > > > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > > > > pte flags. Otherwise the entire "share address space with cpu side,
> > > > > seamlessly" thing is out of the window.
> > > > > 
> > > > > And with that r/o bit on the ringbuffer you can once more force submit
> > > > > through kernel space, and all the legacy dma_fence based stuff keeps
> > > > > working. And we don't have to invent some horrendous userspace fence based
> > > > > implicit sync mechanism in the kernel, but can instead do this transition
> > > > > properly with drm_syncobj timeline explicit sync and protocol reving.
> > > > > 
> > > > > At least I think you'd have to work extra hard to create a gpu which
> > > > > cannot possibly be intercepted by the kernel, even when it's designed to
> > > > > support userspace direct submit only.
> > > > > 
> > > > > Or are your hw engineers more creative here and we're screwed?
> > > > The upcomming hardware generation will have this hardware scheduler as a
> > > > must have, but there are certain ways we can still stick to the old
> > > > approach:
> > > > 
> > > > 1. The new hardware scheduler currently still supports kernel queues which
> > > > essentially is the same as the old hardware ring buffer.
> > > > 
> > > > 2. Mapping the top level ring buffer into the VM at least partially solves
> > > > the problem. This way you can't manipulate the ring buffer content, but the
> > > > location for the fence must still be writeable.
> > > Yeah allowing userspace to lie about completion fences in this model is
> > > ok. Though I haven't thought through full consequences of that, but I
> > > think it's not any worse than userspace lying about which buffers/address
> > > it uses in the current model - we rely on hw vm ptes to catch that stuff.
> > > 
> > > Also it might be good to switch to a non-recoverable ctx model for these.
> > > That's already what we do in i915 (opt-in, but all current umd use that
> > > mode). So any hang/watchdog just kills the entire ctx and you don't have
> > > to worry about userspace doing something funny with it's ringbuffer.
> > > Simplifies everything.
> > > 
> > > Also ofc userspace fencing still disallowed, but since userspace would
> > > queu up all writes to its ringbuffer through the drm/scheduler, we'd
> > > handle dependencies through that still. Not great, but workable.
> > > 
> > > Thinking about this, not even mapping the ringbuffer r/o is required, it's
> > > just that we must queue things throug the kernel to resolve dependencies
> > > and everything without breaking dma_fence. If userspace lies, tdr will
> > > shoot it and the kernel stops running that context entirely.
> 
> Thinking more about that approach I don't think that it will work correctly.
> 
> See we not only need to write the fence as signal that an IB is submitted,
> but also adjust a bunch of privileged hardware registers.
> 
> When userspace could do that from its IBs as well then there is nothing
> blocking it from reprogramming the page table base address for example.
> 
> We could do those writes with the CPU as well, but that would be a huge
> performance drop because of the additional latency.

That's not what I'm suggesting. I'm suggesting you have the queue and
everything in userspace, like in wondows. Fences are exactly handled like
on windows too. The difference is:

- All new additions to the ringbuffer are done through a kernel ioctl
  call, using the drm/scheduler to resolve dependencies.

- Memory management is also done like today int that ioctl.

- TDR makes sure that if userspace abuses the contract (which it can, but
  it can do that already today because there's also no command parser to
  e.g. stop gpu semaphores) the entire context is shot and terminally
  killed. Userspace has to then set up a new one. This isn't how amdgpu
  recovery works right now, but i915 supports it and I think it's also the
  better model for userspace error recovery anyway.

So from hw pov this will look _exactly_ like windows, except we never page
fault.

From sw pov this will look _exactly_ like current kernel ringbuf model,
with exactly same dma_fence semantics. If userspace lies, does something
stupid or otherwise breaks the uapi contract, vm ptes stop invalid access
and tdr kills it if it takes too long.

Where do you need priviledge IB writes or anything like that?

Ofc kernel needs to have some safety checks in the dma_fence timeline that
relies on userspace ringbuffer to never go backwards or unsignal, but
that's kinda just more compat cruft to make the kernel/dma_fence path
work.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28 13:34                                                 ` Daniel Vetter
@ 2021-04-28 13:37                                                   ` Christian König
  2021-04-28 14:34                                                     ` Daniel Vetter
  0 siblings, 1 reply; 105+ messages in thread
From: Christian König @ 2021-04-28 13:37 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: dri-devel, ML Mesa-dev

Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
>> Am 28.04.21 um 14:26 schrieb Daniel Vetter:
>>> On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
>>>> On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
>>>>> Am 28.04.21 um 12:05 schrieb Daniel Vetter:
>>>>>> On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
>>>>>>> On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact@emersion.fr> wrote:
>>>>>>>> On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
>>>>>>>>
>>>>>>>>>> Ok. So that would only make the following use cases broken for now:
>>>>>>>>>>
>>>>>>>>>> - amd render -> external gpu
>>>>>>>>>> - amd video encode -> network device
>>>>>>>>> FWIW, "only" breaking amd render -> external gpu will make us pretty
>>>>>>>>> unhappy
>>>>>>>> I concur. I have quite a few users with a multi-GPU setup involving
>>>>>>>> AMD hardware.
>>>>>>>>
>>>>>>>> Note, if this brokenness can't be avoided, I'd prefer a to get a clear
>>>>>>>> error, and not bad results on screen because nothing is synchronized
>>>>>>>> anymore.
>>>>>>> It's an upcoming requirement for windows[1], so you are likely to
>>>>>>> start seeing this across all GPU vendors that support windows.  I
>>>>>>> think the timing depends on how quickly the legacy hardware support
>>>>>>> sticks around for each vendor.
>>>>>> Yeah but hw scheduling doesn't mean the hw has to be constructed to not
>>>>>> support isolating the ringbuffer at all.
>>>>>>
>>>>>> E.g. even if the hw loses the bit to put the ringbuffer outside of the
>>>>>> userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
>>>>>> pte flags. Otherwise the entire "share address space with cpu side,
>>>>>> seamlessly" thing is out of the window.
>>>>>>
>>>>>> And with that r/o bit on the ringbuffer you can once more force submit
>>>>>> through kernel space, and all the legacy dma_fence based stuff keeps
>>>>>> working. And we don't have to invent some horrendous userspace fence based
>>>>>> implicit sync mechanism in the kernel, but can instead do this transition
>>>>>> properly with drm_syncobj timeline explicit sync and protocol reving.
>>>>>>
>>>>>> At least I think you'd have to work extra hard to create a gpu which
>>>>>> cannot possibly be intercepted by the kernel, even when it's designed to
>>>>>> support userspace direct submit only.
>>>>>>
>>>>>> Or are your hw engineers more creative here and we're screwed?
>>>>> The upcomming hardware generation will have this hardware scheduler as a
>>>>> must have, but there are certain ways we can still stick to the old
>>>>> approach:
>>>>>
>>>>> 1. The new hardware scheduler currently still supports kernel queues which
>>>>> essentially is the same as the old hardware ring buffer.
>>>>>
>>>>> 2. Mapping the top level ring buffer into the VM at least partially solves
>>>>> the problem. This way you can't manipulate the ring buffer content, but the
>>>>> location for the fence must still be writeable.
>>>> Yeah allowing userspace to lie about completion fences in this model is
>>>> ok. Though I haven't thought through full consequences of that, but I
>>>> think it's not any worse than userspace lying about which buffers/address
>>>> it uses in the current model - we rely on hw vm ptes to catch that stuff.
>>>>
>>>> Also it might be good to switch to a non-recoverable ctx model for these.
>>>> That's already what we do in i915 (opt-in, but all current umd use that
>>>> mode). So any hang/watchdog just kills the entire ctx and you don't have
>>>> to worry about userspace doing something funny with it's ringbuffer.
>>>> Simplifies everything.
>>>>
>>>> Also ofc userspace fencing still disallowed, but since userspace would
>>>> queu up all writes to its ringbuffer through the drm/scheduler, we'd
>>>> handle dependencies through that still. Not great, but workable.
>>>>
>>>> Thinking about this, not even mapping the ringbuffer r/o is required, it's
>>>> just that we must queue things throug the kernel to resolve dependencies
>>>> and everything without breaking dma_fence. If userspace lies, tdr will
>>>> shoot it and the kernel stops running that context entirely.
>> Thinking more about that approach I don't think that it will work correctly.
>>
>> See we not only need to write the fence as signal that an IB is submitted,
>> but also adjust a bunch of privileged hardware registers.
>>
>> When userspace could do that from its IBs as well then there is nothing
>> blocking it from reprogramming the page table base address for example.
>>
>> We could do those writes with the CPU as well, but that would be a huge
>> performance drop because of the additional latency.
> That's not what I'm suggesting. I'm suggesting you have the queue and
> everything in userspace, like in wondows. Fences are exactly handled like
> on windows too. The difference is:
>
> - All new additions to the ringbuffer are done through a kernel ioctl
>    call, using the drm/scheduler to resolve dependencies.
>
> - Memory management is also done like today int that ioctl.
>
> - TDR makes sure that if userspace abuses the contract (which it can, but
>    it can do that already today because there's also no command parser to
>    e.g. stop gpu semaphores) the entire context is shot and terminally
>    killed. Userspace has to then set up a new one. This isn't how amdgpu
>    recovery works right now, but i915 supports it and I think it's also the
>    better model for userspace error recovery anyway.
>
> So from hw pov this will look _exactly_ like windows, except we never page
> fault.
>
>  From sw pov this will look _exactly_ like current kernel ringbuf model,
> with exactly same dma_fence semantics. If userspace lies, does something
> stupid or otherwise breaks the uapi contract, vm ptes stop invalid access
> and tdr kills it if it takes too long.
>
> Where do you need priviledge IB writes or anything like that?

For writing the fence value and setting up the priority and VM registers.

Christian.

>
> Ofc kernel needs to have some safety checks in the dma_fence timeline that
> relies on userspace ringbuffer to never go backwards or unsignal, but
> that's kinda just more compat cruft to make the kernel/dma_fence path
> work.
>
> Cheers, Daniel

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28 13:37                                                   ` Christian König
@ 2021-04-28 14:34                                                     ` Daniel Vetter
  2021-04-28 14:45                                                       ` Christian König
  2021-04-28 20:39                                                       ` Alex Deucher
  0 siblings, 2 replies; 105+ messages in thread
From: Daniel Vetter @ 2021-04-28 14:34 UTC (permalink / raw)
  To: Christian König; +Cc: dri-devel, ML Mesa-dev

On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
> Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> > On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> > > Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > > > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact@emersion.fr> wrote:
> > > > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
> > > > > > > > > 
> > > > > > > > > > > Ok. So that would only make the following use cases broken for now:
> > > > > > > > > > > 
> > > > > > > > > > > - amd render -> external gpu
> > > > > > > > > > > - amd video encode -> network device
> > > > > > > > > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > > > > > > > > unhappy
> > > > > > > > > I concur. I have quite a few users with a multi-GPU setup involving
> > > > > > > > > AMD hardware.
> > > > > > > > > 
> > > > > > > > > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > > > > > > > > error, and not bad results on screen because nothing is synchronized
> > > > > > > > > anymore.
> > > > > > > > It's an upcoming requirement for windows[1], so you are likely to
> > > > > > > > start seeing this across all GPU vendors that support windows.  I
> > > > > > > > think the timing depends on how quickly the legacy hardware support
> > > > > > > > sticks around for each vendor.
> > > > > > > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > > > > > > support isolating the ringbuffer at all.
> > > > > > > 
> > > > > > > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > > > > > > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > > > > > > pte flags. Otherwise the entire "share address space with cpu side,
> > > > > > > seamlessly" thing is out of the window.
> > > > > > > 
> > > > > > > And with that r/o bit on the ringbuffer you can once more force submit
> > > > > > > through kernel space, and all the legacy dma_fence based stuff keeps
> > > > > > > working. And we don't have to invent some horrendous userspace fence based
> > > > > > > implicit sync mechanism in the kernel, but can instead do this transition
> > > > > > > properly with drm_syncobj timeline explicit sync and protocol reving.
> > > > > > > 
> > > > > > > At least I think you'd have to work extra hard to create a gpu which
> > > > > > > cannot possibly be intercepted by the kernel, even when it's designed to
> > > > > > > support userspace direct submit only.
> > > > > > > 
> > > > > > > Or are your hw engineers more creative here and we're screwed?
> > > > > > The upcomming hardware generation will have this hardware scheduler as a
> > > > > > must have, but there are certain ways we can still stick to the old
> > > > > > approach:
> > > > > > 
> > > > > > 1. The new hardware scheduler currently still supports kernel queues which
> > > > > > essentially is the same as the old hardware ring buffer.
> > > > > > 
> > > > > > 2. Mapping the top level ring buffer into the VM at least partially solves
> > > > > > the problem. This way you can't manipulate the ring buffer content, but the
> > > > > > location for the fence must still be writeable.
> > > > > Yeah allowing userspace to lie about completion fences in this model is
> > > > > ok. Though I haven't thought through full consequences of that, but I
> > > > > think it's not any worse than userspace lying about which buffers/address
> > > > > it uses in the current model - we rely on hw vm ptes to catch that stuff.
> > > > > 
> > > > > Also it might be good to switch to a non-recoverable ctx model for these.
> > > > > That's already what we do in i915 (opt-in, but all current umd use that
> > > > > mode). So any hang/watchdog just kills the entire ctx and you don't have
> > > > > to worry about userspace doing something funny with it's ringbuffer.
> > > > > Simplifies everything.
> > > > > 
> > > > > Also ofc userspace fencing still disallowed, but since userspace would
> > > > > queu up all writes to its ringbuffer through the drm/scheduler, we'd
> > > > > handle dependencies through that still. Not great, but workable.
> > > > > 
> > > > > Thinking about this, not even mapping the ringbuffer r/o is required, it's
> > > > > just that we must queue things throug the kernel to resolve dependencies
> > > > > and everything without breaking dma_fence. If userspace lies, tdr will
> > > > > shoot it and the kernel stops running that context entirely.
> > > Thinking more about that approach I don't think that it will work correctly.
> > > 
> > > See we not only need to write the fence as signal that an IB is submitted,
> > > but also adjust a bunch of privileged hardware registers.
> > > 
> > > When userspace could do that from its IBs as well then there is nothing
> > > blocking it from reprogramming the page table base address for example.
> > > 
> > > We could do those writes with the CPU as well, but that would be a huge
> > > performance drop because of the additional latency.
> > That's not what I'm suggesting. I'm suggesting you have the queue and
> > everything in userspace, like in wondows. Fences are exactly handled like
> > on windows too. The difference is:
> > 
> > - All new additions to the ringbuffer are done through a kernel ioctl
> >    call, using the drm/scheduler to resolve dependencies.
> > 
> > - Memory management is also done like today int that ioctl.
> > 
> > - TDR makes sure that if userspace abuses the contract (which it can, but
> >    it can do that already today because there's also no command parser to
> >    e.g. stop gpu semaphores) the entire context is shot and terminally
> >    killed. Userspace has to then set up a new one. This isn't how amdgpu
> >    recovery works right now, but i915 supports it and I think it's also the
> >    better model for userspace error recovery anyway.
> > 
> > So from hw pov this will look _exactly_ like windows, except we never page
> > fault.
> > 
> >  From sw pov this will look _exactly_ like current kernel ringbuf model,
> > with exactly same dma_fence semantics. If userspace lies, does something
> > stupid or otherwise breaks the uapi contract, vm ptes stop invalid access
> > and tdr kills it if it takes too long.
> > 
> > Where do you need priviledge IB writes or anything like that?
> 
> For writing the fence value and setting up the priority and VM registers.

I'm confused. How does this work on windows then with pure userspace
submit? Windows userspace sets its priorties and vm registers itself from
userspace?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28 14:34                                                     ` Daniel Vetter
@ 2021-04-28 14:45                                                       ` Christian König
  2021-04-29 11:07                                                         ` Daniel Vetter
  2021-04-28 20:39                                                       ` Alex Deucher
  1 sibling, 1 reply; 105+ messages in thread
From: Christian König @ 2021-04-28 14:45 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: dri-devel, ML Mesa-dev

Am 28.04.21 um 16:34 schrieb Daniel Vetter:
> On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
>> Am 28.04.21 um 15:34 schrieb Daniel Vetter:
>>> On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
>>>> Am 28.04.21 um 14:26 schrieb Daniel Vetter:
>>>>> On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
>>>>>> On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
>>>>>>> Am 28.04.21 um 12:05 schrieb Daniel Vetter:
>>>>>>>> On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
>>>>>>>>> On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact@emersion.fr> wrote:
>>>>>>>>>> On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
>>>>>>>>>>
>>>>>>>>>>>> Ok. So that would only make the following use cases broken for now:
>>>>>>>>>>>>
>>>>>>>>>>>> - amd render -> external gpu
>>>>>>>>>>>> - amd video encode -> network device
>>>>>>>>>>> FWIW, "only" breaking amd render -> external gpu will make us pretty
>>>>>>>>>>> unhappy
>>>>>>>>>> I concur. I have quite a few users with a multi-GPU setup involving
>>>>>>>>>> AMD hardware.
>>>>>>>>>>
>>>>>>>>>> Note, if this brokenness can't be avoided, I'd prefer a to get a clear
>>>>>>>>>> error, and not bad results on screen because nothing is synchronized
>>>>>>>>>> anymore.
>>>>>>>>> It's an upcoming requirement for windows[1], so you are likely to
>>>>>>>>> start seeing this across all GPU vendors that support windows.  I
>>>>>>>>> think the timing depends on how quickly the legacy hardware support
>>>>>>>>> sticks around for each vendor.
>>>>>>>> Yeah but hw scheduling doesn't mean the hw has to be constructed to not
>>>>>>>> support isolating the ringbuffer at all.
>>>>>>>>
>>>>>>>> E.g. even if the hw loses the bit to put the ringbuffer outside of the
>>>>>>>> userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
>>>>>>>> pte flags. Otherwise the entire "share address space with cpu side,
>>>>>>>> seamlessly" thing is out of the window.
>>>>>>>>
>>>>>>>> And with that r/o bit on the ringbuffer you can once more force submit
>>>>>>>> through kernel space, and all the legacy dma_fence based stuff keeps
>>>>>>>> working. And we don't have to invent some horrendous userspace fence based
>>>>>>>> implicit sync mechanism in the kernel, but can instead do this transition
>>>>>>>> properly with drm_syncobj timeline explicit sync and protocol reving.
>>>>>>>>
>>>>>>>> At least I think you'd have to work extra hard to create a gpu which
>>>>>>>> cannot possibly be intercepted by the kernel, even when it's designed to
>>>>>>>> support userspace direct submit only.
>>>>>>>>
>>>>>>>> Or are your hw engineers more creative here and we're screwed?
>>>>>>> The upcomming hardware generation will have this hardware scheduler as a
>>>>>>> must have, but there are certain ways we can still stick to the old
>>>>>>> approach:
>>>>>>>
>>>>>>> 1. The new hardware scheduler currently still supports kernel queues which
>>>>>>> essentially is the same as the old hardware ring buffer.
>>>>>>>
>>>>>>> 2. Mapping the top level ring buffer into the VM at least partially solves
>>>>>>> the problem. This way you can't manipulate the ring buffer content, but the
>>>>>>> location for the fence must still be writeable.
>>>>>> Yeah allowing userspace to lie about completion fences in this model is
>>>>>> ok. Though I haven't thought through full consequences of that, but I
>>>>>> think it's not any worse than userspace lying about which buffers/address
>>>>>> it uses in the current model - we rely on hw vm ptes to catch that stuff.
>>>>>>
>>>>>> Also it might be good to switch to a non-recoverable ctx model for these.
>>>>>> That's already what we do in i915 (opt-in, but all current umd use that
>>>>>> mode). So any hang/watchdog just kills the entire ctx and you don't have
>>>>>> to worry about userspace doing something funny with it's ringbuffer.
>>>>>> Simplifies everything.
>>>>>>
>>>>>> Also ofc userspace fencing still disallowed, but since userspace would
>>>>>> queu up all writes to its ringbuffer through the drm/scheduler, we'd
>>>>>> handle dependencies through that still. Not great, but workable.
>>>>>>
>>>>>> Thinking about this, not even mapping the ringbuffer r/o is required, it's
>>>>>> just that we must queue things throug the kernel to resolve dependencies
>>>>>> and everything without breaking dma_fence. If userspace lies, tdr will
>>>>>> shoot it and the kernel stops running that context entirely.
>>>> Thinking more about that approach I don't think that it will work correctly.
>>>>
>>>> See we not only need to write the fence as signal that an IB is submitted,
>>>> but also adjust a bunch of privileged hardware registers.
>>>>
>>>> When userspace could do that from its IBs as well then there is nothing
>>>> blocking it from reprogramming the page table base address for example.
>>>>
>>>> We could do those writes with the CPU as well, but that would be a huge
>>>> performance drop because of the additional latency.
>>> That's not what I'm suggesting. I'm suggesting you have the queue and
>>> everything in userspace, like in wondows. Fences are exactly handled like
>>> on windows too. The difference is:
>>>
>>> - All new additions to the ringbuffer are done through a kernel ioctl
>>>     call, using the drm/scheduler to resolve dependencies.
>>>
>>> - Memory management is also done like today int that ioctl.
>>>
>>> - TDR makes sure that if userspace abuses the contract (which it can, but
>>>     it can do that already today because there's also no command parser to
>>>     e.g. stop gpu semaphores) the entire context is shot and terminally
>>>     killed. Userspace has to then set up a new one. This isn't how amdgpu
>>>     recovery works right now, but i915 supports it and I think it's also the
>>>     better model for userspace error recovery anyway.
>>>
>>> So from hw pov this will look _exactly_ like windows, except we never page
>>> fault.
>>>
>>>   From sw pov this will look _exactly_ like current kernel ringbuf model,
>>> with exactly same dma_fence semantics. If userspace lies, does something
>>> stupid or otherwise breaks the uapi contract, vm ptes stop invalid access
>>> and tdr kills it if it takes too long.
>>>
>>> Where do you need priviledge IB writes or anything like that?
>> For writing the fence value and setting up the priority and VM registers.
> I'm confused. How does this work on windows then with pure userspace
> submit? Windows userspace sets its priorties and vm registers itself from
> userspace?

The priorities and VM registers are setup from the hw scheduler on 
windows, but this comes with preemption again.

And just letting the kernel write to the ring buffer hast the same 
problems as userspace fences. E.g. userspace could just overwrite the 
command which write the fence value with NOPs.

In other words we certainly need some kind of protection for the ring 
buffer, e.g. setting it readonly and making sure that it can always 
write the fence and is never preempted by the HW scheduler. But that 
protection breaks our neck at different places again.

That solution could maybe work, but it is certainly not something we 
have tested.

Christian.

> -Daniel

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28 14:34                                                     ` Daniel Vetter
  2021-04-28 14:45                                                       ` Christian König
@ 2021-04-28 20:39                                                       ` Alex Deucher
  2021-04-29 11:12                                                         ` Daniel Vetter
  1 sibling, 1 reply; 105+ messages in thread
From: Alex Deucher @ 2021-04-28 20:39 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Christian König, ML Mesa-dev, dri-devel

On Wed, Apr 28, 2021 at 10:35 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
> > Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> > > On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> > > > Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > > > > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > > > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact@emersion.fr> wrote:
> > > > > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
> > > > > > > > > >
> > > > > > > > > > > > Ok. So that would only make the following use cases broken for now:
> > > > > > > > > > > >
> > > > > > > > > > > > - amd render -> external gpu
> > > > > > > > > > > > - amd video encode -> network device
> > > > > > > > > > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > > > > > > > > > unhappy
> > > > > > > > > > I concur. I have quite a few users with a multi-GPU setup involving
> > > > > > > > > > AMD hardware.
> > > > > > > > > >
> > > > > > > > > > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > > > > > > > > > error, and not bad results on screen because nothing is synchronized
> > > > > > > > > > anymore.
> > > > > > > > > It's an upcoming requirement for windows[1], so you are likely to
> > > > > > > > > start seeing this across all GPU vendors that support windows.  I
> > > > > > > > > think the timing depends on how quickly the legacy hardware support
> > > > > > > > > sticks around for each vendor.
> > > > > > > > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > > > > > > > support isolating the ringbuffer at all.
> > > > > > > >
> > > > > > > > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > > > > > > > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > > > > > > > pte flags. Otherwise the entire "share address space with cpu side,
> > > > > > > > seamlessly" thing is out of the window.
> > > > > > > >
> > > > > > > > And with that r/o bit on the ringbuffer you can once more force submit
> > > > > > > > through kernel space, and all the legacy dma_fence based stuff keeps
> > > > > > > > working. And we don't have to invent some horrendous userspace fence based
> > > > > > > > implicit sync mechanism in the kernel, but can instead do this transition
> > > > > > > > properly with drm_syncobj timeline explicit sync and protocol reving.
> > > > > > > >
> > > > > > > > At least I think you'd have to work extra hard to create a gpu which
> > > > > > > > cannot possibly be intercepted by the kernel, even when it's designed to
> > > > > > > > support userspace direct submit only.
> > > > > > > >
> > > > > > > > Or are your hw engineers more creative here and we're screwed?
> > > > > > > The upcomming hardware generation will have this hardware scheduler as a
> > > > > > > must have, but there are certain ways we can still stick to the old
> > > > > > > approach:
> > > > > > >
> > > > > > > 1. The new hardware scheduler currently still supports kernel queues which
> > > > > > > essentially is the same as the old hardware ring buffer.
> > > > > > >
> > > > > > > 2. Mapping the top level ring buffer into the VM at least partially solves
> > > > > > > the problem. This way you can't manipulate the ring buffer content, but the
> > > > > > > location for the fence must still be writeable.
> > > > > > Yeah allowing userspace to lie about completion fences in this model is
> > > > > > ok. Though I haven't thought through full consequences of that, but I
> > > > > > think it's not any worse than userspace lying about which buffers/address
> > > > > > it uses in the current model - we rely on hw vm ptes to catch that stuff.
> > > > > >
> > > > > > Also it might be good to switch to a non-recoverable ctx model for these.
> > > > > > That's already what we do in i915 (opt-in, but all current umd use that
> > > > > > mode). So any hang/watchdog just kills the entire ctx and you don't have
> > > > > > to worry about userspace doing something funny with it's ringbuffer.
> > > > > > Simplifies everything.
> > > > > >
> > > > > > Also ofc userspace fencing still disallowed, but since userspace would
> > > > > > queu up all writes to its ringbuffer through the drm/scheduler, we'd
> > > > > > handle dependencies through that still. Not great, but workable.
> > > > > >
> > > > > > Thinking about this, not even mapping the ringbuffer r/o is required, it's
> > > > > > just that we must queue things throug the kernel to resolve dependencies
> > > > > > and everything without breaking dma_fence. If userspace lies, tdr will
> > > > > > shoot it and the kernel stops running that context entirely.
> > > > Thinking more about that approach I don't think that it will work correctly.
> > > >
> > > > See we not only need to write the fence as signal that an IB is submitted,
> > > > but also adjust a bunch of privileged hardware registers.
> > > >
> > > > When userspace could do that from its IBs as well then there is nothing
> > > > blocking it from reprogramming the page table base address for example.
> > > >
> > > > We could do those writes with the CPU as well, but that would be a huge
> > > > performance drop because of the additional latency.
> > > That's not what I'm suggesting. I'm suggesting you have the queue and
> > > everything in userspace, like in wondows. Fences are exactly handled like
> > > on windows too. The difference is:
> > >
> > > - All new additions to the ringbuffer are done through a kernel ioctl
> > >    call, using the drm/scheduler to resolve dependencies.
> > >
> > > - Memory management is also done like today int that ioctl.
> > >
> > > - TDR makes sure that if userspace abuses the contract (which it can, but
> > >    it can do that already today because there's also no command parser to
> > >    e.g. stop gpu semaphores) the entire context is shot and terminally
> > >    killed. Userspace has to then set up a new one. This isn't how amdgpu
> > >    recovery works right now, but i915 supports it and I think it's also the
> > >    better model for userspace error recovery anyway.
> > >
> > > So from hw pov this will look _exactly_ like windows, except we never page
> > > fault.
> > >
> > >  From sw pov this will look _exactly_ like current kernel ringbuf model,
> > > with exactly same dma_fence semantics. If userspace lies, does something
> > > stupid or otherwise breaks the uapi contract, vm ptes stop invalid access
> > > and tdr kills it if it takes too long.
> > >
> > > Where do you need priviledge IB writes or anything like that?
> >
> > For writing the fence value and setting up the priority and VM registers.
>
> I'm confused. How does this work on windows then with pure userspace
> submit? Windows userspace sets its priorties and vm registers itself from
> userspace?

When the user allocates usermode queues, the kernel driver sets up a
queue descriptor in the kernel which defines the location of the queue
in memory, what priority it has, what page tables it should use, etc.
User mode can then start writing commands to its queues.  When they
are ready for the hardware to start executing them, they ring a
doorbell which signals the scheduler and it maps the queue descriptors
to HW queue slots and they start executing.  The user only has access
to it's queues and any buffers it has mapped in it's GPU virtual
address space.  While the queues are scheduled, the user can keep
submitting work to them and they will keep executing unless they get
preempted by the scheduler due to oversubscription or a priority call
or a request from the kernel driver to preempt, etc.

Alex
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28 14:45                                                       ` Christian König
@ 2021-04-29 11:07                                                         ` Daniel Vetter
  0 siblings, 0 replies; 105+ messages in thread
From: Daniel Vetter @ 2021-04-29 11:07 UTC (permalink / raw)
  To: Christian König; +Cc: dri-devel, ML Mesa-dev

On Wed, Apr 28, 2021 at 04:45:01PM +0200, Christian König wrote:
> Am 28.04.21 um 16:34 schrieb Daniel Vetter:
> > On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
> > > Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> > > > On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> > > > > Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > > > > > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > > > > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > > > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact@emersion.fr> wrote:
> > > > > > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
> > > > > > > > > > > 
> > > > > > > > > > > > > Ok. So that would only make the following use cases broken for now:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > - amd render -> external gpu
> > > > > > > > > > > > > - amd video encode -> network device
> > > > > > > > > > > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > > > > > > > > > > unhappy
> > > > > > > > > > > I concur. I have quite a few users with a multi-GPU setup involving
> > > > > > > > > > > AMD hardware.
> > > > > > > > > > > 
> > > > > > > > > > > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > > > > > > > > > > error, and not bad results on screen because nothing is synchronized
> > > > > > > > > > > anymore.
> > > > > > > > > > It's an upcoming requirement for windows[1], so you are likely to
> > > > > > > > > > start seeing this across all GPU vendors that support windows.  I
> > > > > > > > > > think the timing depends on how quickly the legacy hardware support
> > > > > > > > > > sticks around for each vendor.
> > > > > > > > > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > > > > > > > > support isolating the ringbuffer at all.
> > > > > > > > > 
> > > > > > > > > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > > > > > > > > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > > > > > > > > pte flags. Otherwise the entire "share address space with cpu side,
> > > > > > > > > seamlessly" thing is out of the window.
> > > > > > > > > 
> > > > > > > > > And with that r/o bit on the ringbuffer you can once more force submit
> > > > > > > > > through kernel space, and all the legacy dma_fence based stuff keeps
> > > > > > > > > working. And we don't have to invent some horrendous userspace fence based
> > > > > > > > > implicit sync mechanism in the kernel, but can instead do this transition
> > > > > > > > > properly with drm_syncobj timeline explicit sync and protocol reving.
> > > > > > > > > 
> > > > > > > > > At least I think you'd have to work extra hard to create a gpu which
> > > > > > > > > cannot possibly be intercepted by the kernel, even when it's designed to
> > > > > > > > > support userspace direct submit only.
> > > > > > > > > 
> > > > > > > > > Or are your hw engineers more creative here and we're screwed?
> > > > > > > > The upcomming hardware generation will have this hardware scheduler as a
> > > > > > > > must have, but there are certain ways we can still stick to the old
> > > > > > > > approach:
> > > > > > > > 
> > > > > > > > 1. The new hardware scheduler currently still supports kernel queues which
> > > > > > > > essentially is the same as the old hardware ring buffer.
> > > > > > > > 
> > > > > > > > 2. Mapping the top level ring buffer into the VM at least partially solves
> > > > > > > > the problem. This way you can't manipulate the ring buffer content, but the
> > > > > > > > location for the fence must still be writeable.
> > > > > > > Yeah allowing userspace to lie about completion fences in this model is
> > > > > > > ok. Though I haven't thought through full consequences of that, but I
> > > > > > > think it's not any worse than userspace lying about which buffers/address
> > > > > > > it uses in the current model - we rely on hw vm ptes to catch that stuff.
> > > > > > > 
> > > > > > > Also it might be good to switch to a non-recoverable ctx model for these.
> > > > > > > That's already what we do in i915 (opt-in, but all current umd use that
> > > > > > > mode). So any hang/watchdog just kills the entire ctx and you don't have
> > > > > > > to worry about userspace doing something funny with it's ringbuffer.
> > > > > > > Simplifies everything.
> > > > > > > 
> > > > > > > Also ofc userspace fencing still disallowed, but since userspace would
> > > > > > > queu up all writes to its ringbuffer through the drm/scheduler, we'd
> > > > > > > handle dependencies through that still. Not great, but workable.
> > > > > > > 
> > > > > > > Thinking about this, not even mapping the ringbuffer r/o is required, it's
> > > > > > > just that we must queue things throug the kernel to resolve dependencies
> > > > > > > and everything without breaking dma_fence. If userspace lies, tdr will
> > > > > > > shoot it and the kernel stops running that context entirely.
> > > > > Thinking more about that approach I don't think that it will work correctly.
> > > > > 
> > > > > See we not only need to write the fence as signal that an IB is submitted,
> > > > > but also adjust a bunch of privileged hardware registers.
> > > > > 
> > > > > When userspace could do that from its IBs as well then there is nothing
> > > > > blocking it from reprogramming the page table base address for example.
> > > > > 
> > > > > We could do those writes with the CPU as well, but that would be a huge
> > > > > performance drop because of the additional latency.
> > > > That's not what I'm suggesting. I'm suggesting you have the queue and
> > > > everything in userspace, like in wondows. Fences are exactly handled like
> > > > on windows too. The difference is:
> > > > 
> > > > - All new additions to the ringbuffer are done through a kernel ioctl
> > > >     call, using the drm/scheduler to resolve dependencies.
> > > > 
> > > > - Memory management is also done like today int that ioctl.
> > > > 
> > > > - TDR makes sure that if userspace abuses the contract (which it can, but
> > > >     it can do that already today because there's also no command parser to
> > > >     e.g. stop gpu semaphores) the entire context is shot and terminally
> > > >     killed. Userspace has to then set up a new one. This isn't how amdgpu
> > > >     recovery works right now, but i915 supports it and I think it's also the
> > > >     better model for userspace error recovery anyway.
> > > > 
> > > > So from hw pov this will look _exactly_ like windows, except we never page
> > > > fault.
> > > > 
> > > >   From sw pov this will look _exactly_ like current kernel ringbuf model,
> > > > with exactly same dma_fence semantics. If userspace lies, does something
> > > > stupid or otherwise breaks the uapi contract, vm ptes stop invalid access
> > > > and tdr kills it if it takes too long.
> > > > 
> > > > Where do you need priviledge IB writes or anything like that?
> > > For writing the fence value and setting up the priority and VM registers.
> > I'm confused. How does this work on windows then with pure userspace
> > submit? Windows userspace sets its priorties and vm registers itself from
> > userspace?
> 
> The priorities and VM registers are setup from the hw scheduler on windows,
> but this comes with preemption again.

The thing is, if the hw scheduler preempts your stuff a bit occasionally,
what's the problem? Essentially it just looks like each context is it's
own queue that can make forward progress.

Also, I'm assuming there's some way in the windows model to make sure that
unpriviledged userspace can't change the vm registers and priorities
itself. Those work the same in both worlds.

> And just letting the kernel write to the ring buffer hast the same problems
> as userspace fences. E.g. userspace could just overwrite the command which
> write the fence value with NOPs.
> 
> In other words we certainly need some kind of protection for the ring
> buffer, e.g. setting it readonly and making sure that it can always write
> the fence and is never preempted by the HW scheduler. But that protection
> breaks our neck at different places again.

My point is: You don't need protection. With the current cs ioctl
userspace can already do all kinds of nasty stuff and break itself. gpu
pagetables and TDR make sure nothing bad happens.

So imo you don't actually need to protected anything in the ring, as long
as you don't bother supporting recoverable TDR. You just declare the
entire ring shot and ask userspace to set up a new one.

> That solution could maybe work, but it is certainly not something we have
> tested.

Again, you can run the entire hw like on windows. The only thing you add
on top is that new stuff gets added to the userspace ring through the
kernel, so that the kernel can make sure all the resulting dma_fence are
still properly ordered, wont deadlock and will complete in due time (using
TDR). Also, the entire memory management works like now, but from a hw
point of view that's also not different. It just means that page faults
are never fixed, but the response is always that there's really no page
present at that slot.

I'm really not seeing the fundamental problem, nor why exactly you need a
completely different hw model here.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28 20:39                                                       ` Alex Deucher
@ 2021-04-29 11:12                                                         ` Daniel Vetter
  2021-04-30  8:58                                                           ` Daniel Vetter
  0 siblings, 1 reply; 105+ messages in thread
From: Daniel Vetter @ 2021-04-29 11:12 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Christian König, dri-devel, ML Mesa-dev

On Wed, Apr 28, 2021 at 04:39:24PM -0400, Alex Deucher wrote:
> On Wed, Apr 28, 2021 at 10:35 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
> > > Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> > > > On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> > > > > Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > > > > > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > > > > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > > > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact@emersion.fr> wrote:
> > > > > > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > > Ok. So that would only make the following use cases broken for now:
> > > > > > > > > > > > >
> > > > > > > > > > > > > - amd render -> external gpu
> > > > > > > > > > > > > - amd video encode -> network device
> > > > > > > > > > > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > > > > > > > > > > unhappy
> > > > > > > > > > > I concur. I have quite a few users with a multi-GPU setup involving
> > > > > > > > > > > AMD hardware.
> > > > > > > > > > >
> > > > > > > > > > > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > > > > > > > > > > error, and not bad results on screen because nothing is synchronized
> > > > > > > > > > > anymore.
> > > > > > > > > > It's an upcoming requirement for windows[1], so you are likely to
> > > > > > > > > > start seeing this across all GPU vendors that support windows.  I
> > > > > > > > > > think the timing depends on how quickly the legacy hardware support
> > > > > > > > > > sticks around for each vendor.
> > > > > > > > > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > > > > > > > > support isolating the ringbuffer at all.
> > > > > > > > >
> > > > > > > > > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > > > > > > > > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > > > > > > > > pte flags. Otherwise the entire "share address space with cpu side,
> > > > > > > > > seamlessly" thing is out of the window.
> > > > > > > > >
> > > > > > > > > And with that r/o bit on the ringbuffer you can once more force submit
> > > > > > > > > through kernel space, and all the legacy dma_fence based stuff keeps
> > > > > > > > > working. And we don't have to invent some horrendous userspace fence based
> > > > > > > > > implicit sync mechanism in the kernel, but can instead do this transition
> > > > > > > > > properly with drm_syncobj timeline explicit sync and protocol reving.
> > > > > > > > >
> > > > > > > > > At least I think you'd have to work extra hard to create a gpu which
> > > > > > > > > cannot possibly be intercepted by the kernel, even when it's designed to
> > > > > > > > > support userspace direct submit only.
> > > > > > > > >
> > > > > > > > > Or are your hw engineers more creative here and we're screwed?
> > > > > > > > The upcomming hardware generation will have this hardware scheduler as a
> > > > > > > > must have, but there are certain ways we can still stick to the old
> > > > > > > > approach:
> > > > > > > >
> > > > > > > > 1. The new hardware scheduler currently still supports kernel queues which
> > > > > > > > essentially is the same as the old hardware ring buffer.
> > > > > > > >
> > > > > > > > 2. Mapping the top level ring buffer into the VM at least partially solves
> > > > > > > > the problem. This way you can't manipulate the ring buffer content, but the
> > > > > > > > location for the fence must still be writeable.
> > > > > > > Yeah allowing userspace to lie about completion fences in this model is
> > > > > > > ok. Though I haven't thought through full consequences of that, but I
> > > > > > > think it's not any worse than userspace lying about which buffers/address
> > > > > > > it uses in the current model - we rely on hw vm ptes to catch that stuff.
> > > > > > >
> > > > > > > Also it might be good to switch to a non-recoverable ctx model for these.
> > > > > > > That's already what we do in i915 (opt-in, but all current umd use that
> > > > > > > mode). So any hang/watchdog just kills the entire ctx and you don't have
> > > > > > > to worry about userspace doing something funny with it's ringbuffer.
> > > > > > > Simplifies everything.
> > > > > > >
> > > > > > > Also ofc userspace fencing still disallowed, but since userspace would
> > > > > > > queu up all writes to its ringbuffer through the drm/scheduler, we'd
> > > > > > > handle dependencies through that still. Not great, but workable.
> > > > > > >
> > > > > > > Thinking about this, not even mapping the ringbuffer r/o is required, it's
> > > > > > > just that we must queue things throug the kernel to resolve dependencies
> > > > > > > and everything without breaking dma_fence. If userspace lies, tdr will
> > > > > > > shoot it and the kernel stops running that context entirely.
> > > > > Thinking more about that approach I don't think that it will work correctly.
> > > > >
> > > > > See we not only need to write the fence as signal that an IB is submitted,
> > > > > but also adjust a bunch of privileged hardware registers.
> > > > >
> > > > > When userspace could do that from its IBs as well then there is nothing
> > > > > blocking it from reprogramming the page table base address for example.
> > > > >
> > > > > We could do those writes with the CPU as well, but that would be a huge
> > > > > performance drop because of the additional latency.
> > > > That's not what I'm suggesting. I'm suggesting you have the queue and
> > > > everything in userspace, like in wondows. Fences are exactly handled like
> > > > on windows too. The difference is:
> > > >
> > > > - All new additions to the ringbuffer are done through a kernel ioctl
> > > >    call, using the drm/scheduler to resolve dependencies.
> > > >
> > > > - Memory management is also done like today int that ioctl.
> > > >
> > > > - TDR makes sure that if userspace abuses the contract (which it can, but
> > > >    it can do that already today because there's also no command parser to
> > > >    e.g. stop gpu semaphores) the entire context is shot and terminally
> > > >    killed. Userspace has to then set up a new one. This isn't how amdgpu
> > > >    recovery works right now, but i915 supports it and I think it's also the
> > > >    better model for userspace error recovery anyway.
> > > >
> > > > So from hw pov this will look _exactly_ like windows, except we never page
> > > > fault.
> > > >
> > > >  From sw pov this will look _exactly_ like current kernel ringbuf model,
> > > > with exactly same dma_fence semantics. If userspace lies, does something
> > > > stupid or otherwise breaks the uapi contract, vm ptes stop invalid access
> > > > and tdr kills it if it takes too long.
> > > >
> > > > Where do you need priviledge IB writes or anything like that?
> > >
> > > For writing the fence value and setting up the priority and VM registers.
> >
> > I'm confused. How does this work on windows then with pure userspace
> > submit? Windows userspace sets its priorties and vm registers itself from
> > userspace?
> 
> When the user allocates usermode queues, the kernel driver sets up a
> queue descriptor in the kernel which defines the location of the queue
> in memory, what priority it has, what page tables it should use, etc.
> User mode can then start writing commands to its queues.  When they
> are ready for the hardware to start executing them, they ring a
> doorbell which signals the scheduler and it maps the queue descriptors
> to HW queue slots and they start executing.  The user only has access
> to it's queues and any buffers it has mapped in it's GPU virtual
> address space.  While the queues are scheduled, the user can keep
> submitting work to them and they will keep executing unless they get
> preempted by the scheduler due to oversubscription or a priority call
> or a request from the kernel driver to preempt, etc.

Yeah, works like with our stuff.

I don't see a problem tbh. It's slightly silly going the detour with the
kernel ioctl, and it's annoying that you still have to use drm/scheduler
to resolve dependencies instead of gpu semaphores and all that. But this
only applies to legacy winsys mode, compute (e.g. vk without winsys) can
use the full power. Just needs a flag or something when setting up the
context.

And best part is that from hw pov this really is indistinguishable from
the full on userspace submit model.

The thing where it gets annoying is when you use one of these new cpu
instructions which do direct submit to hw and pass along the pasid id
behind the scenes. That's truly something you can't intercept anymore in
the kernel and fake the legacy dma_fence world.

But what you're describing here sounds like bog standard stuff, and also
pretty easy to keep working with exactly the current model.

Ofc we'll want to push forward a more modern model that better suits
modern gpus, but I don't see any hard requirement here from the hw side.

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-29 11:12                                                         ` Daniel Vetter
@ 2021-04-30  8:58                                                           ` Daniel Vetter
  2021-04-30  9:07                                                             ` Christian König
  0 siblings, 1 reply; 105+ messages in thread
From: Daniel Vetter @ 2021-04-30  8:58 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Christian König, ML Mesa-dev, dri-devel

On Thu, Apr 29, 2021 at 1:12 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Wed, Apr 28, 2021 at 04:39:24PM -0400, Alex Deucher wrote:
> > On Wed, Apr 28, 2021 at 10:35 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Wed, Apr 28, 2021 at 03:37:49PM +0200, Christian König wrote:
> > > > Am 28.04.21 um 15:34 schrieb Daniel Vetter:
> > > > > On Wed, Apr 28, 2021 at 03:11:27PM +0200, Christian König wrote:
> > > > > > Am 28.04.21 um 14:26 schrieb Daniel Vetter:
> > > > > > > On Wed, Apr 28, 2021 at 02:21:54PM +0200, Daniel Vetter wrote:
> > > > > > > > On Wed, Apr 28, 2021 at 12:31:09PM +0200, Christian König wrote:
> > > > > > > > > Am 28.04.21 um 12:05 schrieb Daniel Vetter:
> > > > > > > > > > On Tue, Apr 27, 2021 at 02:01:20PM -0400, Alex Deucher wrote:
> > > > > > > > > > > On Tue, Apr 27, 2021 at 1:35 PM Simon Ser <contact@emersion.fr> wrote:
> > > > > > > > > > > > On Tuesday, April 27th, 2021 at 7:31 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > > Ok. So that would only make the following use cases broken for now:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > - amd render -> external gpu
> > > > > > > > > > > > > > - amd video encode -> network device
> > > > > > > > > > > > > FWIW, "only" breaking amd render -> external gpu will make us pretty
> > > > > > > > > > > > > unhappy
> > > > > > > > > > > > I concur. I have quite a few users with a multi-GPU setup involving
> > > > > > > > > > > > AMD hardware.
> > > > > > > > > > > >
> > > > > > > > > > > > Note, if this brokenness can't be avoided, I'd prefer a to get a clear
> > > > > > > > > > > > error, and not bad results on screen because nothing is synchronized
> > > > > > > > > > > > anymore.
> > > > > > > > > > > It's an upcoming requirement for windows[1], so you are likely to
> > > > > > > > > > > start seeing this across all GPU vendors that support windows.  I
> > > > > > > > > > > think the timing depends on how quickly the legacy hardware support
> > > > > > > > > > > sticks around for each vendor.
> > > > > > > > > > Yeah but hw scheduling doesn't mean the hw has to be constructed to not
> > > > > > > > > > support isolating the ringbuffer at all.
> > > > > > > > > >
> > > > > > > > > > E.g. even if the hw loses the bit to put the ringbuffer outside of the
> > > > > > > > > > userspace gpu vm, if you have pagetables I'm seriously hoping you have r/o
> > > > > > > > > > pte flags. Otherwise the entire "share address space with cpu side,
> > > > > > > > > > seamlessly" thing is out of the window.
> > > > > > > > > >
> > > > > > > > > > And with that r/o bit on the ringbuffer you can once more force submit
> > > > > > > > > > through kernel space, and all the legacy dma_fence based stuff keeps
> > > > > > > > > > working. And we don't have to invent some horrendous userspace fence based
> > > > > > > > > > implicit sync mechanism in the kernel, but can instead do this transition
> > > > > > > > > > properly with drm_syncobj timeline explicit sync and protocol reving.
> > > > > > > > > >
> > > > > > > > > > At least I think you'd have to work extra hard to create a gpu which
> > > > > > > > > > cannot possibly be intercepted by the kernel, even when it's designed to
> > > > > > > > > > support userspace direct submit only.
> > > > > > > > > >
> > > > > > > > > > Or are your hw engineers more creative here and we're screwed?
> > > > > > > > > The upcomming hardware generation will have this hardware scheduler as a
> > > > > > > > > must have, but there are certain ways we can still stick to the old
> > > > > > > > > approach:
> > > > > > > > >
> > > > > > > > > 1. The new hardware scheduler currently still supports kernel queues which
> > > > > > > > > essentially is the same as the old hardware ring buffer.
> > > > > > > > >
> > > > > > > > > 2. Mapping the top level ring buffer into the VM at least partially solves
> > > > > > > > > the problem. This way you can't manipulate the ring buffer content, but the
> > > > > > > > > location for the fence must still be writeable.
> > > > > > > > Yeah allowing userspace to lie about completion fences in this model is
> > > > > > > > ok. Though I haven't thought through full consequences of that, but I
> > > > > > > > think it's not any worse than userspace lying about which buffers/address
> > > > > > > > it uses in the current model - we rely on hw vm ptes to catch that stuff.
> > > > > > > >
> > > > > > > > Also it might be good to switch to a non-recoverable ctx model for these.
> > > > > > > > That's already what we do in i915 (opt-in, but all current umd use that
> > > > > > > > mode). So any hang/watchdog just kills the entire ctx and you don't have
> > > > > > > > to worry about userspace doing something funny with it's ringbuffer.
> > > > > > > > Simplifies everything.
> > > > > > > >
> > > > > > > > Also ofc userspace fencing still disallowed, but since userspace would
> > > > > > > > queu up all writes to its ringbuffer through the drm/scheduler, we'd
> > > > > > > > handle dependencies through that still. Not great, but workable.
> > > > > > > >
> > > > > > > > Thinking about this, not even mapping the ringbuffer r/o is required, it's
> > > > > > > > just that we must queue things throug the kernel to resolve dependencies
> > > > > > > > and everything without breaking dma_fence. If userspace lies, tdr will
> > > > > > > > shoot it and the kernel stops running that context entirely.
> > > > > > Thinking more about that approach I don't think that it will work correctly.
> > > > > >
> > > > > > See we not only need to write the fence as signal that an IB is submitted,
> > > > > > but also adjust a bunch of privileged hardware registers.
> > > > > >
> > > > > > When userspace could do that from its IBs as well then there is nothing
> > > > > > blocking it from reprogramming the page table base address for example.
> > > > > >
> > > > > > We could do those writes with the CPU as well, but that would be a huge
> > > > > > performance drop because of the additional latency.
> > > > > That's not what I'm suggesting. I'm suggesting you have the queue and
> > > > > everything in userspace, like in wondows. Fences are exactly handled like
> > > > > on windows too. The difference is:
> > > > >
> > > > > - All new additions to the ringbuffer are done through a kernel ioctl
> > > > >    call, using the drm/scheduler to resolve dependencies.
> > > > >
> > > > > - Memory management is also done like today int that ioctl.
> > > > >
> > > > > - TDR makes sure that if userspace abuses the contract (which it can, but
> > > > >    it can do that already today because there's also no command parser to
> > > > >    e.g. stop gpu semaphores) the entire context is shot and terminally
> > > > >    killed. Userspace has to then set up a new one. This isn't how amdgpu
> > > > >    recovery works right now, but i915 supports it and I think it's also the
> > > > >    better model for userspace error recovery anyway.
> > > > >
> > > > > So from hw pov this will look _exactly_ like windows, except we never page
> > > > > fault.
> > > > >
> > > > >  From sw pov this will look _exactly_ like current kernel ringbuf model,
> > > > > with exactly same dma_fence semantics. If userspace lies, does something
> > > > > stupid or otherwise breaks the uapi contract, vm ptes stop invalid access
> > > > > and tdr kills it if it takes too long.
> > > > >
> > > > > Where do you need priviledge IB writes or anything like that?
> > > >
> > > > For writing the fence value and setting up the priority and VM registers.
> > >
> > > I'm confused. How does this work on windows then with pure userspace
> > > submit? Windows userspace sets its priorties and vm registers itself from
> > > userspace?
> >
> > When the user allocates usermode queues, the kernel driver sets up a
> > queue descriptor in the kernel which defines the location of the queue
> > in memory, what priority it has, what page tables it should use, etc.
> > User mode can then start writing commands to its queues.  When they
> > are ready for the hardware to start executing them, they ring a
> > doorbell which signals the scheduler and it maps the queue descriptors
> > to HW queue slots and they start executing.  The user only has access
> > to it's queues and any buffers it has mapped in it's GPU virtual
> > address space.  While the queues are scheduled, the user can keep
> > submitting work to them and they will keep executing unless they get
> > preempted by the scheduler due to oversubscription or a priority call
> > or a request from the kernel driver to preempt, etc.
>
> Yeah, works like with our stuff.
>
> I don't see a problem tbh. It's slightly silly going the detour with the
> kernel ioctl, and it's annoying that you still have to use drm/scheduler
> to resolve dependencies instead of gpu semaphores and all that. But this
> only applies to legacy winsys mode, compute (e.g. vk without winsys) can
> use the full power. Just needs a flag or something when setting up the
> context.
>
> And best part is that from hw pov this really is indistinguishable from
> the full on userspace submit model.
>
> The thing where it gets annoying is when you use one of these new cpu
> instructions which do direct submit to hw and pass along the pasid id
> behind the scenes. That's truly something you can't intercept anymore in
> the kernel and fake the legacy dma_fence world.
>
> But what you're describing here sounds like bog standard stuff, and also
> pretty easy to keep working with exactly the current model.
>
> Ofc we'll want to push forward a more modern model that better suits
> modern gpus, but I don't see any hard requirement here from the hw side.

Adding a bit more detail on what I have in mind:

- memory management works like amdgpu does today, so all buffers are
pre-bound to the gpu vm, we keep the entire bo set marked as busy with
the bulk lru trick for every command submission.

- for the ringbuffer, userspace allcoates a suitably sized bo for
ringbuffer, ring/tail/seqno and whatever else it needs

- userspace then asks the kernel to make that into a hw context, with
all the priviledges setup. Doorbell will only be mapped into kernel
(hw can't tell the difference anyway), but if it happens to also be
visible to userspace that's no problem. We assume userspace can ring
the doorbell anytime it wants to.

- we do double memory management: One dma_fence works similar to the
amdkfd preempt fence, except it doesn't preempt but does anything
required to make the hw context unrunnable and take it out of the hw
scheduler entirely. This might involve unmapping the doorbell if
userspace has access to it.

- but we also do classic end-of-batch fences, so that implicit fencing
and all that keeps working. The "make hw ctx unrunnable" fence must
also wait for all of these pending submissions to complete.

- for the actual end-of-batchbuffer dma_fence it's almost all faked,
but with some checks in the kernel to keep up the guarantees. cs flow
is roughly

1. userspace directly writes into the userspace ringbuffer. It needs
to follow the kernel's rule for this if it wants things to work
correctly, but we assume evil userspace is allowed to write whatever
it wants to the ring, and change that whenever it wants. Userspace
does not update ring head/tail pointers.

2. cs ioctl just contains: a) head (the thing userspace advances, tail
is where the gpu consumes) pointer value to write to kick of this new
batch b) in-fences b) out-fence.

3. kernel drm/scheduler handles this like any other request and first
waits for the in-fences to all signal, then it executes the CS. For
execution it simply writes the provided head value into the ring's
metadata, and rings the doorbells. No checks. We assume userspace can
update the tail whenever it feels like, so checking the head value is
pointless anyway.

4. the entire correctness is only depending upon the dma_fences
working as they should. For that we need some very strict rules on
when the end-of-batchbuffer dma_fence signals:
- the drm/scheduler must have marked the request as runnable already,
i.e. all dependencies are fullfilled. This is to prevent the fences
from signalling in the wrong order.
- the fence from the previous batch must have signalled already, again
to guarantee in-order signalling (even if userspace does something
stupid and reorders how things complete)
- the fence must never jump back to unsignalled, so the lockless
fastpath that just checks the seqno is a no-go

5. if drm/scheduler tdr decides it's taking too long we throw the
entire context away, forbit further command submission on it (through
the ioctl, userspace can keep writing to the ring whatever it wants)
and fail all in-flight buffers with an error. Non-evil userspace can
then recover by re-creating a new ringbuffer with everything.

I've pondered this now for a bit and I really can't spot the holes.
And I think it should all work, both for hw and kernel/legacy
dma_fence use-case.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-30  8:58                                                           ` Daniel Vetter
@ 2021-04-30  9:07                                                             ` Christian König
  2021-04-30  9:35                                                               ` Daniel Vetter
  0 siblings, 1 reply; 105+ messages in thread
From: Christian König @ 2021-04-30  9:07 UTC (permalink / raw)
  To: Daniel Vetter, Alex Deucher; +Cc: dri-devel, ML Mesa-dev

Am 30.04.21 um 10:58 schrieb Daniel Vetter:
> [SNIP]
>>> When the user allocates usermode queues, the kernel driver sets up a
>>> queue descriptor in the kernel which defines the location of the queue
>>> in memory, what priority it has, what page tables it should use, etc.
>>> User mode can then start writing commands to its queues.  When they
>>> are ready for the hardware to start executing them, they ring a
>>> doorbell which signals the scheduler and it maps the queue descriptors
>>> to HW queue slots and they start executing.  The user only has access
>>> to it's queues and any buffers it has mapped in it's GPU virtual
>>> address space.  While the queues are scheduled, the user can keep
>>> submitting work to them and they will keep executing unless they get
>>> preempted by the scheduler due to oversubscription or a priority call
>>> or a request from the kernel driver to preempt, etc.
>> Yeah, works like with our stuff.
>>
>> I don't see a problem tbh. It's slightly silly going the detour with the
>> kernel ioctl, and it's annoying that you still have to use drm/scheduler
>> to resolve dependencies instead of gpu semaphores and all that. But this
>> only applies to legacy winsys mode, compute (e.g. vk without winsys) can
>> use the full power. Just needs a flag or something when setting up the
>> context.
>>
>> And best part is that from hw pov this really is indistinguishable from
>> the full on userspace submit model.
>>
>> The thing where it gets annoying is when you use one of these new cpu
>> instructions which do direct submit to hw and pass along the pasid id
>> behind the scenes. That's truly something you can't intercept anymore in
>> the kernel and fake the legacy dma_fence world.
>>
>> But what you're describing here sounds like bog standard stuff, and also
>> pretty easy to keep working with exactly the current model.
>>
>> Ofc we'll want to push forward a more modern model that better suits
>> modern gpus, but I don't see any hard requirement here from the hw side.
> Adding a bit more detail on what I have in mind:
>
> - memory management works like amdgpu does today, so all buffers are
> pre-bound to the gpu vm, we keep the entire bo set marked as busy with
> the bulk lru trick for every command submission.
>
> - for the ringbuffer, userspace allcoates a suitably sized bo for
> ringbuffer, ring/tail/seqno and whatever else it needs
>
> - userspace then asks the kernel to make that into a hw context, with
> all the priviledges setup. Doorbell will only be mapped into kernel
> (hw can't tell the difference anyway), but if it happens to also be
> visible to userspace that's no problem. We assume userspace can ring
> the doorbell anytime it wants to.

This doesn't work in hardware. We at least need to setup a few registers 
and memory locations from inside the VM which userspace shouldn't have 
access to when we want the end of batch fence and ring buffer start to 
be reliable.

> - we do double memory management: One dma_fence works similar to the
> amdkfd preempt fence, except it doesn't preempt but does anything
> required to make the hw context unrunnable and take it out of the hw
> scheduler entirely. This might involve unmapping the doorbell if
> userspace has access to it.
>
> - but we also do classic end-of-batch fences, so that implicit fencing
> and all that keeps working. The "make hw ctx unrunnable" fence must
> also wait for all of these pending submissions to complete.

This together doesn't work from the software side, e.g. you can either 
have preemption fences or end of batch fences but never both or your end 
of batch fences would have another dependency on the preemption fences 
which we currently can't express in the dma_fence framework.

Additional to that it can't work from the hardware side because we have 
a separation between engine and scheduler on the hardware side. So we 
can't reliable get a signal inside the kernel that a batch has completed.

What we could do is to get this signal in userspace, e.g. userspace 
inserts the packets into the ring buffer and then the kernel can read 
the fence value and get the IV.

But this has the same problem as user fences because it requires the 
cooperation of userspace.

We just yesterday had a meeting with the firmware developers to discuss 
the possible options and I now have even stronger doubts that this is 
doable.

We either have user queues where userspace writes the necessary commands 
directly to the ring buffer or we have kernel queues. A mixture of both 
isn't supported in neither the hardware nor the firmware.

Regards,
Christian.

>
> - for the actual end-of-batchbuffer dma_fence it's almost all faked,
> but with some checks in the kernel to keep up the guarantees. cs flow
> is roughly
>
> 1. userspace directly writes into the userspace ringbuffer. It needs
> to follow the kernel's rule for this if it wants things to work
> correctly, but we assume evil userspace is allowed to write whatever
> it wants to the ring, and change that whenever it wants. Userspace
> does not update ring head/tail pointers.
>
> 2. cs ioctl just contains: a) head (the thing userspace advances, tail
> is where the gpu consumes) pointer value to write to kick of this new
> batch b) in-fences b) out-fence.
>
> 3. kernel drm/scheduler handles this like any other request and first
> waits for the in-fences to all signal, then it executes the CS. For
> execution it simply writes the provided head value into the ring's
> metadata, and rings the doorbells. No checks. We assume userspace can
> update the tail whenever it feels like, so checking the head value is
> pointless anyway.
>
> 4. the entire correctness is only depending upon the dma_fences
> working as they should. For that we need some very strict rules on
> when the end-of-batchbuffer dma_fence signals:
> - the drm/scheduler must have marked the request as runnable already,
> i.e. all dependencies are fullfilled. This is to prevent the fences
> from signalling in the wrong order.
> - the fence from the previous batch must have signalled already, again
> to guarantee in-order signalling (even if userspace does something
> stupid and reorders how things complete)
> - the fence must never jump back to unsignalled, so the lockless
> fastpath that just checks the seqno is a no-go
>
> 5. if drm/scheduler tdr decides it's taking too long we throw the
> entire context away, forbit further command submission on it (through
> the ioctl, userspace can keep writing to the ring whatever it wants)
> and fail all in-flight buffers with an error. Non-evil userspace can
> then recover by re-creating a new ringbuffer with everything.
>
> I've pondered this now for a bit and I really can't spot the holes.
> And I think it should all work, both for hw and kernel/legacy
> dma_fence use-case.
> -Daniel

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-30  9:07                                                             ` Christian König
@ 2021-04-30  9:35                                                               ` Daniel Vetter
  2021-04-30 10:17                                                                 ` Daniel Stone
  0 siblings, 1 reply; 105+ messages in thread
From: Daniel Vetter @ 2021-04-30  9:35 UTC (permalink / raw)
  To: Christian König; +Cc: dri-devel, ML Mesa-dev

On Fri, Apr 30, 2021 at 11:08 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 30.04.21 um 10:58 schrieb Daniel Vetter:
> > [SNIP]
> >>> When the user allocates usermode queues, the kernel driver sets up a
> >>> queue descriptor in the kernel which defines the location of the queue
> >>> in memory, what priority it has, what page tables it should use, etc.
> >>> User mode can then start writing commands to its queues.  When they
> >>> are ready for the hardware to start executing them, they ring a
> >>> doorbell which signals the scheduler and it maps the queue descriptors
> >>> to HW queue slots and they start executing.  The user only has access
> >>> to it's queues and any buffers it has mapped in it's GPU virtual
> >>> address space.  While the queues are scheduled, the user can keep
> >>> submitting work to them and they will keep executing unless they get
> >>> preempted by the scheduler due to oversubscription or a priority call
> >>> or a request from the kernel driver to preempt, etc.
> >> Yeah, works like with our stuff.
> >>
> >> I don't see a problem tbh. It's slightly silly going the detour with the
> >> kernel ioctl, and it's annoying that you still have to use drm/scheduler
> >> to resolve dependencies instead of gpu semaphores and all that. But this
> >> only applies to legacy winsys mode, compute (e.g. vk without winsys) can
> >> use the full power. Just needs a flag or something when setting up the
> >> context.
> >>
> >> And best part is that from hw pov this really is indistinguishable from
> >> the full on userspace submit model.
> >>
> >> The thing where it gets annoying is when you use one of these new cpu
> >> instructions which do direct submit to hw and pass along the pasid id
> >> behind the scenes. That's truly something you can't intercept anymore in
> >> the kernel and fake the legacy dma_fence world.
> >>
> >> But what you're describing here sounds like bog standard stuff, and also
> >> pretty easy to keep working with exactly the current model.
> >>
> >> Ofc we'll want to push forward a more modern model that better suits
> >> modern gpus, but I don't see any hard requirement here from the hw side.
> > Adding a bit more detail on what I have in mind:
> >
> > - memory management works like amdgpu does today, so all buffers are
> > pre-bound to the gpu vm, we keep the entire bo set marked as busy with
> > the bulk lru trick for every command submission.
> >
> > - for the ringbuffer, userspace allcoates a suitably sized bo for
> > ringbuffer, ring/tail/seqno and whatever else it needs
> >
> > - userspace then asks the kernel to make that into a hw context, with
> > all the priviledges setup. Doorbell will only be mapped into kernel
> > (hw can't tell the difference anyway), but if it happens to also be
> > visible to userspace that's no problem. We assume userspace can ring
> > the doorbell anytime it wants to.
>
> This doesn't work in hardware. We at least need to setup a few registers
> and memory locations from inside the VM which userspace shouldn't have
> access to when we want the end of batch fence and ring buffer start to
> be reliable.

The thing is, we don't care whether it's reliable or not. Userspace is
allowed to lie, not signal, signal the wrong thing, out of order,
everything.

The design assumes all this is possible.

So unless you can't signal at all from userspace, this works. And for
the "can't signal at all" it just means something needs to do a cpu
busy wait and burn down lots of cpu time. I hope that's not your hw
design :-)

> > - we do double memory management: One dma_fence works similar to the
> > amdkfd preempt fence, except it doesn't preempt but does anything
> > required to make the hw context unrunnable and take it out of the hw
> > scheduler entirely. This might involve unmapping the doorbell if
> > userspace has access to it.
> >
> > - but we also do classic end-of-batch fences, so that implicit fencing
> > and all that keeps working. The "make hw ctx unrunnable" fence must
> > also wait for all of these pending submissions to complete.
>
> This together doesn't work from the software side, e.g. you can either
> have preemption fences or end of batch fences but never both or your end
> of batch fences would have another dependency on the preemption fences
> which we currently can't express in the dma_fence framework.

It's _not_ a preempt fence. It's an ctx unload fence. Not the same
thing. Normal preempt fence would indeed fail.

> Additional to that it can't work from the hardware side because we have
> a separation between engine and scheduler on the hardware side. So we
> can't reliable get a signal inside the kernel that a batch has completed.
>
> What we could do is to get this signal in userspace, e.g. userspace
> inserts the packets into the ring buffer and then the kernel can read
> the fence value and get the IV.
>
> But this has the same problem as user fences because it requires the
> cooperation of userspace.

Nope. Read the thing again, I'm assuming that userspace lies. The
kernel's dma_fence code compensates for that.

Also note that userspace can already lie to it's heart's content with
the current IB stuff. You are already allowed to hang the gpu, submit
utter garbage, render to the wrong buffer or just scribble all over
your own IB. This isn't a new problem.

> We just yesterday had a meeting with the firmware developers to discuss
> the possible options and I now have even stronger doubts that this is
> doable.
>
> We either have user queues where userspace writes the necessary commands
> directly to the ring buffer or we have kernel queues. A mixture of both
> isn't supported in neither the hardware nor the firmware.

Yup. Please read my thing again carefully, I'm stating that userspace
writes all the necessary commands directly into the ringbuffer.

The kernel writes _nothing_ into the ringbuffer. The only thing it
does is update the head pointer to unblock that next section of the
ring, when drm/scheduler thinks that's ok to do.

This works, you just thinking of something completely different than
what I write down :-)

Cheers, Daniel

>
> Regards,
> Christian.
>
> >
> > - for the actual end-of-batchbuffer dma_fence it's almost all faked,
> > but with some checks in the kernel to keep up the guarantees. cs flow
> > is roughly
> >
> > 1. userspace directly writes into the userspace ringbuffer. It needs
> > to follow the kernel's rule for this if it wants things to work
> > correctly, but we assume evil userspace is allowed to write whatever
> > it wants to the ring, and change that whenever it wants. Userspace
> > does not update ring head/tail pointers.
> >
> > 2. cs ioctl just contains: a) head (the thing userspace advances, tail
> > is where the gpu consumes) pointer value to write to kick of this new
> > batch b) in-fences b) out-fence.
> >
> > 3. kernel drm/scheduler handles this like any other request and first
> > waits for the in-fences to all signal, then it executes the CS. For
> > execution it simply writes the provided head value into the ring's
> > metadata, and rings the doorbells. No checks. We assume userspace can
> > update the tail whenever it feels like, so checking the head value is
> > pointless anyway.
> >
> > 4. the entire correctness is only depending upon the dma_fences
> > working as they should. For that we need some very strict rules on
> > when the end-of-batchbuffer dma_fence signals:
> > - the drm/scheduler must have marked the request as runnable already,
> > i.e. all dependencies are fullfilled. This is to prevent the fences
> > from signalling in the wrong order.
> > - the fence from the previous batch must have signalled already, again
> > to guarantee in-order signalling (even if userspace does something
> > stupid and reorders how things complete)
> > - the fence must never jump back to unsignalled, so the lockless
> > fastpath that just checks the seqno is a no-go
> >
> > 5. if drm/scheduler tdr decides it's taking too long we throw the
> > entire context away, forbit further command submission on it (through
> > the ioctl, userspace can keep writing to the ring whatever it wants)
> > and fail all in-flight buffers with an error. Non-evil userspace can
> > then recover by re-creating a new ringbuffer with everything.
> >
> > I've pondered this now for a bit and I really can't spot the holes.
> > And I think it should all work, both for hw and kernel/legacy
> > dma_fence use-case.
> > -Daniel
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-30  9:35                                                               ` Daniel Vetter
@ 2021-04-30 10:17                                                                 ` Daniel Stone
  0 siblings, 0 replies; 105+ messages in thread
From: Daniel Stone @ 2021-04-30 10:17 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: Christian König, dri-devel, ML Mesa-dev

Hi,

On Fri, 30 Apr 2021 at 10:35, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Fri, Apr 30, 2021 at 11:08 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
> > This doesn't work in hardware. We at least need to setup a few registers
> > and memory locations from inside the VM which userspace shouldn't have
> > access to when we want the end of batch fence and ring buffer start to
> > be reliable.
>
> The thing is, we don't care whether it's reliable or not. Userspace is
> allowed to lie, not signal, signal the wrong thing, out of order,
> everything.
>
> The design assumes all this is possible.
>
> So unless you can't signal at all from userspace, this works. And for
> the "can't signal at all" it just means something needs to do a cpu
> busy wait and burn down lots of cpu time. I hope that's not your hw
> design :-)

I've been sitting this one out so far because what other-Dan's
proposed seems totally sensible and workable for me, so I'll let him
argue it rather than confuse it.

But - yes. Our threat model does not care about a malicious content
which deliberately submits garbage and then gets the compositor to
display garbage. If that's the attack then you could just emit noise
from your frag shader.

Cheers,
Daniel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-04-28  9:07                             ` Michel Dänzer
  2021-04-28  9:57                               ` Daniel Vetter
@ 2021-05-01 22:27                               ` Marek Olšák
  2021-05-03 14:42                                 ` Alex Deucher
  1 sibling, 1 reply; 105+ messages in thread
From: Marek Olšák @ 2021-05-01 22:27 UTC (permalink / raw)
  To: Michel Dänzer; +Cc: Christian König, dri-devel, ML Mesa-dev


[-- Attachment #1.1: Type: text/plain, Size: 3336 bytes --]

On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer <michel@daenzer.net> wrote:

> On 2021-04-28 8:59 a.m., Christian König wrote:
> > Hi Dave,
> >
> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> >> Supporting interop with any device is always possible. It depends on
> which drivers we need to interoperate with and update them. We've already
> found the path forward for amdgpu. We just need to find out how many other
> drivers need to be updated and evaluate the cost/benefit aspect.
> >>
> >> Marek
> >>
> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie <airlied@gmail.com <mailto:
> airlied@gmail.com>> wrote:
> >>
> >>     On Tue, 27 Apr 2021 at 22:06, Christian König
> >>     <ckoenig.leichtzumerken@gmail.com <mailto:
> ckoenig.leichtzumerken@gmail.com>> wrote:
> >>     >
> >>     > Correct, we wouldn't have synchronization between device with and
> without user queues any more.
> >>     >
> >>     > That could only be a problem for A+I Laptops.
> >>
> >>     Since I think you mentioned you'd only be enabling this on newer
> >>     chipsets, won't it be a problem for A+A where one A is a generation
> >>     behind the other?
> >>
> >
> > Crap, that is a good point as well.
> >
> >>
> >>     I'm not really liking where this is going btw, seems like a ill
> >>     thought out concept, if AMD is really going down the road of
> designing
> >>     hw that is currently Linux incompatible, you are going to have to
> >>     accept a big part of the burden in bringing this support in to more
> >>     than just amd drivers for upcoming generations of gpu.
> >>
> >
> > Well we don't really like that either, but we have no other option as
> far as I can see.
>
> I don't really understand what "future hw may remove support for kernel
> queues" means exactly. While the per-context queues can be mapped to
> userspace directly, they don't *have* to be, do they? I.e. the kernel
> driver should be able to either intercept userspace access to the queues,
> or in the worst case do it all itself, and provide the existing
> synchronization semantics as needed?
>
> Surely there are resource limits for the per-context queues, so the kernel
> driver needs to do some kind of virtualization / multi-plexing anyway, or
> we'll get sad user faces when there's no queue available for <current hot
> game>.
>
> I'm probably missing something though, awaiting enlightenment. :)
>

The hw interface for userspace is that the ring buffer is mapped to the
process address space alongside a doorbell aperture (4K page) that isn't
real memory, but when the CPU writes into it, it tells the hw scheduler
that there are new GPU commands in the ring buffer. Userspace inserts all
the wait, draw, and signal commands into the ring buffer and then "rings"
the doorbell. It's my understanding that the ring buffer and the doorbell
are always mapped in the same GPU address space as the process, which makes
it very difficult to emulate the current protected ring buffers in the
kernel. The VMID of the ring buffer is also not changeable.

The hw scheduler doesn't do any synchronization and it doesn't see any
dependencies. It only chooses which queue to execute, so it's really just a
simple queue manager handling the virtualization aspect and not much else.

Marek

[-- Attachment #1.2: Type: text/html, Size: 4379 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-01 22:27                               ` Marek Olšák
@ 2021-05-03 14:42                                 ` Alex Deucher
  2021-05-03 14:59                                   ` Jason Ekstrand
  0 siblings, 1 reply; 105+ messages in thread
From: Alex Deucher @ 2021-05-03 14:42 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Christian König, Michel Dänzer, dri-devel, ML Mesa-dev

On Sat, May 1, 2021 at 6:27 PM Marek Olšák <maraeo@gmail.com> wrote:
>
> On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer <michel@daenzer.net> wrote:
>>
>> On 2021-04-28 8:59 a.m., Christian König wrote:
>> > Hi Dave,
>> >
>> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
>> >> Supporting interop with any device is always possible. It depends on which drivers we need to interoperate with and update them. We've already found the path forward for amdgpu. We just need to find out how many other drivers need to be updated and evaluate the cost/benefit aspect.
>> >>
>> >> Marek
>> >>
>> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie <airlied@gmail.com <mailto:airlied@gmail.com>> wrote:
>> >>
>> >>     On Tue, 27 Apr 2021 at 22:06, Christian König
>> >>     <ckoenig.leichtzumerken@gmail.com <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>> >>     >
>> >>     > Correct, we wouldn't have synchronization between device with and without user queues any more.
>> >>     >
>> >>     > That could only be a problem for A+I Laptops.
>> >>
>> >>     Since I think you mentioned you'd only be enabling this on newer
>> >>     chipsets, won't it be a problem for A+A where one A is a generation
>> >>     behind the other?
>> >>
>> >
>> > Crap, that is a good point as well.
>> >
>> >>
>> >>     I'm not really liking where this is going btw, seems like a ill
>> >>     thought out concept, if AMD is really going down the road of designing
>> >>     hw that is currently Linux incompatible, you are going to have to
>> >>     accept a big part of the burden in bringing this support in to more
>> >>     than just amd drivers for upcoming generations of gpu.
>> >>
>> >
>> > Well we don't really like that either, but we have no other option as far as I can see.
>>
>> I don't really understand what "future hw may remove support for kernel queues" means exactly. While the per-context queues can be mapped to userspace directly, they don't *have* to be, do they? I.e. the kernel driver should be able to either intercept userspace access to the queues, or in the worst case do it all itself, and provide the existing synchronization semantics as needed?
>>
>> Surely there are resource limits for the per-context queues, so the kernel driver needs to do some kind of virtualization / multi-plexing anyway, or we'll get sad user faces when there's no queue available for <current hot game>.
>>
>> I'm probably missing something though, awaiting enlightenment. :)
>
>
> The hw interface for userspace is that the ring buffer is mapped to the process address space alongside a doorbell aperture (4K page) that isn't real memory, but when the CPU writes into it, it tells the hw scheduler that there are new GPU commands in the ring buffer. Userspace inserts all the wait, draw, and signal commands into the ring buffer and then "rings" the doorbell. It's my understanding that the ring buffer and the doorbell are always mapped in the same GPU address space as the process, which makes it very difficult to emulate the current protected ring buffers in the kernel. The VMID of the ring buffer is also not changeable.
>

The doorbell does not have to be mapped into the process's GPU virtual
address space.  The CPU could write to it directly.  Mapping it into
the GPU's virtual address space would allow you to have a device kick
off work however rather than the CPU.  E.g., the GPU could kick off
it's own work or multiple devices could kick off work without CPU
involvement.

Alex


> The hw scheduler doesn't do any synchronization and it doesn't see any dependencies. It only chooses which queue to execute, so it's really just a simple queue manager handling the virtualization aspect and not much else.
>
> Marek
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-03 14:42                                 ` Alex Deucher
@ 2021-05-03 14:59                                   ` Jason Ekstrand
  2021-05-03 15:03                                     ` Christian König
  2021-05-03 15:16                                     ` Bas Nieuwenhuizen
  0 siblings, 2 replies; 105+ messages in thread
From: Jason Ekstrand @ 2021-05-03 14:59 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Christian König, Michel Dänzer, dri-devel,
	Marek Olšák, ML Mesa-dev

Sorry for the top-post but there's no good thing to reply to here...

One of the things pointed out to me recently by Daniel Vetter that I
didn't fully understand before is that dma_buf has a very subtle
second requirement beyond finite time completion:  Nothing required
for signaling a dma-fence can allocate memory.  Why?  Because the act
of allocating memory may wait on your dma-fence.  This, as it turns
out, is a massively more strict requirement than finite time
completion and, I think, throws out all of the proposals we have so
far.

Take, for instance, Marek's proposal for userspace involvement with
dma-fence by asking the kernel for a next serial and the kernel
trusting userspace to signal it.  That doesn't work at all if
allocating memory to trigger a dma-fence can blow up.  There's simply
no way for the kernel to trust userspace to not do ANYTHING which
might allocate memory.  I don't even think there's a way userspace can
trust itself there.  It also blows up my plan of moving the fences to
transition boundaries.

Not sure where that leaves us.

--Jason

On Mon, May 3, 2021 at 9:42 AM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Sat, May 1, 2021 at 6:27 PM Marek Olšák <maraeo@gmail.com> wrote:
> >
> > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer <michel@daenzer.net> wrote:
> >>
> >> On 2021-04-28 8:59 a.m., Christian König wrote:
> >> > Hi Dave,
> >> >
> >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> >> >> Supporting interop with any device is always possible. It depends on which drivers we need to interoperate with and update them. We've already found the path forward for amdgpu. We just need to find out how many other drivers need to be updated and evaluate the cost/benefit aspect.
> >> >>
> >> >> Marek
> >> >>
> >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie <airlied@gmail.com <mailto:airlied@gmail.com>> wrote:
> >> >>
> >> >>     On Tue, 27 Apr 2021 at 22:06, Christian König
> >> >>     <ckoenig.leichtzumerken@gmail.com <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
> >> >>     >
> >> >>     > Correct, we wouldn't have synchronization between device with and without user queues any more.
> >> >>     >
> >> >>     > That could only be a problem for A+I Laptops.
> >> >>
> >> >>     Since I think you mentioned you'd only be enabling this on newer
> >> >>     chipsets, won't it be a problem for A+A where one A is a generation
> >> >>     behind the other?
> >> >>
> >> >
> >> > Crap, that is a good point as well.
> >> >
> >> >>
> >> >>     I'm not really liking where this is going btw, seems like a ill
> >> >>     thought out concept, if AMD is really going down the road of designing
> >> >>     hw that is currently Linux incompatible, you are going to have to
> >> >>     accept a big part of the burden in bringing this support in to more
> >> >>     than just amd drivers for upcoming generations of gpu.
> >> >>
> >> >
> >> > Well we don't really like that either, but we have no other option as far as I can see.
> >>
> >> I don't really understand what "future hw may remove support for kernel queues" means exactly. While the per-context queues can be mapped to userspace directly, they don't *have* to be, do they? I.e. the kernel driver should be able to either intercept userspace access to the queues, or in the worst case do it all itself, and provide the existing synchronization semantics as needed?
> >>
> >> Surely there are resource limits for the per-context queues, so the kernel driver needs to do some kind of virtualization / multi-plexing anyway, or we'll get sad user faces when there's no queue available for <current hot game>.
> >>
> >> I'm probably missing something though, awaiting enlightenment. :)
> >
> >
> > The hw interface for userspace is that the ring buffer is mapped to the process address space alongside a doorbell aperture (4K page) that isn't real memory, but when the CPU writes into it, it tells the hw scheduler that there are new GPU commands in the ring buffer. Userspace inserts all the wait, draw, and signal commands into the ring buffer and then "rings" the doorbell. It's my understanding that the ring buffer and the doorbell are always mapped in the same GPU address space as the process, which makes it very difficult to emulate the current protected ring buffers in the kernel. The VMID of the ring buffer is also not changeable.
> >
>
> The doorbell does not have to be mapped into the process's GPU virtual
> address space.  The CPU could write to it directly.  Mapping it into
> the GPU's virtual address space would allow you to have a device kick
> off work however rather than the CPU.  E.g., the GPU could kick off
> it's own work or multiple devices could kick off work without CPU
> involvement.
>
> Alex
>
>
> > The hw scheduler doesn't do any synchronization and it doesn't see any dependencies. It only chooses which queue to execute, so it's really just a simple queue manager handling the virtualization aspect and not much else.
> >
> > Marek
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-03 14:59                                   ` Jason Ekstrand
@ 2021-05-03 15:03                                     ` Christian König
  2021-05-03 15:15                                       ` Jason Ekstrand
  2021-05-03 15:16                                     ` Bas Nieuwenhuizen
  1 sibling, 1 reply; 105+ messages in thread
From: Christian König @ 2021-05-03 15:03 UTC (permalink / raw)
  To: Jason Ekstrand, Alex Deucher
  Cc: ML Mesa-dev, Michel Dänzer, dri-devel, Marek Olšák

Am 03.05.21 um 16:59 schrieb Jason Ekstrand:
> Sorry for the top-post but there's no good thing to reply to here...
>
> One of the things pointed out to me recently by Daniel Vetter that I
> didn't fully understand before is that dma_buf has a very subtle
> second requirement beyond finite time completion:  Nothing required
> for signaling a dma-fence can allocate memory.  Why?  Because the act
> of allocating memory may wait on your dma-fence.  This, as it turns
> out, is a massively more strict requirement than finite time
> completion and, I think, throws out all of the proposals we have so
> far.
>
> Take, for instance, Marek's proposal for userspace involvement with
> dma-fence by asking the kernel for a next serial and the kernel
> trusting userspace to signal it.  That doesn't work at all if
> allocating memory to trigger a dma-fence can blow up.  There's simply
> no way for the kernel to trust userspace to not do ANYTHING which
> might allocate memory.  I don't even think there's a way userspace can
> trust itself there.  It also blows up my plan of moving the fences to
> transition boundaries.
>
> Not sure where that leaves us.

Well at least I was perfectly aware of that :)

I'm currently experimenting with some sample code which would allow 
implicit sync with user fences.

Not that I'm pushing hard into that directly, but I just want to make 
clear how simple or complex the whole thing would be.

Christian.

>
> --Jason
>
> On Mon, May 3, 2021 at 9:42 AM Alex Deucher <alexdeucher@gmail.com> wrote:
>> On Sat, May 1, 2021 at 6:27 PM Marek Olšák <maraeo@gmail.com> wrote:
>>> On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer <michel@daenzer.net> wrote:
>>>> On 2021-04-28 8:59 a.m., Christian König wrote:
>>>>> Hi Dave,
>>>>>
>>>>> Am 27.04.21 um 21:23 schrieb Marek Olšák:
>>>>>> Supporting interop with any device is always possible. It depends on which drivers we need to interoperate with and update them. We've already found the path forward for amdgpu. We just need to find out how many other drivers need to be updated and evaluate the cost/benefit aspect.
>>>>>>
>>>>>> Marek
>>>>>>
>>>>>> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie <airlied@gmail.com <mailto:airlied@gmail.com>> wrote:
>>>>>>
>>>>>>      On Tue, 27 Apr 2021 at 22:06, Christian König
>>>>>>      <ckoenig.leichtzumerken@gmail.com <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>>>>>>      >
>>>>>>      > Correct, we wouldn't have synchronization between device with and without user queues any more.
>>>>>>      >
>>>>>>      > That could only be a problem for A+I Laptops.
>>>>>>
>>>>>>      Since I think you mentioned you'd only be enabling this on newer
>>>>>>      chipsets, won't it be a problem for A+A where one A is a generation
>>>>>>      behind the other?
>>>>>>
>>>>> Crap, that is a good point as well.
>>>>>
>>>>>>      I'm not really liking where this is going btw, seems like a ill
>>>>>>      thought out concept, if AMD is really going down the road of designing
>>>>>>      hw that is currently Linux incompatible, you are going to have to
>>>>>>      accept a big part of the burden in bringing this support in to more
>>>>>>      than just amd drivers for upcoming generations of gpu.
>>>>>>
>>>>> Well we don't really like that either, but we have no other option as far as I can see.
>>>> I don't really understand what "future hw may remove support for kernel queues" means exactly. While the per-context queues can be mapped to userspace directly, they don't *have* to be, do they? I.e. the kernel driver should be able to either intercept userspace access to the queues, or in the worst case do it all itself, and provide the existing synchronization semantics as needed?
>>>>
>>>> Surely there are resource limits for the per-context queues, so the kernel driver needs to do some kind of virtualization / multi-plexing anyway, or we'll get sad user faces when there's no queue available for <current hot game>.
>>>>
>>>> I'm probably missing something though, awaiting enlightenment. :)
>>>
>>> The hw interface for userspace is that the ring buffer is mapped to the process address space alongside a doorbell aperture (4K page) that isn't real memory, but when the CPU writes into it, it tells the hw scheduler that there are new GPU commands in the ring buffer. Userspace inserts all the wait, draw, and signal commands into the ring buffer and then "rings" the doorbell. It's my understanding that the ring buffer and the doorbell are always mapped in the same GPU address space as the process, which makes it very difficult to emulate the current protected ring buffers in the kernel. The VMID of the ring buffer is also not changeable.
>>>
>> The doorbell does not have to be mapped into the process's GPU virtual
>> address space.  The CPU could write to it directly.  Mapping it into
>> the GPU's virtual address space would allow you to have a device kick
>> off work however rather than the CPU.  E.g., the GPU could kick off
>> it's own work or multiple devices could kick off work without CPU
>> involvement.
>>
>> Alex
>>
>>
>>> The hw scheduler doesn't do any synchronization and it doesn't see any dependencies. It only chooses which queue to execute, so it's really just a simple queue manager handling the virtualization aspect and not much else.
>>>
>>> Marek
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-03 15:03                                     ` Christian König
@ 2021-05-03 15:15                                       ` Jason Ekstrand
  0 siblings, 0 replies; 105+ messages in thread
From: Jason Ekstrand @ 2021-05-03 15:15 UTC (permalink / raw)
  To: Christian König
  Cc: ML Mesa-dev, Michel Dänzer, dri-devel, Marek Olšák

On Mon, May 3, 2021 at 10:03 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 03.05.21 um 16:59 schrieb Jason Ekstrand:
> > Sorry for the top-post but there's no good thing to reply to here...
> >
> > One of the things pointed out to me recently by Daniel Vetter that I
> > didn't fully understand before is that dma_buf has a very subtle
> > second requirement beyond finite time completion:  Nothing required
> > for signaling a dma-fence can allocate memory.  Why?  Because the act
> > of allocating memory may wait on your dma-fence.  This, as it turns
> > out, is a massively more strict requirement than finite time
> > completion and, I think, throws out all of the proposals we have so
> > far.
> >
> > Take, for instance, Marek's proposal for userspace involvement with
> > dma-fence by asking the kernel for a next serial and the kernel
> > trusting userspace to signal it.  That doesn't work at all if
> > allocating memory to trigger a dma-fence can blow up.  There's simply
> > no way for the kernel to trust userspace to not do ANYTHING which
> > might allocate memory.  I don't even think there's a way userspace can
> > trust itself there.  It also blows up my plan of moving the fences to
> > transition boundaries.
> >
> > Not sure where that leaves us.
>
> Well at least I was perfectly aware of that :)

I'd have been a bit disappointed if this had been news to you. :-P
However, there are a number of us plebeians on the thread who need
things spelled out sometimes. :-)

> I'm currently experimenting with some sample code which would allow
> implicit sync with user fences.
>
> Not that I'm pushing hard into that directly, but I just want to make
> clear how simple or complex the whole thing would be.

I'd like to see that.  It'd be good to know what our options are.
Honestly, if we can get implicit sync somehow without tying our hands
w.r.t. how fences work in modern drivers, that's the opens a lot of
doors.

--Jason

> Christian.
>
> >
> > --Jason
> >
> > On Mon, May 3, 2021 at 9:42 AM Alex Deucher <alexdeucher@gmail.com> wrote:
> >> On Sat, May 1, 2021 at 6:27 PM Marek Olšák <maraeo@gmail.com> wrote:
> >>> On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer <michel@daenzer.net> wrote:
> >>>> On 2021-04-28 8:59 a.m., Christian König wrote:
> >>>>> Hi Dave,
> >>>>>
> >>>>> Am 27.04.21 um 21:23 schrieb Marek Olšák:
> >>>>>> Supporting interop with any device is always possible. It depends on which drivers we need to interoperate with and update them. We've already found the path forward for amdgpu. We just need to find out how many other drivers need to be updated and evaluate the cost/benefit aspect.
> >>>>>>
> >>>>>> Marek
> >>>>>>
> >>>>>> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie <airlied@gmail.com <mailto:airlied@gmail.com>> wrote:
> >>>>>>
> >>>>>>      On Tue, 27 Apr 2021 at 22:06, Christian König
> >>>>>>      <ckoenig.leichtzumerken@gmail.com <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
> >>>>>>      >
> >>>>>>      > Correct, we wouldn't have synchronization between device with and without user queues any more.
> >>>>>>      >
> >>>>>>      > That could only be a problem for A+I Laptops.
> >>>>>>
> >>>>>>      Since I think you mentioned you'd only be enabling this on newer
> >>>>>>      chipsets, won't it be a problem for A+A where one A is a generation
> >>>>>>      behind the other?
> >>>>>>
> >>>>> Crap, that is a good point as well.
> >>>>>
> >>>>>>      I'm not really liking where this is going btw, seems like a ill
> >>>>>>      thought out concept, if AMD is really going down the road of designing
> >>>>>>      hw that is currently Linux incompatible, you are going to have to
> >>>>>>      accept a big part of the burden in bringing this support in to more
> >>>>>>      than just amd drivers for upcoming generations of gpu.
> >>>>>>
> >>>>> Well we don't really like that either, but we have no other option as far as I can see.
> >>>> I don't really understand what "future hw may remove support for kernel queues" means exactly. While the per-context queues can be mapped to userspace directly, they don't *have* to be, do they? I.e. the kernel driver should be able to either intercept userspace access to the queues, or in the worst case do it all itself, and provide the existing synchronization semantics as needed?
> >>>>
> >>>> Surely there are resource limits for the per-context queues, so the kernel driver needs to do some kind of virtualization / multi-plexing anyway, or we'll get sad user faces when there's no queue available for <current hot game>.
> >>>>
> >>>> I'm probably missing something though, awaiting enlightenment. :)
> >>>
> >>> The hw interface for userspace is that the ring buffer is mapped to the process address space alongside a doorbell aperture (4K page) that isn't real memory, but when the CPU writes into it, it tells the hw scheduler that there are new GPU commands in the ring buffer. Userspace inserts all the wait, draw, and signal commands into the ring buffer and then "rings" the doorbell. It's my understanding that the ring buffer and the doorbell are always mapped in the same GPU address space as the process, which makes it very difficult to emulate the current protected ring buffers in the kernel. The VMID of the ring buffer is also not changeable.
> >>>
> >> The doorbell does not have to be mapped into the process's GPU virtual
> >> address space.  The CPU could write to it directly.  Mapping it into
> >> the GPU's virtual address space would allow you to have a device kick
> >> off work however rather than the CPU.  E.g., the GPU could kick off
> >> it's own work or multiple devices could kick off work without CPU
> >> involvement.
> >>
> >> Alex
> >>
> >>
> >>> The hw scheduler doesn't do any synchronization and it doesn't see any dependencies. It only chooses which queue to execute, so it's really just a simple queue manager handling the virtualization aspect and not much else.
> >>>
> >>> Marek
> >>> _______________________________________________
> >>> dri-devel mailing list
> >>> dri-devel@lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >> _______________________________________________
> >> mesa-dev mailing list
> >> mesa-dev@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-03 14:59                                   ` Jason Ekstrand
  2021-05-03 15:03                                     ` Christian König
@ 2021-05-03 15:16                                     ` Bas Nieuwenhuizen
  2021-05-03 15:23                                       ` Jason Ekstrand
  1 sibling, 1 reply; 105+ messages in thread
From: Bas Nieuwenhuizen @ 2021-05-03 15:16 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Marek Olšák, Christian König, Michel Dänzer,
	dri-devel, ML Mesa-dev

On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
>
> Sorry for the top-post but there's no good thing to reply to here...
>
> One of the things pointed out to me recently by Daniel Vetter that I
> didn't fully understand before is that dma_buf has a very subtle
> second requirement beyond finite time completion:  Nothing required
> for signaling a dma-fence can allocate memory.  Why?  Because the act
> of allocating memory may wait on your dma-fence.  This, as it turns
> out, is a massively more strict requirement than finite time
> completion and, I think, throws out all of the proposals we have so
> far.
>
> Take, for instance, Marek's proposal for userspace involvement with
> dma-fence by asking the kernel for a next serial and the kernel
> trusting userspace to signal it.  That doesn't work at all if
> allocating memory to trigger a dma-fence can blow up.  There's simply
> no way for the kernel to trust userspace to not do ANYTHING which
> might allocate memory.  I don't even think there's a way userspace can
> trust itself there.  It also blows up my plan of moving the fences to
> transition boundaries.
>
> Not sure where that leaves us.

Honestly the more I look at things I think userspace-signalable fences
with a timeout sound like they are a valid solution for these issues.
Especially since (as has been mentioned countless times in this email
thread) userspace already has a lot of ways to cause timeouts and or
GPU hangs through GPU work already.

Adding a timeout on the signaling side of a dma_fence would ensure:

- The dma_fence signals in finite time
-  If the timeout case does not allocate memory then memory allocation
is not a blocker for signaling.

Of course you lose the full dependency graph and we need to make sure
garbage collection of fences works correctly when we have cycles.
However, the latter sounds very doable and the first sounds like it is
to some extent inevitable.

I feel like I'm missing some requirement here given that we
immediately went to much more complicated things but can't find it.
Thoughts?

- Bas
>
> --Jason
>
> On Mon, May 3, 2021 at 9:42 AM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Sat, May 1, 2021 at 6:27 PM Marek Olšák <maraeo@gmail.com> wrote:
> > >
> > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer <michel@daenzer.net> wrote:
> > >>
> > >> On 2021-04-28 8:59 a.m., Christian König wrote:
> > >> > Hi Dave,
> > >> >
> > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> > >> >> Supporting interop with any device is always possible. It depends on which drivers we need to interoperate with and update them. We've already found the path forward for amdgpu. We just need to find out how many other drivers need to be updated and evaluate the cost/benefit aspect.
> > >> >>
> > >> >> Marek
> > >> >>
> > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie <airlied@gmail.com <mailto:airlied@gmail.com>> wrote:
> > >> >>
> > >> >>     On Tue, 27 Apr 2021 at 22:06, Christian König
> > >> >>     <ckoenig.leichtzumerken@gmail.com <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
> > >> >>     >
> > >> >>     > Correct, we wouldn't have synchronization between device with and without user queues any more.
> > >> >>     >
> > >> >>     > That could only be a problem for A+I Laptops.
> > >> >>
> > >> >>     Since I think you mentioned you'd only be enabling this on newer
> > >> >>     chipsets, won't it be a problem for A+A where one A is a generation
> > >> >>     behind the other?
> > >> >>
> > >> >
> > >> > Crap, that is a good point as well.
> > >> >
> > >> >>
> > >> >>     I'm not really liking where this is going btw, seems like a ill
> > >> >>     thought out concept, if AMD is really going down the road of designing
> > >> >>     hw that is currently Linux incompatible, you are going to have to
> > >> >>     accept a big part of the burden in bringing this support in to more
> > >> >>     than just amd drivers for upcoming generations of gpu.
> > >> >>
> > >> >
> > >> > Well we don't really like that either, but we have no other option as far as I can see.
> > >>
> > >> I don't really understand what "future hw may remove support for kernel queues" means exactly. While the per-context queues can be mapped to userspace directly, they don't *have* to be, do they? I.e. the kernel driver should be able to either intercept userspace access to the queues, or in the worst case do it all itself, and provide the existing synchronization semantics as needed?
> > >>
> > >> Surely there are resource limits for the per-context queues, so the kernel driver needs to do some kind of virtualization / multi-plexing anyway, or we'll get sad user faces when there's no queue available for <current hot game>.
> > >>
> > >> I'm probably missing something though, awaiting enlightenment. :)
> > >
> > >
> > > The hw interface for userspace is that the ring buffer is mapped to the process address space alongside a doorbell aperture (4K page) that isn't real memory, but when the CPU writes into it, it tells the hw scheduler that there are new GPU commands in the ring buffer. Userspace inserts all the wait, draw, and signal commands into the ring buffer and then "rings" the doorbell. It's my understanding that the ring buffer and the doorbell are always mapped in the same GPU address space as the process, which makes it very difficult to emulate the current protected ring buffers in the kernel. The VMID of the ring buffer is also not changeable.
> > >
> >
> > The doorbell does not have to be mapped into the process's GPU virtual
> > address space.  The CPU could write to it directly.  Mapping it into
> > the GPU's virtual address space would allow you to have a device kick
> > off work however rather than the CPU.  E.g., the GPU could kick off
> > it's own work or multiple devices could kick off work without CPU
> > involvement.
> >
> > Alex
> >
> >
> > > The hw scheduler doesn't do any synchronization and it doesn't see any dependencies. It only chooses which queue to execute, so it's really just a simple queue manager handling the virtualization aspect and not much else.
> > >
> > > Marek
> > > _______________________________________________
> > > dri-devel mailing list
> > > dri-devel@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > _______________________________________________
> > mesa-dev mailing list
> > mesa-dev@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-03 15:16                                     ` Bas Nieuwenhuizen
@ 2021-05-03 15:23                                       ` Jason Ekstrand
  2021-05-03 20:36                                         ` Marek Olšák
  0 siblings, 1 reply; 105+ messages in thread
From: Jason Ekstrand @ 2021-05-03 15:23 UTC (permalink / raw)
  To: Bas Nieuwenhuizen
  Cc: Marek Olšák, Christian König, Michel Dänzer,
	dri-devel, ML Mesa-dev

On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
<bas@basnieuwenhuizen.nl> wrote:
>
> On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand <jason@jlekstrand.net> wrote:
> >
> > Sorry for the top-post but there's no good thing to reply to here...
> >
> > One of the things pointed out to me recently by Daniel Vetter that I
> > didn't fully understand before is that dma_buf has a very subtle
> > second requirement beyond finite time completion:  Nothing required
> > for signaling a dma-fence can allocate memory.  Why?  Because the act
> > of allocating memory may wait on your dma-fence.  This, as it turns
> > out, is a massively more strict requirement than finite time
> > completion and, I think, throws out all of the proposals we have so
> > far.
> >
> > Take, for instance, Marek's proposal for userspace involvement with
> > dma-fence by asking the kernel for a next serial and the kernel
> > trusting userspace to signal it.  That doesn't work at all if
> > allocating memory to trigger a dma-fence can blow up.  There's simply
> > no way for the kernel to trust userspace to not do ANYTHING which
> > might allocate memory.  I don't even think there's a way userspace can
> > trust itself there.  It also blows up my plan of moving the fences to
> > transition boundaries.
> >
> > Not sure where that leaves us.
>
> Honestly the more I look at things I think userspace-signalable fences
> with a timeout sound like they are a valid solution for these issues.
> Especially since (as has been mentioned countless times in this email
> thread) userspace already has a lot of ways to cause timeouts and or
> GPU hangs through GPU work already.
>
> Adding a timeout on the signaling side of a dma_fence would ensure:
>
> - The dma_fence signals in finite time
> -  If the timeout case does not allocate memory then memory allocation
> is not a blocker for signaling.
>
> Of course you lose the full dependency graph and we need to make sure
> garbage collection of fences works correctly when we have cycles.
> However, the latter sounds very doable and the first sounds like it is
> to some extent inevitable.
>
> I feel like I'm missing some requirement here given that we
> immediately went to much more complicated things but can't find it.
> Thoughts?

Timeouts are sufficient to protect the kernel but they make the fences
unpredictable and unreliable from a userspace PoV.  One of the big
problems we face is that, once we expose a dma_fence to userspace,
we've allowed for some pretty crazy potential dependencies that
neither userspace nor the kernel can sort out.  Say you have marek's
"next serial, please" proposal and a multi-threaded application.
Between time time you ask the kernel for a serial and get a dma_fence
and submit the work to signal that serial, your process may get
preempted, something else shoved in which allocates memory, and then
we end up blocking on that dma_fence.  There's no way userspace can
predict and defend itself from that.

So I think where that leaves us is that there is no safe place to
create a dma_fence except for inside the ioctl which submits the work
and only after any necessary memory has been allocated.  That's a
pretty stiff requirement.  We may still be able to interact with
userspace a bit more explicitly but I think it throws any notion of
userspace direct submit out the window.

--Jason


> - Bas
> >
> > --Jason
> >
> > On Mon, May 3, 2021 at 9:42 AM Alex Deucher <alexdeucher@gmail.com> wrote:
> > >
> > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák <maraeo@gmail.com> wrote:
> > > >
> > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer <michel@daenzer.net> wrote:
> > > >>
> > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
> > > >> > Hi Dave,
> > > >> >
> > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> > > >> >> Supporting interop with any device is always possible. It depends on which drivers we need to interoperate with and update them. We've already found the path forward for amdgpu. We just need to find out how many other drivers need to be updated and evaluate the cost/benefit aspect.
> > > >> >>
> > > >> >> Marek
> > > >> >>
> > > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie <airlied@gmail.com <mailto:airlied@gmail.com>> wrote:
> > > >> >>
> > > >> >>     On Tue, 27 Apr 2021 at 22:06, Christian König
> > > >> >>     <ckoenig.leichtzumerken@gmail.com <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
> > > >> >>     >
> > > >> >>     > Correct, we wouldn't have synchronization between device with and without user queues any more.
> > > >> >>     >
> > > >> >>     > That could only be a problem for A+I Laptops.
> > > >> >>
> > > >> >>     Since I think you mentioned you'd only be enabling this on newer
> > > >> >>     chipsets, won't it be a problem for A+A where one A is a generation
> > > >> >>     behind the other?
> > > >> >>
> > > >> >
> > > >> > Crap, that is a good point as well.
> > > >> >
> > > >> >>
> > > >> >>     I'm not really liking where this is going btw, seems like a ill
> > > >> >>     thought out concept, if AMD is really going down the road of designing
> > > >> >>     hw that is currently Linux incompatible, you are going to have to
> > > >> >>     accept a big part of the burden in bringing this support in to more
> > > >> >>     than just amd drivers for upcoming generations of gpu.
> > > >> >>
> > > >> >
> > > >> > Well we don't really like that either, but we have no other option as far as I can see.
> > > >>
> > > >> I don't really understand what "future hw may remove support for kernel queues" means exactly. While the per-context queues can be mapped to userspace directly, they don't *have* to be, do they? I.e. the kernel driver should be able to either intercept userspace access to the queues, or in the worst case do it all itself, and provide the existing synchronization semantics as needed?
> > > >>
> > > >> Surely there are resource limits for the per-context queues, so the kernel driver needs to do some kind of virtualization / multi-plexing anyway, or we'll get sad user faces when there's no queue available for <current hot game>.
> > > >>
> > > >> I'm probably missing something though, awaiting enlightenment. :)
> > > >
> > > >
> > > > The hw interface for userspace is that the ring buffer is mapped to the process address space alongside a doorbell aperture (4K page) that isn't real memory, but when the CPU writes into it, it tells the hw scheduler that there are new GPU commands in the ring buffer. Userspace inserts all the wait, draw, and signal commands into the ring buffer and then "rings" the doorbell. It's my understanding that the ring buffer and the doorbell are always mapped in the same GPU address space as the process, which makes it very difficult to emulate the current protected ring buffers in the kernel. The VMID of the ring buffer is also not changeable.
> > > >
> > >
> > > The doorbell does not have to be mapped into the process's GPU virtual
> > > address space.  The CPU could write to it directly.  Mapping it into
> > > the GPU's virtual address space would allow you to have a device kick
> > > off work however rather than the CPU.  E.g., the GPU could kick off
> > > it's own work or multiple devices could kick off work without CPU
> > > involvement.
> > >
> > > Alex
> > >
> > >
> > > > The hw scheduler doesn't do any synchronization and it doesn't see any dependencies. It only chooses which queue to execute, so it's really just a simple queue manager handling the virtualization aspect and not much else.
> > > >
> > > > Marek
> > > > _______________________________________________
> > > > dri-devel mailing list
> > > > dri-devel@lists.freedesktop.org
> > > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > > _______________________________________________
> > > mesa-dev mailing list
> > > mesa-dev@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> > _______________________________________________
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-03 15:23                                       ` Jason Ekstrand
@ 2021-05-03 20:36                                         ` Marek Olšák
  2021-05-04  3:11                                           ` Marek Olšák
  0 siblings, 1 reply; 105+ messages in thread
From: Marek Olšák @ 2021-05-03 20:36 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Christian König, Michel Dänzer, dri-devel, ML Mesa-dev


[-- Attachment #1.1: Type: text/plain, Size: 9067 bytes --]

What about direct submit from the kernel where the process still has write
access to the GPU ring buffer but doesn't use it? I think that solves your
preemption example, but leaves a potential backdoor for a process to
overwrite the signal commands, which shouldn't be a problem since we are OK
with timeouts.

Marek

On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand <jason@jlekstrand.net> wrote:

> On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
> <bas@basnieuwenhuizen.nl> wrote:
> >
> > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand <jason@jlekstrand.net>
> wrote:
> > >
> > > Sorry for the top-post but there's no good thing to reply to here...
> > >
> > > One of the things pointed out to me recently by Daniel Vetter that I
> > > didn't fully understand before is that dma_buf has a very subtle
> > > second requirement beyond finite time completion:  Nothing required
> > > for signaling a dma-fence can allocate memory.  Why?  Because the act
> > > of allocating memory may wait on your dma-fence.  This, as it turns
> > > out, is a massively more strict requirement than finite time
> > > completion and, I think, throws out all of the proposals we have so
> > > far.
> > >
> > > Take, for instance, Marek's proposal for userspace involvement with
> > > dma-fence by asking the kernel for a next serial and the kernel
> > > trusting userspace to signal it.  That doesn't work at all if
> > > allocating memory to trigger a dma-fence can blow up.  There's simply
> > > no way for the kernel to trust userspace to not do ANYTHING which
> > > might allocate memory.  I don't even think there's a way userspace can
> > > trust itself there.  It also blows up my plan of moving the fences to
> > > transition boundaries.
> > >
> > > Not sure where that leaves us.
> >
> > Honestly the more I look at things I think userspace-signalable fences
> > with a timeout sound like they are a valid solution for these issues.
> > Especially since (as has been mentioned countless times in this email
> > thread) userspace already has a lot of ways to cause timeouts and or
> > GPU hangs through GPU work already.
> >
> > Adding a timeout on the signaling side of a dma_fence would ensure:
> >
> > - The dma_fence signals in finite time
> > -  If the timeout case does not allocate memory then memory allocation
> > is not a blocker for signaling.
> >
> > Of course you lose the full dependency graph and we need to make sure
> > garbage collection of fences works correctly when we have cycles.
> > However, the latter sounds very doable and the first sounds like it is
> > to some extent inevitable.
> >
> > I feel like I'm missing some requirement here given that we
> > immediately went to much more complicated things but can't find it.
> > Thoughts?
>
> Timeouts are sufficient to protect the kernel but they make the fences
> unpredictable and unreliable from a userspace PoV.  One of the big
> problems we face is that, once we expose a dma_fence to userspace,
> we've allowed for some pretty crazy potential dependencies that
> neither userspace nor the kernel can sort out.  Say you have marek's
> "next serial, please" proposal and a multi-threaded application.
> Between time time you ask the kernel for a serial and get a dma_fence
> and submit the work to signal that serial, your process may get
> preempted, something else shoved in which allocates memory, and then
> we end up blocking on that dma_fence.  There's no way userspace can
> predict and defend itself from that.
>
> So I think where that leaves us is that there is no safe place to
> create a dma_fence except for inside the ioctl which submits the work
> and only after any necessary memory has been allocated.  That's a
> pretty stiff requirement.  We may still be able to interact with
> userspace a bit more explicitly but I think it throws any notion of
> userspace direct submit out the window.
>
> --Jason
>
>
> > - Bas
> > >
> > > --Jason
> > >
> > > On Mon, May 3, 2021 at 9:42 AM Alex Deucher <alexdeucher@gmail.com>
> wrote:
> > > >
> > > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák <maraeo@gmail.com> wrote:
> > > > >
> > > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer <michel@daenzer.net>
> wrote:
> > > > >>
> > > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
> > > > >> > Hi Dave,
> > > > >> >
> > > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> > > > >> >> Supporting interop with any device is always possible. It
> depends on which drivers we need to interoperate with and update them.
> We've already found the path forward for amdgpu. We just need to find out
> how many other drivers need to be updated and evaluate the cost/benefit
> aspect.
> > > > >> >>
> > > > >> >> Marek
> > > > >> >>
> > > > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie <airlied@gmail.com
> <mailto:airlied@gmail.com>> wrote:
> > > > >> >>
> > > > >> >>     On Tue, 27 Apr 2021 at 22:06, Christian König
> > > > >> >>     <ckoenig.leichtzumerken@gmail.com <mailto:
> ckoenig.leichtzumerken@gmail.com>> wrote:
> > > > >> >>     >
> > > > >> >>     > Correct, we wouldn't have synchronization between device
> with and without user queues any more.
> > > > >> >>     >
> > > > >> >>     > That could only be a problem for A+I Laptops.
> > > > >> >>
> > > > >> >>     Since I think you mentioned you'd only be enabling this on
> newer
> > > > >> >>     chipsets, won't it be a problem for A+A where one A is a
> generation
> > > > >> >>     behind the other?
> > > > >> >>
> > > > >> >
> > > > >> > Crap, that is a good point as well.
> > > > >> >
> > > > >> >>
> > > > >> >>     I'm not really liking where this is going btw, seems like
> a ill
> > > > >> >>     thought out concept, if AMD is really going down the road
> of designing
> > > > >> >>     hw that is currently Linux incompatible, you are going to
> have to
> > > > >> >>     accept a big part of the burden in bringing this support
> in to more
> > > > >> >>     than just amd drivers for upcoming generations of gpu.
> > > > >> >>
> > > > >> >
> > > > >> > Well we don't really like that either, but we have no other
> option as far as I can see.
> > > > >>
> > > > >> I don't really understand what "future hw may remove support for
> kernel queues" means exactly. While the per-context queues can be mapped to
> userspace directly, they don't *have* to be, do they? I.e. the kernel
> driver should be able to either intercept userspace access to the queues,
> or in the worst case do it all itself, and provide the existing
> synchronization semantics as needed?
> > > > >>
> > > > >> Surely there are resource limits for the per-context queues, so
> the kernel driver needs to do some kind of virtualization / multi-plexing
> anyway, or we'll get sad user faces when there's no queue available for
> <current hot game>.
> > > > >>
> > > > >> I'm probably missing something though, awaiting enlightenment. :)
> > > > >
> > > > >
> > > > > The hw interface for userspace is that the ring buffer is mapped
> to the process address space alongside a doorbell aperture (4K page) that
> isn't real memory, but when the CPU writes into it, it tells the hw
> scheduler that there are new GPU commands in the ring buffer. Userspace
> inserts all the wait, draw, and signal commands into the ring buffer and
> then "rings" the doorbell. It's my understanding that the ring buffer and
> the doorbell are always mapped in the same GPU address space as the
> process, which makes it very difficult to emulate the current protected
> ring buffers in the kernel. The VMID of the ring buffer is also not
> changeable.
> > > > >
> > > >
> > > > The doorbell does not have to be mapped into the process's GPU
> virtual
> > > > address space.  The CPU could write to it directly.  Mapping it into
> > > > the GPU's virtual address space would allow you to have a device kick
> > > > off work however rather than the CPU.  E.g., the GPU could kick off
> > > > it's own work or multiple devices could kick off work without CPU
> > > > involvement.
> > > >
> > > > Alex
> > > >
> > > >
> > > > > The hw scheduler doesn't do any synchronization and it doesn't see
> any dependencies. It only chooses which queue to execute, so it's really
> just a simple queue manager handling the virtualization aspect and not much
> else.
> > > > >
> > > > > Marek
> > > > > _______________________________________________
> > > > > dri-devel mailing list
> > > > > dri-devel@lists.freedesktop.org
> > > > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > > > _______________________________________________
> > > > mesa-dev mailing list
> > > > mesa-dev@lists.freedesktop.org
> > > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> > > _______________________________________________
> > > dri-devel mailing list
> > > dri-devel@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

[-- Attachment #1.2: Type: text/html, Size: 12262 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-03 20:36                                         ` Marek Olšák
@ 2021-05-04  3:11                                           ` Marek Olšák
  2021-05-04  7:01                                             ` Christian König
  0 siblings, 1 reply; 105+ messages in thread
From: Marek Olšák @ 2021-05-04  3:11 UTC (permalink / raw)
  To: Jason Ekstrand
  Cc: Christian König, Michel Dänzer, dri-devel, ML Mesa-dev


[-- Attachment #1.1: Type: text/plain, Size: 9760 bytes --]

Proposal for a new CS ioctl, kernel pseudo code:

lock(&global_lock);
serial = get_next_serial(dev);
add_wait_command(ring, serial - 1);
add_exec_cmdbuf(ring, user_cmdbuf);
add_signal_command(ring, serial);
*ring->doorbell = FIRE;
unlock(&global_lock);

See? Just like userspace submit, but in the kernel without
concurrency/preemption. Is this now safe enough for dma_fence?

Marek

On Mon, May 3, 2021 at 4:36 PM Marek Olšák <maraeo@gmail.com> wrote:

> What about direct submit from the kernel where the process still has write
> access to the GPU ring buffer but doesn't use it? I think that solves your
> preemption example, but leaves a potential backdoor for a process to
> overwrite the signal commands, which shouldn't be a problem since we are OK
> with timeouts.
>
> Marek
>
> On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand <jason@jlekstrand.net>
> wrote:
>
>> On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
>> <bas@basnieuwenhuizen.nl> wrote:
>> >
>> > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand <jason@jlekstrand.net>
>> wrote:
>> > >
>> > > Sorry for the top-post but there's no good thing to reply to here...
>> > >
>> > > One of the things pointed out to me recently by Daniel Vetter that I
>> > > didn't fully understand before is that dma_buf has a very subtle
>> > > second requirement beyond finite time completion:  Nothing required
>> > > for signaling a dma-fence can allocate memory.  Why?  Because the act
>> > > of allocating memory may wait on your dma-fence.  This, as it turns
>> > > out, is a massively more strict requirement than finite time
>> > > completion and, I think, throws out all of the proposals we have so
>> > > far.
>> > >
>> > > Take, for instance, Marek's proposal for userspace involvement with
>> > > dma-fence by asking the kernel for a next serial and the kernel
>> > > trusting userspace to signal it.  That doesn't work at all if
>> > > allocating memory to trigger a dma-fence can blow up.  There's simply
>> > > no way for the kernel to trust userspace to not do ANYTHING which
>> > > might allocate memory.  I don't even think there's a way userspace can
>> > > trust itself there.  It also blows up my plan of moving the fences to
>> > > transition boundaries.
>> > >
>> > > Not sure where that leaves us.
>> >
>> > Honestly the more I look at things I think userspace-signalable fences
>> > with a timeout sound like they are a valid solution for these issues.
>> > Especially since (as has been mentioned countless times in this email
>> > thread) userspace already has a lot of ways to cause timeouts and or
>> > GPU hangs through GPU work already.
>> >
>> > Adding a timeout on the signaling side of a dma_fence would ensure:
>> >
>> > - The dma_fence signals in finite time
>> > -  If the timeout case does not allocate memory then memory allocation
>> > is not a blocker for signaling.
>> >
>> > Of course you lose the full dependency graph and we need to make sure
>> > garbage collection of fences works correctly when we have cycles.
>> > However, the latter sounds very doable and the first sounds like it is
>> > to some extent inevitable.
>> >
>> > I feel like I'm missing some requirement here given that we
>> > immediately went to much more complicated things but can't find it.
>> > Thoughts?
>>
>> Timeouts are sufficient to protect the kernel but they make the fences
>> unpredictable and unreliable from a userspace PoV.  One of the big
>> problems we face is that, once we expose a dma_fence to userspace,
>> we've allowed for some pretty crazy potential dependencies that
>> neither userspace nor the kernel can sort out.  Say you have marek's
>> "next serial, please" proposal and a multi-threaded application.
>> Between time time you ask the kernel for a serial and get a dma_fence
>> and submit the work to signal that serial, your process may get
>> preempted, something else shoved in which allocates memory, and then
>> we end up blocking on that dma_fence.  There's no way userspace can
>> predict and defend itself from that.
>>
>> So I think where that leaves us is that there is no safe place to
>> create a dma_fence except for inside the ioctl which submits the work
>> and only after any necessary memory has been allocated.  That's a
>> pretty stiff requirement.  We may still be able to interact with
>> userspace a bit more explicitly but I think it throws any notion of
>> userspace direct submit out the window.
>>
>> --Jason
>>
>>
>> > - Bas
>> > >
>> > > --Jason
>> > >
>> > > On Mon, May 3, 2021 at 9:42 AM Alex Deucher <alexdeucher@gmail.com>
>> wrote:
>> > > >
>> > > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák <maraeo@gmail.com>
>> wrote:
>> > > > >
>> > > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer <michel@daenzer.net>
>> wrote:
>> > > > >>
>> > > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
>> > > > >> > Hi Dave,
>> > > > >> >
>> > > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
>> > > > >> >> Supporting interop with any device is always possible. It
>> depends on which drivers we need to interoperate with and update them.
>> We've already found the path forward for amdgpu. We just need to find out
>> how many other drivers need to be updated and evaluate the cost/benefit
>> aspect.
>> > > > >> >>
>> > > > >> >> Marek
>> > > > >> >>
>> > > > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie <
>> airlied@gmail.com <mailto:airlied@gmail.com>> wrote:
>> > > > >> >>
>> > > > >> >>     On Tue, 27 Apr 2021 at 22:06, Christian König
>> > > > >> >>     <ckoenig.leichtzumerken@gmail.com <mailto:
>> ckoenig.leichtzumerken@gmail.com>> wrote:
>> > > > >> >>     >
>> > > > >> >>     > Correct, we wouldn't have synchronization between
>> device with and without user queues any more.
>> > > > >> >>     >
>> > > > >> >>     > That could only be a problem for A+I Laptops.
>> > > > >> >>
>> > > > >> >>     Since I think you mentioned you'd only be enabling this
>> on newer
>> > > > >> >>     chipsets, won't it be a problem for A+A where one A is a
>> generation
>> > > > >> >>     behind the other?
>> > > > >> >>
>> > > > >> >
>> > > > >> > Crap, that is a good point as well.
>> > > > >> >
>> > > > >> >>
>> > > > >> >>     I'm not really liking where this is going btw, seems like
>> a ill
>> > > > >> >>     thought out concept, if AMD is really going down the road
>> of designing
>> > > > >> >>     hw that is currently Linux incompatible, you are going to
>> have to
>> > > > >> >>     accept a big part of the burden in bringing this support
>> in to more
>> > > > >> >>     than just amd drivers for upcoming generations of gpu.
>> > > > >> >>
>> > > > >> >
>> > > > >> > Well we don't really like that either, but we have no other
>> option as far as I can see.
>> > > > >>
>> > > > >> I don't really understand what "future hw may remove support for
>> kernel queues" means exactly. While the per-context queues can be mapped to
>> userspace directly, they don't *have* to be, do they? I.e. the kernel
>> driver should be able to either intercept userspace access to the queues,
>> or in the worst case do it all itself, and provide the existing
>> synchronization semantics as needed?
>> > > > >>
>> > > > >> Surely there are resource limits for the per-context queues, so
>> the kernel driver needs to do some kind of virtualization / multi-plexing
>> anyway, or we'll get sad user faces when there's no queue available for
>> <current hot game>.
>> > > > >>
>> > > > >> I'm probably missing something though, awaiting enlightenment. :)
>> > > > >
>> > > > >
>> > > > > The hw interface for userspace is that the ring buffer is mapped
>> to the process address space alongside a doorbell aperture (4K page) that
>> isn't real memory, but when the CPU writes into it, it tells the hw
>> scheduler that there are new GPU commands in the ring buffer. Userspace
>> inserts all the wait, draw, and signal commands into the ring buffer and
>> then "rings" the doorbell. It's my understanding that the ring buffer and
>> the doorbell are always mapped in the same GPU address space as the
>> process, which makes it very difficult to emulate the current protected
>> ring buffers in the kernel. The VMID of the ring buffer is also not
>> changeable.
>> > > > >
>> > > >
>> > > > The doorbell does not have to be mapped into the process's GPU
>> virtual
>> > > > address space.  The CPU could write to it directly.  Mapping it into
>> > > > the GPU's virtual address space would allow you to have a device
>> kick
>> > > > off work however rather than the CPU.  E.g., the GPU could kick off
>> > > > it's own work or multiple devices could kick off work without CPU
>> > > > involvement.
>> > > >
>> > > > Alex
>> > > >
>> > > >
>> > > > > The hw scheduler doesn't do any synchronization and it doesn't
>> see any dependencies. It only chooses which queue to execute, so it's
>> really just a simple queue manager handling the virtualization aspect and
>> not much else.
>> > > > >
>> > > > > Marek
>> > > > > _______________________________________________
>> > > > > dri-devel mailing list
>> > > > > dri-devel@lists.freedesktop.org
>> > > > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
>> > > > _______________________________________________
>> > > > mesa-dev mailing list
>> > > > mesa-dev@lists.freedesktop.org
>> > > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>> > > _______________________________________________
>> > > dri-devel mailing list
>> > > dri-devel@lists.freedesktop.org
>> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>

[-- Attachment #1.2: Type: text/html, Size: 13160 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-04  3:11                                           ` Marek Olšák
@ 2021-05-04  7:01                                             ` Christian König
  2021-05-04  7:32                                               ` Daniel Vetter
  0 siblings, 1 reply; 105+ messages in thread
From: Christian König @ 2021-05-04  7:01 UTC (permalink / raw)
  To: Marek Olšák, Jason Ekstrand
  Cc: ML Mesa-dev, Michel Dänzer, dri-devel


[-- Attachment #1.1: Type: text/plain, Size: 13187 bytes --]

Unfortunately as I pointed out to Daniel as well this won't work 100% 
reliable either.

See the signal on the ring buffer needs to be protected by manipulation 
from userspace so that we can guarantee that the hardware really has 
finished executing when it fires.

Protecting memory by immediate page table updates is a good first step, 
but unfortunately not sufficient (and we would need to restructure large 
parts of the driver to make this happen).

On older hardware we often had the situation that for reliable 
invalidation we need the guarantee that every previous operation has 
finished executing. It's not so much of a problem when the next 
operation has already started, since then we had the opportunity to do 
things in between the last and the next operation. Just see cache 
invalidation and VM switching for example.

Additional to that it doesn't really buy us anything, e.g. there is not 
much advantage to this. Writing the ring buffer in userspace and then 
ringing in the kernel has the same overhead as doing everything in the 
kernel in the first place.

Christian.

Am 04.05.21 um 05:11 schrieb Marek Olšák:
> Proposal for a new CS ioctl, kernel pseudo code:
>
> lock(&global_lock);
> serial = get_next_serial(dev);
> add_wait_command(ring, serial - 1);
> add_exec_cmdbuf(ring, user_cmdbuf);
> add_signal_command(ring, serial);
> *ring->doorbell = FIRE;
> unlock(&global_lock);
>
> See? Just like userspace submit, but in the kernel without 
> concurrency/preemption. Is this now safe enough for dma_fence?
>
> Marek
>
> On Mon, May 3, 2021 at 4:36 PM Marek Olšák <maraeo@gmail.com 
> <mailto:maraeo@gmail.com>> wrote:
>
>     What about direct submit from the kernel where the process still
>     has write access to the GPU ring buffer but doesn't use it? I
>     think that solves your preemption example, but leaves a potential
>     backdoor for a process to overwrite the signal commands, which
>     shouldn't be a problem since we are OK with timeouts.
>
>     Marek
>
>     On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand
>     <jason@jlekstrand.net <mailto:jason@jlekstrand.net>> wrote:
>
>         On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
>         <bas@basnieuwenhuizen.nl <mailto:bas@basnieuwenhuizen.nl>> wrote:
>         >
>         > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand
>         <jason@jlekstrand.net <mailto:jason@jlekstrand.net>> wrote:
>         > >
>         > > Sorry for the top-post but there's no good thing to reply
>         to here...
>         > >
>         > > One of the things pointed out to me recently by Daniel
>         Vetter that I
>         > > didn't fully understand before is that dma_buf has a very
>         subtle
>         > > second requirement beyond finite time completion:  Nothing
>         required
>         > > for signaling a dma-fence can allocate memory. Why? 
>         Because the act
>         > > of allocating memory may wait on your dma-fence.  This, as
>         it turns
>         > > out, is a massively more strict requirement than finite time
>         > > completion and, I think, throws out all of the proposals
>         we have so
>         > > far.
>         > >
>         > > Take, for instance, Marek's proposal for userspace
>         involvement with
>         > > dma-fence by asking the kernel for a next serial and the
>         kernel
>         > > trusting userspace to signal it.  That doesn't work at all if
>         > > allocating memory to trigger a dma-fence can blow up. 
>         There's simply
>         > > no way for the kernel to trust userspace to not do
>         ANYTHING which
>         > > might allocate memory.  I don't even think there's a way
>         userspace can
>         > > trust itself there.  It also blows up my plan of moving
>         the fences to
>         > > transition boundaries.
>         > >
>         > > Not sure where that leaves us.
>         >
>         > Honestly the more I look at things I think
>         userspace-signalable fences
>         > with a timeout sound like they are a valid solution for
>         these issues.
>         > Especially since (as has been mentioned countless times in
>         this email
>         > thread) userspace already has a lot of ways to cause
>         timeouts and or
>         > GPU hangs through GPU work already.
>         >
>         > Adding a timeout on the signaling side of a dma_fence would
>         ensure:
>         >
>         > - The dma_fence signals in finite time
>         > -  If the timeout case does not allocate memory then memory
>         allocation
>         > is not a blocker for signaling.
>         >
>         > Of course you lose the full dependency graph and we need to
>         make sure
>         > garbage collection of fences works correctly when we have
>         cycles.
>         > However, the latter sounds very doable and the first sounds
>         like it is
>         > to some extent inevitable.
>         >
>         > I feel like I'm missing some requirement here given that we
>         > immediately went to much more complicated things but can't
>         find it.
>         > Thoughts?
>
>         Timeouts are sufficient to protect the kernel but they make
>         the fences
>         unpredictable and unreliable from a userspace PoV.  One of the big
>         problems we face is that, once we expose a dma_fence to userspace,
>         we've allowed for some pretty crazy potential dependencies that
>         neither userspace nor the kernel can sort out.  Say you have
>         marek's
>         "next serial, please" proposal and a multi-threaded application.
>         Between time time you ask the kernel for a serial and get a
>         dma_fence
>         and submit the work to signal that serial, your process may get
>         preempted, something else shoved in which allocates memory,
>         and then
>         we end up blocking on that dma_fence.  There's no way
>         userspace can
>         predict and defend itself from that.
>
>         So I think where that leaves us is that there is no safe place to
>         create a dma_fence except for inside the ioctl which submits
>         the work
>         and only after any necessary memory has been allocated. That's a
>         pretty stiff requirement.  We may still be able to interact with
>         userspace a bit more explicitly but I think it throws any
>         notion of
>         userspace direct submit out the window.
>
>         --Jason
>
>
>         > - Bas
>         > >
>         > > --Jason
>         > >
>         > > On Mon, May 3, 2021 at 9:42 AM Alex Deucher
>         <alexdeucher@gmail.com <mailto:alexdeucher@gmail.com>> wrote:
>         > > >
>         > > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák
>         <maraeo@gmail.com <mailto:maraeo@gmail.com>> wrote:
>         > > > >
>         > > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer
>         <michel@daenzer.net <mailto:michel@daenzer.net>> wrote:
>         > > > >>
>         > > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
>         > > > >> > Hi Dave,
>         > > > >> >
>         > > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
>         > > > >> >> Supporting interop with any device is always
>         possible. It depends on which drivers we need to interoperate
>         with and update them. We've already found the path forward for
>         amdgpu. We just need to find out how many other drivers need
>         to be updated and evaluate the cost/benefit aspect.
>         > > > >> >>
>         > > > >> >> Marek
>         > > > >> >>
>         > > > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie
>         <airlied@gmail.com <mailto:airlied@gmail.com>
>         <mailto:airlied@gmail.com <mailto:airlied@gmail.com>>> wrote:
>         > > > >> >>
>         > > > >> >>     On Tue, 27 Apr 2021 at 22:06, Christian König
>         > > > >> >>     <ckoenig.leichtzumerken@gmail.com
>         <mailto:ckoenig.leichtzumerken@gmail.com>
>         <mailto:ckoenig.leichtzumerken@gmail.com
>         <mailto:ckoenig.leichtzumerken@gmail.com>>> wrote:
>         > > > >> >>     >
>         > > > >> >>     > Correct, we wouldn't have synchronization
>         between device with and without user queues any more.
>         > > > >> >>     >
>         > > > >> >>     > That could only be a problem for A+I Laptops.
>         > > > >> >>
>         > > > >> >>     Since I think you mentioned you'd only be
>         enabling this on newer
>         > > > >> >>     chipsets, won't it be a problem for A+A where
>         one A is a generation
>         > > > >> >>     behind the other?
>         > > > >> >>
>         > > > >> >
>         > > > >> > Crap, that is a good point as well.
>         > > > >> >
>         > > > >> >>
>         > > > >> >>     I'm not really liking where this is going btw,
>         seems like a ill
>         > > > >> >>     thought out concept, if AMD is really going
>         down the road of designing
>         > > > >> >>     hw that is currently Linux incompatible, you
>         are going to have to
>         > > > >> >>     accept a big part of the burden in bringing
>         this support in to more
>         > > > >> >>     than just amd drivers for upcoming generations
>         of gpu.
>         > > > >> >>
>         > > > >> >
>         > > > >> > Well we don't really like that either, but we have
>         no other option as far as I can see.
>         > > > >>
>         > > > >> I don't really understand what "future hw may remove
>         support for kernel queues" means exactly. While the
>         per-context queues can be mapped to userspace directly, they
>         don't *have* to be, do they? I.e. the kernel driver should be
>         able to either intercept userspace access to the queues, or in
>         the worst case do it all itself, and provide the existing
>         synchronization semantics as needed?
>         > > > >>
>         > > > >> Surely there are resource limits for the per-context
>         queues, so the kernel driver needs to do some kind of
>         virtualization / multi-plexing anyway, or we'll get sad user
>         faces when there's no queue available for <current hot game>.
>         > > > >>
>         > > > >> I'm probably missing something though, awaiting
>         enlightenment. :)
>         > > > >
>         > > > >
>         > > > > The hw interface for userspace is that the ring buffer
>         is mapped to the process address space alongside a doorbell
>         aperture (4K page) that isn't real memory, but when the CPU
>         writes into it, it tells the hw scheduler that there are new
>         GPU commands in the ring buffer. Userspace inserts all the
>         wait, draw, and signal commands into the ring buffer and then
>         "rings" the doorbell. It's my understanding that the ring
>         buffer and the doorbell are always mapped in the same GPU
>         address space as the process, which makes it very difficult to
>         emulate the current protected ring buffers in the kernel. The
>         VMID of the ring buffer is also not changeable.
>         > > > >
>         > > >
>         > > > The doorbell does not have to be mapped into the
>         process's GPU virtual
>         > > > address space.  The CPU could write to it directly. 
>         Mapping it into
>         > > > the GPU's virtual address space would allow you to have
>         a device kick
>         > > > off work however rather than the CPU. E.g., the GPU
>         could kick off
>         > > > it's own work or multiple devices could kick off work
>         without CPU
>         > > > involvement.
>         > > >
>         > > > Alex
>         > > >
>         > > >
>         > > > > The hw scheduler doesn't do any synchronization and it
>         doesn't see any dependencies. It only chooses which queue to
>         execute, so it's really just a simple queue manager handling
>         the virtualization aspect and not much else.
>         > > > >
>         > > > > Marek
>         > > > > _______________________________________________
>         > > > > dri-devel mailing list
>         > > > > dri-devel@lists.freedesktop.org
>         <mailto:dri-devel@lists.freedesktop.org>
>         > > > >
>         https://lists.freedesktop.org/mailman/listinfo/dri-devel
>         <https://lists.freedesktop.org/mailman/listinfo/dri-devel>
>         > > > _______________________________________________
>         > > > mesa-dev mailing list
>         > > > mesa-dev@lists.freedesktop.org
>         <mailto:mesa-dev@lists.freedesktop.org>
>         > > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>         <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
>         > > _______________________________________________
>         > > dri-devel mailing list
>         > > dri-devel@lists.freedesktop.org
>         <mailto:dri-devel@lists.freedesktop.org>
>         > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
>         <https://lists.freedesktop.org/mailman/listinfo/dri-devel>
>


[-- Attachment #1.2: Type: text/html, Size: 19503 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-04  7:01                                             ` Christian König
@ 2021-05-04  7:32                                               ` Daniel Vetter
  2021-05-04  8:09                                                 ` Christian König
  0 siblings, 1 reply; 105+ messages in thread
From: Daniel Vetter @ 2021-05-04  7:32 UTC (permalink / raw)
  To: Christian König
  Cc: dri-devel, ML Mesa-dev, Michel Dänzer, Jason Ekstrand,
	Marek Olšák

On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:
> Unfortunately as I pointed out to Daniel as well this won't work 100%
> reliable either.

You're claiming this, but there's no clear reason why really, and you
did't reply to my last mail on that sub-thread, so I really don't get
where exactly you're seeing a problem.

> See the signal on the ring buffer needs to be protected by manipulation from
> userspace so that we can guarantee that the hardware really has finished
> executing when it fires.

Nope you don't. Userspace is already allowed to submit all kinds of random
garbage, the only thing the kernel has to guarnatee is:
- the dma-fence DAG stays a DAG
- dma-fence completes in finite time

Everything else is not the kernel's problem, and if userspace mixes stuff
up like manipulates the seqno, that's ok. It can do that kind of garbage
already.

> Protecting memory by immediate page table updates is a good first step, but
> unfortunately not sufficient (and we would need to restructure large parts
> of the driver to make this happen).

This is why you need the unload-fence on top, because indeed you can't
just rely on the fences created from the userspace ring, those are
unreliable for memory management.

btw I thought some more, and I think it's probably best if we only attach
the unload-fence in the ->move(_notify) callbacks. Kinda like we already
do for async copy jobs. So the overall buffer move sequence would be:

1. wait for (untrusted for kernel, but necessary for userspace
correctness) fake dma-fence that rely on the userspace ring

2. unload ctx

3. copy buffer

Ofc 2&3 would be done async behind a dma_fence.

> On older hardware we often had the situation that for reliable invalidation
> we need the guarantee that every previous operation has finished executing.
> It's not so much of a problem when the next operation has already started,
> since then we had the opportunity to do things in between the last and the
> next operation. Just see cache invalidation and VM switching for example.

If you have gpu page faults you generally have synchronous tlb
invalidation, so this also shouldn't be a big problem. Combined with the
unload fence at least. If you don't have synchronous tlb invalidate it
gets a bit more nasty and you need to force a preemption to a kernel
context which has the required flushes across all the caches. Slightly
nasty, but the exact same thing would be required for handling page faults
anyway with the direct userspace submit model.

Again I'm not seeing a problem.

> Additional to that it doesn't really buy us anything, e.g. there is not much
> advantage to this. Writing the ring buffer in userspace and then ringing in
> the kernel has the same overhead as doing everything in the kernel in the
> first place.

It gets you dma-fence backwards compat without having to rewrite the
entire userspace ecosystem. Also since you have the hw already designed
for ringbuffer in userspace it would be silly to copy that through the cs
ioctl, that's just overhead.

Also I thought the problem you're having is that all the kernel ringbuf
stuff is going away, so the old cs ioctl wont work anymore for sure?

Maybe also pick up that other subthread which ended with my last reply.

Cheers, Daniel


> 
> Christian.
> 
> Am 04.05.21 um 05:11 schrieb Marek Olšák:
> > Proposal for a new CS ioctl, kernel pseudo code:
> > 
> > lock(&global_lock);
> > serial = get_next_serial(dev);
> > add_wait_command(ring, serial - 1);
> > add_exec_cmdbuf(ring, user_cmdbuf);
> > add_signal_command(ring, serial);
> > *ring->doorbell = FIRE;
> > unlock(&global_lock);
> > 
> > See? Just like userspace submit, but in the kernel without
> > concurrency/preemption. Is this now safe enough for dma_fence?
> > 
> > Marek
> > 
> > On Mon, May 3, 2021 at 4:36 PM Marek Olšák <maraeo@gmail.com
> > <mailto:maraeo@gmail.com>> wrote:
> > 
> >     What about direct submit from the kernel where the process still
> >     has write access to the GPU ring buffer but doesn't use it? I
> >     think that solves your preemption example, but leaves a potential
> >     backdoor for a process to overwrite the signal commands, which
> >     shouldn't be a problem since we are OK with timeouts.
> > 
> >     Marek
> > 
> >     On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand
> >     <jason@jlekstrand.net <mailto:jason@jlekstrand.net>> wrote:
> > 
> >         On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
> >         <bas@basnieuwenhuizen.nl <mailto:bas@basnieuwenhuizen.nl>> wrote:
> >         >
> >         > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand
> >         <jason@jlekstrand.net <mailto:jason@jlekstrand.net>> wrote:
> >         > >
> >         > > Sorry for the top-post but there's no good thing to reply
> >         to here...
> >         > >
> >         > > One of the things pointed out to me recently by Daniel
> >         Vetter that I
> >         > > didn't fully understand before is that dma_buf has a very
> >         subtle
> >         > > second requirement beyond finite time completion:  Nothing
> >         required
> >         > > for signaling a dma-fence can allocate memory. Why? 
> >         Because the act
> >         > > of allocating memory may wait on your dma-fence.  This, as
> >         it turns
> >         > > out, is a massively more strict requirement than finite time
> >         > > completion and, I think, throws out all of the proposals
> >         we have so
> >         > > far.
> >         > >
> >         > > Take, for instance, Marek's proposal for userspace
> >         involvement with
> >         > > dma-fence by asking the kernel for a next serial and the
> >         kernel
> >         > > trusting userspace to signal it.  That doesn't work at all if
> >         > > allocating memory to trigger a dma-fence can blow up. 
> >         There's simply
> >         > > no way for the kernel to trust userspace to not do
> >         ANYTHING which
> >         > > might allocate memory.  I don't even think there's a way
> >         userspace can
> >         > > trust itself there.  It also blows up my plan of moving
> >         the fences to
> >         > > transition boundaries.
> >         > >
> >         > > Not sure where that leaves us.
> >         >
> >         > Honestly the more I look at things I think
> >         userspace-signalable fences
> >         > with a timeout sound like they are a valid solution for
> >         these issues.
> >         > Especially since (as has been mentioned countless times in
> >         this email
> >         > thread) userspace already has a lot of ways to cause
> >         timeouts and or
> >         > GPU hangs through GPU work already.
> >         >
> >         > Adding a timeout on the signaling side of a dma_fence would
> >         ensure:
> >         >
> >         > - The dma_fence signals in finite time
> >         > -  If the timeout case does not allocate memory then memory
> >         allocation
> >         > is not a blocker for signaling.
> >         >
> >         > Of course you lose the full dependency graph and we need to
> >         make sure
> >         > garbage collection of fences works correctly when we have
> >         cycles.
> >         > However, the latter sounds very doable and the first sounds
> >         like it is
> >         > to some extent inevitable.
> >         >
> >         > I feel like I'm missing some requirement here given that we
> >         > immediately went to much more complicated things but can't
> >         find it.
> >         > Thoughts?
> > 
> >         Timeouts are sufficient to protect the kernel but they make
> >         the fences
> >         unpredictable and unreliable from a userspace PoV.  One of the big
> >         problems we face is that, once we expose a dma_fence to userspace,
> >         we've allowed for some pretty crazy potential dependencies that
> >         neither userspace nor the kernel can sort out.  Say you have
> >         marek's
> >         "next serial, please" proposal and a multi-threaded application.
> >         Between time time you ask the kernel for a serial and get a
> >         dma_fence
> >         and submit the work to signal that serial, your process may get
> >         preempted, something else shoved in which allocates memory,
> >         and then
> >         we end up blocking on that dma_fence.  There's no way
> >         userspace can
> >         predict and defend itself from that.
> > 
> >         So I think where that leaves us is that there is no safe place to
> >         create a dma_fence except for inside the ioctl which submits
> >         the work
> >         and only after any necessary memory has been allocated. That's a
> >         pretty stiff requirement.  We may still be able to interact with
> >         userspace a bit more explicitly but I think it throws any
> >         notion of
> >         userspace direct submit out the window.
> > 
> >         --Jason
> > 
> > 
> >         > - Bas
> >         > >
> >         > > --Jason
> >         > >
> >         > > On Mon, May 3, 2021 at 9:42 AM Alex Deucher
> >         <alexdeucher@gmail.com <mailto:alexdeucher@gmail.com>> wrote:
> >         > > >
> >         > > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák
> >         <maraeo@gmail.com <mailto:maraeo@gmail.com>> wrote:
> >         > > > >
> >         > > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer
> >         <michel@daenzer.net <mailto:michel@daenzer.net>> wrote:
> >         > > > >>
> >         > > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
> >         > > > >> > Hi Dave,
> >         > > > >> >
> >         > > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> >         > > > >> >> Supporting interop with any device is always
> >         possible. It depends on which drivers we need to interoperate
> >         with and update them. We've already found the path forward for
> >         amdgpu. We just need to find out how many other drivers need
> >         to be updated and evaluate the cost/benefit aspect.
> >         > > > >> >>
> >         > > > >> >> Marek
> >         > > > >> >>
> >         > > > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie
> >         <airlied@gmail.com <mailto:airlied@gmail.com>
> >         <mailto:airlied@gmail.com <mailto:airlied@gmail.com>>> wrote:
> >         > > > >> >>
> >         > > > >> >>     On Tue, 27 Apr 2021 at 22:06, Christian König
> >         > > > >> >>     <ckoenig.leichtzumerken@gmail.com
> >         <mailto:ckoenig.leichtzumerken@gmail.com>
> >         <mailto:ckoenig.leichtzumerken@gmail.com
> >         <mailto:ckoenig.leichtzumerken@gmail.com>>> wrote:
> >         > > > >> >>     >
> >         > > > >> >>     > Correct, we wouldn't have synchronization
> >         between device with and without user queues any more.
> >         > > > >> >>     >
> >         > > > >> >>     > That could only be a problem for A+I Laptops.
> >         > > > >> >>
> >         > > > >> >>     Since I think you mentioned you'd only be
> >         enabling this on newer
> >         > > > >> >>     chipsets, won't it be a problem for A+A where
> >         one A is a generation
> >         > > > >> >>     behind the other?
> >         > > > >> >>
> >         > > > >> >
> >         > > > >> > Crap, that is a good point as well.
> >         > > > >> >
> >         > > > >> >>
> >         > > > >> >>     I'm not really liking where this is going btw,
> >         seems like a ill
> >         > > > >> >>     thought out concept, if AMD is really going
> >         down the road of designing
> >         > > > >> >>     hw that is currently Linux incompatible, you
> >         are going to have to
> >         > > > >> >>     accept a big part of the burden in bringing
> >         this support in to more
> >         > > > >> >>     than just amd drivers for upcoming generations
> >         of gpu.
> >         > > > >> >>
> >         > > > >> >
> >         > > > >> > Well we don't really like that either, but we have
> >         no other option as far as I can see.
> >         > > > >>
> >         > > > >> I don't really understand what "future hw may remove
> >         support for kernel queues" means exactly. While the
> >         per-context queues can be mapped to userspace directly, they
> >         don't *have* to be, do they? I.e. the kernel driver should be
> >         able to either intercept userspace access to the queues, or in
> >         the worst case do it all itself, and provide the existing
> >         synchronization semantics as needed?
> >         > > > >>
> >         > > > >> Surely there are resource limits for the per-context
> >         queues, so the kernel driver needs to do some kind of
> >         virtualization / multi-plexing anyway, or we'll get sad user
> >         faces when there's no queue available for <current hot game>.
> >         > > > >>
> >         > > > >> I'm probably missing something though, awaiting
> >         enlightenment. :)
> >         > > > >
> >         > > > >
> >         > > > > The hw interface for userspace is that the ring buffer
> >         is mapped to the process address space alongside a doorbell
> >         aperture (4K page) that isn't real memory, but when the CPU
> >         writes into it, it tells the hw scheduler that there are new
> >         GPU commands in the ring buffer. Userspace inserts all the
> >         wait, draw, and signal commands into the ring buffer and then
> >         "rings" the doorbell. It's my understanding that the ring
> >         buffer and the doorbell are always mapped in the same GPU
> >         address space as the process, which makes it very difficult to
> >         emulate the current protected ring buffers in the kernel. The
> >         VMID of the ring buffer is also not changeable.
> >         > > > >
> >         > > >
> >         > > > The doorbell does not have to be mapped into the
> >         process's GPU virtual
> >         > > > address space.  The CPU could write to it directly. 
> >         Mapping it into
> >         > > > the GPU's virtual address space would allow you to have
> >         a device kick
> >         > > > off work however rather than the CPU. E.g., the GPU
> >         could kick off
> >         > > > it's own work or multiple devices could kick off work
> >         without CPU
> >         > > > involvement.
> >         > > >
> >         > > > Alex
> >         > > >
> >         > > >
> >         > > > > The hw scheduler doesn't do any synchronization and it
> >         doesn't see any dependencies. It only chooses which queue to
> >         execute, so it's really just a simple queue manager handling
> >         the virtualization aspect and not much else.
> >         > > > >
> >         > > > > Marek
> >         > > > > _______________________________________________
> >         > > > > dri-devel mailing list
> >         > > > > dri-devel@lists.freedesktop.org
> >         <mailto:dri-devel@lists.freedesktop.org>
> >         > > > >
> >         https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >         <https://lists.freedesktop.org/mailman/listinfo/dri-devel>
> >         > > > _______________________________________________
> >         > > > mesa-dev mailing list
> >         > > > mesa-dev@lists.freedesktop.org
> >         <mailto:mesa-dev@lists.freedesktop.org>
> >         > > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >         <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
> >         > > _______________________________________________
> >         > > dri-devel mailing list
> >         > > dri-devel@lists.freedesktop.org
> >         <mailto:dri-devel@lists.freedesktop.org>
> >         > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >         <https://lists.freedesktop.org/mailman/listinfo/dri-devel>
> > 
> 

> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-04  7:32                                               ` Daniel Vetter
@ 2021-05-04  8:09                                                 ` Christian König
  2021-05-04  8:27                                                   ` Daniel Vetter
  0 siblings, 1 reply; 105+ messages in thread
From: Christian König @ 2021-05-04  8:09 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: dri-devel, ML Mesa-dev, Michel Dänzer, Jason Ekstrand,
	Marek Olšák

Am 04.05.21 um 09:32 schrieb Daniel Vetter:
> On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:
>> Unfortunately as I pointed out to Daniel as well this won't work 100%
>> reliable either.
> You're claiming this, but there's no clear reason why really, and you
> did't reply to my last mail on that sub-thread, so I really don't get
> where exactly you're seeing a problem.

Yeah, it's rather hard to explain without pointing out how the hardware 
works in detail.

>> See the signal on the ring buffer needs to be protected by manipulation from
>> userspace so that we can guarantee that the hardware really has finished
>> executing when it fires.
> Nope you don't. Userspace is already allowed to submit all kinds of random
> garbage, the only thing the kernel has to guarnatee is:
> - the dma-fence DAG stays a DAG
> - dma-fence completes in finite time
>
> Everything else is not the kernel's problem, and if userspace mixes stuff
> up like manipulates the seqno, that's ok. It can do that kind of garbage
> already.
>
>> Protecting memory by immediate page table updates is a good first step, but
>> unfortunately not sufficient (and we would need to restructure large parts
>> of the driver to make this happen).
> This is why you need the unload-fence on top, because indeed you can't
> just rely on the fences created from the userspace ring, those are
> unreliable for memory management.

And exactly that's the problem! We can't provide a reliable unload-fence 
and the user fences are unreliable for that.

I've talked this through lengthy with our hardware/firmware guy last 
Thursday but couldn't find a solution either.

We can have a preemption fence for the kernel which says: Hey this queue 
was scheduled away you can touch it's hardware descriptor, control 
registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again. 
But that one is only triggered on preemption and then we have the same 
ordering problems once more.

Or we can have a end of operation fence for userspace which says: Hey 
this queue has finished it's batch of execution, but this one is 
manipulable from userspace in both finish to early (very very bad for 
invalidations and memory management) or finish to late/never (deadlock 
prone but fixable by timeout).

What we could do is to use the preemption fence to emulate the unload 
fence, e.g. something like:
1. Preempt the queue in fixed intervals (let's say 100ms).
2. While preempted check if we have reached the checkpoint in question 
by looking at the hardware descriptor.
3. If we have reached the checkpoint signal the unload fence.
4. If we haven't reached the checkpoint resume the queue again.

The problem is that this might introduce a maximum of 100ms delay before 
signaling the unload fence and preempt/resume has such a hefty overhead 
that we waste a horrible amount of time on it.

>
> btw I thought some more, and I think it's probably best if we only attach
> the unload-fence in the ->move(_notify) callbacks. Kinda like we already
> do for async copy jobs. So the overall buffer move sequence would be:
>
> 1. wait for (untrusted for kernel, but necessary for userspace
> correctness) fake dma-fence that rely on the userspace ring
>
> 2. unload ctx
>
> 3. copy buffer
>
> Ofc 2&3 would be done async behind a dma_fence.
>
>> On older hardware we often had the situation that for reliable invalidation
>> we need the guarantee that every previous operation has finished executing.
>> It's not so much of a problem when the next operation has already started,
>> since then we had the opportunity to do things in between the last and the
>> next operation. Just see cache invalidation and VM switching for example.
> If you have gpu page faults you generally have synchronous tlb
> invalidation,

Please tell that our hardware engineers :)

We have two modes of operation, see the whole XNACK on/off discussion on 
the amdgfx mailing list.

> so this also shouldn't be a big problem. Combined with the
> unload fence at least. If you don't have synchronous tlb invalidate it
> gets a bit more nasty and you need to force a preemption to a kernel
> context which has the required flushes across all the caches. Slightly
> nasty, but the exact same thing would be required for handling page faults
> anyway with the direct userspace submit model.
>
> Again I'm not seeing a problem.
>
>> Additional to that it doesn't really buy us anything, e.g. there is not much
>> advantage to this. Writing the ring buffer in userspace and then ringing in
>> the kernel has the same overhead as doing everything in the kernel in the
>> first place.
> It gets you dma-fence backwards compat without having to rewrite the
> entire userspace ecosystem. Also since you have the hw already designed
> for ringbuffer in userspace it would be silly to copy that through the cs
> ioctl, that's just overhead.
>
> Also I thought the problem you're having is that all the kernel ringbuf
> stuff is going away, so the old cs ioctl wont work anymore for sure?

We still have a bit more time for this. As I learned from our firmware 
engineer last Thursday the Windows side is running into similar problems 
as we do.

> Maybe also pick up that other subthread which ended with my last reply.

I will send out another proposal for how to handle user fences shortly.

Cheers,
Christian.

>
> Cheers, Daniel
>
>
>> Christian.
>>
>> Am 04.05.21 um 05:11 schrieb Marek Olšák:
>>> Proposal for a new CS ioctl, kernel pseudo code:
>>>
>>> lock(&global_lock);
>>> serial = get_next_serial(dev);
>>> add_wait_command(ring, serial - 1);
>>> add_exec_cmdbuf(ring, user_cmdbuf);
>>> add_signal_command(ring, serial);
>>> *ring->doorbell = FIRE;
>>> unlock(&global_lock);
>>>
>>> See? Just like userspace submit, but in the kernel without
>>> concurrency/preemption. Is this now safe enough for dma_fence?
>>>
>>> Marek
>>>
>>> On Mon, May 3, 2021 at 4:36 PM Marek Olšák <maraeo@gmail.com
>>> <mailto:maraeo@gmail.com>> wrote:
>>>
>>>      What about direct submit from the kernel where the process still
>>>      has write access to the GPU ring buffer but doesn't use it? I
>>>      think that solves your preemption example, but leaves a potential
>>>      backdoor for a process to overwrite the signal commands, which
>>>      shouldn't be a problem since we are OK with timeouts.
>>>
>>>      Marek
>>>
>>>      On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand
>>>      <jason@jlekstrand.net <mailto:jason@jlekstrand.net>> wrote:
>>>
>>>          On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
>>>          <bas@basnieuwenhuizen.nl <mailto:bas@basnieuwenhuizen.nl>> wrote:
>>>          >
>>>          > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand
>>>          <jason@jlekstrand.net <mailto:jason@jlekstrand.net>> wrote:
>>>          > >
>>>          > > Sorry for the top-post but there's no good thing to reply
>>>          to here...
>>>          > >
>>>          > > One of the things pointed out to me recently by Daniel
>>>          Vetter that I
>>>          > > didn't fully understand before is that dma_buf has a very
>>>          subtle
>>>          > > second requirement beyond finite time completion:  Nothing
>>>          required
>>>          > > for signaling a dma-fence can allocate memory. Why?
>>>          Because the act
>>>          > > of allocating memory may wait on your dma-fence.  This, as
>>>          it turns
>>>          > > out, is a massively more strict requirement than finite time
>>>          > > completion and, I think, throws out all of the proposals
>>>          we have so
>>>          > > far.
>>>          > >
>>>          > > Take, for instance, Marek's proposal for userspace
>>>          involvement with
>>>          > > dma-fence by asking the kernel for a next serial and the
>>>          kernel
>>>          > > trusting userspace to signal it.  That doesn't work at all if
>>>          > > allocating memory to trigger a dma-fence can blow up.
>>>          There's simply
>>>          > > no way for the kernel to trust userspace to not do
>>>          ANYTHING which
>>>          > > might allocate memory.  I don't even think there's a way
>>>          userspace can
>>>          > > trust itself there.  It also blows up my plan of moving
>>>          the fences to
>>>          > > transition boundaries.
>>>          > >
>>>          > > Not sure where that leaves us.
>>>          >
>>>          > Honestly the more I look at things I think
>>>          userspace-signalable fences
>>>          > with a timeout sound like they are a valid solution for
>>>          these issues.
>>>          > Especially since (as has been mentioned countless times in
>>>          this email
>>>          > thread) userspace already has a lot of ways to cause
>>>          timeouts and or
>>>          > GPU hangs through GPU work already.
>>>          >
>>>          > Adding a timeout on the signaling side of a dma_fence would
>>>          ensure:
>>>          >
>>>          > - The dma_fence signals in finite time
>>>          > -  If the timeout case does not allocate memory then memory
>>>          allocation
>>>          > is not a blocker for signaling.
>>>          >
>>>          > Of course you lose the full dependency graph and we need to
>>>          make sure
>>>          > garbage collection of fences works correctly when we have
>>>          cycles.
>>>          > However, the latter sounds very doable and the first sounds
>>>          like it is
>>>          > to some extent inevitable.
>>>          >
>>>          > I feel like I'm missing some requirement here given that we
>>>          > immediately went to much more complicated things but can't
>>>          find it.
>>>          > Thoughts?
>>>
>>>          Timeouts are sufficient to protect the kernel but they make
>>>          the fences
>>>          unpredictable and unreliable from a userspace PoV.  One of the big
>>>          problems we face is that, once we expose a dma_fence to userspace,
>>>          we've allowed for some pretty crazy potential dependencies that
>>>          neither userspace nor the kernel can sort out.  Say you have
>>>          marek's
>>>          "next serial, please" proposal and a multi-threaded application.
>>>          Between time time you ask the kernel for a serial and get a
>>>          dma_fence
>>>          and submit the work to signal that serial, your process may get
>>>          preempted, something else shoved in which allocates memory,
>>>          and then
>>>          we end up blocking on that dma_fence.  There's no way
>>>          userspace can
>>>          predict and defend itself from that.
>>>
>>>          So I think where that leaves us is that there is no safe place to
>>>          create a dma_fence except for inside the ioctl which submits
>>>          the work
>>>          and only after any necessary memory has been allocated. That's a
>>>          pretty stiff requirement.  We may still be able to interact with
>>>          userspace a bit more explicitly but I think it throws any
>>>          notion of
>>>          userspace direct submit out the window.
>>>
>>>          --Jason
>>>
>>>
>>>          > - Bas
>>>          > >
>>>          > > --Jason
>>>          > >
>>>          > > On Mon, May 3, 2021 at 9:42 AM Alex Deucher
>>>          <alexdeucher@gmail.com <mailto:alexdeucher@gmail.com>> wrote:
>>>          > > >
>>>          > > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák
>>>          <maraeo@gmail.com <mailto:maraeo@gmail.com>> wrote:
>>>          > > > >
>>>          > > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer
>>>          <michel@daenzer.net <mailto:michel@daenzer.net>> wrote:
>>>          > > > >>
>>>          > > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
>>>          > > > >> > Hi Dave,
>>>          > > > >> >
>>>          > > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
>>>          > > > >> >> Supporting interop with any device is always
>>>          possible. It depends on which drivers we need to interoperate
>>>          with and update them. We've already found the path forward for
>>>          amdgpu. We just need to find out how many other drivers need
>>>          to be updated and evaluate the cost/benefit aspect.
>>>          > > > >> >>
>>>          > > > >> >> Marek
>>>          > > > >> >>
>>>          > > > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie
>>>          <airlied@gmail.com <mailto:airlied@gmail.com>
>>>          <mailto:airlied@gmail.com <mailto:airlied@gmail.com>>> wrote:
>>>          > > > >> >>
>>>          > > > >> >>     On Tue, 27 Apr 2021 at 22:06, Christian König
>>>          > > > >> >>     <ckoenig.leichtzumerken@gmail.com
>>>          <mailto:ckoenig.leichtzumerken@gmail.com>
>>>          <mailto:ckoenig.leichtzumerken@gmail.com
>>>          <mailto:ckoenig.leichtzumerken@gmail.com>>> wrote:
>>>          > > > >> >>     >
>>>          > > > >> >>     > Correct, we wouldn't have synchronization
>>>          between device with and without user queues any more.
>>>          > > > >> >>     >
>>>          > > > >> >>     > That could only be a problem for A+I Laptops.
>>>          > > > >> >>
>>>          > > > >> >>     Since I think you mentioned you'd only be
>>>          enabling this on newer
>>>          > > > >> >>     chipsets, won't it be a problem for A+A where
>>>          one A is a generation
>>>          > > > >> >>     behind the other?
>>>          > > > >> >>
>>>          > > > >> >
>>>          > > > >> > Crap, that is a good point as well.
>>>          > > > >> >
>>>          > > > >> >>
>>>          > > > >> >>     I'm not really liking where this is going btw,
>>>          seems like a ill
>>>          > > > >> >>     thought out concept, if AMD is really going
>>>          down the road of designing
>>>          > > > >> >>     hw that is currently Linux incompatible, you
>>>          are going to have to
>>>          > > > >> >>     accept a big part of the burden in bringing
>>>          this support in to more
>>>          > > > >> >>     than just amd drivers for upcoming generations
>>>          of gpu.
>>>          > > > >> >>
>>>          > > > >> >
>>>          > > > >> > Well we don't really like that either, but we have
>>>          no other option as far as I can see.
>>>          > > > >>
>>>          > > > >> I don't really understand what "future hw may remove
>>>          support for kernel queues" means exactly. While the
>>>          per-context queues can be mapped to userspace directly, they
>>>          don't *have* to be, do they? I.e. the kernel driver should be
>>>          able to either intercept userspace access to the queues, or in
>>>          the worst case do it all itself, and provide the existing
>>>          synchronization semantics as needed?
>>>          > > > >>
>>>          > > > >> Surely there are resource limits for the per-context
>>>          queues, so the kernel driver needs to do some kind of
>>>          virtualization / multi-plexing anyway, or we'll get sad user
>>>          faces when there's no queue available for <current hot game>.
>>>          > > > >>
>>>          > > > >> I'm probably missing something though, awaiting
>>>          enlightenment. :)
>>>          > > > >
>>>          > > > >
>>>          > > > > The hw interface for userspace is that the ring buffer
>>>          is mapped to the process address space alongside a doorbell
>>>          aperture (4K page) that isn't real memory, but when the CPU
>>>          writes into it, it tells the hw scheduler that there are new
>>>          GPU commands in the ring buffer. Userspace inserts all the
>>>          wait, draw, and signal commands into the ring buffer and then
>>>          "rings" the doorbell. It's my understanding that the ring
>>>          buffer and the doorbell are always mapped in the same GPU
>>>          address space as the process, which makes it very difficult to
>>>          emulate the current protected ring buffers in the kernel. The
>>>          VMID of the ring buffer is also not changeable.
>>>          > > > >
>>>          > > >
>>>          > > > The doorbell does not have to be mapped into the
>>>          process's GPU virtual
>>>          > > > address space.  The CPU could write to it directly.
>>>          Mapping it into
>>>          > > > the GPU's virtual address space would allow you to have
>>>          a device kick
>>>          > > > off work however rather than the CPU. E.g., the GPU
>>>          could kick off
>>>          > > > it's own work or multiple devices could kick off work
>>>          without CPU
>>>          > > > involvement.
>>>          > > >
>>>          > > > Alex
>>>          > > >
>>>          > > >
>>>          > > > > The hw scheduler doesn't do any synchronization and it
>>>          doesn't see any dependencies. It only chooses which queue to
>>>          execute, so it's really just a simple queue manager handling
>>>          the virtualization aspect and not much else.
>>>          > > > >
>>>          > > > > Marek
>>>          > > > > _______________________________________________
>>>          > > > > dri-devel mailing list
>>>          > > > > dri-devel@lists.freedesktop.org
>>>          <mailto:dri-devel@lists.freedesktop.org>
>>>          > > > >
>>>          https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>          <https://lists.freedesktop.org/mailman/listinfo/dri-devel>
>>>          > > > _______________________________________________
>>>          > > > mesa-dev mailing list
>>>          > > > mesa-dev@lists.freedesktop.org
>>>          <mailto:mesa-dev@lists.freedesktop.org>
>>>          > > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>          <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
>>>          > > _______________________________________________
>>>          > > dri-devel mailing list
>>>          > > dri-devel@lists.freedesktop.org
>>>          <mailto:dri-devel@lists.freedesktop.org>
>>>          > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>          <https://lists.freedesktop.org/mailman/listinfo/dri-devel>
>>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-04  8:09                                                 ` Christian König
@ 2021-05-04  8:27                                                   ` Daniel Vetter
  2021-05-04  9:14                                                     ` Christian König
  0 siblings, 1 reply; 105+ messages in thread
From: Daniel Vetter @ 2021-05-04  8:27 UTC (permalink / raw)
  To: Christian König
  Cc: dri-devel, ML Mesa-dev, Michel Dänzer, Jason Ekstrand,
	Marek Olšák

On Tue, May 4, 2021 at 10:09 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 04.05.21 um 09:32 schrieb Daniel Vetter:
> > On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:
> >> Unfortunately as I pointed out to Daniel as well this won't work 100%
> >> reliable either.
> > You're claiming this, but there's no clear reason why really, and you
> > did't reply to my last mail on that sub-thread, so I really don't get
> > where exactly you're seeing a problem.
>
> Yeah, it's rather hard to explain without pointing out how the hardware
> works in detail.
>
> >> See the signal on the ring buffer needs to be protected by manipulation from
> >> userspace so that we can guarantee that the hardware really has finished
> >> executing when it fires.
> > Nope you don't. Userspace is already allowed to submit all kinds of random
> > garbage, the only thing the kernel has to guarnatee is:
> > - the dma-fence DAG stays a DAG
> > - dma-fence completes in finite time
> >
> > Everything else is not the kernel's problem, and if userspace mixes stuff
> > up like manipulates the seqno, that's ok. It can do that kind of garbage
> > already.
> >
> >> Protecting memory by immediate page table updates is a good first step, but
> >> unfortunately not sufficient (and we would need to restructure large parts
> >> of the driver to make this happen).
> > This is why you need the unload-fence on top, because indeed you can't
> > just rely on the fences created from the userspace ring, those are
> > unreliable for memory management.
>
> And exactly that's the problem! We can't provide a reliable unload-fence
> and the user fences are unreliable for that.
>
> I've talked this through lengthy with our hardware/firmware guy last
> Thursday but couldn't find a solution either.
>
> We can have a preemption fence for the kernel which says: Hey this queue
> was scheduled away you can touch it's hardware descriptor, control
> registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again.
> But that one is only triggered on preemption and then we have the same
> ordering problems once more.
>
> Or we can have a end of operation fence for userspace which says: Hey
> this queue has finished it's batch of execution, but this one is
> manipulable from userspace in both finish to early (very very bad for
> invalidations and memory management) or finish to late/never (deadlock
> prone but fixable by timeout).
>
> What we could do is to use the preemption fence to emulate the unload
> fence, e.g. something like:
> 1. Preempt the queue in fixed intervals (let's say 100ms).
> 2. While preempted check if we have reached the checkpoint in question
> by looking at the hardware descriptor.
> 3. If we have reached the checkpoint signal the unload fence.
> 4. If we haven't reached the checkpoint resume the queue again.
>
> The problem is that this might introduce a maximum of 100ms delay before
> signaling the unload fence and preempt/resume has such a hefty overhead
> that we waste a horrible amount of time on it.

So your hw can preempt? That's good enough.

The unload fence is just
1. wait for all dma_fence that are based on the userspace ring. This
is unreliable, but we don't care because tdr will make it reliable.
And once tdr shot down a context we'll force-unload and thrash it
completely, which solves the problem.
2. preempt the context, which /should/ now be stuck waiting for more
commands to be stuffed into the ringbuffer. Which means your
preemption is hopefully fast enough to not matter. If your hw takes
forever to preempt an idle ring, I can't help you :-)

Also, if userspace lies to us and keeps pushing crap into the ring
after it's supposed to be idle: Userspace is already allowed to waste
gpu time. If you're too worried about this set a fairly aggressive
preempt timeout on the unload fence, and kill the context if it takes
longer than what preempting an idle ring should take (because that
would indicate broken/evil userspace).

Again, I'm not seeing the problem. Except if your hw is really
completely busted to the point where it can't even support userspace
ringbuffers properly and with sufficient performance :-P

Of course if you issue the preempt context request before the
userspace fences have finished (or tdr cleaned up the mess) like you
do in your proposal, then it will be ridiculously expensive and/or
wont work. So just don't do that.

> > btw I thought some more, and I think it's probably best if we only attach
> > the unload-fence in the ->move(_notify) callbacks. Kinda like we already
> > do for async copy jobs. So the overall buffer move sequence would be:
> >
> > 1. wait for (untrusted for kernel, but necessary for userspace
> > correctness) fake dma-fence that rely on the userspace ring
> >
> > 2. unload ctx
> >
> > 3. copy buffer
> >
> > Ofc 2&3 would be done async behind a dma_fence.
> >
> >> On older hardware we often had the situation that for reliable invalidation
> >> we need the guarantee that every previous operation has finished executing.
> >> It's not so much of a problem when the next operation has already started,
> >> since then we had the opportunity to do things in between the last and the
> >> next operation. Just see cache invalidation and VM switching for example.
> > If you have gpu page faults you generally have synchronous tlb
> > invalidation,
>
> Please tell that our hardware engineers :)
>
> We have two modes of operation, see the whole XNACK on/off discussion on
> the amdgfx mailing list.

I didn't find this anywhere with a quick search. Pointers to archive
(lore.kernel.org/amd-gfx is the best imo).

> > so this also shouldn't be a big problem. Combined with the
> > unload fence at least. If you don't have synchronous tlb invalidate it
> > gets a bit more nasty and you need to force a preemption to a kernel
> > context which has the required flushes across all the caches. Slightly
> > nasty, but the exact same thing would be required for handling page faults
> > anyway with the direct userspace submit model.
> >
> > Again I'm not seeing a problem.
> >
> >> Additional to that it doesn't really buy us anything, e.g. there is not much
> >> advantage to this. Writing the ring buffer in userspace and then ringing in
> >> the kernel has the same overhead as doing everything in the kernel in the
> >> first place.
> > It gets you dma-fence backwards compat without having to rewrite the
> > entire userspace ecosystem. Also since you have the hw already designed
> > for ringbuffer in userspace it would be silly to copy that through the cs
> > ioctl, that's just overhead.
> >
> > Also I thought the problem you're having is that all the kernel ringbuf
> > stuff is going away, so the old cs ioctl wont work anymore for sure?
>
> We still have a bit more time for this. As I learned from our firmware
> engineer last Thursday the Windows side is running into similar problems
> as we do.

This story sounds familiar, I've heard it a few times here at intel
too on various things where we complained and then windows hit the
same issues too :-)

E.g. I've just learned that all the things we've discussed around gpu
page faults vs 3d workloads and how you need to reserve some CU for 3d
guaranteed forward progress or even worse measures is also something
they're hitting on Windows. Apparently they fixed it by only running
3d or compute workloads at the same time, but not both.

> > Maybe also pick up that other subthread which ended with my last reply.
>
> I will send out another proposal for how to handle user fences shortly.

Maybe let's discuss this here first before we commit to requiring all
userspace to upgrade to user fences ... I do agree that we want to go
there too, but breaking all the compositors is probably not the best
option.

Cheers, Daniel


>
> Cheers,
> Christian.
>
> >
> > Cheers, Daniel
> >
> >
> >> Christian.
> >>
> >> Am 04.05.21 um 05:11 schrieb Marek Olšák:
> >>> Proposal for a new CS ioctl, kernel pseudo code:
> >>>
> >>> lock(&global_lock);
> >>> serial = get_next_serial(dev);
> >>> add_wait_command(ring, serial - 1);
> >>> add_exec_cmdbuf(ring, user_cmdbuf);
> >>> add_signal_command(ring, serial);
> >>> *ring->doorbell = FIRE;
> >>> unlock(&global_lock);
> >>>
> >>> See? Just like userspace submit, but in the kernel without
> >>> concurrency/preemption. Is this now safe enough for dma_fence?
> >>>
> >>> Marek
> >>>
> >>> On Mon, May 3, 2021 at 4:36 PM Marek Olšák <maraeo@gmail.com
> >>> <mailto:maraeo@gmail.com>> wrote:
> >>>
> >>>      What about direct submit from the kernel where the process still
> >>>      has write access to the GPU ring buffer but doesn't use it? I
> >>>      think that solves your preemption example, but leaves a potential
> >>>      backdoor for a process to overwrite the signal commands, which
> >>>      shouldn't be a problem since we are OK with timeouts.
> >>>
> >>>      Marek
> >>>
> >>>      On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand
> >>>      <jason@jlekstrand.net <mailto:jason@jlekstrand.net>> wrote:
> >>>
> >>>          On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
> >>>          <bas@basnieuwenhuizen.nl <mailto:bas@basnieuwenhuizen.nl>> wrote:
> >>>          >
> >>>          > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand
> >>>          <jason@jlekstrand.net <mailto:jason@jlekstrand.net>> wrote:
> >>>          > >
> >>>          > > Sorry for the top-post but there's no good thing to reply
> >>>          to here...
> >>>          > >
> >>>          > > One of the things pointed out to me recently by Daniel
> >>>          Vetter that I
> >>>          > > didn't fully understand before is that dma_buf has a very
> >>>          subtle
> >>>          > > second requirement beyond finite time completion:  Nothing
> >>>          required
> >>>          > > for signaling a dma-fence can allocate memory. Why?
> >>>          Because the act
> >>>          > > of allocating memory may wait on your dma-fence.  This, as
> >>>          it turns
> >>>          > > out, is a massively more strict requirement than finite time
> >>>          > > completion and, I think, throws out all of the proposals
> >>>          we have so
> >>>          > > far.
> >>>          > >
> >>>          > > Take, for instance, Marek's proposal for userspace
> >>>          involvement with
> >>>          > > dma-fence by asking the kernel for a next serial and the
> >>>          kernel
> >>>          > > trusting userspace to signal it.  That doesn't work at all if
> >>>          > > allocating memory to trigger a dma-fence can blow up.
> >>>          There's simply
> >>>          > > no way for the kernel to trust userspace to not do
> >>>          ANYTHING which
> >>>          > > might allocate memory.  I don't even think there's a way
> >>>          userspace can
> >>>          > > trust itself there.  It also blows up my plan of moving
> >>>          the fences to
> >>>          > > transition boundaries.
> >>>          > >
> >>>          > > Not sure where that leaves us.
> >>>          >
> >>>          > Honestly the more I look at things I think
> >>>          userspace-signalable fences
> >>>          > with a timeout sound like they are a valid solution for
> >>>          these issues.
> >>>          > Especially since (as has been mentioned countless times in
> >>>          this email
> >>>          > thread) userspace already has a lot of ways to cause
> >>>          timeouts and or
> >>>          > GPU hangs through GPU work already.
> >>>          >
> >>>          > Adding a timeout on the signaling side of a dma_fence would
> >>>          ensure:
> >>>          >
> >>>          > - The dma_fence signals in finite time
> >>>          > -  If the timeout case does not allocate memory then memory
> >>>          allocation
> >>>          > is not a blocker for signaling.
> >>>          >
> >>>          > Of course you lose the full dependency graph and we need to
> >>>          make sure
> >>>          > garbage collection of fences works correctly when we have
> >>>          cycles.
> >>>          > However, the latter sounds very doable and the first sounds
> >>>          like it is
> >>>          > to some extent inevitable.
> >>>          >
> >>>          > I feel like I'm missing some requirement here given that we
> >>>          > immediately went to much more complicated things but can't
> >>>          find it.
> >>>          > Thoughts?
> >>>
> >>>          Timeouts are sufficient to protect the kernel but they make
> >>>          the fences
> >>>          unpredictable and unreliable from a userspace PoV.  One of the big
> >>>          problems we face is that, once we expose a dma_fence to userspace,
> >>>          we've allowed for some pretty crazy potential dependencies that
> >>>          neither userspace nor the kernel can sort out.  Say you have
> >>>          marek's
> >>>          "next serial, please" proposal and a multi-threaded application.
> >>>          Between time time you ask the kernel for a serial and get a
> >>>          dma_fence
> >>>          and submit the work to signal that serial, your process may get
> >>>          preempted, something else shoved in which allocates memory,
> >>>          and then
> >>>          we end up blocking on that dma_fence.  There's no way
> >>>          userspace can
> >>>          predict and defend itself from that.
> >>>
> >>>          So I think where that leaves us is that there is no safe place to
> >>>          create a dma_fence except for inside the ioctl which submits
> >>>          the work
> >>>          and only after any necessary memory has been allocated. That's a
> >>>          pretty stiff requirement.  We may still be able to interact with
> >>>          userspace a bit more explicitly but I think it throws any
> >>>          notion of
> >>>          userspace direct submit out the window.
> >>>
> >>>          --Jason
> >>>
> >>>
> >>>          > - Bas
> >>>          > >
> >>>          > > --Jason
> >>>          > >
> >>>          > > On Mon, May 3, 2021 at 9:42 AM Alex Deucher
> >>>          <alexdeucher@gmail.com <mailto:alexdeucher@gmail.com>> wrote:
> >>>          > > >
> >>>          > > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák
> >>>          <maraeo@gmail.com <mailto:maraeo@gmail.com>> wrote:
> >>>          > > > >
> >>>          > > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer
> >>>          <michel@daenzer.net <mailto:michel@daenzer.net>> wrote:
> >>>          > > > >>
> >>>          > > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
> >>>          > > > >> > Hi Dave,
> >>>          > > > >> >
> >>>          > > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
> >>>          > > > >> >> Supporting interop with any device is always
> >>>          possible. It depends on which drivers we need to interoperate
> >>>          with and update them. We've already found the path forward for
> >>>          amdgpu. We just need to find out how many other drivers need
> >>>          to be updated and evaluate the cost/benefit aspect.
> >>>          > > > >> >>
> >>>          > > > >> >> Marek
> >>>          > > > >> >>
> >>>          > > > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie
> >>>          <airlied@gmail.com <mailto:airlied@gmail.com>
> >>>          <mailto:airlied@gmail.com <mailto:airlied@gmail.com>>> wrote:
> >>>          > > > >> >>
> >>>          > > > >> >>     On Tue, 27 Apr 2021 at 22:06, Christian König
> >>>          > > > >> >>     <ckoenig.leichtzumerken@gmail.com
> >>>          <mailto:ckoenig.leichtzumerken@gmail.com>
> >>>          <mailto:ckoenig.leichtzumerken@gmail.com
> >>>          <mailto:ckoenig.leichtzumerken@gmail.com>>> wrote:
> >>>          > > > >> >>     >
> >>>          > > > >> >>     > Correct, we wouldn't have synchronization
> >>>          between device with and without user queues any more.
> >>>          > > > >> >>     >
> >>>          > > > >> >>     > That could only be a problem for A+I Laptops.
> >>>          > > > >> >>
> >>>          > > > >> >>     Since I think you mentioned you'd only be
> >>>          enabling this on newer
> >>>          > > > >> >>     chipsets, won't it be a problem for A+A where
> >>>          one A is a generation
> >>>          > > > >> >>     behind the other?
> >>>          > > > >> >>
> >>>          > > > >> >
> >>>          > > > >> > Crap, that is a good point as well.
> >>>          > > > >> >
> >>>          > > > >> >>
> >>>          > > > >> >>     I'm not really liking where this is going btw,
> >>>          seems like a ill
> >>>          > > > >> >>     thought out concept, if AMD is really going
> >>>          down the road of designing
> >>>          > > > >> >>     hw that is currently Linux incompatible, you
> >>>          are going to have to
> >>>          > > > >> >>     accept a big part of the burden in bringing
> >>>          this support in to more
> >>>          > > > >> >>     than just amd drivers for upcoming generations
> >>>          of gpu.
> >>>          > > > >> >>
> >>>          > > > >> >
> >>>          > > > >> > Well we don't really like that either, but we have
> >>>          no other option as far as I can see.
> >>>          > > > >>
> >>>          > > > >> I don't really understand what "future hw may remove
> >>>          support for kernel queues" means exactly. While the
> >>>          per-context queues can be mapped to userspace directly, they
> >>>          don't *have* to be, do they? I.e. the kernel driver should be
> >>>          able to either intercept userspace access to the queues, or in
> >>>          the worst case do it all itself, and provide the existing
> >>>          synchronization semantics as needed?
> >>>          > > > >>
> >>>          > > > >> Surely there are resource limits for the per-context
> >>>          queues, so the kernel driver needs to do some kind of
> >>>          virtualization / multi-plexing anyway, or we'll get sad user
> >>>          faces when there's no queue available for <current hot game>.
> >>>          > > > >>
> >>>          > > > >> I'm probably missing something though, awaiting
> >>>          enlightenment. :)
> >>>          > > > >
> >>>          > > > >
> >>>          > > > > The hw interface for userspace is that the ring buffer
> >>>          is mapped to the process address space alongside a doorbell
> >>>          aperture (4K page) that isn't real memory, but when the CPU
> >>>          writes into it, it tells the hw scheduler that there are new
> >>>          GPU commands in the ring buffer. Userspace inserts all the
> >>>          wait, draw, and signal commands into the ring buffer and then
> >>>          "rings" the doorbell. It's my understanding that the ring
> >>>          buffer and the doorbell are always mapped in the same GPU
> >>>          address space as the process, which makes it very difficult to
> >>>          emulate the current protected ring buffers in the kernel. The
> >>>          VMID of the ring buffer is also not changeable.
> >>>          > > > >
> >>>          > > >
> >>>          > > > The doorbell does not have to be mapped into the
> >>>          process's GPU virtual
> >>>          > > > address space.  The CPU could write to it directly.
> >>>          Mapping it into
> >>>          > > > the GPU's virtual address space would allow you to have
> >>>          a device kick
> >>>          > > > off work however rather than the CPU. E.g., the GPU
> >>>          could kick off
> >>>          > > > it's own work or multiple devices could kick off work
> >>>          without CPU
> >>>          > > > involvement.
> >>>          > > >
> >>>          > > > Alex
> >>>          > > >
> >>>          > > >
> >>>          > > > > The hw scheduler doesn't do any synchronization and it
> >>>          doesn't see any dependencies. It only chooses which queue to
> >>>          execute, so it's really just a simple queue manager handling
> >>>          the virtualization aspect and not much else.
> >>>          > > > >
> >>>          > > > > Marek
> >>>          > > > > _______________________________________________
> >>>          > > > > dri-devel mailing list
> >>>          > > > > dri-devel@lists.freedesktop.org
> >>>          <mailto:dri-devel@lists.freedesktop.org>
> >>>          > > > >
> >>>          https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >>>          <https://lists.freedesktop.org/mailman/listinfo/dri-devel>
> >>>          > > > _______________________________________________
> >>>          > > > mesa-dev mailing list
> >>>          > > > mesa-dev@lists.freedesktop.org
> >>>          <mailto:mesa-dev@lists.freedesktop.org>
> >>>          > > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
> >>>          <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
> >>>          > > _______________________________________________
> >>>          > > dri-devel mailing list
> >>>          > > dri-devel@lists.freedesktop.org
> >>>          <mailto:dri-devel@lists.freedesktop.org>
> >>>          > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >>>          <https://lists.freedesktop.org/mailman/listinfo/dri-devel>
> >>>
> >> _______________________________________________
> >> dri-devel mailing list
> >> dri-devel@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-04  8:27                                                   ` Daniel Vetter
@ 2021-05-04  9:14                                                     ` Christian König
  2021-05-04  9:47                                                       ` Daniel Vetter
  0 siblings, 1 reply; 105+ messages in thread
From: Christian König @ 2021-05-04  9:14 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: dri-devel, ML Mesa-dev, Michel Dänzer, Jason Ekstrand,
	Marek Olšák

Am 04.05.21 um 10:27 schrieb Daniel Vetter:
> On Tue, May 4, 2021 at 10:09 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 04.05.21 um 09:32 schrieb Daniel Vetter:
>>> On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:
>>>> Unfortunately as I pointed out to Daniel as well this won't work 100%
>>>> reliable either.
>>> You're claiming this, but there's no clear reason why really, and you
>>> did't reply to my last mail on that sub-thread, so I really don't get
>>> where exactly you're seeing a problem.
>> Yeah, it's rather hard to explain without pointing out how the hardware
>> works in detail.
>>
>>>> See the signal on the ring buffer needs to be protected by manipulation from
>>>> userspace so that we can guarantee that the hardware really has finished
>>>> executing when it fires.
>>> Nope you don't. Userspace is already allowed to submit all kinds of random
>>> garbage, the only thing the kernel has to guarnatee is:
>>> - the dma-fence DAG stays a DAG
>>> - dma-fence completes in finite time
>>>
>>> Everything else is not the kernel's problem, and if userspace mixes stuff
>>> up like manipulates the seqno, that's ok. It can do that kind of garbage
>>> already.
>>>
>>>> Protecting memory by immediate page table updates is a good first step, but
>>>> unfortunately not sufficient (and we would need to restructure large parts
>>>> of the driver to make this happen).
>>> This is why you need the unload-fence on top, because indeed you can't
>>> just rely on the fences created from the userspace ring, those are
>>> unreliable for memory management.
>> And exactly that's the problem! We can't provide a reliable unload-fence
>> and the user fences are unreliable for that.
>>
>> I've talked this through lengthy with our hardware/firmware guy last
>> Thursday but couldn't find a solution either.
>>
>> We can have a preemption fence for the kernel which says: Hey this queue
>> was scheduled away you can touch it's hardware descriptor, control
>> registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again.
>> But that one is only triggered on preemption and then we have the same
>> ordering problems once more.
>>
>> Or we can have a end of operation fence for userspace which says: Hey
>> this queue has finished it's batch of execution, but this one is
>> manipulable from userspace in both finish to early (very very bad for
>> invalidations and memory management) or finish to late/never (deadlock
>> prone but fixable by timeout).
>>
>> What we could do is to use the preemption fence to emulate the unload
>> fence, e.g. something like:
>> 1. Preempt the queue in fixed intervals (let's say 100ms).
>> 2. While preempted check if we have reached the checkpoint in question
>> by looking at the hardware descriptor.
>> 3. If we have reached the checkpoint signal the unload fence.
>> 4. If we haven't reached the checkpoint resume the queue again.
>>
>> The problem is that this might introduce a maximum of 100ms delay before
>> signaling the unload fence and preempt/resume has such a hefty overhead
>> that we waste a horrible amount of time on it.
> So your hw can preempt? That's good enough.
>
> The unload fence is just
> 1. wait for all dma_fence that are based on the userspace ring. This
> is unreliable, but we don't care because tdr will make it reliable.
> And once tdr shot down a context we'll force-unload and thrash it
> completely, which solves the problem.
> 2. preempt the context, which /should/ now be stuck waiting for more
> commands to be stuffed into the ringbuffer. Which means your
> preemption is hopefully fast enough to not matter. If your hw takes
> forever to preempt an idle ring, I can't help you :-)

Yeah, it just takes to long for the preemption to complete to be really 
useful for the feature we are discussing here.

As I said when the kernel requests to preempt a queue we can easily 
expect a timeout of ~100ms until that comes back. For compute that is 
even in the multiple seconds range.

The "preemption" feature is really called suspend and made just for the 
case when we want to put a process to sleep or need to forcefully kill 
it for misbehavior or stuff like that. It is not meant to be used in 
normal operation.

If we only attach it on ->move then yeah maybe a last resort possibility 
to do it this way, but I think in that case we could rather stick with 
kernel submissions.

> Also, if userspace lies to us and keeps pushing crap into the ring
> after it's supposed to be idle: Userspace is already allowed to waste
> gpu time. If you're too worried about this set a fairly aggressive
> preempt timeout on the unload fence, and kill the context if it takes
> longer than what preempting an idle ring should take (because that
> would indicate broken/evil userspace).

I think you have the wrong expectation here. It is perfectly valid and 
expected for userspace to keep writing commands into the ring buffer.

After all when one frame is completed they want to immediately start 
rendering the next one.

> Again, I'm not seeing the problem. Except if your hw is really
> completely busted to the point where it can't even support userspace
> ringbuffers properly and with sufficient performance :-P
>
> Of course if you issue the preempt context request before the
> userspace fences have finished (or tdr cleaned up the mess) like you
> do in your proposal, then it will be ridiculously expensive and/or
> wont work. So just don't do that.
>
>>> btw I thought some more, and I think it's probably best if we only attach
>>> the unload-fence in the ->move(_notify) callbacks. Kinda like we already
>>> do for async copy jobs. So the overall buffer move sequence would be:
>>>
>>> 1. wait for (untrusted for kernel, but necessary for userspace
>>> correctness) fake dma-fence that rely on the userspace ring
>>>
>>> 2. unload ctx
>>>
>>> 3. copy buffer
>>>
>>> Ofc 2&3 would be done async behind a dma_fence.
>>>
>>>> On older hardware we often had the situation that for reliable invalidation
>>>> we need the guarantee that every previous operation has finished executing.
>>>> It's not so much of a problem when the next operation has already started,
>>>> since then we had the opportunity to do things in between the last and the
>>>> next operation. Just see cache invalidation and VM switching for example.
>>> If you have gpu page faults you generally have synchronous tlb
>>> invalidation,
>> Please tell that our hardware engineers :)
>>
>> We have two modes of operation, see the whole XNACK on/off discussion on
>> the amdgfx mailing list.
> I didn't find this anywhere with a quick search. Pointers to archive
> (lore.kernel.org/amd-gfx is the best imo).

Can't find that of hand either, but see the amdgpu_noretry module option.

It basically tells the hardware if retry page faults should be supported 
or not because this whole TLB shutdown thing when they are supported is 
extremely costly.

>>> so this also shouldn't be a big problem. Combined with the
>>> unload fence at least. If you don't have synchronous tlb invalidate it
>>> gets a bit more nasty and you need to force a preemption to a kernel
>>> context which has the required flushes across all the caches. Slightly
>>> nasty, but the exact same thing would be required for handling page faults
>>> anyway with the direct userspace submit model.
>>>
>>> Again I'm not seeing a problem.
>>>
>>>> Additional to that it doesn't really buy us anything, e.g. there is not much
>>>> advantage to this. Writing the ring buffer in userspace and then ringing in
>>>> the kernel has the same overhead as doing everything in the kernel in the
>>>> first place.
>>> It gets you dma-fence backwards compat without having to rewrite the
>>> entire userspace ecosystem. Also since you have the hw already designed
>>> for ringbuffer in userspace it would be silly to copy that through the cs
>>> ioctl, that's just overhead.
>>>
>>> Also I thought the problem you're having is that all the kernel ringbuf
>>> stuff is going away, so the old cs ioctl wont work anymore for sure?
>> We still have a bit more time for this. As I learned from our firmware
>> engineer last Thursday the Windows side is running into similar problems
>> as we do.
> This story sounds familiar, I've heard it a few times here at intel
> too on various things where we complained and then windows hit the
> same issues too :-)
>
> E.g. I've just learned that all the things we've discussed around gpu
> page faults vs 3d workloads and how you need to reserve some CU for 3d
> guaranteed forward progress or even worse measures is also something
> they're hitting on Windows. Apparently they fixed it by only running
> 3d or compute workloads at the same time, but not both.

I'm not even sure if we are going to see user fences on Windows with the 
next hw generation.

Before we can continue with this discussion we need to figure out how to 
get the hardware reliable first.

In other words if we would have explicit user fences everywhere, how 
would we handle timeouts and misbehaving processes? As it turned out 
they haven't figured this out on Windows yet either.

>
>>> Maybe also pick up that other subthread which ended with my last reply.
>> I will send out another proposal for how to handle user fences shortly.
> Maybe let's discuss this here first before we commit to requiring all
> userspace to upgrade to user fences ... I do agree that we want to go
> there too, but breaking all the compositors is probably not the best
> option.

I was more thinking about handling it all in the kernel.

Christian.

>
> Cheers, Daniel
>
>
>> Cheers,
>> Christian.
>>
>>> Cheers, Daniel
>>>
>>>
>>>> Christian.
>>>>
>>>> Am 04.05.21 um 05:11 schrieb Marek Olšák:
>>>>> Proposal for a new CS ioctl, kernel pseudo code:
>>>>>
>>>>> lock(&global_lock);
>>>>> serial = get_next_serial(dev);
>>>>> add_wait_command(ring, serial - 1);
>>>>> add_exec_cmdbuf(ring, user_cmdbuf);
>>>>> add_signal_command(ring, serial);
>>>>> *ring->doorbell = FIRE;
>>>>> unlock(&global_lock);
>>>>>
>>>>> See? Just like userspace submit, but in the kernel without
>>>>> concurrency/preemption. Is this now safe enough for dma_fence?
>>>>>
>>>>> Marek
>>>>>
>>>>> On Mon, May 3, 2021 at 4:36 PM Marek Olšák <maraeo@gmail.com
>>>>> <mailto:maraeo@gmail.com>> wrote:
>>>>>
>>>>>       What about direct submit from the kernel where the process still
>>>>>       has write access to the GPU ring buffer but doesn't use it? I
>>>>>       think that solves your preemption example, but leaves a potential
>>>>>       backdoor for a process to overwrite the signal commands, which
>>>>>       shouldn't be a problem since we are OK with timeouts.
>>>>>
>>>>>       Marek
>>>>>
>>>>>       On Mon, May 3, 2021 at 11:23 AM Jason Ekstrand
>>>>>       <jason@jlekstrand.net <mailto:jason@jlekstrand.net>> wrote:
>>>>>
>>>>>           On Mon, May 3, 2021 at 10:16 AM Bas Nieuwenhuizen
>>>>>           <bas@basnieuwenhuizen.nl <mailto:bas@basnieuwenhuizen.nl>> wrote:
>>>>>           >
>>>>>           > On Mon, May 3, 2021 at 5:00 PM Jason Ekstrand
>>>>>           <jason@jlekstrand.net <mailto:jason@jlekstrand.net>> wrote:
>>>>>           > >
>>>>>           > > Sorry for the top-post but there's no good thing to reply
>>>>>           to here...
>>>>>           > >
>>>>>           > > One of the things pointed out to me recently by Daniel
>>>>>           Vetter that I
>>>>>           > > didn't fully understand before is that dma_buf has a very
>>>>>           subtle
>>>>>           > > second requirement beyond finite time completion:  Nothing
>>>>>           required
>>>>>           > > for signaling a dma-fence can allocate memory. Why?
>>>>>           Because the act
>>>>>           > > of allocating memory may wait on your dma-fence.  This, as
>>>>>           it turns
>>>>>           > > out, is a massively more strict requirement than finite time
>>>>>           > > completion and, I think, throws out all of the proposals
>>>>>           we have so
>>>>>           > > far.
>>>>>           > >
>>>>>           > > Take, for instance, Marek's proposal for userspace
>>>>>           involvement with
>>>>>           > > dma-fence by asking the kernel for a next serial and the
>>>>>           kernel
>>>>>           > > trusting userspace to signal it.  That doesn't work at all if
>>>>>           > > allocating memory to trigger a dma-fence can blow up.
>>>>>           There's simply
>>>>>           > > no way for the kernel to trust userspace to not do
>>>>>           ANYTHING which
>>>>>           > > might allocate memory.  I don't even think there's a way
>>>>>           userspace can
>>>>>           > > trust itself there.  It also blows up my plan of moving
>>>>>           the fences to
>>>>>           > > transition boundaries.
>>>>>           > >
>>>>>           > > Not sure where that leaves us.
>>>>>           >
>>>>>           > Honestly the more I look at things I think
>>>>>           userspace-signalable fences
>>>>>           > with a timeout sound like they are a valid solution for
>>>>>           these issues.
>>>>>           > Especially since (as has been mentioned countless times in
>>>>>           this email
>>>>>           > thread) userspace already has a lot of ways to cause
>>>>>           timeouts and or
>>>>>           > GPU hangs through GPU work already.
>>>>>           >
>>>>>           > Adding a timeout on the signaling side of a dma_fence would
>>>>>           ensure:
>>>>>           >
>>>>>           > - The dma_fence signals in finite time
>>>>>           > -  If the timeout case does not allocate memory then memory
>>>>>           allocation
>>>>>           > is not a blocker for signaling.
>>>>>           >
>>>>>           > Of course you lose the full dependency graph and we need to
>>>>>           make sure
>>>>>           > garbage collection of fences works correctly when we have
>>>>>           cycles.
>>>>>           > However, the latter sounds very doable and the first sounds
>>>>>           like it is
>>>>>           > to some extent inevitable.
>>>>>           >
>>>>>           > I feel like I'm missing some requirement here given that we
>>>>>           > immediately went to much more complicated things but can't
>>>>>           find it.
>>>>>           > Thoughts?
>>>>>
>>>>>           Timeouts are sufficient to protect the kernel but they make
>>>>>           the fences
>>>>>           unpredictable and unreliable from a userspace PoV.  One of the big
>>>>>           problems we face is that, once we expose a dma_fence to userspace,
>>>>>           we've allowed for some pretty crazy potential dependencies that
>>>>>           neither userspace nor the kernel can sort out.  Say you have
>>>>>           marek's
>>>>>           "next serial, please" proposal and a multi-threaded application.
>>>>>           Between time time you ask the kernel for a serial and get a
>>>>>           dma_fence
>>>>>           and submit the work to signal that serial, your process may get
>>>>>           preempted, something else shoved in which allocates memory,
>>>>>           and then
>>>>>           we end up blocking on that dma_fence.  There's no way
>>>>>           userspace can
>>>>>           predict and defend itself from that.
>>>>>
>>>>>           So I think where that leaves us is that there is no safe place to
>>>>>           create a dma_fence except for inside the ioctl which submits
>>>>>           the work
>>>>>           and only after any necessary memory has been allocated. That's a
>>>>>           pretty stiff requirement.  We may still be able to interact with
>>>>>           userspace a bit more explicitly but I think it throws any
>>>>>           notion of
>>>>>           userspace direct submit out the window.
>>>>>
>>>>>           --Jason
>>>>>
>>>>>
>>>>>           > - Bas
>>>>>           > >
>>>>>           > > --Jason
>>>>>           > >
>>>>>           > > On Mon, May 3, 2021 at 9:42 AM Alex Deucher
>>>>>           <alexdeucher@gmail.com <mailto:alexdeucher@gmail.com>> wrote:
>>>>>           > > >
>>>>>           > > > On Sat, May 1, 2021 at 6:27 PM Marek Olšák
>>>>>           <maraeo@gmail.com <mailto:maraeo@gmail.com>> wrote:
>>>>>           > > > >
>>>>>           > > > > On Wed, Apr 28, 2021 at 5:07 AM Michel Dänzer
>>>>>           <michel@daenzer.net <mailto:michel@daenzer.net>> wrote:
>>>>>           > > > >>
>>>>>           > > > >> On 2021-04-28 8:59 a.m., Christian König wrote:
>>>>>           > > > >> > Hi Dave,
>>>>>           > > > >> >
>>>>>           > > > >> > Am 27.04.21 um 21:23 schrieb Marek Olšák:
>>>>>           > > > >> >> Supporting interop with any device is always
>>>>>           possible. It depends on which drivers we need to interoperate
>>>>>           with and update them. We've already found the path forward for
>>>>>           amdgpu. We just need to find out how many other drivers need
>>>>>           to be updated and evaluate the cost/benefit aspect.
>>>>>           > > > >> >>
>>>>>           > > > >> >> Marek
>>>>>           > > > >> >>
>>>>>           > > > >> >> On Tue, Apr 27, 2021 at 2:38 PM Dave Airlie
>>>>>           <airlied@gmail.com <mailto:airlied@gmail.com>
>>>>>           <mailto:airlied@gmail.com <mailto:airlied@gmail.com>>> wrote:
>>>>>           > > > >> >>
>>>>>           > > > >> >>     On Tue, 27 Apr 2021 at 22:06, Christian König
>>>>>           > > > >> >>     <ckoenig.leichtzumerken@gmail.com
>>>>>           <mailto:ckoenig.leichtzumerken@gmail.com>
>>>>>           <mailto:ckoenig.leichtzumerken@gmail.com
>>>>>           <mailto:ckoenig.leichtzumerken@gmail.com>>> wrote:
>>>>>           > > > >> >>     >
>>>>>           > > > >> >>     > Correct, we wouldn't have synchronization
>>>>>           between device with and without user queues any more.
>>>>>           > > > >> >>     >
>>>>>           > > > >> >>     > That could only be a problem for A+I Laptops.
>>>>>           > > > >> >>
>>>>>           > > > >> >>     Since I think you mentioned you'd only be
>>>>>           enabling this on newer
>>>>>           > > > >> >>     chipsets, won't it be a problem for A+A where
>>>>>           one A is a generation
>>>>>           > > > >> >>     behind the other?
>>>>>           > > > >> >>
>>>>>           > > > >> >
>>>>>           > > > >> > Crap, that is a good point as well.
>>>>>           > > > >> >
>>>>>           > > > >> >>
>>>>>           > > > >> >>     I'm not really liking where this is going btw,
>>>>>           seems like a ill
>>>>>           > > > >> >>     thought out concept, if AMD is really going
>>>>>           down the road of designing
>>>>>           > > > >> >>     hw that is currently Linux incompatible, you
>>>>>           are going to have to
>>>>>           > > > >> >>     accept a big part of the burden in bringing
>>>>>           this support in to more
>>>>>           > > > >> >>     than just amd drivers for upcoming generations
>>>>>           of gpu.
>>>>>           > > > >> >>
>>>>>           > > > >> >
>>>>>           > > > >> > Well we don't really like that either, but we have
>>>>>           no other option as far as I can see.
>>>>>           > > > >>
>>>>>           > > > >> I don't really understand what "future hw may remove
>>>>>           support for kernel queues" means exactly. While the
>>>>>           per-context queues can be mapped to userspace directly, they
>>>>>           don't *have* to be, do they? I.e. the kernel driver should be
>>>>>           able to either intercept userspace access to the queues, or in
>>>>>           the worst case do it all itself, and provide the existing
>>>>>           synchronization semantics as needed?
>>>>>           > > > >>
>>>>>           > > > >> Surely there are resource limits for the per-context
>>>>>           queues, so the kernel driver needs to do some kind of
>>>>>           virtualization / multi-plexing anyway, or we'll get sad user
>>>>>           faces when there's no queue available for <current hot game>.
>>>>>           > > > >>
>>>>>           > > > >> I'm probably missing something though, awaiting
>>>>>           enlightenment. :)
>>>>>           > > > >
>>>>>           > > > >
>>>>>           > > > > The hw interface for userspace is that the ring buffer
>>>>>           is mapped to the process address space alongside a doorbell
>>>>>           aperture (4K page) that isn't real memory, but when the CPU
>>>>>           writes into it, it tells the hw scheduler that there are new
>>>>>           GPU commands in the ring buffer. Userspace inserts all the
>>>>>           wait, draw, and signal commands into the ring buffer and then
>>>>>           "rings" the doorbell. It's my understanding that the ring
>>>>>           buffer and the doorbell are always mapped in the same GPU
>>>>>           address space as the process, which makes it very difficult to
>>>>>           emulate the current protected ring buffers in the kernel. The
>>>>>           VMID of the ring buffer is also not changeable.
>>>>>           > > > >
>>>>>           > > >
>>>>>           > > > The doorbell does not have to be mapped into the
>>>>>           process's GPU virtual
>>>>>           > > > address space.  The CPU could write to it directly.
>>>>>           Mapping it into
>>>>>           > > > the GPU's virtual address space would allow you to have
>>>>>           a device kick
>>>>>           > > > off work however rather than the CPU. E.g., the GPU
>>>>>           could kick off
>>>>>           > > > it's own work or multiple devices could kick off work
>>>>>           without CPU
>>>>>           > > > involvement.
>>>>>           > > >
>>>>>           > > > Alex
>>>>>           > > >
>>>>>           > > >
>>>>>           > > > > The hw scheduler doesn't do any synchronization and it
>>>>>           doesn't see any dependencies. It only chooses which queue to
>>>>>           execute, so it's really just a simple queue manager handling
>>>>>           the virtualization aspect and not much else.
>>>>>           > > > >
>>>>>           > > > > Marek
>>>>>           > > > > _______________________________________________
>>>>>           > > > > dri-devel mailing list
>>>>>           > > > > dri-devel@lists.freedesktop.org
>>>>>           <mailto:dri-devel@lists.freedesktop.org>
>>>>>           > > > >
>>>>>           https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>>           <https://lists.freedesktop.org/mailman/listinfo/dri-devel>
>>>>>           > > > _______________________________________________
>>>>>           > > > mesa-dev mailing list
>>>>>           > > > mesa-dev@lists.freedesktop.org
>>>>>           <mailto:mesa-dev@lists.freedesktop.org>
>>>>>           > > > https://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>>>>           <https://lists.freedesktop.org/mailman/listinfo/mesa-dev>
>>>>>           > > _______________________________________________
>>>>>           > > dri-devel mailing list
>>>>>           > > dri-devel@lists.freedesktop.org
>>>>>           <mailto:dri-devel@lists.freedesktop.org>
>>>>>           > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>>           <https://lists.freedesktop.org/mailman/listinfo/dri-devel>
>>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-04  9:14                                                     ` Christian König
@ 2021-05-04  9:47                                                       ` Daniel Vetter
  2021-05-04 10:53                                                         ` Christian König
  0 siblings, 1 reply; 105+ messages in thread
From: Daniel Vetter @ 2021-05-04  9:47 UTC (permalink / raw)
  To: Christian König
  Cc: Marek Olšák, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

On Tue, May 04, 2021 at 11:14:06AM +0200, Christian König wrote:
> Am 04.05.21 um 10:27 schrieb Daniel Vetter:
> > On Tue, May 4, 2021 at 10:09 AM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> > > Am 04.05.21 um 09:32 schrieb Daniel Vetter:
> > > > On Tue, May 04, 2021 at 09:01:23AM +0200, Christian König wrote:
> > > > > Unfortunately as I pointed out to Daniel as well this won't work 100%
> > > > > reliable either.
> > > > You're claiming this, but there's no clear reason why really, and you
> > > > did't reply to my last mail on that sub-thread, so I really don't get
> > > > where exactly you're seeing a problem.
> > > Yeah, it's rather hard to explain without pointing out how the hardware
> > > works in detail.
> > > 
> > > > > See the signal on the ring buffer needs to be protected by manipulation from
> > > > > userspace so that we can guarantee that the hardware really has finished
> > > > > executing when it fires.
> > > > Nope you don't. Userspace is already allowed to submit all kinds of random
> > > > garbage, the only thing the kernel has to guarnatee is:
> > > > - the dma-fence DAG stays a DAG
> > > > - dma-fence completes in finite time
> > > > 
> > > > Everything else is not the kernel's problem, and if userspace mixes stuff
> > > > up like manipulates the seqno, that's ok. It can do that kind of garbage
> > > > already.
> > > > 
> > > > > Protecting memory by immediate page table updates is a good first step, but
> > > > > unfortunately not sufficient (and we would need to restructure large parts
> > > > > of the driver to make this happen).
> > > > This is why you need the unload-fence on top, because indeed you can't
> > > > just rely on the fences created from the userspace ring, those are
> > > > unreliable for memory management.
> > > And exactly that's the problem! We can't provide a reliable unload-fence
> > > and the user fences are unreliable for that.
> > > 
> > > I've talked this through lengthy with our hardware/firmware guy last
> > > Thursday but couldn't find a solution either.
> > > 
> > > We can have a preemption fence for the kernel which says: Hey this queue
> > > was scheduled away you can touch it's hardware descriptor, control
> > > registers, page tables, TLB, memory, GWS, GDS, OA etc etc etc... again.
> > > But that one is only triggered on preemption and then we have the same
> > > ordering problems once more.
> > > 
> > > Or we can have a end of operation fence for userspace which says: Hey
> > > this queue has finished it's batch of execution, but this one is
> > > manipulable from userspace in both finish to early (very very bad for
> > > invalidations and memory management) or finish to late/never (deadlock
> > > prone but fixable by timeout).
> > > 
> > > What we could do is to use the preemption fence to emulate the unload
> > > fence, e.g. something like:
> > > 1. Preempt the queue in fixed intervals (let's say 100ms).
> > > 2. While preempted check if we have reached the checkpoint in question
> > > by looking at the hardware descriptor.
> > > 3. If we have reached the checkpoint signal the unload fence.
> > > 4. If we haven't reached the checkpoint resume the queue again.
> > > 
> > > The problem is that this might introduce a maximum of 100ms delay before
> > > signaling the unload fence and preempt/resume has such a hefty overhead
> > > that we waste a horrible amount of time on it.
> > So your hw can preempt? That's good enough.
> > 
> > The unload fence is just
> > 1. wait for all dma_fence that are based on the userspace ring. This
> > is unreliable, but we don't care because tdr will make it reliable.
> > And once tdr shot down a context we'll force-unload and thrash it
> > completely, which solves the problem.
> > 2. preempt the context, which /should/ now be stuck waiting for more
> > commands to be stuffed into the ringbuffer. Which means your
> > preemption is hopefully fast enough to not matter. If your hw takes
> > forever to preempt an idle ring, I can't help you :-)
> 
> Yeah, it just takes to long for the preemption to complete to be really
> useful for the feature we are discussing here.
> 
> As I said when the kernel requests to preempt a queue we can easily expect a
> timeout of ~100ms until that comes back. For compute that is even in the
> multiple seconds range.

100ms for preempting an idle request sounds like broken hw to me. Of
course preemting something that actually runs takes a while, that's
nothing new. But it's also not the thing we're talking about here. Is this
100ms actual numbers from hw for an actual idle ringbuffer?

> The "preemption" feature is really called suspend and made just for the case
> when we want to put a process to sleep or need to forcefully kill it for
> misbehavior or stuff like that. It is not meant to be used in normal
> operation.
> 
> If we only attach it on ->move then yeah maybe a last resort possibility to
> do it this way, but I think in that case we could rather stick with kernel
> submissions.

Well this is a hybrid userspace ring + kernel augmeted submit mode, so you
can keep dma-fences working. Because the dma-fence stuff wont work with
pure userspace submit, I think that conclusion is rather solid. Once more
even after this long thread here.

> > Also, if userspace lies to us and keeps pushing crap into the ring
> > after it's supposed to be idle: Userspace is already allowed to waste
> > gpu time. If you're too worried about this set a fairly aggressive
> > preempt timeout on the unload fence, and kill the context if it takes
> > longer than what preempting an idle ring should take (because that
> > would indicate broken/evil userspace).
> 
> I think you have the wrong expectation here. It is perfectly valid and
> expected for userspace to keep writing commands into the ring buffer.
> 
> After all when one frame is completed they want to immediately start
> rendering the next one.

Sure, for the true userspace direct submit model. But with that you don't
get dma-fence, which means this gpu will not work for 3d accel on any
current linux desktop.

Which sucks, hence some hybrid model of using the userspace ring and
kernel augmented submit is needed. Which was my idea.

> > Again, I'm not seeing the problem. Except if your hw is really
> > completely busted to the point where it can't even support userspace
> > ringbuffers properly and with sufficient performance :-P
> > 
> > Of course if you issue the preempt context request before the
> > userspace fences have finished (or tdr cleaned up the mess) like you
> > do in your proposal, then it will be ridiculously expensive and/or
> > wont work. So just don't do that.
> > 
> > > > btw I thought some more, and I think it's probably best if we only attach
> > > > the unload-fence in the ->move(_notify) callbacks. Kinda like we already
> > > > do for async copy jobs. So the overall buffer move sequence would be:
> > > > 
> > > > 1. wait for (untrusted for kernel, but necessary for userspace
> > > > correctness) fake dma-fence that rely on the userspace ring
> > > > 
> > > > 2. unload ctx
> > > > 
> > > > 3. copy buffer
> > > > 
> > > > Ofc 2&3 would be done async behind a dma_fence.
> > > > 
> > > > > On older hardware we often had the situation that for reliable invalidation
> > > > > we need the guarantee that every previous operation has finished executing.
> > > > > It's not so much of a problem when the next operation has already started,
> > > > > since then we had the opportunity to do things in between the last and the
> > > > > next operation. Just see cache invalidation and VM switching for example.
> > > > If you have gpu page faults you generally have synchronous tlb
> > > > invalidation,
> > > Please tell that our hardware engineers :)
> > > 
> > > We have two modes of operation, see the whole XNACK on/off discussion on
> > > the amdgfx mailing list.
> > I didn't find this anywhere with a quick search. Pointers to archive
> > (lore.kernel.org/amd-gfx is the best imo).
> 
> Can't find that of hand either, but see the amdgpu_noretry module option.
> 
> It basically tells the hardware if retry page faults should be supported or
> not because this whole TLB shutdown thing when they are supported is
> extremely costly.

Hm so synchronous tlb shootdown is a lot more costly when you allow
retrying of page faults?

That sounds bad, because for full hmm mode you need to be able to retry
pagefaults. Well at least the PASID/ATS/IOMMU side will do that, and might just
hang your gpu for a long time while it's waiting for the va->pa lookup
response to return. So retrying lookups shouldn't be any different really.

And you also need fairly fast synchronous tlb shootdown for hmm. So if
your hw has a problem with both together that sounds bad.

> > > > so this also shouldn't be a big problem. Combined with the
> > > > unload fence at least. If you don't have synchronous tlb invalidate it
> > > > gets a bit more nasty and you need to force a preemption to a kernel
> > > > context which has the required flushes across all the caches. Slightly
> > > > nasty, but the exact same thing would be required for handling page faults
> > > > anyway with the direct userspace submit model.
> > > > 
> > > > Again I'm not seeing a problem.
> > > > 
> > > > > Additional to that it doesn't really buy us anything, e.g. there is not much
> > > > > advantage to this. Writing the ring buffer in userspace and then ringing in
> > > > > the kernel has the same overhead as doing everything in the kernel in the
> > > > > first place.
> > > > It gets you dma-fence backwards compat without having to rewrite the
> > > > entire userspace ecosystem. Also since you have the hw already designed
> > > > for ringbuffer in userspace it would be silly to copy that through the cs
> > > > ioctl, that's just overhead.
> > > > 
> > > > Also I thought the problem you're having is that all the kernel ringbuf
> > > > stuff is going away, so the old cs ioctl wont work anymore for sure?
> > > We still have a bit more time for this. As I learned from our firmware
> > > engineer last Thursday the Windows side is running into similar problems
> > > as we do.
> > This story sounds familiar, I've heard it a few times here at intel
> > too on various things where we complained and then windows hit the
> > same issues too :-)
> > 
> > E.g. I've just learned that all the things we've discussed around gpu
> > page faults vs 3d workloads and how you need to reserve some CU for 3d
> > guaranteed forward progress or even worse measures is also something
> > they're hitting on Windows. Apparently they fixed it by only running
> > 3d or compute workloads at the same time, but not both.
> 
> I'm not even sure if we are going to see user fences on Windows with the
> next hw generation.
> 
> Before we can continue with this discussion we need to figure out how to get
> the hardware reliable first.
> 
> In other words if we would have explicit user fences everywhere, how would
> we handle timeouts and misbehaving processes? As it turned out they haven't
> figured this out on Windows yet either.

Lol.

> > > > Maybe also pick up that other subthread which ended with my last reply.
> > > I will send out another proposal for how to handle user fences shortly.
> > Maybe let's discuss this here first before we commit to requiring all
> > userspace to upgrade to user fences ... I do agree that we want to go
> > there too, but breaking all the compositors is probably not the best
> > option.
> 
> I was more thinking about handling it all in the kernel.

Yeah can do, just means that you also have to copy the ringbuffer stuff
over from userspace to the kernel.

It also means that there's more differences in how your userspace works
between full userspace mode (necessary for compute) and legacy dma-fence
mode (necessary for desktop 3d). Which is especially big fun for vulkan,
since that will have to do both.

But then amd is still hanging onto the amdgpu vs amdkfd split, so you're
going for max pain in this area anyway :-P
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-04  9:47                                                       ` Daniel Vetter
@ 2021-05-04 10:53                                                         ` Christian König
  2021-05-04 11:13                                                           ` Daniel Vetter
  0 siblings, 1 reply; 105+ messages in thread
From: Christian König @ 2021-05-04 10:53 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: dri-devel, ML Mesa-dev, Michel Dänzer, Jason Ekstrand,
	Marek Olšák

Am 04.05.21 um 11:47 schrieb Daniel Vetter:
> [SNIP]
>> Yeah, it just takes to long for the preemption to complete to be really
>> useful for the feature we are discussing here.
>>
>> As I said when the kernel requests to preempt a queue we can easily expect a
>> timeout of ~100ms until that comes back. For compute that is even in the
>> multiple seconds range.
> 100ms for preempting an idle request sounds like broken hw to me. Of
> course preemting something that actually runs takes a while, that's
> nothing new. But it's also not the thing we're talking about here. Is this
> 100ms actual numbers from hw for an actual idle ringbuffer?

Well 100ms is just an example of the scheduler granularity. Let me 
explain in a wider context.

The hardware can have X queues mapped at the same time and every Y time 
interval the hardware scheduler checks if those queues have changed and 
only if they have changed the necessary steps to reload them are started.

Multiple queues can be rendering at the same time, so you can have X as 
a high priority queue active and just waiting for a signal to start and 
the client rendering one frame after another and a third background 
compute task mining bitcoins for you.

As long as everything is static this is perfectly performant. Adding a 
queue to the list of active queues is also relatively simple, but taking 
one down requires you to wait until we are sure the hardware has seen 
the change and reloaded the queues.

Think of it as an RCU grace period. This is simply not something which 
is made to be used constantly, but rather just at process termination.

>> The "preemption" feature is really called suspend and made just for the case
>> when we want to put a process to sleep or need to forcefully kill it for
>> misbehavior or stuff like that. It is not meant to be used in normal
>> operation.
>>
>> If we only attach it on ->move then yeah maybe a last resort possibility to
>> do it this way, but I think in that case we could rather stick with kernel
>> submissions.
> Well this is a hybrid userspace ring + kernel augmeted submit mode, so you
> can keep dma-fences working. Because the dma-fence stuff wont work with
> pure userspace submit, I think that conclusion is rather solid. Once more
> even after this long thread here.

When assisted with unload fences, then yes. Problem is that I can't see 
how we could implement those performant currently.

>>> Also, if userspace lies to us and keeps pushing crap into the ring
>>> after it's supposed to be idle: Userspace is already allowed to waste
>>> gpu time. If you're too worried about this set a fairly aggressive
>>> preempt timeout on the unload fence, and kill the context if it takes
>>> longer than what preempting an idle ring should take (because that
>>> would indicate broken/evil userspace).
>> I think you have the wrong expectation here. It is perfectly valid and
>> expected for userspace to keep writing commands into the ring buffer.
>>
>> After all when one frame is completed they want to immediately start
>> rendering the next one.
> Sure, for the true userspace direct submit model. But with that you don't
> get dma-fence, which means this gpu will not work for 3d accel on any
> current linux desktop.

I'm not sure of that. I've looked a bit into how we could add user 
fences to dma_resv objects and that isn't that hard after all.

> Which sucks, hence some hybrid model of using the userspace ring and
> kernel augmented submit is needed. Which was my idea.

Yeah, I think when our firmware folks would really remove the kernel 
queue and we still don't have

>
>> [SNIP]
>> Can't find that of hand either, but see the amdgpu_noretry module option.
>>
>> It basically tells the hardware if retry page faults should be supported or
>> not because this whole TLB shutdown thing when they are supported is
>> extremely costly.
> Hm so synchronous tlb shootdown is a lot more costly when you allow
> retrying of page faults?

Partially correct, yes.

See when you have retry page faults enabled and unmap something you need 
to make sure that everybody which could have potentially translated that 
page and has a TLB is either invalidated or waited until the access is 
completed.

Since every CU could be using a memory location that takes ages to 
completed compared to the normal invalidation where you just invalidate 
the L1/L2 and are done.

Additional to that the recovery adds some extra overhead to every memory 
access, so even without a fault you are quite a bit slower if this is 
enabled.

> That sounds bad, because for full hmm mode you need to be able to retry
> pagefaults. Well at least the PASID/ATS/IOMMU side will do that, and might just
> hang your gpu for a long time while it's waiting for the va->pa lookup
> response to return. So retrying lookups shouldn't be any different really.
>
> And you also need fairly fast synchronous tlb shootdown for hmm. So if
> your hw has a problem with both together that sounds bad.

Completely agree. And since it was my job to validate the implementation 
on Vega10 I was also the first one to realize that.

Felix, a couple of others and me are trying to work around those 
restrictions ever since.

> I was more thinking about handling it all in the kernel.
> Yeah can do, just means that you also have to copy the ringbuffer stuff
> over from userspace to the kernel.

That is my least worry. The IBs are just addr+length., so no more than 
16 bytes for each IB.

> It also means that there's more differences in how your userspace works
> between full userspace mode (necessary for compute) and legacy dma-fence
> mode (necessary for desktop 3d). Which is especially big fun for vulkan,
> since that will have to do both.

That is the bigger problem.

Christian.

>
> But then amd is still hanging onto the amdgpu vs amdkfd split, so you're
> going for max pain in this area anyway :-P
> -Daniel

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-04 10:53                                                         ` Christian König
@ 2021-05-04 11:13                                                           ` Daniel Vetter
  2021-05-04 12:48                                                             ` Christian König
  0 siblings, 1 reply; 105+ messages in thread
From: Daniel Vetter @ 2021-05-04 11:13 UTC (permalink / raw)
  To: Christian König
  Cc: dri-devel, ML Mesa-dev, Michel Dänzer, Jason Ekstrand,
	Marek Olšák

On Tue, May 4, 2021 at 12:53 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 04.05.21 um 11:47 schrieb Daniel Vetter:
> > [SNIP]
> >> Yeah, it just takes to long for the preemption to complete to be really
> >> useful for the feature we are discussing here.
> >>
> >> As I said when the kernel requests to preempt a queue we can easily expect a
> >> timeout of ~100ms until that comes back. For compute that is even in the
> >> multiple seconds range.
> > 100ms for preempting an idle request sounds like broken hw to me. Of
> > course preemting something that actually runs takes a while, that's
> > nothing new. But it's also not the thing we're talking about here. Is this
> > 100ms actual numbers from hw for an actual idle ringbuffer?
>
> Well 100ms is just an example of the scheduler granularity. Let me
> explain in a wider context.
>
> The hardware can have X queues mapped at the same time and every Y time
> interval the hardware scheduler checks if those queues have changed and
> only if they have changed the necessary steps to reload them are started.
>
> Multiple queues can be rendering at the same time, so you can have X as
> a high priority queue active and just waiting for a signal to start and
> the client rendering one frame after another and a third background
> compute task mining bitcoins for you.
>
> As long as everything is static this is perfectly performant. Adding a
> queue to the list of active queues is also relatively simple, but taking
> one down requires you to wait until we are sure the hardware has seen
> the change and reloaded the queues.
>
> Think of it as an RCU grace period. This is simply not something which
> is made to be used constantly, but rather just at process termination.

Uh ... that indeed sounds rather broken.

Otoh it's just a dma_fence that'd we'd inject as this unload-fence. So
by and large everyone should already be able to cope with it taking a
bit longer. So from a design pov I don't see a huge problem, but I
guess you guys wont be happy since it means on amd hw there will be
random unsightly stalls in desktop linux usage.

> >> The "preemption" feature is really called suspend and made just for the case
> >> when we want to put a process to sleep or need to forcefully kill it for
> >> misbehavior or stuff like that. It is not meant to be used in normal
> >> operation.
> >>
> >> If we only attach it on ->move then yeah maybe a last resort possibility to
> >> do it this way, but I think in that case we could rather stick with kernel
> >> submissions.
> > Well this is a hybrid userspace ring + kernel augmeted submit mode, so you
> > can keep dma-fences working. Because the dma-fence stuff wont work with
> > pure userspace submit, I think that conclusion is rather solid. Once more
> > even after this long thread here.
>
> When assisted with unload fences, then yes. Problem is that I can't see
> how we could implement those performant currently.

Is there really no way to fix fw here? Like if process start/teardown
takes 100ms, that's going to suck no matter what.

> >>> Also, if userspace lies to us and keeps pushing crap into the ring
> >>> after it's supposed to be idle: Userspace is already allowed to waste
> >>> gpu time. If you're too worried about this set a fairly aggressive
> >>> preempt timeout on the unload fence, and kill the context if it takes
> >>> longer than what preempting an idle ring should take (because that
> >>> would indicate broken/evil userspace).
> >> I think you have the wrong expectation here. It is perfectly valid and
> >> expected for userspace to keep writing commands into the ring buffer.
> >>
> >> After all when one frame is completed they want to immediately start
> >> rendering the next one.
> > Sure, for the true userspace direct submit model. But with that you don't
> > get dma-fence, which means this gpu will not work for 3d accel on any
> > current linux desktop.
>
> I'm not sure of that. I've looked a bit into how we could add user
> fences to dma_resv objects and that isn't that hard after all.

I think as a proof of concept it's fine, but as an actual solution ...
pls no. Two reasons:
- implicit sync is bad
- this doesn't fix anything for explicit sync using dma_fence in terms
of sync_file or drm_syncobj.

So if we go with the route of papering over this in the kernel, then
it'll be a ton more work than just hacking something into dma_resv.

> > Which sucks, hence some hybrid model of using the userspace ring and
> > kernel augmented submit is needed. Which was my idea.
>
> Yeah, I think when our firmware folks would really remove the kernel
> queue and we still don't have

Yeah I think kernel queue can be removed. But the price is that you
need reasonable fast preempt of idle contexts.

I really can't understand how this can take multiple ms, something
feels very broken in the design of the fw (since obviously the hw can
preempt an idle context to another one pretty fast, or you'd render
any multi-client desktop as a slideshow at best).

> >
> >> [SNIP]
> >> Can't find that of hand either, but see the amdgpu_noretry module option.
> >>
> >> It basically tells the hardware if retry page faults should be supported or
> >> not because this whole TLB shutdown thing when they are supported is
> >> extremely costly.
> > Hm so synchronous tlb shootdown is a lot more costly when you allow
> > retrying of page faults?
>
> Partially correct, yes.
>
> See when you have retry page faults enabled and unmap something you need
> to make sure that everybody which could have potentially translated that
> page and has a TLB is either invalidated or waited until the access is
> completed.
>
> Since every CU could be using a memory location that takes ages to
> completed compared to the normal invalidation where you just invalidate
> the L1/L2 and are done.
>
> Additional to that the recovery adds some extra overhead to every memory
> access, so even without a fault you are quite a bit slower if this is
> enabled.

Well yes it's complicated, and it's even more fun when the tlb
invalidate comes in through the IOMMU through ATS.

But also if you don't your hw is just broken from a security pov, no
page fault handling for you. So it's really not optional.

> > That sounds bad, because for full hmm mode you need to be able to retry
> > pagefaults. Well at least the PASID/ATS/IOMMU side will do that, and might just
> > hang your gpu for a long time while it's waiting for the va->pa lookup
> > response to return. So retrying lookups shouldn't be any different really.
> >
> > And you also need fairly fast synchronous tlb shootdown for hmm. So if
> > your hw has a problem with both together that sounds bad.
>
> Completely agree. And since it was my job to validate the implementation
> on Vega10 I was also the first one to realize that.
>
> Felix, a couple of others and me are trying to work around those
> restrictions ever since.
>
> > I was more thinking about handling it all in the kernel.
> > Yeah can do, just means that you also have to copy the ringbuffer stuff
> > over from userspace to the kernel.
>
> That is my least worry. The IBs are just addr+length., so no more than
> 16 bytes for each IB.

Ah ok, maybe I'm biased from drm/i915 where an ib launch + seqno is
rather long, because the hw folks keep piling more workarounds and
additional flushes on top. Like on some hw the recommended w/a was to
just issue 32 gpu cache flushes or something like that (otherwise the
seqno write could arrive before the gpu actually finished flushing)
:-/

Cheers, Daniel

> > It also means that there's more differences in how your userspace works
> > between full userspace mode (necessary for compute) and legacy dma-fence
> > mode (necessary for desktop 3d). Which is especially big fun for vulkan,
> > since that will have to do both.
>
> That is the bigger problem.
>
> Christian.
>
> >
> > But then amd is still hanging onto the amdgpu vs amdkfd split, so you're
> > going for max pain in this area anyway :-P
> > -Daniel
>


-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-04 11:13                                                           ` Daniel Vetter
@ 2021-05-04 12:48                                                             ` Christian König
  2021-05-04 16:44                                                               ` Daniel Vetter
  2021-05-04 17:16                                                               ` Marek Olšák
  0 siblings, 2 replies; 105+ messages in thread
From: Christian König @ 2021-05-04 12:48 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: dri-devel, ML Mesa-dev, Michel Dänzer, Jason Ekstrand,
	Marek Olšák

Am 04.05.21 um 13:13 schrieb Daniel Vetter:
> On Tue, May 4, 2021 at 12:53 PM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Am 04.05.21 um 11:47 schrieb Daniel Vetter:
>>> [SNIP]
>>>> Yeah, it just takes to long for the preemption to complete to be really
>>>> useful for the feature we are discussing here.
>>>>
>>>> As I said when the kernel requests to preempt a queue we can easily expect a
>>>> timeout of ~100ms until that comes back. For compute that is even in the
>>>> multiple seconds range.
>>> 100ms for preempting an idle request sounds like broken hw to me. Of
>>> course preemting something that actually runs takes a while, that's
>>> nothing new. But it's also not the thing we're talking about here. Is this
>>> 100ms actual numbers from hw for an actual idle ringbuffer?
>> Well 100ms is just an example of the scheduler granularity. Let me
>> explain in a wider context.
>>
>> The hardware can have X queues mapped at the same time and every Y time
>> interval the hardware scheduler checks if those queues have changed and
>> only if they have changed the necessary steps to reload them are started.
>>
>> Multiple queues can be rendering at the same time, so you can have X as
>> a high priority queue active and just waiting for a signal to start and
>> the client rendering one frame after another and a third background
>> compute task mining bitcoins for you.
>>
>> As long as everything is static this is perfectly performant. Adding a
>> queue to the list of active queues is also relatively simple, but taking
>> one down requires you to wait until we are sure the hardware has seen
>> the change and reloaded the queues.
>>
>> Think of it as an RCU grace period. This is simply not something which
>> is made to be used constantly, but rather just at process termination.
> Uh ... that indeed sounds rather broken.

Well I wouldn't call it broken. It's just not made for the use case we 
are trying to abuse it for.

> Otoh it's just a dma_fence that'd we'd inject as this unload-fence.

Yeah, exactly that's why it isn't much of a problem for process 
termination or freeing memory.

> So by and large everyone should already be able to cope with it taking a
> bit longer. So from a design pov I don't see a huge problem, but I
> guess you guys wont be happy since it means on amd hw there will be
> random unsightly stalls in desktop linux usage.
>
>>>> The "preemption" feature is really called suspend and made just for the case
>>>> when we want to put a process to sleep or need to forcefully kill it for
>>>> misbehavior or stuff like that. It is not meant to be used in normal
>>>> operation.
>>>>
>>>> If we only attach it on ->move then yeah maybe a last resort possibility to
>>>> do it this way, but I think in that case we could rather stick with kernel
>>>> submissions.
>>> Well this is a hybrid userspace ring + kernel augmeted submit mode, so you
>>> can keep dma-fences working. Because the dma-fence stuff wont work with
>>> pure userspace submit, I think that conclusion is rather solid. Once more
>>> even after this long thread here.
>> When assisted with unload fences, then yes. Problem is that I can't see
>> how we could implement those performant currently.
> Is there really no way to fix fw here? Like if process start/teardown
> takes 100ms, that's going to suck no matter what.

As I said adding the queue is unproblematic and teardown just results in 
a bit more waiting to free things up.

Problematic is more overcommit swapping and OOM situations which need to 
wait for the hw scheduler to come back and tell us that the queue is now 
unmapped.

>>>>> Also, if userspace lies to us and keeps pushing crap into the ring
>>>>> after it's supposed to be idle: Userspace is already allowed to waste
>>>>> gpu time. If you're too worried about this set a fairly aggressive
>>>>> preempt timeout on the unload fence, and kill the context if it takes
>>>>> longer than what preempting an idle ring should take (because that
>>>>> would indicate broken/evil userspace).
>>>> I think you have the wrong expectation here. It is perfectly valid and
>>>> expected for userspace to keep writing commands into the ring buffer.
>>>>
>>>> After all when one frame is completed they want to immediately start
>>>> rendering the next one.
>>> Sure, for the true userspace direct submit model. But with that you don't
>>> get dma-fence, which means this gpu will not work for 3d accel on any
>>> current linux desktop.
>> I'm not sure of that. I've looked a bit into how we could add user
>> fences to dma_resv objects and that isn't that hard after all.
> I think as a proof of concept it's fine, but as an actual solution ...
> pls no. Two reasons:
> - implicit sync is bad

Well can't disagree with that :) But I think we can't avoid supporting it.

> - this doesn't fix anything for explicit sync using dma_fence in terms
> of sync_file or drm_syncobj.

Exactly.

If we do implicit sync or explicit sync is orthogonal to the problems 
that sync must be made reliable somehow.

So when we sync and timeout the waiter should just continue, but whoever 
failed to signal will be punished.

But since this isn't solved on Windows I don't see how we can solve it 
on Linux either.

> So if we go with the route of papering over this in the kernel, then
> it'll be a ton more work than just hacking something into dma_resv.

I'm just now prototyping that and at least for the driver parts it 
doesn't look that hard after all.

>>> Which sucks, hence some hybrid model of using the userspace ring and
>>> kernel augmented submit is needed. Which was my idea.
>> Yeah, I think when our firmware folks would really remove the kernel
>> queue and we still don't have
> Yeah I think kernel queue can be removed. But the price is that you
> need reasonable fast preempt of idle contexts.
>
> I really can't understand how this can take multiple ms, something
> feels very broken in the design of the fw (since obviously the hw can
> preempt an idle context to another one pretty fast, or you'd render
> any multi-client desktop as a slideshow at best).

Well the hardware doesn't preempt and idle context. See you can have a 
number of active ("mapped" in the fw terminology) contexts and idle 
contexts are usually kept active even when they are idle.

So when multi-client desktop switches between context then that is 
rather fast, but when the kernel asks for a context to be unmapped that 
can take rather long.


>
>>>> [SNIP]
>>>> Can't find that of hand either, but see the amdgpu_noretry module option.
>>>>
>>>> It basically tells the hardware if retry page faults should be supported or
>>>> not because this whole TLB shutdown thing when they are supported is
>>>> extremely costly.
>>> Hm so synchronous tlb shootdown is a lot more costly when you allow
>>> retrying of page faults?
>> Partially correct, yes.
>>
>> See when you have retry page faults enabled and unmap something you need
>> to make sure that everybody which could have potentially translated that
>> page and has a TLB is either invalidated or waited until the access is
>> completed.
>>
>> Since every CU could be using a memory location that takes ages to
>> completed compared to the normal invalidation where you just invalidate
>> the L1/L2 and are done.
>>
>> Additional to that the recovery adds some extra overhead to every memory
>> access, so even without a fault you are quite a bit slower if this is
>> enabled.
> Well yes it's complicated, and it's even more fun when the tlb
> invalidate comes in through the IOMMU through ATS.
>
> But also if you don't your hw is just broken from a security pov, no
> page fault handling for you. So it's really not optional.

Yeah, but that is also a known issue. You either have retry faults and 
live with the extra overhead or you disable them and go with the kernel 
based submission approach.

>
>>> That sounds bad, because for full hmm mode you need to be able to retry
>>> pagefaults. Well at least the PASID/ATS/IOMMU side will do that, and might just
>>> hang your gpu for a long time while it's waiting for the va->pa lookup
>>> response to return. So retrying lookups shouldn't be any different really.
>>>
>>> And you also need fairly fast synchronous tlb shootdown for hmm. So if
>>> your hw has a problem with both together that sounds bad.
>> Completely agree. And since it was my job to validate the implementation
>> on Vega10 I was also the first one to realize that.
>>
>> Felix, a couple of others and me are trying to work around those
>> restrictions ever since.
>>
>>> I was more thinking about handling it all in the kernel.
>>> Yeah can do, just means that you also have to copy the ringbuffer stuff
>>> over from userspace to the kernel.
>> That is my least worry. The IBs are just addr+length., so no more than
>> 16 bytes for each IB.
> Ah ok, maybe I'm biased from drm/i915 where an ib launch + seqno is
> rather long, because the hw folks keep piling more workarounds and
> additional flushes on top. Like on some hw the recommended w/a was to
> just issue 32 gpu cache flushes or something like that (otherwise the
> seqno write could arrive before the gpu actually finished flushing)
> :-/

Well I once had a conversation with a hw engineer which wanted to split 
up the TLB in validations into 1Gib chunks :)

That would have mean we would need to emit 2^17 different invalidation 
requests on the kernel ring buffer....

Christian.


>
> Cheers, Daniel
>
>>> It also means that there's more differences in how your userspace works
>>> between full userspace mode (necessary for compute) and legacy dma-fence
>>> mode (necessary for desktop 3d). Which is especially big fun for vulkan,
>>> since that will have to do both.
>> That is the bigger problem.
>>
>> Christian.
>>
>>> But then amd is still hanging onto the amdgpu vs amdkfd split, so you're
>>> going for max pain in this area anyway :-P
>>> -Daniel
>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-04 12:48                                                             ` Christian König
@ 2021-05-04 16:44                                                               ` Daniel Vetter
  2021-05-04 17:16                                                               ` Marek Olšák
  1 sibling, 0 replies; 105+ messages in thread
From: Daniel Vetter @ 2021-05-04 16:44 UTC (permalink / raw)
  To: Christian König
  Cc: Marek Olšák, Michel Dänzer, dri-devel,
	Jason Ekstrand, ML Mesa-dev

On Tue, May 04, 2021 at 02:48:35PM +0200, Christian König wrote:
> Am 04.05.21 um 13:13 schrieb Daniel Vetter:
> > On Tue, May 4, 2021 at 12:53 PM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> > > Am 04.05.21 um 11:47 schrieb Daniel Vetter:
> > > > [SNIP]
> > > > > Yeah, it just takes to long for the preemption to complete to be really
> > > > > useful for the feature we are discussing here.
> > > > > 
> > > > > As I said when the kernel requests to preempt a queue we can easily expect a
> > > > > timeout of ~100ms until that comes back. For compute that is even in the
> > > > > multiple seconds range.
> > > > 100ms for preempting an idle request sounds like broken hw to me. Of
> > > > course preemting something that actually runs takes a while, that's
> > > > nothing new. But it's also not the thing we're talking about here. Is this
> > > > 100ms actual numbers from hw for an actual idle ringbuffer?
> > > Well 100ms is just an example of the scheduler granularity. Let me
> > > explain in a wider context.
> > > 
> > > The hardware can have X queues mapped at the same time and every Y time
> > > interval the hardware scheduler checks if those queues have changed and
> > > only if they have changed the necessary steps to reload them are started.
> > > 
> > > Multiple queues can be rendering at the same time, so you can have X as
> > > a high priority queue active and just waiting for a signal to start and
> > > the client rendering one frame after another and a third background
> > > compute task mining bitcoins for you.
> > > 
> > > As long as everything is static this is perfectly performant. Adding a
> > > queue to the list of active queues is also relatively simple, but taking
> > > one down requires you to wait until we are sure the hardware has seen
> > > the change and reloaded the queues.
> > > 
> > > Think of it as an RCU grace period. This is simply not something which
> > > is made to be used constantly, but rather just at process termination.
> > Uh ... that indeed sounds rather broken.
> 
> Well I wouldn't call it broken. It's just not made for the use case we are
> trying to abuse it for.
> 
> > Otoh it's just a dma_fence that'd we'd inject as this unload-fence.
> 
> Yeah, exactly that's why it isn't much of a problem for process termination
> or freeing memory.

Ok so your hw really hates the unload fence. On ours the various queues
are a bit more explicit, so largely unload/preempt is the same as context
switch and pretty quick. Afaik at least.

Still baffled that you can't fix this in fw, but oh well. Judging from how
fast our fw team moves I'm not surprised :-/

Anyway so next plan: Make this work exactly like hmm:
1. wait for the user fence as a dma-fence fake thing, tdr makes this safe
2. remove pte
3. do synchronous tlb flush

Tada, no more 100ms stall in your buffer move callbacks. And feel free to
pack up 2&3 into an async worker or something if it takes too long and
treating it as a bo move dma_fence is better. Also that way you might be
able to batch up the tlb flushing if it's too damn expensive, by
collecting them all under a single dma_fence (and starting a new tlb flush
cycle every time ->enable_signalling gets called).

As long as you nack any gpu faults and don't try to fill them for these
legacy contexts that support dma-fence there's no harm in using the hw
facilities.

Ofc if you're now telling me your synchronous tlb flush is also 100ms,
then maybe just throw the hw out the window, and accept that the
millisecond anything evicts anything (good look with userptr) the screen
freezes for a bit.

> > So by and large everyone should already be able to cope with it taking a
> > bit longer. So from a design pov I don't see a huge problem, but I
> > guess you guys wont be happy since it means on amd hw there will be
> > random unsightly stalls in desktop linux usage.
> > 
> > > > > The "preemption" feature is really called suspend and made just for the case
> > > > > when we want to put a process to sleep or need to forcefully kill it for
> > > > > misbehavior or stuff like that. It is not meant to be used in normal
> > > > > operation.
> > > > > 
> > > > > If we only attach it on ->move then yeah maybe a last resort possibility to
> > > > > do it this way, but I think in that case we could rather stick with kernel
> > > > > submissions.
> > > > Well this is a hybrid userspace ring + kernel augmeted submit mode, so you
> > > > can keep dma-fences working. Because the dma-fence stuff wont work with
> > > > pure userspace submit, I think that conclusion is rather solid. Once more
> > > > even after this long thread here.
> > > When assisted with unload fences, then yes. Problem is that I can't see
> > > how we could implement those performant currently.
> > Is there really no way to fix fw here? Like if process start/teardown
> > takes 100ms, that's going to suck no matter what.
> 
> As I said adding the queue is unproblematic and teardown just results in a
> bit more waiting to free things up.
> 
> Problematic is more overcommit swapping and OOM situations which need to
> wait for the hw scheduler to come back and tell us that the queue is now
> unmapped.
> 
> > > > > > Also, if userspace lies to us and keeps pushing crap into the ring
> > > > > > after it's supposed to be idle: Userspace is already allowed to waste
> > > > > > gpu time. If you're too worried about this set a fairly aggressive
> > > > > > preempt timeout on the unload fence, and kill the context if it takes
> > > > > > longer than what preempting an idle ring should take (because that
> > > > > > would indicate broken/evil userspace).
> > > > > I think you have the wrong expectation here. It is perfectly valid and
> > > > > expected for userspace to keep writing commands into the ring buffer.
> > > > > 
> > > > > After all when one frame is completed they want to immediately start
> > > > > rendering the next one.
> > > > Sure, for the true userspace direct submit model. But with that you don't
> > > > get dma-fence, which means this gpu will not work for 3d accel on any
> > > > current linux desktop.
> > > I'm not sure of that. I've looked a bit into how we could add user
> > > fences to dma_resv objects and that isn't that hard after all.
> > I think as a proof of concept it's fine, but as an actual solution ...
> > pls no. Two reasons:
> > - implicit sync is bad
> 
> Well can't disagree with that :) But I think we can't avoid supporting it.
> 
> > - this doesn't fix anything for explicit sync using dma_fence in terms
> > of sync_file or drm_syncobj.
> 
> Exactly.
> 
> If we do implicit sync or explicit sync is orthogonal to the problems that
> sync must be made reliable somehow.
> 
> So when we sync and timeout the waiter should just continue, but whoever
> failed to signal will be punished.
> 
> But since this isn't solved on Windows I don't see how we can solve it on
> Linux either.
> 
> > So if we go with the route of papering over this in the kernel, then
> > it'll be a ton more work than just hacking something into dma_resv.
> 
> I'm just now prototyping that and at least for the driver parts it doesn't
> look that hard after all.
> 
> > > > Which sucks, hence some hybrid model of using the userspace ring and
> > > > kernel augmented submit is needed. Which was my idea.
> > > Yeah, I think when our firmware folks would really remove the kernel
> > > queue and we still don't have
> > Yeah I think kernel queue can be removed. But the price is that you
> > need reasonable fast preempt of idle contexts.
> > 
> > I really can't understand how this can take multiple ms, something
> > feels very broken in the design of the fw (since obviously the hw can
> > preempt an idle context to another one pretty fast, or you'd render
> > any multi-client desktop as a slideshow at best).
> 
> Well the hardware doesn't preempt and idle context. See you can have a
> number of active ("mapped" in the fw terminology) contexts and idle contexts
> are usually kept active even when they are idle.
> 
> So when multi-client desktop switches between context then that is rather
> fast, but when the kernel asks for a context to be unmapped that can take
> rather long.
> 
> 
> > 
> > > > > [SNIP]
> > > > > Can't find that of hand either, but see the amdgpu_noretry module option.
> > > > > 
> > > > > It basically tells the hardware if retry page faults should be supported or
> > > > > not because this whole TLB shutdown thing when they are supported is
> > > > > extremely costly.
> > > > Hm so synchronous tlb shootdown is a lot more costly when you allow
> > > > retrying of page faults?
> > > Partially correct, yes.
> > > 
> > > See when you have retry page faults enabled and unmap something you need
> > > to make sure that everybody which could have potentially translated that
> > > page and has a TLB is either invalidated or waited until the access is
> > > completed.
> > > 
> > > Since every CU could be using a memory location that takes ages to
> > > completed compared to the normal invalidation where you just invalidate
> > > the L1/L2 and are done.
> > > 
> > > Additional to that the recovery adds some extra overhead to every memory
> > > access, so even without a fault you are quite a bit slower if this is
> > > enabled.
> > Well yes it's complicated, and it's even more fun when the tlb
> > invalidate comes in through the IOMMU through ATS.
> > 
> > But also if you don't your hw is just broken from a security pov, no
> > page fault handling for you. So it's really not optional.
> 
> Yeah, but that is also a known issue. You either have retry faults and live
> with the extra overhead or you disable them and go with the kernel based
> submission approach.

Well kernel based submit is out with your new hw it sounds, so retry
faults and sync tlb invalidate is the price you have to pay. There's no
"both ways pls" here :-)

> > > > That sounds bad, because for full hmm mode you need to be able to retry
> > > > pagefaults. Well at least the PASID/ATS/IOMMU side will do that, and might just
> > > > hang your gpu for a long time while it's waiting for the va->pa lookup
> > > > response to return. So retrying lookups shouldn't be any different really.
> > > > 
> > > > And you also need fairly fast synchronous tlb shootdown for hmm. So if
> > > > your hw has a problem with both together that sounds bad.
> > > Completely agree. And since it was my job to validate the implementation
> > > on Vega10 I was also the first one to realize that.
> > > 
> > > Felix, a couple of others and me are trying to work around those
> > > restrictions ever since.
> > > 
> > > > I was more thinking about handling it all in the kernel.
> > > > Yeah can do, just means that you also have to copy the ringbuffer stuff
> > > > over from userspace to the kernel.
> > > That is my least worry. The IBs are just addr+length., so no more than
> > > 16 bytes for each IB.
> > Ah ok, maybe I'm biased from drm/i915 where an ib launch + seqno is
> > rather long, because the hw folks keep piling more workarounds and
> > additional flushes on top. Like on some hw the recommended w/a was to
> > just issue 32 gpu cache flushes or something like that (otherwise the
> > seqno write could arrive before the gpu actually finished flushing)
> > :-/
> 
> Well I once had a conversation with a hw engineer which wanted to split up
> the TLB in validations into 1Gib chunks :)
> 
> That would have mean we would need to emit 2^17 different invalidation
> requests on the kernel ring buffer....

Well on the cpu side you invalidate tlbs as ranges, but there's a fallback
to just flush the entire thing if the range flush is too much. So it's not
entirely bonkers, just that the global flush needs to be there still.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-04 12:48                                                             ` Christian König
  2021-05-04 16:44                                                               ` Daniel Vetter
@ 2021-05-04 17:16                                                               ` Marek Olšák
  2021-05-04 21:06                                                                 ` Jason Ekstrand
  1 sibling, 1 reply; 105+ messages in thread
From: Marek Olšák @ 2021-05-04 17:16 UTC (permalink / raw)
  To: Christian König
  Cc: dri-devel, ML Mesa-dev, Michel Dänzer, Jason Ekstrand


[-- Attachment #1.1: Type: text/plain, Size: 11131 bytes --]

I see some mentions of XNACK and recoverable page faults. Note that all
gaming AMD hw that has userspace queues doesn't have XNACK, so there is no
overhead in compute units. My understanding is that recoverable page faults
are still supported without XNACK, but instead of the compute unit
replaying the faulting instruction, the L1 cache does that. Anyway, the
point is that XNACK is totally irrelevant here.

Marek

On Tue., May 4, 2021, 08:48 Christian König, <
ckoenig.leichtzumerken@gmail.com> wrote:

> Am 04.05.21 um 13:13 schrieb Daniel Vetter:
> > On Tue, May 4, 2021 at 12:53 PM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >> Am 04.05.21 um 11:47 schrieb Daniel Vetter:
> >>> [SNIP]
> >>>> Yeah, it just takes to long for the preemption to complete to be
> really
> >>>> useful for the feature we are discussing here.
> >>>>
> >>>> As I said when the kernel requests to preempt a queue we can easily
> expect a
> >>>> timeout of ~100ms until that comes back. For compute that is even in
> the
> >>>> multiple seconds range.
> >>> 100ms for preempting an idle request sounds like broken hw to me. Of
> >>> course preemting something that actually runs takes a while, that's
> >>> nothing new. But it's also not the thing we're talking about here. Is
> this
> >>> 100ms actual numbers from hw for an actual idle ringbuffer?
> >> Well 100ms is just an example of the scheduler granularity. Let me
> >> explain in a wider context.
> >>
> >> The hardware can have X queues mapped at the same time and every Y time
> >> interval the hardware scheduler checks if those queues have changed and
> >> only if they have changed the necessary steps to reload them are
> started.
> >>
> >> Multiple queues can be rendering at the same time, so you can have X as
> >> a high priority queue active and just waiting for a signal to start and
> >> the client rendering one frame after another and a third background
> >> compute task mining bitcoins for you.
> >>
> >> As long as everything is static this is perfectly performant. Adding a
> >> queue to the list of active queues is also relatively simple, but taking
> >> one down requires you to wait until we are sure the hardware has seen
> >> the change and reloaded the queues.
> >>
> >> Think of it as an RCU grace period. This is simply not something which
> >> is made to be used constantly, but rather just at process termination.
> > Uh ... that indeed sounds rather broken.
>
> Well I wouldn't call it broken. It's just not made for the use case we
> are trying to abuse it for.
>
> > Otoh it's just a dma_fence that'd we'd inject as this unload-fence.
>
> Yeah, exactly that's why it isn't much of a problem for process
> termination or freeing memory.
>
> > So by and large everyone should already be able to cope with it taking a
> > bit longer. So from a design pov I don't see a huge problem, but I
> > guess you guys wont be happy since it means on amd hw there will be
> > random unsightly stalls in desktop linux usage.
> >
> >>>> The "preemption" feature is really called suspend and made just for
> the case
> >>>> when we want to put a process to sleep or need to forcefully kill it
> for
> >>>> misbehavior or stuff like that. It is not meant to be used in normal
> >>>> operation.
> >>>>
> >>>> If we only attach it on ->move then yeah maybe a last resort
> possibility to
> >>>> do it this way, but I think in that case we could rather stick with
> kernel
> >>>> submissions.
> >>> Well this is a hybrid userspace ring + kernel augmeted submit mode, so
> you
> >>> can keep dma-fences working. Because the dma-fence stuff wont work with
> >>> pure userspace submit, I think that conclusion is rather solid. Once
> more
> >>> even after this long thread here.
> >> When assisted with unload fences, then yes. Problem is that I can't see
> >> how we could implement those performant currently.
> > Is there really no way to fix fw here? Like if process start/teardown
> > takes 100ms, that's going to suck no matter what.
>
> As I said adding the queue is unproblematic and teardown just results in
> a bit more waiting to free things up.
>
> Problematic is more overcommit swapping and OOM situations which need to
> wait for the hw scheduler to come back and tell us that the queue is now
> unmapped.
>
> >>>>> Also, if userspace lies to us and keeps pushing crap into the ring
> >>>>> after it's supposed to be idle: Userspace is already allowed to waste
> >>>>> gpu time. If you're too worried about this set a fairly aggressive
> >>>>> preempt timeout on the unload fence, and kill the context if it takes
> >>>>> longer than what preempting an idle ring should take (because that
> >>>>> would indicate broken/evil userspace).
> >>>> I think you have the wrong expectation here. It is perfectly valid and
> >>>> expected for userspace to keep writing commands into the ring buffer.
> >>>>
> >>>> After all when one frame is completed they want to immediately start
> >>>> rendering the next one.
> >>> Sure, for the true userspace direct submit model. But with that you
> don't
> >>> get dma-fence, which means this gpu will not work for 3d accel on any
> >>> current linux desktop.
> >> I'm not sure of that. I've looked a bit into how we could add user
> >> fences to dma_resv objects and that isn't that hard after all.
> > I think as a proof of concept it's fine, but as an actual solution ...
> > pls no. Two reasons:
> > - implicit sync is bad
>
> Well can't disagree with that :) But I think we can't avoid supporting it.
>
> > - this doesn't fix anything for explicit sync using dma_fence in terms
> > of sync_file or drm_syncobj.
>
> Exactly.
>
> If we do implicit sync or explicit sync is orthogonal to the problems
> that sync must be made reliable somehow.
>
> So when we sync and timeout the waiter should just continue, but whoever
> failed to signal will be punished.
>
> But since this isn't solved on Windows I don't see how we can solve it
> on Linux either.
>
> > So if we go with the route of papering over this in the kernel, then
> > it'll be a ton more work than just hacking something into dma_resv.
>
> I'm just now prototyping that and at least for the driver parts it
> doesn't look that hard after all.
>
> >>> Which sucks, hence some hybrid model of using the userspace ring and
> >>> kernel augmented submit is needed. Which was my idea.
> >> Yeah, I think when our firmware folks would really remove the kernel
> >> queue and we still don't have
> > Yeah I think kernel queue can be removed. But the price is that you
> > need reasonable fast preempt of idle contexts.
> >
> > I really can't understand how this can take multiple ms, something
> > feels very broken in the design of the fw (since obviously the hw can
> > preempt an idle context to another one pretty fast, or you'd render
> > any multi-client desktop as a slideshow at best).
>
> Well the hardware doesn't preempt and idle context. See you can have a
> number of active ("mapped" in the fw terminology) contexts and idle
> contexts are usually kept active even when they are idle.
>
> So when multi-client desktop switches between context then that is
> rather fast, but when the kernel asks for a context to be unmapped that
> can take rather long.
>
>
> >
> >>>> [SNIP]
> >>>> Can't find that of hand either, but see the amdgpu_noretry module
> option.
> >>>>
> >>>> It basically tells the hardware if retry page faults should be
> supported or
> >>>> not because this whole TLB shutdown thing when they are supported is
> >>>> extremely costly.
> >>> Hm so synchronous tlb shootdown is a lot more costly when you allow
> >>> retrying of page faults?
> >> Partially correct, yes.
> >>
> >> See when you have retry page faults enabled and unmap something you need
> >> to make sure that everybody which could have potentially translated that
> >> page and has a TLB is either invalidated or waited until the access is
> >> completed.
> >>
> >> Since every CU could be using a memory location that takes ages to
> >> completed compared to the normal invalidation where you just invalidate
> >> the L1/L2 and are done.
> >>
> >> Additional to that the recovery adds some extra overhead to every memory
> >> access, so even without a fault you are quite a bit slower if this is
> >> enabled.
> > Well yes it's complicated, and it's even more fun when the tlb
> > invalidate comes in through the IOMMU through ATS.
> >
> > But also if you don't your hw is just broken from a security pov, no
> > page fault handling for you. So it's really not optional.
>
> Yeah, but that is also a known issue. You either have retry faults and
> live with the extra overhead or you disable them and go with the kernel
> based submission approach.
>
> >
> >>> That sounds bad, because for full hmm mode you need to be able to retry
> >>> pagefaults. Well at least the PASID/ATS/IOMMU side will do that, and
> might just
> >>> hang your gpu for a long time while it's waiting for the va->pa lookup
> >>> response to return. So retrying lookups shouldn't be any different
> really.
> >>>
> >>> And you also need fairly fast synchronous tlb shootdown for hmm. So if
> >>> your hw has a problem with both together that sounds bad.
> >> Completely agree. And since it was my job to validate the implementation
> >> on Vega10 I was also the first one to realize that.
> >>
> >> Felix, a couple of others and me are trying to work around those
> >> restrictions ever since.
> >>
> >>> I was more thinking about handling it all in the kernel.
> >>> Yeah can do, just means that you also have to copy the ringbuffer stuff
> >>> over from userspace to the kernel.
> >> That is my least worry. The IBs are just addr+length., so no more than
> >> 16 bytes for each IB.
> > Ah ok, maybe I'm biased from drm/i915 where an ib launch + seqno is
> > rather long, because the hw folks keep piling more workarounds and
> > additional flushes on top. Like on some hw the recommended w/a was to
> > just issue 32 gpu cache flushes or something like that (otherwise the
> > seqno write could arrive before the gpu actually finished flushing)
> > :-/
>
> Well I once had a conversation with a hw engineer which wanted to split
> up the TLB in validations into 1Gib chunks :)
>
> That would have mean we would need to emit 2^17 different invalidation
> requests on the kernel ring buffer....
>
> Christian.
>
>
> >
> > Cheers, Daniel
> >
> >>> It also means that there's more differences in how your userspace works
> >>> between full userspace mode (necessary for compute) and legacy
> dma-fence
> >>> mode (necessary for desktop 3d). Which is especially big fun for
> vulkan,
> >>> since that will have to do both.
> >> That is the bigger problem.
> >>
> >> Christian.
> >>
> >>> But then amd is still hanging onto the amdgpu vs amdkfd split, so
> you're
> >>> going for max pain in this area anyway :-P
> >>> -Daniel
> >
>
>

[-- Attachment #1.2: Type: text/html, Size: 13279 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: [Mesa-dev] [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal
  2021-05-04 17:16                                                               ` Marek Olšák
@ 2021-05-04 21:06                                                                 ` Jason Ekstrand
  0 siblings, 0 replies; 105+ messages in thread
From: Jason Ekstrand @ 2021-05-04 21:06 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Christian König, Michel Dänzer, dri-devel, ML Mesa-dev

On Tue, May 4, 2021 at 12:16 PM Marek Olšák <maraeo@gmail.com> wrote:
>
> I see some mentions of XNACK and recoverable page faults. Note that all gaming AMD hw that has userspace queues doesn't have XNACK, so there is no overhead in compute units. My understanding is that recoverable page faults are still supported without XNACK, but instead of the compute unit replaying the faulting instruction, the L1 cache does that. Anyway, the point is that XNACK is totally irrelevant here.
>
> Marek
>
> On Tue., May 4, 2021, 08:48 Christian König, <ckoenig.leichtzumerken@gmail.com> wrote:
>>
>> Am 04.05.21 um 13:13 schrieb Daniel Vetter:
>> > On Tue, May 4, 2021 at 12:53 PM Christian König
>> > <ckoenig.leichtzumerken@gmail.com> wrote:
>> >> Am 04.05.21 um 11:47 schrieb Daniel Vetter:
>> >>> [SNIP]
>> >>>> Yeah, it just takes to long for the preemption to complete to be really
>> >>>> useful for the feature we are discussing here.
>> >>>>
>> >>>> As I said when the kernel requests to preempt a queue we can easily expect a
>> >>>> timeout of ~100ms until that comes back. For compute that is even in the
>> >>>> multiple seconds range.
>> >>> 100ms for preempting an idle request sounds like broken hw to me. Of
>> >>> course preemting something that actually runs takes a while, that's
>> >>> nothing new. But it's also not the thing we're talking about here. Is this
>> >>> 100ms actual numbers from hw for an actual idle ringbuffer?
>> >> Well 100ms is just an example of the scheduler granularity. Let me
>> >> explain in a wider context.
>> >>
>> >> The hardware can have X queues mapped at the same time and every Y time
>> >> interval the hardware scheduler checks if those queues have changed and
>> >> only if they have changed the necessary steps to reload them are started.
>> >>
>> >> Multiple queues can be rendering at the same time, so you can have X as
>> >> a high priority queue active and just waiting for a signal to start and
>> >> the client rendering one frame after another and a third background
>> >> compute task mining bitcoins for you.
>> >>
>> >> As long as everything is static this is perfectly performant. Adding a
>> >> queue to the list of active queues is also relatively simple, but taking
>> >> one down requires you to wait until we are sure the hardware has seen
>> >> the change and reloaded the queues.
>> >>
>> >> Think of it as an RCU grace period. This is simply not something which
>> >> is made to be used constantly, but rather just at process termination.
>> > Uh ... that indeed sounds rather broken.
>>
>> Well I wouldn't call it broken. It's just not made for the use case we
>> are trying to abuse it for.
>>
>> > Otoh it's just a dma_fence that'd we'd inject as this unload-fence.
>>
>> Yeah, exactly that's why it isn't much of a problem for process
>> termination or freeing memory.
>>
>> > So by and large everyone should already be able to cope with it taking a
>> > bit longer. So from a design pov I don't see a huge problem, but I
>> > guess you guys wont be happy since it means on amd hw there will be
>> > random unsightly stalls in desktop linux usage.
>> >
>> >>>> The "preemption" feature is really called suspend and made just for the case
>> >>>> when we want to put a process to sleep or need to forcefully kill it for
>> >>>> misbehavior or stuff like that. It is not meant to be used in normal
>> >>>> operation.
>> >>>>
>> >>>> If we only attach it on ->move then yeah maybe a last resort possibility to
>> >>>> do it this way, but I think in that case we could rather stick with kernel
>> >>>> submissions.
>> >>> Well this is a hybrid userspace ring + kernel augmeted submit mode, so you
>> >>> can keep dma-fences working. Because the dma-fence stuff wont work with
>> >>> pure userspace submit, I think that conclusion is rather solid. Once more
>> >>> even after this long thread here.
>> >> When assisted with unload fences, then yes. Problem is that I can't see
>> >> how we could implement those performant currently.
>> > Is there really no way to fix fw here? Like if process start/teardown
>> > takes 100ms, that's going to suck no matter what.
>>
>> As I said adding the queue is unproblematic and teardown just results in
>> a bit more waiting to free things up.
>>
>> Problematic is more overcommit swapping and OOM situations which need to
>> wait for the hw scheduler to come back and tell us that the queue is now
>> unmapped.
>>
>> >>>>> Also, if userspace lies to us and keeps pushing crap into the ring
>> >>>>> after it's supposed to be idle: Userspace is already allowed to waste
>> >>>>> gpu time. If you're too worried about this set a fairly aggressive
>> >>>>> preempt timeout on the unload fence, and kill the context if it takes
>> >>>>> longer than what preempting an idle ring should take (because that
>> >>>>> would indicate broken/evil userspace).
>> >>>> I think you have the wrong expectation here. It is perfectly valid and
>> >>>> expected for userspace to keep writing commands into the ring buffer.
>> >>>>
>> >>>> After all when one frame is completed they want to immediately start
>> >>>> rendering the next one.
>> >>> Sure, for the true userspace direct submit model. But with that you don't
>> >>> get dma-fence, which means this gpu will not work for 3d accel on any
>> >>> current linux desktop.
>> >> I'm not sure of that. I've looked a bit into how we could add user
>> >> fences to dma_resv objects and that isn't that hard after all.
>> > I think as a proof of concept it's fine, but as an actual solution ...
>> > pls no. Two reasons:

I'm looking forward to seeing the prototype because...

>> > - implicit sync is bad
>>
>> Well can't disagree with that :) But I think we can't avoid supporting it.
>>
>> > - this doesn't fix anything for explicit sync using dma_fence in terms
>> > of sync_file or drm_syncobj.
>>
>> Exactly.
>>
>> If we do implicit sync or explicit sync is orthogonal to the problems
>> that sync must be made reliable somehow.

Regardless of implicit vs. explicit sync, the fundamental problem we
have to solve is the same.  I'm moderately hopeful that if Christian
has an idea for how to do it with dma_resv that maybe we can translate
that in a semi-generic way to syncobj.  Yes, I realize I just waved my
hands and made all the big problems go away.  Except I really didn't.
I made them all Christian's problems. :-P

--Jason


>> So when we sync and timeout the waiter should just continue, but whoever
>> failed to signal will be punished.
>>
>> But since this isn't solved on Windows I don't see how we can solve it
>> on Linux either.
>>
>> > So if we go with the route of papering over this in the kernel, then
>> > it'll be a ton more work than just hacking something into dma_resv.
>>
>> I'm just now prototyping that and at least for the driver parts it
>> doesn't look that hard after all.
>>
>> >>> Which sucks, hence some hybrid model of using the userspace ring and
>> >>> kernel augmented submit is needed. Which was my idea.
>> >> Yeah, I think when our firmware folks would really remove the kernel
>> >> queue and we still don't have
>> > Yeah I think kernel queue can be removed. But the price is that you
>> > need reasonable fast preempt of idle contexts.
>> >
>> > I really can't understand how this can take multiple ms, something
>> > feels very broken in the design of the fw (since obviously the hw can
>> > preempt an idle context to another one pretty fast, or you'd render
>> > any multi-client desktop as a slideshow at best).
>>
>> Well the hardware doesn't preempt and idle context. See you can have a
>> number of active ("mapped" in the fw terminology) contexts and idle
>> contexts are usually kept active even when they are idle.
>>
>> So when multi-client desktop switches between context then that is
>> rather fast, but when the kernel asks for a context to be unmapped that
>> can take rather long.
>>
>>
>> >
>> >>>> [SNIP]
>> >>>> Can't find that of hand either, but see the amdgpu_noretry module option.
>> >>>>
>> >>>> It basically tells the hardware if retry page faults should be supported or
>> >>>> not because this whole TLB shutdown thing when they are supported is
>> >>>> extremely costly.
>> >>> Hm so synchronous tlb shootdown is a lot more costly when you allow
>> >>> retrying of page faults?
>> >> Partially correct, yes.
>> >>
>> >> See when you have retry page faults enabled and unmap something you need
>> >> to make sure that everybody which could have potentially translated that
>> >> page and has a TLB is either invalidated or waited until the access is
>> >> completed.
>> >>
>> >> Since every CU could be using a memory location that takes ages to
>> >> completed compared to the normal invalidation where you just invalidate
>> >> the L1/L2 and are done.
>> >>
>> >> Additional to that the recovery adds some extra overhead to every memory
>> >> access, so even without a fault you are quite a bit slower if this is
>> >> enabled.
>> > Well yes it's complicated, and it's even more fun when the tlb
>> > invalidate comes in through the IOMMU through ATS.
>> >
>> > But also if you don't your hw is just broken from a security pov, no
>> > page fault handling for you. So it's really not optional.
>>
>> Yeah, but that is also a known issue. You either have retry faults and
>> live with the extra overhead or you disable them and go with the kernel
>> based submission approach.
>>
>> >
>> >>> That sounds bad, because for full hmm mode you need to be able to retry
>> >>> pagefaults. Well at least the PASID/ATS/IOMMU side will do that, and might just
>> >>> hang your gpu for a long time while it's waiting for the va->pa lookup
>> >>> response to return. So retrying lookups shouldn't be any different really.
>> >>>
>> >>> And you also need fairly fast synchronous tlb shootdown for hmm. So if
>> >>> your hw has a problem with both together that sounds bad.
>> >> Completely agree. And since it was my job to validate the implementation
>> >> on Vega10 I was also the first one to realize that.
>> >>
>> >> Felix, a couple of others and me are trying to work around those
>> >> restrictions ever since.
>> >>
>> >>> I was more thinking about handling it all in the kernel.
>> >>> Yeah can do, just means that you also have to copy the ringbuffer stuff
>> >>> over from userspace to the kernel.
>> >> That is my least worry. The IBs are just addr+length., so no more than
>> >> 16 bytes for each IB.
>> > Ah ok, maybe I'm biased from drm/i915 where an ib launch + seqno is
>> > rather long, because the hw folks keep piling more workarounds and
>> > additional flushes on top. Like on some hw the recommended w/a was to
>> > just issue 32 gpu cache flushes or something like that (otherwise the
>> > seqno write could arrive before the gpu actually finished flushing)
>> > :-/
>>
>> Well I once had a conversation with a hw engineer which wanted to split
>> up the TLB in validations into 1Gib chunks :)
>>
>> That would have mean we would need to emit 2^17 different invalidation
>> requests on the kernel ring buffer....
>>
>> Christian.
>>
>>
>> >
>> > Cheers, Daniel
>> >
>> >>> It also means that there's more differences in how your userspace works
>> >>> between full userspace mode (necessary for compute) and legacy dma-fence
>> >>> mode (necessary for desktop 3d). Which is especially big fun for vulkan,
>> >>> since that will have to do both.
>> >> That is the bigger problem.
>> >>
>> >> Christian.
>> >>
>> >>> But then amd is still hanging onto the amdgpu vs amdkfd split, so you're
>> >>> going for max pain in this area anyway :-P
>> >>> -Daniel
>> >
>>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 105+ messages in thread

end of thread, other threads:[~2021-05-04 21:06 UTC | newest]

Thread overview: 105+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-19 10:47 [RFC] Linux Graphics Next: Explicit fences everywhere and no BO fences - initial proposal Marek Olšák
2021-04-19 15:48 ` Jason Ekstrand
2021-04-20  2:25   ` Marek Olšák
2021-04-20 10:15   ` [Mesa-dev] " Christian König
2021-04-20 10:34     ` Daniel Vetter
2021-04-20 11:03       ` Marek Olšák
2021-04-20 11:16         ` Daniel Vetter
2021-04-20 11:59           ` Christian König
2021-04-20 14:09             ` Daniel Vetter
2021-04-20 16:24               ` Jason Ekstrand
2021-04-20 16:19             ` Jason Ekstrand
2021-04-20 12:01 ` Daniel Vetter
2021-04-20 12:19   ` [Mesa-dev] " Christian König
2021-04-20 13:03   ` Daniel Stone
2021-04-20 14:04     ` Daniel Vetter
2021-04-20 12:42 ` Daniel Stone
2021-04-20 15:45   ` Jason Ekstrand
2021-04-20 17:44     ` Daniel Stone
2021-04-20 18:00       ` [Mesa-dev] " Christian König
2021-04-20 18:15         ` Daniel Stone
2021-04-20 19:03           ` Bas Nieuwenhuizen
2021-04-20 19:18             ` Daniel Stone
2021-04-20 18:53       ` Daniel Vetter
2021-04-20 19:14         ` Daniel Stone
2021-04-20 19:29           ` Daniel Vetter
2021-04-20 20:32             ` Daniel Stone
2021-04-26 20:59               ` Marek Olšák
2021-04-27  8:02                 ` Daniel Vetter
2021-04-27 11:49                   ` Marek Olšák
2021-04-27 12:06                     ` Christian König
2021-04-27 12:11                       ` Marek Olšák
2021-04-27 12:15                         ` Daniel Vetter
2021-04-27 12:27                           ` Christian König
2021-04-27 12:46                           ` Marek Olšák
2021-04-27 12:50                             ` Christian König
2021-04-27 13:26                               ` Marek Olšák
2021-04-27 15:13                                 ` Christian König
2021-04-27 17:31                                 ` Lucas Stach
2021-04-27 17:35                                   ` Simon Ser
2021-04-27 18:01                                     ` Alex Deucher
2021-04-27 18:27                                       ` Simon Ser
2021-04-28 10:01                                         ` Daniel Vetter
2021-04-28 10:05                                       ` Daniel Vetter
2021-04-28 10:31                                         ` Christian König
2021-04-28 12:21                                           ` Daniel Vetter
2021-04-28 12:26                                             ` Daniel Vetter
2021-04-28 13:11                                               ` Christian König
2021-04-28 13:34                                                 ` Daniel Vetter
2021-04-28 13:37                                                   ` Christian König
2021-04-28 14:34                                                     ` Daniel Vetter
2021-04-28 14:45                                                       ` Christian König
2021-04-29 11:07                                                         ` Daniel Vetter
2021-04-28 20:39                                                       ` Alex Deucher
2021-04-29 11:12                                                         ` Daniel Vetter
2021-04-30  8:58                                                           ` Daniel Vetter
2021-04-30  9:07                                                             ` Christian König
2021-04-30  9:35                                                               ` Daniel Vetter
2021-04-30 10:17                                                                 ` Daniel Stone
2021-04-28 12:45                                             ` Simon Ser
2021-04-28 13:03                                           ` Alex Deucher
2021-04-27 19:41                                   ` Jason Ekstrand
2021-04-27 21:58                                     ` Marek Olšák
2021-04-28  4:01                                       ` Jason Ekstrand
2021-04-28  5:19                                         ` Marek Olšák
2021-04-27 18:38                       ` Dave Airlie
2021-04-27 19:23                         ` Marek Olšák
2021-04-28  6:59                           ` Christian König
2021-04-28  9:07                             ` Michel Dänzer
2021-04-28  9:57                               ` Daniel Vetter
2021-05-01 22:27                               ` Marek Olšák
2021-05-03 14:42                                 ` Alex Deucher
2021-05-03 14:59                                   ` Jason Ekstrand
2021-05-03 15:03                                     ` Christian König
2021-05-03 15:15                                       ` Jason Ekstrand
2021-05-03 15:16                                     ` Bas Nieuwenhuizen
2021-05-03 15:23                                       ` Jason Ekstrand
2021-05-03 20:36                                         ` Marek Olšák
2021-05-04  3:11                                           ` Marek Olšák
2021-05-04  7:01                                             ` Christian König
2021-05-04  7:32                                               ` Daniel Vetter
2021-05-04  8:09                                                 ` Christian König
2021-05-04  8:27                                                   ` Daniel Vetter
2021-05-04  9:14                                                     ` Christian König
2021-05-04  9:47                                                       ` Daniel Vetter
2021-05-04 10:53                                                         ` Christian König
2021-05-04 11:13                                                           ` Daniel Vetter
2021-05-04 12:48                                                             ` Christian König
2021-05-04 16:44                                                               ` Daniel Vetter
2021-05-04 17:16                                                               ` Marek Olšák
2021-05-04 21:06                                                                 ` Jason Ekstrand
2021-04-28  9:54                             ` Daniel Vetter
2021-04-27 20:49                         ` Jason Ekstrand
2021-04-27 12:12                     ` Daniel Vetter
2021-04-20 19:16         ` Jason Ekstrand
2021-04-20 19:27           ` Daniel Vetter
2021-04-20 14:53 ` Daniel Stone
2021-04-20 14:58   ` [Mesa-dev] " Christian König
2021-04-20 15:07     ` Daniel Stone
2021-04-20 15:16       ` Christian König
2021-04-20 15:49         ` Daniel Stone
2021-04-20 16:25           ` Marek Olšák
2021-04-20 16:42             ` Jacob Lifshay
2021-04-20 18:03             ` Daniel Stone
2021-04-20 18:39             ` Daniel Vetter
2021-04-20 19:20               ` Marek Olšák

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).