dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
@ 2023-06-27 13:23 André Almeida
  2023-06-27 16:09 ` Randy Dunlap
                   ` (5 more replies)
  0 siblings, 6 replies; 41+ messages in thread
From: André Almeida @ 2023-06-27 13:23 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, Randy Dunlap, André Almeida,
	'Marek Olšák',
	Michel Dänzer, Timur Kristóf, Pekka Paalanen,
	Samuel Pitoiset, kernel-dev, alexander.deucher, Pekka Paalanen,
	christian.koenig

Create a section that specifies how to deal with DRM device resets for
kernel and userspace drivers.

Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---

v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/

Changes:
 - Grammar fixes (Randy)

 Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 65fb3036a580..3cbffa25ed93 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
 mmapped regular files. Threads cause additional pain with signal
 handling as well.
 
+Device reset
+============
+
+The GPU stack is really complex and is prone to errors, from hardware bugs,
+faulty applications and everything in between the many layers. Some errors
+require resetting the device in order to make the device usable again. This
+sections describes the expectations for DRM and usermode drivers when a
+device resets and how to propagate the reset status.
+
+Kernel Mode Driver
+------------------
+
+The KMD is responsible for checking if the device needs a reset, and to perform
+it as needed. Usually a hang is detected when a job gets stuck executing. KMD
+should keep track of resets, because userspace can query any time about the
+reset stats for an specific context. This is needed to propagate to the rest of
+the stack that a reset has happened. Currently, this is implemented by each
+driver separately, with no common DRM interface.
+
+User Mode Driver
+----------------
+
+The UMD should check before submitting new commands to the KMD if the device has
+been reset, and this can be checked more often if the UMD requires it. After
+detecting a reset, UMD will then proceed to report it to the application using
+the appropriate API error code, as explained in the section below about
+robustness.
+
+Robustness
+----------
+
+The only way to try to keep an application working after a reset is if it
+complies with the robustness aspects of the graphical API that it is using.
+
+Graphical APIs provide ways to applications to deal with device resets. However,
+there is no guarantee that the app will use such features correctly, and the
+UMD can implement policies to close the app if it is a repeating offender,
+likely in a broken loop. This is done to ensure that it does not keep blocking
+the user interface from being correctly displayed. This should be done even if
+the app is correct but happens to trigger some bug in the hardware/driver.
+
+OpenGL
+~~~~~~
+
+Apps using OpenGL should use the available robust interfaces, like the
+extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
+interface tells if a reset has happened, and if so, all the context state is
+considered lost and the app proceeds by creating new ones. If it is possible to
+determine that robustness is not in use, the UMD will terminate the app when a
+reset is detected, giving that the contexts are lost and the app won't be able
+to figure this out and recreate the contexts.
+
+Vulkan
+~~~~~~
+
+Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
+This error code means, among other things, that a device reset has happened and
+it needs to recreate the contexts to keep going.
+
+Reporting causes of resets
+--------------------------
+
+Apart from propagating the reset through the stack so apps can recover, it's
+really useful for driver developers to learn more about what caused the reset in
+first place. DRM devices should make use of devcoredump to store relevant
+information about the reset, so this information can be added to user bug
+reports.
+
 .. _drm_driver_ioctl:
 
 IOCTL Support on Device Nodes
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-06-27 13:23 [PATCH v5 1/1] drm/doc: Document DRM device reset expectations André Almeida
@ 2023-06-27 16:09 ` Randy Dunlap
  2023-06-27 17:47 ` Christian König
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: Randy Dunlap @ 2023-06-27 16:09 UTC (permalink / raw)
  To: André Almeida, dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, Pekka Paalanen,
	'Marek Olšák',
	Michel Dänzer, Timur Kristóf, Pekka Paalanen,
	Samuel Pitoiset, kernel-dev, alexander.deucher, christian.koenig

Hi André,

I have just a few more below:

On 6/27/23 06:23, André Almeida wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
> 
> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
> 
> v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
> 
> Changes:
>  - Grammar fixes (Randy)
> 
>  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
>  1 file changed, 68 insertions(+)
> 
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>  mmapped regular files. Threads cause additional pain with signal
>  handling as well.
>  
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +sections describes the expectations for DRM and usermode drivers when a

   section

> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context. This is needed to propagate to the rest of

               for a specific

stats or status?

> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if the UMD requires it. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriate API error code, as explained in the section below about
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If it is possible to
> +determine that robustness is not in use, the UMD will terminate the app when a
> +reset is detected, giving that the contexts are lost and the app won't be able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant

   the first place.

> +information about the reset, so this information can be added to user bug
> +reports.
> +
>  .. _drm_driver_ioctl:
>  
>  IOCTL Support on Device Nodes

and with those addressed:

Reviewed-by: Randy Dunlap <rdunlap@infradead.org>

Thanks for adding the documentation.

-- 
~Randy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-06-27 13:23 [PATCH v5 1/1] drm/doc: Document DRM device reset expectations André Almeida
  2023-06-27 16:09 ` Randy Dunlap
@ 2023-06-27 17:47 ` Christian König
  2023-06-27 21:17   ` André Almeida
  2023-06-27 18:57 ` Marek Olšák
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 41+ messages in thread
From: Christian König @ 2023-06-27 17:47 UTC (permalink / raw)
  To: André Almeida, dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, Pekka Paalanen, kernel-dev,
	'Marek Olšák',
	Michel Dänzer, Randy Dunlap, Pekka Paalanen,
	Timur Kristóf, Samuel Pitoiset, alexander.deucher,
	christian.koenig

Am 27.06.23 um 15:23 schrieb André Almeida:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>
> v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
>
> Changes:
>   - Grammar fixes (Randy)
>
>   Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
>   1 file changed, 68 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>   mmapped regular files. Threads cause additional pain with signal
>   handling as well.
>   
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +sections describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context.

Maybe drop the part "for a specific context". Essentially the reset 
query could use global counters instead and we won't need the context 
any more here.

Apart from that this sounds good to me, feel free to add my rb.

Regards,
Christian.

>   This is needed to propagate to the rest of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if the UMD requires it. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriate API error code, as explained in the section below about
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If it is possible to
> +determine that robustness is not in use, the UMD will terminate the app when a
> +reset is detected, giving that the contexts are lost and the app won't be able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
>   .. _drm_driver_ioctl:
>   
>   IOCTL Support on Device Nodes


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-06-27 13:23 [PATCH v5 1/1] drm/doc: Document DRM device reset expectations André Almeida
  2023-06-27 16:09 ` Randy Dunlap
  2023-06-27 17:47 ` Christian König
@ 2023-06-27 18:57 ` Marek Olšák
  2023-06-27 21:31   ` André Almeida
  2023-06-30 14:48 ` Sebastian Wick
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 41+ messages in thread
From: Marek Olšák @ 2023-06-27 18:57 UTC (permalink / raw)
  To: André Almeida
  Cc: Pierre-Eric Pelloux-Prayer, Samuel Pitoiset, Pekka Paalanen,
	kernel-dev, Michel Dänzer, Randy Dunlap,
	Linux Kernel Mailing List, amd-gfx mailing list, Pekka Paalanen,
	Timur Kristóf, dri-devel, Deucher, Alexander,
	Christian König

[-- Attachment #1: Type: text/plain, Size: 4808 bytes --]

On Tue, Jun 27, 2023, 09:23 André Almeida <andrealmeid@igalia.com> wrote:

> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>
> v4:
> https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
>
> Changes:
>  - Grammar fixes (Randy)
>
>  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
>  1 file changed, 68 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst
> b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third
> handler for
>  mmapped regular files. Threads cause additional pain with signal
>  handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware
> bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again.
> This
> +sections describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to
> perform
> +it as needed. Usually a hang is detected when a job gets stuck executing.
> KMD
> +should keep track of resets, because userspace can query any time about
> the
> +reset stats for an specific context. This is needed to propagate to the
> rest of
> +the stack that a reset has happened. Currently, this is implemented by
> each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the
> device has
> +been reset, and this can be checked more often if the UMD requires it.
> After
> +detecting a reset, UMD will then proceed to report it to the application
> using
> +the appropriate API error code, as explained in the section below about
> +robustness.
>

The UMD won't check the device status before every command submission due
to ioctl overhead. Instead, the KMD should skip command submission and
return an error that it was skipped.

The only case where that won't be applicable is user queues where drivers
don't call into the kernel to submit work, but they do call into the kernel
to create a dma_fence. In that case, the call to create a dma_fence can
fail with an error.

Marek

+
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is
> using.
> +
> +Graphical APIs provide ways to applications to deal with device resets.
> However,
> +there is no guarantee that the app will use such features correctly, and
> the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep
> blocking
> +the user interface from being correctly displayed. This should be done
> even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES).
> This
> +interface tells if a reset has happened, and if so, all the context state
> is
> +considered lost and the app proceeds by creating new ones. If it is
> possible to
> +determine that robustness is not in use, the UMD will terminate the app
> when a
> +reset is detected, giving that the contexts are lost and the app won't be
> able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for
> submissions.
> +This error code means, among other things, that a device reset has
> happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover,
> it's
> +really useful for driver developers to learn more about what caused the
> reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
>  .. _drm_driver_ioctl:
>
>  IOCTL Support on Device Nodes
> --
> 2.41.0
>
>

[-- Attachment #2: Type: text/html, Size: 5849 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-06-27 17:47 ` Christian König
@ 2023-06-27 21:17   ` André Almeida
  2023-06-29 13:11     ` André Almeida
  0 siblings, 1 reply; 41+ messages in thread
From: André Almeida @ 2023-06-27 21:17 UTC (permalink / raw)
  To: Christian König
  Cc: pierre-eric.pelloux-prayer, Samuel Pitoiset, Pekka Paalanen,
	kernel-dev, 'Marek Olšák',
	Michel Dänzer, Randy Dunlap, linux-kernel, dri-devel,
	Pekka Paalanen, Timur Kristóf, amd-gfx, alexander.deucher,
	christian.koenig

Em 27/06/2023 14:47, Christian König escreveu:
> Am 27.06.23 um 15:23 schrieb André Almeida:
>> Create a section that specifies how to deal with DRM device resets for
>> kernel and userspace drivers.
>>
>> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>> ---
>>
>> v4: 
>> https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
>>
>> Changes:
>>   - Grammar fixes (Randy)
>>
>>   Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
>>   1 file changed, 68 insertions(+)
>>
>> diff --git a/Documentation/gpu/drm-uapi.rst 
>> b/Documentation/gpu/drm-uapi.rst
>> index 65fb3036a580..3cbffa25ed93 100644
>> --- a/Documentation/gpu/drm-uapi.rst
>> +++ b/Documentation/gpu/drm-uapi.rst
>> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a 
>> third handler for
>>   mmapped regular files. Threads cause additional pain with signal
>>   handling as well.
>> +Device reset
>> +============
>> +
>> +The GPU stack is really complex and is prone to errors, from hardware 
>> bugs,
>> +faulty applications and everything in between the many layers. Some 
>> errors
>> +require resetting the device in order to make the device usable 
>> again. This
>> +sections describes the expectations for DRM and usermode drivers when a
>> +device resets and how to propagate the reset status.
>> +
>> +Kernel Mode Driver
>> +------------------
>> +
>> +The KMD is responsible for checking if the device needs a reset, and 
>> to perform
>> +it as needed. Usually a hang is detected when a job gets stuck 
>> executing. KMD
>> +should keep track of resets, because userspace can query any time 
>> about the
>> +reset stats for an specific context.
> 
> Maybe drop the part "for a specific context". Essentially the reset 
> query could use global counters instead and we won't need the context 
> any more here.
> 

Right, I wrote like this to reflect how it's currently implemented.

If follow correctly what you meant, KMD could always notify the global 
count for UMD, and we would move to the UMD the responsibility to manage 
the reset counters, right? This would also simplify my 
DRM_IOCTL_GET_RESET proposal. I'll apply your suggestion to the next doc 
version.

> Apart from that this sounds good to me, feel free to add my rb.
> 
> Regards,
> Christian.
> 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-06-27 18:57 ` Marek Olšák
@ 2023-06-27 21:31   ` André Almeida
  2023-06-28  0:36     ` Marek Olšák
  0 siblings, 1 reply; 41+ messages in thread
From: André Almeida @ 2023-06-27 21:31 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Pierre-Eric Pelloux-Prayer, Samuel Pitoiset, Pekka Paalanen,
	kernel-dev, Michel Dänzer, Randy Dunlap,
	Linux Kernel Mailing List, amd-gfx mailing list, Pekka Paalanen,
	Timur Kristóf, dri-devel, Deucher, Alexander,
	Christian König

Hi Marek,

Em 27/06/2023 15:57, Marek Olšák escreveu:
> On Tue, Jun 27, 2023, 09:23 André Almeida <andrealmeid@igalia.com 
> <mailto:andrealmeid@igalia.com>> wrote:
> 
>     +User Mode Driver
>     +----------------
>     +
>     +The UMD should check before submitting new commands to the KMD if
>     the device has
>     +been reset, and this can be checked more often if the UMD requires
>     it. After
>     +detecting a reset, UMD will then proceed to report it to the
>     application using
>     +the appropriate API error code, as explained in the section below about
>     +robustness.
> 
> 
> The UMD won't check the device status before every command submission 
> due to ioctl overhead. Instead, the KMD should skip command submission 
> and return an error that it was skipped.

I wrote like this because when reading the source code for 
vk::check_status()[0] and Gallium's si_flush_gfx_cs()[1], I was under 
the impression that UMD checks the reset status before every 
submission/flush.

Is your comment about of how things are currently implemented, or how 
they would ideally work? Either way I can apply your suggestion, I just 
want to make it clear.

[0] 
https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/vulkan/runtime/vk_device.h#L142
[1] 
https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/gallium/drivers/radeonsi/si_gfx_cs.c#L83

> 
> The only case where that won't be applicable is user queues where 
> drivers don't call into the kernel to submit work, but they do call into 
> the kernel to create a dma_fence. In that case, the call to create a 
> dma_fence can fail with an error.
> 
> Marek


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-06-27 21:31   ` André Almeida
@ 2023-06-28  0:36     ` Marek Olšák
  0 siblings, 0 replies; 41+ messages in thread
From: Marek Olšák @ 2023-06-28  0:36 UTC (permalink / raw)
  To: André Almeida
  Cc: Pierre-Eric Pelloux-Prayer, Samuel Pitoiset, Pekka Paalanen,
	kernel-dev, Michel Dänzer, Randy Dunlap,
	Linux Kernel Mailing List, amd-gfx mailing list, Pekka Paalanen,
	Timur Kristóf, dri-devel, Deucher, Alexander,
	Christian König

[-- Attachment #1: Type: text/plain, Size: 2128 bytes --]

On Tue, Jun 27, 2023 at 5:31 PM André Almeida <andrealmeid@igalia.com>
wrote:

> Hi Marek,
>
> Em 27/06/2023 15:57, Marek Olšák escreveu:
> > On Tue, Jun 27, 2023, 09:23 André Almeida <andrealmeid@igalia.com
> > <mailto:andrealmeid@igalia.com>> wrote:
> >
> >     +User Mode Driver
> >     +----------------
> >     +
> >     +The UMD should check before submitting new commands to the KMD if
> >     the device has
> >     +been reset, and this can be checked more often if the UMD requires
> >     it. After
> >     +detecting a reset, UMD will then proceed to report it to the
> >     application using
> >     +the appropriate API error code, as explained in the section below
> about
> >     +robustness.
> >
> >
> > The UMD won't check the device status before every command submission
> > due to ioctl overhead. Instead, the KMD should skip command submission
> > and return an error that it was skipped.
>
> I wrote like this because when reading the source code for
> vk::check_status()[0] and Gallium's si_flush_gfx_cs()[1], I was under
> the impression that UMD checks the reset status before every
> submission/flush.
>

It only does that before every command submission when the context is
robust. When it's not robust, radeonsi doesn't do anything.


>
> Is your comment about of how things are currently implemented, or how
> they would ideally work? Either way I can apply your suggestion, I just
> want to make it clear.
>

Yes. Ideally, we would get the reply whether the context is lost from the
CS ioctl. This is not currently implemented.

Marek


>
> [0]
>
> https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/vulkan/runtime/vk_device.h#L142
> [1]
>
> https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/gallium/drivers/radeonsi/si_gfx_cs.c#L83
>
> >
> > The only case where that won't be applicable is user queues where
> > drivers don't call into the kernel to submit work, but they do call into
> > the kernel to create a dma_fence. In that case, the call to create a
> > dma_fence can fail with an error.
> >
> > Marek
>
>

[-- Attachment #2: Type: text/html, Size: 3407 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-06-27 21:17   ` André Almeida
@ 2023-06-29 13:11     ` André Almeida
  0 siblings, 0 replies; 41+ messages in thread
From: André Almeida @ 2023-06-29 13:11 UTC (permalink / raw)
  To: Christian König
  Cc: pierre-eric.pelloux-prayer, Samuel Pitoiset, Pekka Paalanen,
	kernel-dev, 'Marek Olšák',
	Michel Dänzer, Randy Dunlap, linux-kernel, dri-devel,
	Pekka Paalanen, Timur Kristóf, amd-gfx, alexander.deucher,
	christian.koenig



Em 27/06/2023 18:17, André Almeida escreveu:
> Em 27/06/2023 14:47, Christian König escreveu:
>> Am 27.06.23 um 15:23 schrieb André Almeida:
>>> Create a section that specifies how to deal with DRM device resets for
>>> kernel and userspace drivers.
>>>
>>> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>> ---
>>>
>>> v4: 
>>> https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
>>>
>>> Changes:
>>>   - Grammar fixes (Randy)
>>>
>>>   Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
>>>   1 file changed, 68 insertions(+)
>>>
>>> diff --git a/Documentation/gpu/drm-uapi.rst 
>>> b/Documentation/gpu/drm-uapi.rst
>>> index 65fb3036a580..3cbffa25ed93 100644
>>> --- a/Documentation/gpu/drm-uapi.rst
>>> +++ b/Documentation/gpu/drm-uapi.rst
>>> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a 
>>> third handler for
>>>   mmapped regular files. Threads cause additional pain with signal
>>>   handling as well.
>>> +Device reset
>>> +============
>>> +
>>> +The GPU stack is really complex and is prone to errors, from 
>>> hardware bugs,
>>> +faulty applications and everything in between the many layers. Some 
>>> errors
>>> +require resetting the device in order to make the device usable 
>>> again. This
>>> +sections describes the expectations for DRM and usermode drivers when a
>>> +device resets and how to propagate the reset status.
>>> +
>>> +Kernel Mode Driver
>>> +------------------
>>> +
>>> +The KMD is responsible for checking if the device needs a reset, and 
>>> to perform
>>> +it as needed. Usually a hang is detected when a job gets stuck 
>>> executing. KMD
>>> +should keep track of resets, because userspace can query any time 
>>> about the
>>> +reset stats for an specific context.
>>
>> Maybe drop the part "for a specific context". Essentially the reset 
>> query could use global counters instead and we won't need the context 
>> any more here.
>>
> 
> Right, I wrote like this to reflect how it's currently implemented.
> 
> If follow correctly what you meant, KMD could always notify the global 
> count for UMD, and we would move to the UMD the responsibility to manage 
> the reset counters, right? This would also simplify my 
> DRM_IOCTL_GET_RESET proposal. I'll apply your suggestion to the next doc 
> version.
> 

Actually, if we drop the context identifier we would lose the ability to 
track which is the guilty context. Vulkan API doesn't seem to care about 
this, but OpenGL does.

>> Apart from that this sounds good to me, feel free to add my rb.
>>
>> Regards,
>> Christian.
>>
>>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-06-27 13:23 [PATCH v5 1/1] drm/doc: Document DRM device reset expectations André Almeida
                   ` (2 preceding siblings ...)
  2023-06-27 18:57 ` Marek Olšák
@ 2023-06-30 14:48 ` Sebastian Wick
  2023-06-30 14:59   ` Alex Deucher
  2023-07-25  2:55 ` Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations) André Almeida
  2023-08-04 13:03 ` [PATCH v5 1/1] drm/doc: Document DRM device reset expectations Daniel Vetter
  5 siblings, 1 reply; 41+ messages in thread
From: Sebastian Wick @ 2023-06-30 14:48 UTC (permalink / raw)
  To: André Almeida
  Cc: pierre-eric.pelloux-prayer, Samuel Pitoiset, Pekka Paalanen,
	Marek Olšák, Michel Dänzer, Randy Dunlap,
	linux-kernel, amd-gfx, Pekka Paalanen, Timur Kristóf,
	dri-devel, kernel-dev, alexander.deucher, christian.koenig

On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote:
>
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>
> v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
>
> Changes:
>  - Grammar fixes (Randy)
>
>  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
>  1 file changed, 68 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>  mmapped regular files. Threads cause additional pain with signal
>  handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +sections describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context. This is needed to propagate to the rest of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if the UMD requires it. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriate API error code, as explained in the section below about
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.

I still don't think it's good to let the kernel arbitrarily kill
processes that it thinks are not well-behaved based on some heuristics
and policy.

Can't this be outsourced to user space? Expose the information about
processes causing a device and let e.g. systemd deal with coming up
with a policy and with killing stuff.

> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If it is possible to
> +determine that robustness is not in use, the UMD will terminate the app when a
> +reset is detected, giving that the contexts are lost and the app won't be able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
>  .. _drm_driver_ioctl:
>
>  IOCTL Support on Device Nodes
> --
> 2.41.0
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-06-30 14:48 ` Sebastian Wick
@ 2023-06-30 14:59   ` Alex Deucher
  2023-06-30 15:11     ` Michel Dänzer
  2023-06-30 15:21     ` Sebastian Wick
  0 siblings, 2 replies; 41+ messages in thread
From: Alex Deucher @ 2023-06-30 14:59 UTC (permalink / raw)
  To: Sebastian Wick
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	Marek Olšák, Michel Dänzer, dri-devel,
	Randy Dunlap, linux-kernel, Samuel Pitoiset, Pekka Paalanen,
	Timur Kristóf, amd-gfx, kernel-dev, alexander.deucher,
	Pekka Paalanen, christian.koenig

On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
<sebastian.wick@redhat.com> wrote:
>
> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote:
> >
> > Create a section that specifies how to deal with DRM device resets for
> > kernel and userspace drivers.
> >
> > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > ---
> >
> > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
> >
> > Changes:
> >  - Grammar fixes (Randy)
> >
> >  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 68 insertions(+)
> >
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 65fb3036a580..3cbffa25ed93 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> >  mmapped regular files. Threads cause additional pain with signal
> >  handling as well.
> >
> > +Device reset
> > +============
> > +
> > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > +faulty applications and everything in between the many layers. Some errors
> > +require resetting the device in order to make the device usable again. This
> > +sections describes the expectations for DRM and usermode drivers when a
> > +device resets and how to propagate the reset status.
> > +
> > +Kernel Mode Driver
> > +------------------
> > +
> > +The KMD is responsible for checking if the device needs a reset, and to perform
> > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> > +should keep track of resets, because userspace can query any time about the
> > +reset stats for an specific context. This is needed to propagate to the rest of
> > +the stack that a reset has happened. Currently, this is implemented by each
> > +driver separately, with no common DRM interface.
> > +
> > +User Mode Driver
> > +----------------
> > +
> > +The UMD should check before submitting new commands to the KMD if the device has
> > +been reset, and this can be checked more often if the UMD requires it. After
> > +detecting a reset, UMD will then proceed to report it to the application using
> > +the appropriate API error code, as explained in the section below about
> > +robustness.
> > +
> > +Robustness
> > +----------
> > +
> > +The only way to try to keep an application working after a reset is if it
> > +complies with the robustness aspects of the graphical API that it is using.
> > +
> > +Graphical APIs provide ways to applications to deal with device resets. However,
> > +there is no guarantee that the app will use such features correctly, and the
> > +UMD can implement policies to close the app if it is a repeating offender,
> > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > +the user interface from being correctly displayed. This should be done even if
> > +the app is correct but happens to trigger some bug in the hardware/driver.
>
> I still don't think it's good to let the kernel arbitrarily kill
> processes that it thinks are not well-behaved based on some heuristics
> and policy.
>
> Can't this be outsourced to user space? Expose the information about
> processes causing a device and let e.g. systemd deal with coming up
> with a policy and with killing stuff.

I don't think it's the kernel doing the killing, it would be the UMD.
E.g., if the app is guilty and doesn't support robustness the UMD can
just call exit().

Alex

>
> > +
> > +OpenGL
> > +~~~~~~
> > +
> > +Apps using OpenGL should use the available robust interfaces, like the
> > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> > +interface tells if a reset has happened, and if so, all the context state is
> > +considered lost and the app proceeds by creating new ones. If it is possible to
> > +determine that robustness is not in use, the UMD will terminate the app when a
> > +reset is detected, giving that the contexts are lost and the app won't be able
> > +to figure this out and recreate the contexts.
> > +
> > +Vulkan
> > +~~~~~~
> > +
> > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> > +This error code means, among other things, that a device reset has happened and
> > +it needs to recreate the contexts to keep going.
> > +
> > +Reporting causes of resets
> > +--------------------------
> > +
> > +Apart from propagating the reset through the stack so apps can recover, it's
> > +really useful for driver developers to learn more about what caused the reset in
> > +first place. DRM devices should make use of devcoredump to store relevant
> > +information about the reset, so this information can be added to user bug
> > +reports.
> > +
> >  .. _drm_driver_ioctl:
> >
> >  IOCTL Support on Device Nodes
> > --
> > 2.41.0
> >
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-06-30 14:59   ` Alex Deucher
@ 2023-06-30 15:11     ` Michel Dänzer
  2023-06-30 20:32       ` Marek Olšák
  2023-06-30 15:21     ` Sebastian Wick
  1 sibling, 1 reply; 41+ messages in thread
From: Michel Dänzer @ 2023-06-30 15:11 UTC (permalink / raw)
  To: Alex Deucher, Sebastian Wick
  Cc: pierre-eric.pelloux-prayer, amd-gfx, André Almeida,
	Marek Olšák, Timur Kristóf, Randy Dunlap,
	linux-kernel, dri-devel, Pekka Paalanen, Samuel Pitoiset,
	kernel-dev, alexander.deucher, Pekka Paalanen, christian.koenig

On 6/30/23 16:59, Alex Deucher wrote:
> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> <sebastian.wick@redhat.com> wrote:
>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote:
>>>
>>> +Robustness
>>> +----------
>>> +
>>> +The only way to try to keep an application working after a reset is if it
>>> +complies with the robustness aspects of the graphical API that it is using.
>>> +
>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>> +there is no guarantee that the app will use such features correctly, and the
>>> +UMD can implement policies to close the app if it is a repeating offender,
>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>> +the user interface from being correctly displayed. This should be done even if
>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>
>> I still don't think it's good to let the kernel arbitrarily kill
>> processes that it thinks are not well-behaved based on some heuristics
>> and policy.
>>
>> Can't this be outsourced to user space? Expose the information about
>> processes causing a device and let e.g. systemd deal with coming up
>> with a policy and with killing stuff.
> 
> I don't think it's the kernel doing the killing, it would be the UMD.
> E.g., if the app is guilty and doesn't support robustness the UMD can
> just call exit().

It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means.


[0] Possibly accompanied by a one-time message to stderr along the lines of "GPU reset detected but robustness not enabled in context, ignoring OpenGL API calls".

-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-06-30 14:59   ` Alex Deucher
  2023-06-30 15:11     ` Michel Dänzer
@ 2023-06-30 15:21     ` Sebastian Wick
  1 sibling, 0 replies; 41+ messages in thread
From: Sebastian Wick @ 2023-06-30 15:21 UTC (permalink / raw)
  To: Alex Deucher
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	Marek Olšák, Michel Dänzer, dri-devel,
	Randy Dunlap, linux-kernel, Samuel Pitoiset, Pekka Paalanen,
	Timur Kristóf, amd-gfx, kernel-dev, alexander.deucher,
	Pekka Paalanen, christian.koenig

On Fri, Jun 30, 2023 at 4:59 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> <sebastian.wick@redhat.com> wrote:
> >
> > On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote:
> > >
> > > Create a section that specifies how to deal with DRM device resets for
> > > kernel and userspace drivers.
> > >
> > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > ---
> > >
> > > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
> > >
> > > Changes:
> > >  - Grammar fixes (Randy)
> > >
> > >  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> > >  1 file changed, 68 insertions(+)
> > >
> > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > index 65fb3036a580..3cbffa25ed93 100644
> > > --- a/Documentation/gpu/drm-uapi.rst
> > > +++ b/Documentation/gpu/drm-uapi.rst
> > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> > >  mmapped regular files. Threads cause additional pain with signal
> > >  handling as well.
> > >
> > > +Device reset
> > > +============
> > > +
> > > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > > +faulty applications and everything in between the many layers. Some errors
> > > +require resetting the device in order to make the device usable again. This
> > > +sections describes the expectations for DRM and usermode drivers when a
> > > +device resets and how to propagate the reset status.
> > > +
> > > +Kernel Mode Driver
> > > +------------------
> > > +
> > > +The KMD is responsible for checking if the device needs a reset, and to perform
> > > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> > > +should keep track of resets, because userspace can query any time about the
> > > +reset stats for an specific context. This is needed to propagate to the rest of
> > > +the stack that a reset has happened. Currently, this is implemented by each
> > > +driver separately, with no common DRM interface.
> > > +
> > > +User Mode Driver
> > > +----------------
> > > +
> > > +The UMD should check before submitting new commands to the KMD if the device has
> > > +been reset, and this can be checked more often if the UMD requires it. After
> > > +detecting a reset, UMD will then proceed to report it to the application using
> > > +the appropriate API error code, as explained in the section below about
> > > +robustness.
> > > +
> > > +Robustness
> > > +----------
> > > +
> > > +The only way to try to keep an application working after a reset is if it
> > > +complies with the robustness aspects of the graphical API that it is using.
> > > +
> > > +Graphical APIs provide ways to applications to deal with device resets. However,
> > > +there is no guarantee that the app will use such features correctly, and the
> > > +UMD can implement policies to close the app if it is a repeating offender,
> > > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > > +the user interface from being correctly displayed. This should be done even if
> > > +the app is correct but happens to trigger some bug in the hardware/driver.
> >
> > I still don't think it's good to let the kernel arbitrarily kill
> > processes that it thinks are not well-behaved based on some heuristics
> > and policy.
> >
> > Can't this be outsourced to user space? Expose the information about
> > processes causing a device and let e.g. systemd deal with coming up
> > with a policy and with killing stuff.
>
> I don't think it's the kernel doing the killing, it would be the UMD.
> E.g., if the app is guilty and doesn't support robustness the UMD can
> just call exit().

Ah, right, completely skipped over the UMD part. That makes more sense.
>
> Alex
>
> >
> > > +
> > > +OpenGL
> > > +~~~~~~
> > > +
> > > +Apps using OpenGL should use the available robust interfaces, like the
> > > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> > > +interface tells if a reset has happened, and if so, all the context state is
> > > +considered lost and the app proceeds by creating new ones. If it is possible to
> > > +determine that robustness is not in use, the UMD will terminate the app when a
> > > +reset is detected, giving that the contexts are lost and the app won't be able
> > > +to figure this out and recreate the contexts.
> > > +
> > > +Vulkan
> > > +~~~~~~
> > > +
> > > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> > > +This error code means, among other things, that a device reset has happened and
> > > +it needs to recreate the contexts to keep going.
> > > +
> > > +Reporting causes of resets
> > > +--------------------------
> > > +
> > > +Apart from propagating the reset through the stack so apps can recover, it's
> > > +really useful for driver developers to learn more about what caused the reset in
> > > +first place. DRM devices should make use of devcoredump to store relevant
> > > +information about the reset, so this information can be added to user bug
> > > +reports.
> > > +
> > >  .. _drm_driver_ioctl:
> > >
> > >  IOCTL Support on Device Nodes
> > > --
> > > 2.41.0
> > >
> >
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-06-30 15:11     ` Michel Dänzer
@ 2023-06-30 20:32       ` Marek Olšák
  2023-07-03  7:12         ` Michel Dänzer
  0 siblings, 1 reply; 41+ messages in thread
From: Marek Olšák @ 2023-06-30 20:32 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: pierre-eric.pelloux-prayer, Sebastian Wick, amd-gfx,
	André Almeida, Timur Kristóf, Randy Dunlap,
	linux-kernel, dri-devel, alexander.deucher, Pekka Paalanen,
	Samuel Pitoiset, kernel-dev, Pekka Paalanen, christian.koenig

[-- Attachment #1: Type: text/plain, Size: 2661 bytes --]

That's a terrible idea. Ignoring API calls would be identical to a freeze.
You might as well disable GPU recovery because the result would be the same.

There are 2 scenarios:
- robust contexts: report the GPU reset status and skip API calls; let the
app recreate the context to recover
- non-robust contexts: call exit(1) immediately, which is the best way to
recover

Marek

On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org>
wrote:

> On 6/30/23 16:59, Alex Deucher wrote:
> > On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> > <sebastian.wick@redhat.com> wrote:
> >> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com>
> wrote:
> >>>
> >>> +Robustness
> >>> +----------
> >>> +
> >>> +The only way to try to keep an application working after a reset is
> if it
> >>> +complies with the robustness aspects of the graphical API that it is
> using.
> >>> +
> >>> +Graphical APIs provide ways to applications to deal with device
> resets. However,
> >>> +there is no guarantee that the app will use such features correctly,
> and the
> >>> +UMD can implement policies to close the app if it is a repeating
> offender,
> >>> +likely in a broken loop. This is done to ensure that it does not keep
> blocking
> >>> +the user interface from being correctly displayed. This should be
> done even if
> >>> +the app is correct but happens to trigger some bug in the
> hardware/driver.
> >>
> >> I still don't think it's good to let the kernel arbitrarily kill
> >> processes that it thinks are not well-behaved based on some heuristics
> >> and policy.
> >>
> >> Can't this be outsourced to user space? Expose the information about
> >> processes causing a device and let e.g. systemd deal with coming up
> >> with a policy and with killing stuff.
> >
> > I don't think it's the kernel doing the killing, it would be the UMD.
> > E.g., if the app is guilty and doesn't support robustness the UMD can
> > just call exit().
>
> It would be safer to just ignore API calls[0], similarly to what is done
> until the application destroys the context with robustness. Calling exit()
> likely results in losing any unsaved work, whereas at least some
> applications might otherwise allow saving the work by other means.
>
>
> [0] Possibly accompanied by a one-time message to stderr along the lines
> of "GPU reset detected but robustness not enabled in context, ignoring
> OpenGL API calls".
>
> --
> Earthling Michel Dänzer            |                  https://redhat.com
> Libre software enthusiast          |         Mesa and Xwayland developer
>
>

[-- Attachment #2: Type: text/html, Size: 3539 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-06-30 20:32       ` Marek Olšák
@ 2023-07-03  7:12         ` Michel Dänzer
  2023-07-03  8:49           ` Pekka Paalanen
  2023-07-04  2:34           ` Marek Olšák
  0 siblings, 2 replies; 41+ messages in thread
From: Michel Dänzer @ 2023-07-03  7:12 UTC (permalink / raw)
  To: Marek Olšák
  Cc: pierre-eric.pelloux-prayer, Sebastian Wick, André Almeida,
	Timur Kristóf, Randy Dunlap, linux-kernel, amd-gfx,
	Pekka Paalanen, Samuel Pitoiset, dri-devel, kernel-dev,
	alexander.deucher, Pekka Paalanen, christian.koenig

On 6/30/23 22:32, Marek Olšák wrote:
> On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote:
>> On 6/30/23 16:59, Alex Deucher wrote:
>>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
>>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>> wrote:
>>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com <mailto:andrealmeid@igalia.com>> wrote:
>>>>>
>>>>> +Robustness
>>>>> +----------
>>>>> +
>>>>> +The only way to try to keep an application working after a reset is if it
>>>>> +complies with the robustness aspects of the graphical API that it is using.
>>>>> +
>>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>>>> +there is no guarantee that the app will use such features correctly, and the
>>>>> +UMD can implement policies to close the app if it is a repeating offender,
>>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>>>> +the user interface from being correctly displayed. This should be done even if
>>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>>>
>>>> I still don't think it's good to let the kernel arbitrarily kill
>>>> processes that it thinks are not well-behaved based on some heuristics
>>>> and policy.
>>>>
>>>> Can't this be outsourced to user space? Expose the information about
>>>> processes causing a device and let e.g. systemd deal with coming up
>>>> with a policy and with killing stuff.
>>>
>>> I don't think it's the kernel doing the killing, it would be the UMD.
>>> E.g., if the app is guilty and doesn't support robustness the UMD can
>>> just call exit().
>>
>> It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means.
> 
> That's a terrible idea. Ignoring API calls would be identical to a freeze. You might as well disable GPU recovery because the result would be the same.

No GPU recovery would affect everything using the GPU, whereas this affects only non-robust applications.


> - non-robust contexts: call exit(1) immediately, which is the best way to recover

That's not the UMD's call to make.


>>     [0] Possibly accompanied by a one-time message to stderr along the lines of "GPU reset detected but robustness not enabled in context, ignoring OpenGL API calls".


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-07-03  7:12         ` Michel Dänzer
@ 2023-07-03  8:49           ` Pekka Paalanen
  2023-07-03 15:00             ` André Almeida
  2023-07-04  2:34           ` Marek Olšák
  1 sibling, 1 reply; 41+ messages in thread
From: Pekka Paalanen @ 2023-07-03  8:49 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: pierre-eric.pelloux-prayer, Sebastian Wick, André Almeida,
	Marek Olšák, Timur Kristóf, Randy Dunlap,
	linux-kernel, amd-gfx, Samuel Pitoiset, dri-devel, kernel-dev,
	alexander.deucher, christian.koenig

[-- Attachment #1: Type: text/plain, Size: 3767 bytes --]

On Mon, 3 Jul 2023 09:12:29 +0200
Michel Dänzer <michel.daenzer@mailbox.org> wrote:

> On 6/30/23 22:32, Marek Olšák wrote:
> > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote:  
> >> On 6/30/23 16:59, Alex Deucher wrote:  
> >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>> wrote:  
> >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com <mailto:andrealmeid@igalia.com>> wrote:  
> >>>>>
> >>>>> +Robustness
> >>>>> +----------
> >>>>> +
> >>>>> +The only way to try to keep an application working after a reset is if it
> >>>>> +complies with the robustness aspects of the graphical API that it is using.
> >>>>> +
> >>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
> >>>>> +there is no guarantee that the app will use such features correctly, and the
> >>>>> +UMD can implement policies to close the app if it is a repeating offender,
> >>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
> >>>>> +the user interface from being correctly displayed. This should be done even if
> >>>>> +the app is correct but happens to trigger some bug in the hardware/driver.  
> >>>>
> >>>> I still don't think it's good to let the kernel arbitrarily kill
> >>>> processes that it thinks are not well-behaved based on some heuristics
> >>>> and policy.
> >>>>
> >>>> Can't this be outsourced to user space? Expose the information about
> >>>> processes causing a device and let e.g. systemd deal with coming up
> >>>> with a policy and with killing stuff.  
> >>>
> >>> I don't think it's the kernel doing the killing, it would be the UMD.
> >>> E.g., if the app is guilty and doesn't support robustness the UMD can
> >>> just call exit().  
> >>
> >> It would be safer to just ignore API calls[0], similarly to what
> >> is done until the application destroys the context with
> >> robustness. Calling exit() likely results in losing any unsaved
> >> work, whereas at least some applications might otherwise allow
> >> saving the work by other means.  
> > 
> > That's a terrible idea. Ignoring API calls would be identical to a
> > freeze. You might as well disable GPU recovery because the result
> > would be the same.  
> 
> No GPU recovery would affect everything using the GPU, whereas this
> affects only non-robust applications.
> 
> 
> > - non-robust contexts: call exit(1) immediately, which is the best
> > way to recover  
> 
> That's not the UMD's call to make.
> 
> 
> >>     [0] Possibly accompanied by a one-time message to stderr along
> >> the lines of "GPU reset detected but robustness not enabled in
> >> context, ignoring OpenGL API calls".  
> 

Hi,

Michel does have a point. It's not just games and display servers that
use GPU, but productivity tools as well. They may have periodic
autosave in anticipation of crashes, but being able to do the final
save before quitting would be nice. UMD killing the process would be
new behaviour, right? Previously either application's GPU thread hangs
or various API calls return errors, but it didn't kill the process, did
it?

If an application freezes, that's "no problem"; the end user can just
continue using everything else. Alt-tab away etc. if the app was
fullscreen. I do that already with games on even Xorg.

If a display server freezes, that's a desktop-wide problem, but so is
killing it.

OTOH, if UMD really does need to terminate the process, then please do
it in a way that causes a crash report to be recorded. _exit() with an
error code is not it.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-07-03  8:49           ` Pekka Paalanen
@ 2023-07-03 15:00             ` André Almeida
  2023-07-04  7:42               ` Pekka Paalanen
  0 siblings, 1 reply; 41+ messages in thread
From: André Almeida @ 2023-07-03 15:00 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: pierre-eric.pelloux-prayer, Sebastian Wick, Randy Dunlap,
	Marek Olšák, Timur Kristóf, Michel Dänzer,
	linux-kernel, amd-gfx, Samuel Pitoiset, dri-devel, kernel-dev,
	alexander.deucher, christian.koenig



Em 03/07/2023 05:49, Pekka Paalanen escreveu:
> On Mon, 3 Jul 2023 09:12:29 +0200
> Michel Dänzer <michel.daenzer@mailbox.org> wrote:
> 
>> On 6/30/23 22:32, Marek Olšák wrote:
>>> On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote:
>>>> On 6/30/23 16:59, Alex Deucher wrote:
>>>>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
>>>>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>> wrote:
>>>>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com <mailto:andrealmeid@igalia.com>> wrote:
>>>>>>>
>>>>>>> +Robustness
>>>>>>> +----------
>>>>>>> +
>>>>>>> +The only way to try to keep an application working after a reset is if it
>>>>>>> +complies with the robustness aspects of the graphical API that it is using.
>>>>>>> +
>>>>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>>>>>> +there is no guarantee that the app will use such features correctly, and the
>>>>>>> +UMD can implement policies to close the app if it is a repeating offender,
>>>>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>>>>>> +the user interface from being correctly displayed. This should be done even if
>>>>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>>>>>
>>>>>> I still don't think it's good to let the kernel arbitrarily kill
>>>>>> processes that it thinks are not well-behaved based on some heuristics
>>>>>> and policy.
>>>>>>
>>>>>> Can't this be outsourced to user space? Expose the information about
>>>>>> processes causing a device and let e.g. systemd deal with coming up
>>>>>> with a policy and with killing stuff.
>>>>>
>>>>> I don't think it's the kernel doing the killing, it would be the UMD.
>>>>> E.g., if the app is guilty and doesn't support robustness the UMD can
>>>>> just call exit().
>>>>
>>>> It would be safer to just ignore API calls[0], similarly to what
>>>> is done until the application destroys the context with
>>>> robustness. Calling exit() likely results in losing any unsaved
>>>> work, whereas at least some applications might otherwise allow
>>>> saving the work by other means.
>>>
>>> That's a terrible idea. Ignoring API calls would be identical to a
>>> freeze. You might as well disable GPU recovery because the result
>>> would be the same.
>>
>> No GPU recovery would affect everything using the GPU, whereas this
>> affects only non-robust applications.
>>
>>
>>> - non-robust contexts: call exit(1) immediately, which is the best
>>> way to recover
>>
>> That's not the UMD's call to make.
>>
>>
>>>>      [0] Possibly accompanied by a one-time message to stderr along
>>>> the lines of "GPU reset detected but robustness not enabled in
>>>> context, ignoring OpenGL API calls".
>>
> 
> Hi,
> 
> Michel does have a point. It's not just games and display servers that
> use GPU, but productivity tools as well. They may have periodic
> autosave in anticipation of crashes, but being able to do the final
> save before quitting would be nice. UMD killing the process would be
> new behaviour, right? Previously either application's GPU thread hangs
> or various API calls return errors, but it didn't kill the process, did
> it?
> 

In Intel's Iris, UMD may call abort() for the reset guilty application:

https://elixir.bootlin.com/mesa/mesa-23.0.4/source/src/gallium/drivers/iris/iris_batch.c#L1063

I was pretty sure this was the same for RadeonSI, but I failed to find 
the code for this, so I might be wrong.

> If an application freezes, that's "no problem"; the end user can just
> continue using everything else. Alt-tab away etc. if the app was
> fullscreen. I do that already with games on even Xorg.
> 
> If a display server freezes, that's a desktop-wide problem, but so is
> killing it.
> 

Interesting, what GPU do you use? In my experience (AMD RX 5600 XT), 
hanging the GPU usually means that the rest of applications/compositor 
can't use the GPU either, freezing all user interactions. So killing the 
guilty app is one effective solution currently, but ignoring calls may 
help as well.

> OTOH, if UMD really does need to terminate the process, then please do
> it in a way that causes a crash report to be recorded. _exit() with an
> error code is not it.
> 

In the "Reporting causes of resets" subsection of this document I can 
add something for UMD as well.

> 
> Thanks,
> pq

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-07-03  7:12         ` Michel Dänzer
  2023-07-03  8:49           ` Pekka Paalanen
@ 2023-07-04  2:34           ` Marek Olšák
  2023-07-04  2:38             ` Randy Dunlap
  2023-07-04  7:54             ` Michel Dänzer
  1 sibling, 2 replies; 41+ messages in thread
From: Marek Olšák @ 2023-07-04  2:34 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Pierre-Eric Pelloux-Prayer, Sebastian Wick, André Almeida,
	Timur Kristóf, Randy Dunlap, Linux Kernel Mailing List,
	amd-gfx mailing list, Pekka Paalanen, Samuel Pitoiset, dri-devel,
	kernel-dev, Deucher, Alexander, Pekka Paalanen,
	Christian König

[-- Attachment #1: Type: text/plain, Size: 3546 bytes --]

On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org> wrote:

> On 6/30/23 22:32, Marek Olšák wrote:
> > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <
> michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote:
> >> On 6/30/23 16:59, Alex Deucher wrote:
> >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>> wrote:
> >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com
> <mailto:andrealmeid@igalia.com>> wrote:
> >>>>>
> >>>>> +Robustness
> >>>>> +----------
> >>>>> +
> >>>>> +The only way to try to keep an application working after a reset is
> if it
> >>>>> +complies with the robustness aspects of the graphical API that it
> is using.
> >>>>> +
> >>>>> +Graphical APIs provide ways to applications to deal with device
> resets. However,
> >>>>> +there is no guarantee that the app will use such features
> correctly, and the
> >>>>> +UMD can implement policies to close the app if it is a repeating
> offender,
> >>>>> +likely in a broken loop. This is done to ensure that it does not
> keep blocking
> >>>>> +the user interface from being correctly displayed. This should be
> done even if
> >>>>> +the app is correct but happens to trigger some bug in the
> hardware/driver.
> >>>>
> >>>> I still don't think it's good to let the kernel arbitrarily kill
> >>>> processes that it thinks are not well-behaved based on some heuristics
> >>>> and policy.
> >>>>
> >>>> Can't this be outsourced to user space? Expose the information about
> >>>> processes causing a device and let e.g. systemd deal with coming up
> >>>> with a policy and with killing stuff.
> >>>
> >>> I don't think it's the kernel doing the killing, it would be the UMD.
> >>> E.g., if the app is guilty and doesn't support robustness the UMD can
> >>> just call exit().
> >>
> >> It would be safer to just ignore API calls[0], similarly to what is
> done until the application destroys the context with robustness. Calling
> exit() likely results in losing any unsaved work, whereas at least some
> applications might otherwise allow saving the work by other means.
> >
> > That's a terrible idea. Ignoring API calls would be identical to a
> freeze. You might as well disable GPU recovery because the result would be
> the same.
>
> No GPU recovery would affect everything using the GPU, whereas this
> affects only non-robust applications.
>

which is currently the majority.


>
> > - non-robust contexts: call exit(1) immediately, which is the best way
> to recover
>
> That's not the UMD's call to make.
>

That's absolutely the UMD's call to make because that's mandated by the hw
and API design and only driver devs know this, which this thread is a proof
of. The default behavior is to skip all command submission if a non-robust
context is lost, which looks like a freeze. That's required to prevent
infinite hangs from the same context and can be caused by the side effects
of the GPU reset itself, not by the cause of the previous hang. The only
way out of that is killing the process.

Marek


>
> >>     [0] Possibly accompanied by a one-time message to stderr along the
> lines of "GPU reset detected but robustness not enabled in context,
> ignoring OpenGL API calls".
>
>
> --
> Earthling Michel Dänzer            |                  https://redhat.com
> Libre software enthusiast          |         Mesa and Xwayland developer
>
>

[-- Attachment #2: Type: text/html, Size: 5459 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-07-04  2:34           ` Marek Olšák
@ 2023-07-04  2:38             ` Randy Dunlap
  2023-07-04  2:44               ` Marek Olšák
  2023-07-04  7:54             ` Michel Dänzer
  1 sibling, 1 reply; 41+ messages in thread
From: Randy Dunlap @ 2023-07-04  2:38 UTC (permalink / raw)
  To: Marek Olšák, Michel Dänzer
  Cc: Pierre-Eric Pelloux-Prayer, Sebastian Wick, André Almeida,
	Timur Kristóf, Linux Kernel Mailing List,
	amd-gfx mailing list, Pekka Paalanen, Samuel Pitoiset, dri-devel,
	kernel-dev, Deucher, Alexander, Pekka Paalanen,
	Christian König



On 7/3/23 19:34, Marek Olšák wrote:
> 
> 
> On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote:
> 

Marek,
Please stop sending html emails to the mailing lists.
The mailing list software drops them.

Please set your email interface to use plain text mode instead.
Thanks.

-- 
~Randy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-07-04  2:38             ` Randy Dunlap
@ 2023-07-04  2:44               ` Marek Olšák
  2023-07-04  2:48                 ` Randy Dunlap
  0 siblings, 1 reply; 41+ messages in thread
From: Marek Olšák @ 2023-07-04  2:44 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Pierre-Eric Pelloux-Prayer, Sebastian Wick, André Almeida,
	Timur Kristóf, Michel Dänzer,
	Linux Kernel Mailing List, amd-gfx mailing list, Pekka Paalanen,
	Samuel Pitoiset, dri-devel, kernel-dev, Deucher, Alexander,
	Pekka Paalanen, Christian König

[-- Attachment #1: Type: text/plain, Size: 561 bytes --]

On Mon, Jul 3, 2023, 22:38 Randy Dunlap <rdunlap@infradead.org> wrote:

>
>
> On 7/3/23 19:34, Marek Olšák wrote:
> >
> >
> > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org
> <mailto:michel.daenzer@mailbox.org>> wrote:
> >
>
> Marek,
> Please stop sending html emails to the mailing lists.
> The mailing list software drops them.
>
> Please set your email interface to use plain text mode instead.
> Thanks.
>

The mobile Gmail app doesn't support plain text, which I use frequently.

Marek


> --
> ~Randy
>

[-- Attachment #2: Type: text/html, Size: 1344 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-07-04  2:44               ` Marek Olšák
@ 2023-07-04  2:48                 ` Randy Dunlap
  0 siblings, 0 replies; 41+ messages in thread
From: Randy Dunlap @ 2023-07-04  2:48 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Pierre-Eric Pelloux-Prayer, Sebastian Wick, André Almeida,
	Timur Kristóf, Michel Dänzer,
	Linux Kernel Mailing List, amd-gfx mailing list, Pekka Paalanen,
	Samuel Pitoiset, dri-devel, kernel-dev, Deucher, Alexander,
	Pekka Paalanen, Christian König



On 7/3/23 19:44, Marek Olšák wrote:
> 
> 
> On Mon, Jul 3, 2023, 22:38 Randy Dunlap <rdunlap@infradead.org <mailto:rdunlap@infradead.org>> wrote:
> 
> 
> 
>     On 7/3/23 19:34, Marek Olšák wrote:
>     >
>     >
>     > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org> <mailto:michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>>> wrote:
>     >
> 
>     Marek,
>     Please stop sending html emails to the mailing lists.
>     The mailing list software drops them.
> 
>     Please set your email interface to use plain text mode instead.
>     Thanks.
> 
> 
> The mobile Gmail app doesn't support plain text, which I use frequently.

Perhaps you should consider some other mobile app for kernel discussions.

E.g., it looks like the K-9 mail app works with gmail.

-- 
~Randy

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-07-03 15:00             ` André Almeida
@ 2023-07-04  7:42               ` Pekka Paalanen
  0 siblings, 0 replies; 41+ messages in thread
From: Pekka Paalanen @ 2023-07-04  7:42 UTC (permalink / raw)
  To: André Almeida
  Cc: pierre-eric.pelloux-prayer, Sebastian Wick, Randy Dunlap,
	Marek Olšák, Timur Kristóf, Michel Dänzer,
	linux-kernel, amd-gfx, Samuel Pitoiset, dri-devel, kernel-dev,
	alexander.deucher, christian.koenig

[-- Attachment #1: Type: text/plain, Size: 1394 bytes --]

On Mon, 3 Jul 2023 12:00:22 -0300
André Almeida <andrealmeid@igalia.com> wrote:

> Em 03/07/2023 05:49, Pekka Paalanen escreveu:

> > If an application freezes, that's "no problem"; the end user can just
> > continue using everything else. Alt-tab away etc. if the app was
> > fullscreen. I do that already with games on even Xorg.
> > 
> > If a display server freezes, that's a desktop-wide problem, but so is
> > killing it.
> >   
> 
> Interesting, what GPU do you use? In my experience (AMD RX 5600 XT), 
> hanging the GPU usually means that the rest of applications/compositor 
> can't use the GPU either, freezing all user interactions. So killing the 
> guilty app is one effective solution currently, but ignoring calls may 
> help as well.

I don't know if what I'm seeing is a GPU hang or just e.g. Proton
getting somehow stuck, all I see is a game freezing. I just Alt+tab
back to Steam, force-stop it, and then all is fine again. This is how
it should work regardless of why a game freezes.

However, even if it was a GPU hang, if I am on a display server that
actually handles GPU resets, I don't see why the rest of the desktop
would not be able to recover. Individual apps are each to their own,
but at the very least non-GPU apps and the DE itself should not have
any problem (DE components can simply be restarted automatically).


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-07-04  2:34           ` Marek Olšák
  2023-07-04  2:38             ` Randy Dunlap
@ 2023-07-04  7:54             ` Michel Dänzer
  2023-07-05  6:30               ` Marek Olšák
  1 sibling, 1 reply; 41+ messages in thread
From: Michel Dänzer @ 2023-07-04  7:54 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Pierre-Eric Pelloux-Prayer, Sebastian Wick, André Almeida,
	Timur Kristóf, dri-devel, Randy Dunlap,
	Linux Kernel Mailing List, amd-gfx mailing list, Pekka Paalanen,
	Samuel Pitoiset, kernel-dev, Deucher, Alexander, Pekka Paalanen,
	Christian König

On 7/4/23 04:34, Marek Olšák wrote:
> On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote:
>     On 6/30/23 22:32, Marek Olšák wrote:
>     > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org> <mailto:michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>>> wrote:
>     >> On 6/30/23 16:59, Alex Deucher wrote:
>     >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
>     >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com> <mailto:sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>>> wrote:
>     >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com <mailto:andrealmeid@igalia.com> <mailto:andrealmeid@igalia.com <mailto:andrealmeid@igalia.com>>> wrote:
>     >>>>>
>     >>>>> +Robustness
>     >>>>> +----------
>     >>>>> +
>     >>>>> +The only way to try to keep an application working after a reset is if it
>     >>>>> +complies with the robustness aspects of the graphical API that it is using.
>     >>>>> +
>     >>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>     >>>>> +there is no guarantee that the app will use such features correctly, and the
>     >>>>> +UMD can implement policies to close the app if it is a repeating offender,
>     >>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>     >>>>> +the user interface from being correctly displayed. This should be done even if
>     >>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>     >>>>
>     >>>> I still don't think it's good to let the kernel arbitrarily kill
>     >>>> processes that it thinks are not well-behaved based on some heuristics
>     >>>> and policy.
>     >>>>
>     >>>> Can't this be outsourced to user space? Expose the information about
>     >>>> processes causing a device and let e.g. systemd deal with coming up
>     >>>> with a policy and with killing stuff.
>     >>>
>     >>> I don't think it's the kernel doing the killing, it would be the UMD.
>     >>> E.g., if the app is guilty and doesn't support robustness the UMD can
>     >>> just call exit().
>     >>
>     >> It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means.
>     >
>     > That's a terrible idea. Ignoring API calls would be identical to a freeze. You might as well disable GPU recovery because the result would be the same.
> 
>     No GPU recovery would affect everything using the GPU, whereas this affects only non-robust applications.
> 
> which is currently the majority.

Not sure where you're going with this. Applications need to use robustness to be able to recover from a GPU hang, and the GPU needs to be reset for that. So disabling GPU reset is not the same as what we're discussing here.


>     > - non-robust contexts: call exit(1) immediately, which is the best way to recover
> 
>     That's not the UMD's call to make.
> 
> That's absolutely the UMD's call to make because that's mandated by the hw and API design

Can you point us to a spec which mandates that the process must be killed in this case?


> and only driver devs know this, which this thread is a proof of. The default behavior is to skip all command submission if a non-robust context is lost, which looks like a freeze. That's required to prevent infinite hangs from the same context and can be caused by the side effects of the GPU reset itself, not by the cause of the previous hang. The only way out of that is killing the process.

The UMD killing the process is not the only way out of that, and doing so is overreach on its part. The UMD is but one out of many components in a process, not the main one or a special one. It doesn't get to decide when the process must die, certainly not under circumstances where it must be able to continue while ignoring API calls (that's required for robustness).


>     >>     [0] Possibly accompanied by a one-time message to stderr along the lines of "GPU reset detected but robustness not enabled in context, ignoring OpenGL API calls".


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-07-04  7:54             ` Michel Dänzer
@ 2023-07-05  6:30               ` Marek Olšák
  2023-07-05  7:32                 ` Michel Dänzer
  0 siblings, 1 reply; 41+ messages in thread
From: Marek Olšák @ 2023-07-05  6:30 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Pierre-Eric Pelloux-Prayer, Sebastian Wick, André Almeida,
	Timur Kristóf, dri-devel, Randy Dunlap,
	Linux Kernel Mailing List, amd-gfx mailing list, Pekka Paalanen,
	Samuel Pitoiset, kernel-dev, Deucher, Alexander, Pekka Paalanen,
	Christian König

[-- Attachment #1: Type: text/plain, Size: 5685 bytes --]

On Tue, Jul 4, 2023, 03:55 Michel Dänzer <michel.daenzer@mailbox.org> wrote:

> On 7/4/23 04:34, Marek Olšák wrote:
> > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org
> <mailto:michel.daenzer@mailbox.org>> wrote:
> >     On 6/30/23 22:32, Marek Olšák wrote:
> >     > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <
> michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org> <mailto:
> michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>>> wrote:
> >     >> On 6/30/23 16:59, Alex Deucher wrote:
> >     >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >     >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>
> <mailto:sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>>>
> wrote:
> >     >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <
> andrealmeid@igalia.com <mailto:andrealmeid@igalia.com> <mailto:
> andrealmeid@igalia.com <mailto:andrealmeid@igalia.com>>> wrote:
> >     >>>>>
> >     >>>>> +Robustness
> >     >>>>> +----------
> >     >>>>> +
> >     >>>>> +The only way to try to keep an application working after a
> reset is if it
> >     >>>>> +complies with the robustness aspects of the graphical API
> that it is using.
> >     >>>>> +
> >     >>>>> +Graphical APIs provide ways to applications to deal with
> device resets. However,
> >     >>>>> +there is no guarantee that the app will use such features
> correctly, and the
> >     >>>>> +UMD can implement policies to close the app if it is a
> repeating offender,
> >     >>>>> +likely in a broken loop. This is done to ensure that it does
> not keep blocking
> >     >>>>> +the user interface from being correctly displayed. This
> should be done even if
> >     >>>>> +the app is correct but happens to trigger some bug in the
> hardware/driver.
> >     >>>>
> >     >>>> I still don't think it's good to let the kernel arbitrarily kill
> >     >>>> processes that it thinks are not well-behaved based on some
> heuristics
> >     >>>> and policy.
> >     >>>>
> >     >>>> Can't this be outsourced to user space? Expose the information
> about
> >     >>>> processes causing a device and let e.g. systemd deal with
> coming up
> >     >>>> with a policy and with killing stuff.
> >     >>>
> >     >>> I don't think it's the kernel doing the killing, it would be the
> UMD.
> >     >>> E.g., if the app is guilty and doesn't support robustness the
> UMD can
> >     >>> just call exit().
> >     >>
> >     >> It would be safer to just ignore API calls[0], similarly to what
> is done until the application destroys the context with robustness. Calling
> exit() likely results in losing any unsaved work, whereas at least some
> applications might otherwise allow saving the work by other means.
> >     >
> >     > That's a terrible idea. Ignoring API calls would be identical to a
> freeze. You might as well disable GPU recovery because the result would be
> the same.
> >
> >     No GPU recovery would affect everything using the GPU, whereas this
> affects only non-robust applications.
> >
> > which is currently the majority.
>
> Not sure where you're going with this. Applications need to use robustness
> to be able to recover from a GPU hang, and the GPU needs to be reset for
> that. So disabling GPU reset is not the same as what we're discussing here.
>
>
> >     > - non-robust contexts: call exit(1) immediately, which is the best
> way to recover
> >
> >     That's not the UMD's call to make.
> >
> > That's absolutely the UMD's call to make because that's mandated by the
> hw and API design
>
> Can you point us to a spec which mandates that the process must be killed
> in this case?
>
>
> > and only driver devs know this, which this thread is a proof of. The
> default behavior is to skip all command submission if a non-robust context
> is lost, which looks like a freeze. That's required to prevent infinite
> hangs from the same context and can be caused by the side effects of the
> GPU reset itself, not by the cause of the previous hang. The only way out
> of that is killing the process.
>
> The UMD killing the process is not the only way out of that, and doing so
> is overreach on its part. The UMD is but one out of many components in a
> process, not the main one or a special one. It doesn't get to decide when
> the process must die, certainly not under circumstances where it must be
> able to continue while ignoring API calls (that's required for robustness).
>

You're mixing things up. Robust apps don't any special action from a UMD.
Only non-robust apps need to be killed for proper recovery with the only
other alternative being not updating the window/screen, which is not
user-friendly because the user who's never heard of GPU hangs has no
fucking idea why it's frozen and what do with it. It doesn't matter that
you can debug it because you're not the average user. Also it's already
used and required by our customers on Android because killing a process
returns the user to the desktop screen and can generate a crash dump
instead of keeping the app output frozen, and they agree that this is the
best user experience given the circumstances.

Also if the ML ignores html, that's fine.

Marek


>
> >     >>     [0] Possibly accompanied by a one-time message to stderr
> along the lines of "GPU reset detected but robustness not enabled in
> context, ignoring OpenGL API calls".
>
>
> --
> Earthling Michel Dänzer            |                  https://redhat.com
> Libre software enthusiast          |         Mesa and Xwayland developer
>
>

[-- Attachment #2: Type: text/html, Size: 8361 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-07-05  6:30               ` Marek Olšák
@ 2023-07-05  7:32                 ` Michel Dänzer
  2023-07-05 15:53                   ` Marek Olšák
  0 siblings, 1 reply; 41+ messages in thread
From: Michel Dänzer @ 2023-07-05  7:32 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Pierre-Eric Pelloux-Prayer, Sebastian Wick, André Almeida,
	Timur Kristóf, Randy Dunlap, Linux Kernel Mailing List,
	dri-devel, Pekka Paalanen, Samuel Pitoiset, amd-gfx mailing list,
	kernel-dev, Deucher, Alexander, Pekka Paalanen,
	Christian König

On 7/5/23 08:30, Marek Olšák wrote:
> On Tue, Jul 4, 2023, 03:55 Michel Dänzer <michel.daenzer@mailbox.org> wrote:
>     On 7/4/23 04:34, Marek Olšák wrote:
>     > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org > wrote:
>     >     On 6/30/23 22:32, Marek Olšák wrote:
>     >     > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote:
>     >     >> On 6/30/23 16:59, Alex Deucher wrote:
>     >     >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
>     >     >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com> wrote:
>     >     >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote:
>     >     >>>>>
>     >     >>>>> +Robustness
>     >     >>>>> +----------
>     >     >>>>> +
>     >     >>>>> +The only way to try to keep an application working after a reset is if it
>     >     >>>>> +complies with the robustness aspects of the graphical API that it is using.
>     >     >>>>> +
>     >     >>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>     >     >>>>> +there is no guarantee that the app will use such features correctly, and the
>     >     >>>>> +UMD can implement policies to close the app if it is a repeating offender,
>     >     >>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>     >     >>>>> +the user interface from being correctly displayed. This should be done even if
>     >     >>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>     >     >>>>
>     >     >>>> I still don't think it's good to let the kernel arbitrarily kill
>     >     >>>> processes that it thinks are not well-behaved based on some heuristics
>     >     >>>> and policy.
>     >     >>>>
>     >     >>>> Can't this be outsourced to user space? Expose the information about
>     >     >>>> processes causing a device and let e.g. systemd deal with coming up
>     >     >>>> with a policy and with killing stuff.
>     >     >>>
>     >     >>> I don't think it's the kernel doing the killing, it would be the UMD.
>     >     >>> E.g., if the app is guilty and doesn't support robustness the UMD can
>     >     >>> just call exit().
>     >     >>
>     >     >> It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means.
>     >     >
>     >     > That's a terrible idea. Ignoring API calls would be identical to a freeze. You might as well disable GPU recovery because the result would be the same.
>     >
>     >     No GPU recovery would affect everything using the GPU, whereas this affects only non-robust applications.
>     >
>     > which is currently the majority.
> 
>     Not sure where you're going with this. Applications need to use robustness to be able to recover from a GPU hang, and the GPU needs to be reset for that. So disabling GPU reset is not the same as what we're discussing here.
> 
> 
>     >     > - non-robust contexts: call exit(1) immediately, which is the best way to recover
>     >
>     >     That's not the UMD's call to make.
>     >
>     > That's absolutely the UMD's call to make because that's mandated by the hw and API design
> 
>     Can you point us to a spec which mandates that the process must be killed in this case?
> 
> 
>     > and only driver devs know this, which this thread is a proof of. The default behavior is to skip all command submission if a non-robust context is lost, which looks like a freeze. That's required to prevent infinite hangs from the same context and can be caused by the side effects of the GPU reset itself, not by the cause of the previous hang. The only way out of that is killing the process.
> 
>     The UMD killing the process is not the only way out of that, and doing so is overreach on its part. The UMD is but one out of many components in a process, not the main one or a special one. It doesn't get to decide when the process must die, certainly not under circumstances where it must be able to continue while ignoring API calls (that's required for robustness).
> 
> 
> You're mixing things up. Robust apps don't any special action from a UMD. Only non-robust apps need to be killed for proper recovery with the only other alternative being not updating the window/screen,

I'm saying they don't "need" to be killed, since the UMD must be able to keep going while ignoring API calls (or it couldn't support robustness). It's a choice, one which is not for the UMD to make.


> Also it's already used and required by our customers on Android because killing a process returns the user to the desktop screen and can generate a crash dump instead of keeping the app output frozen, and they agree that this is the best user experience given the circumstances.

Then some appropriate Android component needs to make that call. The UMD is not it.


>     >     >>     [0] Possibly accompanied by a one-time message to stderr along the lines of "GPU reset detected but robustness not enabled in context, ignoring OpenGL API calls".


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-07-05  7:32                 ` Michel Dänzer
@ 2023-07-05 15:53                   ` Marek Olšák
  0 siblings, 0 replies; 41+ messages in thread
From: Marek Olšák @ 2023-07-05 15:53 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Pierre-Eric Pelloux-Prayer, Sebastian Wick, André Almeida,
	Timur Kristóf, Randy Dunlap, Linux Kernel Mailing List,
	dri-devel, Pekka Paalanen, Samuel Pitoiset, amd-gfx mailing list,
	kernel-dev, Deucher, Alexander, Pekka Paalanen,
	Christian König

On Wed, Jul 5, 2023 at 3:32 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote:
>
> On 7/5/23 08:30, Marek Olšák wrote:
> > On Tue, Jul 4, 2023, 03:55 Michel Dänzer <michel.daenzer@mailbox.org> wrote:
> >     On 7/4/23 04:34, Marek Olšák wrote:
> >     > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org > wrote:
> >     >     On 6/30/23 22:32, Marek Olšák wrote:
> >     >     > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote:
> >     >     >> On 6/30/23 16:59, Alex Deucher wrote:
> >     >     >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >     >     >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com> wrote:
> >     >     >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote:
> >     >     >>>>>
> >     >     >>>>> +Robustness
> >     >     >>>>> +----------
> >     >     >>>>> +
> >     >     >>>>> +The only way to try to keep an application working after a reset is if it
> >     >     >>>>> +complies with the robustness aspects of the graphical API that it is using.
> >     >     >>>>> +
> >     >     >>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
> >     >     >>>>> +there is no guarantee that the app will use such features correctly, and the
> >     >     >>>>> +UMD can implement policies to close the app if it is a repeating offender,
> >     >     >>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
> >     >     >>>>> +the user interface from being correctly displayed. This should be done even if
> >     >     >>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
> >     >     >>>>
> >     >     >>>> I still don't think it's good to let the kernel arbitrarily kill
> >     >     >>>> processes that it thinks are not well-behaved based on some heuristics
> >     >     >>>> and policy.
> >     >     >>>>
> >     >     >>>> Can't this be outsourced to user space? Expose the information about
> >     >     >>>> processes causing a device and let e.g. systemd deal with coming up
> >     >     >>>> with a policy and with killing stuff.
> >     >     >>>
> >     >     >>> I don't think it's the kernel doing the killing, it would be the UMD.
> >     >     >>> E.g., if the app is guilty and doesn't support robustness the UMD can
> >     >     >>> just call exit().
> >     >     >>
> >     >     >> It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means.
> >     >     >
> >     >     > That's a terrible idea. Ignoring API calls would be identical to a freeze. You might as well disable GPU recovery because the result would be the same.
> >     >
> >     >     No GPU recovery would affect everything using the GPU, whereas this affects only non-robust applications.
> >     >
> >     > which is currently the majority.
> >
> >     Not sure where you're going with this. Applications need to use robustness to be able to recover from a GPU hang, and the GPU needs to be reset for that. So disabling GPU reset is not the same as what we're discussing here.
> >
> >
> >     >     > - non-robust contexts: call exit(1) immediately, which is the best way to recover
> >     >
> >     >     That's not the UMD's call to make.
> >     >
> >     > That's absolutely the UMD's call to make because that's mandated by the hw and API design
> >
> >     Can you point us to a spec which mandates that the process must be killed in this case?
> >
> >
> >     > and only driver devs know this, which this thread is a proof of. The default behavior is to skip all command submission if a non-robust context is lost, which looks like a freeze. That's required to prevent infinite hangs from the same context and can be caused by the side effects of the GPU reset itself, not by the cause of the previous hang. The only way out of that is killing the process.
> >
> >     The UMD killing the process is not the only way out of that, and doing so is overreach on its part. The UMD is but one out of many components in a process, not the main one or a special one. It doesn't get to decide when the process must die, certainly not under circumstances where it must be able to continue while ignoring API calls (that's required for robustness).
> >
> >
> > You're mixing things up. Robust apps don't any special action from a UMD. Only non-robust apps need to be killed for proper recovery with the only other alternative being not updating the window/screen,
>
> I'm saying they don't "need" to be killed, since the UMD must be able to keep going while ignoring API calls (or it couldn't support robustness). It's a choice, one which is not for the UMD to make.
>
>
> > Also it's already used and required by our customers on Android because killing a process returns the user to the desktop screen and can generate a crash dump instead of keeping the app output frozen, and they agree that this is the best user experience given the circumstances.
>
> Then some appropriate Android component needs to make that call. The UMD is not it.

We can change it once Android and Linux have a better way to handle
non-robust apps. Until then, generating a core dump after a GPU crash
produces the best outcome for users and developers.

Marek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)
  2023-06-27 13:23 [PATCH v5 1/1] drm/doc: Document DRM device reset expectations André Almeida
                   ` (3 preceding siblings ...)
  2023-06-30 14:48 ` Sebastian Wick
@ 2023-07-25  2:55 ` André Almeida
  2023-07-25  7:02   ` Simon Ser
  2023-07-25  8:03   ` Michel Dänzer
  2023-08-04 13:03 ` [PATCH v5 1/1] drm/doc: Document DRM device reset expectations Daniel Vetter
  5 siblings, 2 replies; 41+ messages in thread
From: André Almeida @ 2023-07-25  2:55 UTC (permalink / raw)
  To: dri-devel
  Cc: pierre-eric.pelloux-prayer, Samuel Pitoiset, Randy Dunlap,
	Pekka Paalanen, 'Marek Olšák',
	Michel Dänzer, Timur Kristóf, Pekka Paalanen, amd-gfx,
	linux-kernel, kernel-dev, alexander.deucher, christian.koenig

Hi everyone,

It's not clear what we should do about non-robust OpenGL apps after GPU 
resets, so I'll try to summarize the topic, show some options and my 
proposal to move forward on that.

Em 27/06/2023 10:23, André Almeida escreveu:
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
Depending on the OpenGL version, there are different robustness API 
available:

- OpenGL ABR extension [0]
- OpenGL KHR extension [1]
- OpenGL ES extension  [2]

Apps written in OpenGL should use whatever version is available for them 
to make the app robust for GPU resets. That usually means calling 
GetGraphicsResetStatusARB(), checking the status, and if it encounter 
something different from NO_ERROR, that means that a reset has happened, 
the context is considered lost and should be recreated. If an app follow 
this, it will likely succeed recovering a reset.

What should non-robustness apps do then? They certainly will not be 
notified if a reset happens, and thus can't recover if their context is 
lost. OpenGL specification does not explicitly define what should be 
done in such situations[3], and I believe that usually when the spec 
mandates to close the app, it would explicitly note it.

However, in reality there are different types of device resets, causing 
different results. A reset can be precise enough to damage only the 
guilty context, and keep others alive.

Given that, I believe drivers have the following options:

a) Kill all non-robust apps after a reset. This may lead to lose work 
from innocent applications.

b) Ignore all non-robust apps OpenGL calls. That means that applications 
would still be alive, but the user interface would be freeze. The user 
would need to close it manually anyway, but in some corner cases, the 
app could autosave some work or the user might be able to interact with 
it using some alternative method (command line?).

c) Kill just the affected non-robust applications. To do that, the 
driver need to be 100% sure on the impact of its resets.

RadeonSI currently implements a), as can be seen at [4], while Iris 
implements what I think it's c)[5].

For the user experience point-of-view, c) is clearly the best option, 
but it's the hardest to archive. There's not much gain on having b) over 
a), perhaps it could be an optional env var for such corner case 
applications.

My proposal for the documentation is: implement a) if nothing else is 
available, have a MESA_NO_RESET_KILL for people that want b), ideally 
implement c) if the driver is able to know for sure that the non-guilty 
apps can still work after a reset.

Thanks,
     André

[0] https://registry.khronos.org/OpenGL/extensions/ARB/ARB_robustness.txt
[1] https://registry.khronos.org/OpenGL/extensions/KHR/KHR_robustness.txt
[2] https://registry.khronos.org/OpenGL/extensions/EXT/EXT_robustness.txt
[3] https://registry.khronos.org/OpenGL/specs/gl/glspec46.core.pdf
[4] 
https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c#L1657
[5] 
https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/drivers/iris/iris_batch.c#L842

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)
  2023-07-25  2:55 ` Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations) André Almeida
@ 2023-07-25  7:02   ` Simon Ser
  2023-07-25  8:03   ` Michel Dänzer
  1 sibling, 0 replies; 41+ messages in thread
From: Simon Ser @ 2023-07-25  7:02 UTC (permalink / raw)
  To: André Almeida
  Cc: pierre-eric.pelloux-prayer, Samuel Pitoiset, Randy Dunlap,
	Pekka Paalanen, 'Marek Olšák',
	Michel Dänzer, Timur Kristóf, Pekka Paalanen, amd-gfx,
	linux-kernel, dri-devel, kernel-dev, alexander.deucher,
	christian.koenig

On Tuesday, July 25th, 2023 at 04:55, André Almeida <andrealmeid@igalia.com> wrote:

> It's not clear what we should do about non-robust OpenGL apps after GPU
> resets, so I'll try to summarize the topic, show some options and my
> proposal to move forward on that.
> 
> Em 27/06/2023 10:23, André Almeida escreveu:
> 
> > +Robustness
> > +----------
> > +
> > +The only way to try to keep an application working after a reset is if it
> > +complies with the robustness aspects of the graphical API that it is using.
> > +
> > +Graphical APIs provide ways to applications to deal with device resets. However,
> > +there is no guarantee that the app will use such features correctly, and the
> > +UMD can implement policies to close the app if it is a repeating offender,
> > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > +the user interface from being correctly displayed. This should be done even if
> > +the app is correct but happens to trigger some bug in the hardware/driver.
> > +
> 
> Depending on the OpenGL version, there are different robustness API
> available:
> 
> - OpenGL ABR extension [0]
> - OpenGL KHR extension [1]
> - OpenGL ES extension [2]
> 
> Apps written in OpenGL should use whatever version is available for them
> to make the app robust for GPU resets. That usually means calling
> GetGraphicsResetStatusARB(), checking the status, and if it encounter
> something different from NO_ERROR, that means that a reset has happened,
> the context is considered lost and should be recreated. If an app follow
> this, it will likely succeed recovering a reset.
> 
> What should non-robustness apps do then? They certainly will not be
> notified if a reset happens, and thus can't recover if their context is
> lost. OpenGL specification does not explicitly define what should be
> done in such situations[3], and I believe that usually when the spec
> mandates to close the app, it would explicitly note it.
> 
> However, in reality there are different types of device resets, causing
> different results. A reset can be precise enough to damage only the
> guilty context, and keep others alive.
> 
> Given that, I believe drivers have the following options:
> 
> a) Kill all non-robust apps after a reset. This may lead to lose work
> from innocent applications.
> 
> b) Ignore all non-robust apps OpenGL calls. That means that applications
> would still be alive, but the user interface would be freeze. The user
> would need to close it manually anyway, but in some corner cases, the
> app could autosave some work or the user might be able to interact with
> it using some alternative method (command line?).
> 
> c) Kill just the affected non-robust applications. To do that, the
> driver need to be 100% sure on the impact of its resets.

We've discussed this a while back on #dri-devel IIRC. I think the best
experience would be for the Wayland compositor to gray out apps which
lost their GL context, and display an information dialog to explain
what happened to the user and a button to kill the app. I'm not exactly
sure how that would translate to a kernel or Mesa uAPI, and if there's
appetite to do a lot of work to get "the best GPU reset UX" (IOW: maybe
it's not worth all of the trouble).

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)
  2023-07-25  2:55 ` Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations) André Almeida
  2023-07-25  7:02   ` Simon Ser
@ 2023-07-25  8:03   ` Michel Dänzer
  2023-07-25 13:02     ` André Almeida
  2023-07-25 15:05     ` Marek Olšák
  1 sibling, 2 replies; 41+ messages in thread
From: Michel Dänzer @ 2023-07-25  8:03 UTC (permalink / raw)
  To: André Almeida, dri-devel
  Cc: pierre-eric.pelloux-prayer, Pekka Paalanen,
	'Marek Olšák',
	Timur Kristóf, Randy Dunlap, linux-kernel, Samuel Pitoiset,
	Pekka Paalanen, amd-gfx, kernel-dev, alexander.deucher,
	christian.koenig

On 7/25/23 04:55, André Almeida wrote:
> Hi everyone,
> 
> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
> 
> Em 27/06/2023 10:23, André Almeida escreveu:
>> +Robustness
>> +----------
>> +
>> +The only way to try to keep an application working after a reset is if it
>> +complies with the robustness aspects of the graphical API that it is using.
>> +
>> +Graphical APIs provide ways to applications to deal with device resets. However,
>> +there is no guarantee that the app will use such features correctly, and the
>> +UMD can implement policies to close the app if it is a repeating offender,
>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>> +the user interface from being correctly displayed. This should be done even if
>> +the app is correct but happens to trigger some bug in the hardware/driver.
>> +
> Depending on the OpenGL version, there are different robustness API available:
> 
> - OpenGL ABR extension [0]
> - OpenGL KHR extension [1]
> - OpenGL ES extension  [2]
> 
> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
> 
> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
> 
> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
> 
> Given that, I believe drivers have the following options:
> 
> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
> 
> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
> 
> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
> 
> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
> 
> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.

I disagree on these conclusions.

c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.


> [0] https://registry.khronos.org/OpenGL/extensions/ARB/ARB_robustness.txt
> [1] https://registry.khronos.org/OpenGL/extensions/KHR/KHR_robustness.txt
> [2] https://registry.khronos.org/OpenGL/extensions/EXT/EXT_robustness.txt
> [3] https://registry.khronos.org/OpenGL/specs/gl/glspec46.core.pdf
> [4] https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c#L1657
> [5] https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/drivers/iris/iris_batch.c#L842

-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)
  2023-07-25  8:03   ` Michel Dänzer
@ 2023-07-25 13:02     ` André Almeida
  2023-07-26  8:07       ` Michel Dänzer
  2023-07-25 15:05     ` Marek Olšák
  1 sibling, 1 reply; 41+ messages in thread
From: André Almeida @ 2023-07-25 13:02 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: pierre-eric.pelloux-prayer, amd-gfx, Pekka Paalanen,
	'Marek Olšák',
	Timur Kristóf, Randy Dunlap, linux-kernel, dri-devel,
	Pekka Paalanen, Samuel Pitoiset, kernel-dev, alexander.deucher,
	christian.koenig

Hi Michel,

Em 25/07/2023 05:03, Michel Dänzer escreveu:
> On 7/25/23 04:55, André Almeida wrote:
>> Hi everyone,
>>
>> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
>>
>> Em 27/06/2023 10:23, André Almeida escreveu:
>>> +Robustness
>>> +----------
>>> +
>>> +The only way to try to keep an application working after a reset is if it
>>> +complies with the robustness aspects of the graphical API that it is using.
>>> +
>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>> +there is no guarantee that the app will use such features correctly, and the
>>> +UMD can implement policies to close the app if it is a repeating offender,
>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>> +the user interface from being correctly displayed. This should be done even if
>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>> +
>> Depending on the OpenGL version, there are different robustness API available:
>>
>> - OpenGL ABR extension [0]
>> - OpenGL KHR extension [1]
>> - OpenGL ES extension  [2]
>>
>> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
>>
>> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
>>
>> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
>>
>> Given that, I believe drivers have the following options:
>>
>> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
>>
>> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
>>
>> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
>>
>> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
>>
>> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.
> 
> I disagree on these conclusions.
> 
> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.
> 

Thank you for the feedback. How do you think the documentation should 
look like for this part?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)
  2023-07-25  8:03   ` Michel Dänzer
  2023-07-25 13:02     ` André Almeida
@ 2023-07-25 15:05     ` Marek Olšák
  2023-07-25 17:00       ` Michel Dänzer
  1 sibling, 1 reply; 41+ messages in thread
From: Marek Olšák @ 2023-07-25 15:05 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: pierre-eric.pelloux-prayer, amd-gfx, Pekka Paalanen,
	Timur Kristóf, Randy Dunlap, linux-kernel, Samuel Pitoiset,
	Pekka Paalanen, dri-devel, kernel-dev, alexander.deucher,
	André Almeida, christian.koenig

On Tue, Jul 25, 2023 at 4:03 AM Michel Dänzer
<michel.daenzer@mailbox.org> wrote:
>
> On 7/25/23 04:55, André Almeida wrote:
> > Hi everyone,
> >
> > It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
> >
> > Em 27/06/2023 10:23, André Almeida escreveu:
> >> +Robustness
> >> +----------
> >> +
> >> +The only way to try to keep an application working after a reset is if it
> >> +complies with the robustness aspects of the graphical API that it is using.
> >> +
> >> +Graphical APIs provide ways to applications to deal with device resets. However,
> >> +there is no guarantee that the app will use such features correctly, and the
> >> +UMD can implement policies to close the app if it is a repeating offender,
> >> +likely in a broken loop. This is done to ensure that it does not keep blocking
> >> +the user interface from being correctly displayed. This should be done even if
> >> +the app is correct but happens to trigger some bug in the hardware/driver.
> >> +
> > Depending on the OpenGL version, there are different robustness API available:
> >
> > - OpenGL ABR extension [0]
> > - OpenGL KHR extension [1]
> > - OpenGL ES extension  [2]
> >
> > Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
> >
> > What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
> >
> > However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
> >
> > Given that, I believe drivers have the following options:
> >
> > a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
> >
> > b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
> >
> > c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
> >
> > RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
> >
> > For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.
>
> I disagree on these conclusions.
>
> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.

That's not true. I recommend that you enable b) with your driver and
then hang the GPU under different scenarios and see the result. Then
enable a) and do the same and compare.

Options a) and c) can be merged into one because they are not separate
options to choose from.

If Wayland wanted to grey out lost apps, they would appear as robust
contexts in gallium, but the reset status would be piped through the
Wayland protocol instead of the GL API.

Marek



Marek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)
  2023-07-25 15:05     ` Marek Olšák
@ 2023-07-25 17:00       ` Michel Dänzer
  2023-07-26  7:55         ` Timur Kristóf
  0 siblings, 1 reply; 41+ messages in thread
From: Michel Dänzer @ 2023-07-25 17:00 UTC (permalink / raw)
  To: Marek Olšák
  Cc: pierre-eric.pelloux-prayer, Pekka Paalanen, Timur Kristóf,
	dri-devel, Randy Dunlap, linux-kernel, amd-gfx, Pekka Paalanen,
	Samuel Pitoiset, kernel-dev, alexander.deucher,
	André Almeida, christian.koenig

On 7/25/23 17:05, Marek Olšák wrote:
> On Tue, Jul 25, 2023 at 4:03 AM Michel Dänzer
> <michel.daenzer@mailbox.org> wrote:
>> On 7/25/23 04:55, André Almeida wrote:
>>> Hi everyone,
>>>
>>> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
>>>
>>> Em 27/06/2023 10:23, André Almeida escreveu:
>>>> +Robustness
>>>> +----------
>>>> +
>>>> +The only way to try to keep an application working after a reset is if it
>>>> +complies with the robustness aspects of the graphical API that it is using.
>>>> +
>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>>> +there is no guarantee that the app will use such features correctly, and the
>>>> +UMD can implement policies to close the app if it is a repeating offender,
>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>>> +the user interface from being correctly displayed. This should be done even if
>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>>> +
>>> Depending on the OpenGL version, there are different robustness API available:
>>>
>>> - OpenGL ABR extension [0]
>>> - OpenGL KHR extension [1]
>>> - OpenGL ES extension  [2]
>>>
>>> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
>>>
>>> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
>>>
>>> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
>>>
>>> Given that, I believe drivers have the following options:
>>>
>>> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
>>>
>>> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
>>>
>>> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
>>>
>>> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
>>>
>>> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.
>>
>> I disagree on these conclusions.
>>
>> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.
> 
> That's not true.

Which part of what I wrote are you referring to?


> I recommend that you enable b) with your driver and then hang the GPU under different scenarios and see the result.

I've been doing GPU driver development for over two decades, I'm perfectly aware what it means. It doesn't change what I wrote above.


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)
  2023-07-25 17:00       ` Michel Dänzer
@ 2023-07-26  7:55         ` Timur Kristóf
  0 siblings, 0 replies; 41+ messages in thread
From: Timur Kristóf @ 2023-07-26  7:55 UTC (permalink / raw)
  To: Michel Dänzer, Marek Olšák
  Cc: pierre-eric.pelloux-prayer, Pekka Paalanen, dri-devel,
	Randy Dunlap, linux-kernel, amd-gfx, Pekka Paalanen,
	Samuel Pitoiset, kernel-dev, alexander.deucher,
	André Almeida, christian.koenig

On Tue, 2023-07-25 at 19:00 +0200, Michel Dänzer wrote:
> On 7/25/23 17:05, Marek Olšák wrote:
> > On Tue, Jul 25, 2023 at 4:03 AM Michel Dänzer
> > <michel.daenzer@mailbox.org> wrote:
> > > On 7/25/23 04:55, André Almeida wrote:
> > > > Hi everyone,
> > > > 
> > > > It's not clear what we should do about non-robust OpenGL apps
> > > > after GPU resets, so I'll try to summarize the topic, show some
> > > > options and my proposal to move forward on that.
> > > > 
> > > > Em 27/06/2023 10:23, André Almeida escreveu:
> > > > > +Robustness
> > > > > +----------
> > > > > +
> > > > > +The only way to try to keep an application working after a
> > > > > reset is if it
> > > > > +complies with the robustness aspects of the graphical API
> > > > > that it is using.
> > > > > +
> > > > > +Graphical APIs provide ways to applications to deal with
> > > > > device resets. However,
> > > > > +there is no guarantee that the app will use such features
> > > > > correctly, and the
> > > > > +UMD can implement policies to close the app if it is a
> > > > > repeating offender,
> > > > > +likely in a broken loop. This is done to ensure that it does
> > > > > not keep blocking
> > > > > +the user interface from being correctly displayed. This
> > > > > should be done even if
> > > > > +the app is correct but happens to trigger some bug in the
> > > > > hardware/driver.
> > > > > +
> > > > Depending on the OpenGL version, there are different robustness
> > > > API available:
> > > > 
> > > > - OpenGL ABR extension [0]
> > > > - OpenGL KHR extension [1]
> > > > - OpenGL ES extension  [2]
> > > > 
> > > > Apps written in OpenGL should use whatever version is available
> > > > for them to make the app robust for GPU resets. That usually
> > > > means calling GetGraphicsResetStatusARB(), checking the status,
> > > > and if it encounter something different from NO_ERROR, that
> > > > means that a reset has happened, the context is considered lost
> > > > and should be recreated. If an app follow this, it will likely
> > > > succeed recovering a reset.
> > > > 
> > > > What should non-robustness apps do then? They certainly will
> > > > not be notified if a reset happens, and thus can't recover if
> > > > their context is lost. OpenGL specification does not explicitly
> > > > define what should be done in such situations[3], and I believe
> > > > that usually when the spec mandates to close the app, it would
> > > > explicitly note it.
> > > > 
> > > > However, in reality there are different types of device resets,
> > > > causing different results. A reset can be precise enough to
> > > > damage only the guilty context, and keep others alive.
> > > > 
> > > > Given that, I believe drivers have the following options:
> > > > 
> > > > a) Kill all non-robust apps after a reset. This may lead to
> > > > lose work from innocent applications.
> > > > 
> > > > b) Ignore all non-robust apps OpenGL calls. That means that
> > > > applications would still be alive, but the user interface would
> > > > be freeze. The user would need to close it manually anyway, but
> > > > in some corner cases, the app could autosave some work or the
> > > > user might be able to interact with it using some alternative
> > > > method (command line?).
> > > > 
> > > > c) Kill just the affected non-robust applications. To do that,
> > > > the driver need to be 100% sure on the impact of its resets.
> > > > 
> > > > RadeonSI currently implements a), as can be seen at [4], while
> > > > Iris implements what I think it's c)[5].
> > > > 
> > > > For the user experience point-of-view, c) is clearly the best
> > > > option, but it's the hardest to archive. There's not much gain
> > > > on having b) over a), perhaps it could be an optional env var
> > > > for such corner case applications.
> > > 
> > > I disagree on these conclusions.
> > > 
> > > c) is certainly better than a), but it's not "clearly the best"
> > > in all cases. The OpenGL UMD is not a privileged/special
> > > component and is in no position to decide whether or not the
> > > process as a whole (only some thread(s) of which may use OpenGL
> > > at all) gets to continue running or not.
> > 
> > That's not true.
> 
> Which part of what I wrote are you referring to?
> 
> 
> > I recommend that you enable b) with your driver and then hang the
> > GPU under different scenarios and see the result.
> 
> I've been doing GPU driver development for over two decades, I'm
> perfectly aware what it means. It doesn't change what I wrote above.
> 

Michel, I understand that you disagree with the proposed solutions in
this email thread but from your mails it is unclear to me what exactly
is the solution that you would actually recommend, can you please
clarify?

Thanks,
Timur


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)
  2023-07-25 13:02     ` André Almeida
@ 2023-07-26  8:07       ` Michel Dänzer
  2023-08-02  7:38         ` Marek Olšák
  0 siblings, 1 reply; 41+ messages in thread
From: Michel Dänzer @ 2023-07-26  8:07 UTC (permalink / raw)
  To: André Almeida
  Cc: pierre-eric.pelloux-prayer, Samuel Pitoiset, Pekka Paalanen,
	'Marek Olšák',
	Timur Kristóf, Randy Dunlap, linux-kernel, amd-gfx,
	Pekka Paalanen, dri-devel, kernel-dev, alexander.deucher,
	christian.koenig

On 7/25/23 15:02, André Almeida wrote:
> Em 25/07/2023 05:03, Michel Dänzer escreveu:
>> On 7/25/23 04:55, André Almeida wrote:
>>> Hi everyone,
>>>
>>> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
>>>
>>> Em 27/06/2023 10:23, André Almeida escreveu:
>>>> +Robustness
>>>> +----------
>>>> +
>>>> +The only way to try to keep an application working after a reset is if it
>>>> +complies with the robustness aspects of the graphical API that it is using.
>>>> +
>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>>> +there is no guarantee that the app will use such features correctly, and the
>>>> +UMD can implement policies to close the app if it is a repeating offender,
>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>>> +the user interface from being correctly displayed. This should be done even if
>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>>> +
>>> Depending on the OpenGL version, there are different robustness API available:
>>>
>>> - OpenGL ABR extension [0]
>>> - OpenGL KHR extension [1]
>>> - OpenGL ES extension  [2]
>>>
>>> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
>>>
>>> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
>>>
>>> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
>>>
>>> Given that, I believe drivers have the following options:
>>>
>>> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
>>>
>>> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
>>>
>>> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
>>>
>>> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
>>>
>>> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.
>>
>> I disagree on these conclusions.
>>
>> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.
>>
> 
> Thank you for the feedback. How do you think the documentation should look like for this part?

The initial paragraph about robustness should say "keep OpenGL working" instead of "keep an application working". If an OpenGL context stops working, it doesn't necessarily mean the application stops working altogether.


If the application doesn't use the robustness extensions, your option b) is what should happen by default whenever possible. And it really has to be possible if the robustness extensions are supported.


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)
  2023-07-26  8:07       ` Michel Dänzer
@ 2023-08-02  7:38         ` Marek Olšák
  2023-08-02  8:34           ` Michel Dänzer
  0 siblings, 1 reply; 41+ messages in thread
From: Marek Olšák @ 2023-08-02  7:38 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: pierre-eric.pelloux-prayer, Samuel Pitoiset, André Almeida,
	Timur Kristóf, Randy Dunlap, linux-kernel, amd-gfx,
	Pekka Paalanen, dri-devel, kernel-dev, alexander.deucher,
	Pekka Paalanen, christian.koenig

A screen that doesn't update isn't usable. Killing the window system
and returning to the login screen is one option. Killing the window
system manually from a terminal or over ssh and then returning to the
login screen is another option, but 99% of users either hard-reset the
machine or do sysrq+REISUB anyway because it's faster that way. Those
are all your options. If we don't do the kill, users might decide to
do a hard reset with an unsync'd file system, which can cause more
damage.

The precedent from the CPU land is pretty strong here. There is
SIGSEGV for invalid CPU memory access and SIGILL for invalid CPU
instructions, yet we do nothing for invalid GPU memory access and
invalid GPU instructions. Sending a terminating signal from the kernel
would be the most natural thing to do. Instead, we just keep a frozen
GUI to keep users helpless, or we continue command submission and then
the hanging app can cause an infinite cycle of GPU hangs and resets,
making the GPU unusable until somebody kills the app over ssh.

That's why GL/Vulkan robustness is required - either robust apps, or a
robust compositor that greys out lost windows and pops up a diagnostic
message with a list of actions to choose from. That's the direction we
should be taking. Non-robust apps under a non-robust compositor should
just be killed if they crash the GPU.


Marek

On Wed, Jul 26, 2023 at 4:07 AM Michel Dänzer
<michel.daenzer@mailbox.org> wrote:
>
> On 7/25/23 15:02, André Almeida wrote:
> > Em 25/07/2023 05:03, Michel Dänzer escreveu:
> >> On 7/25/23 04:55, André Almeida wrote:
> >>> Hi everyone,
> >>>
> >>> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
> >>>
> >>> Em 27/06/2023 10:23, André Almeida escreveu:
> >>>> +Robustness
> >>>> +----------
> >>>> +
> >>>> +The only way to try to keep an application working after a reset is if it
> >>>> +complies with the robustness aspects of the graphical API that it is using.
> >>>> +
> >>>> +Graphical APIs provide ways to applications to deal with device resets. However,
> >>>> +there is no guarantee that the app will use such features correctly, and the
> >>>> +UMD can implement policies to close the app if it is a repeating offender,
> >>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
> >>>> +the user interface from being correctly displayed. This should be done even if
> >>>> +the app is correct but happens to trigger some bug in the hardware/driver.
> >>>> +
> >>> Depending on the OpenGL version, there are different robustness API available:
> >>>
> >>> - OpenGL ABR extension [0]
> >>> - OpenGL KHR extension [1]
> >>> - OpenGL ES extension  [2]
> >>>
> >>> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
> >>>
> >>> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
> >>>
> >>> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
> >>>
> >>> Given that, I believe drivers have the following options:
> >>>
> >>> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
> >>>
> >>> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
> >>>
> >>> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
> >>>
> >>> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
> >>>
> >>> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.
> >>
> >> I disagree on these conclusions.
> >>
> >> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.
> >>
> >
> > Thank you for the feedback. How do you think the documentation should look like for this part?
>
> The initial paragraph about robustness should say "keep OpenGL working" instead of "keep an application working". If an OpenGL context stops working, it doesn't necessarily mean the application stops working altogether.
>
>
> If the application doesn't use the robustness extensions, your option b) is what should happen by default whenever possible. And it really has to be possible if the robustness extensions are supported.
>
>
> --
> Earthling Michel Dänzer            |                  https://redhat.com
> Libre software enthusiast          |         Mesa and Xwayland developer
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations)
  2023-08-02  7:38         ` Marek Olšák
@ 2023-08-02  8:34           ` Michel Dänzer
  0 siblings, 0 replies; 41+ messages in thread
From: Michel Dänzer @ 2023-08-02  8:34 UTC (permalink / raw)
  To: Marek Olšák
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	Timur Kristóf, dri-devel, Randy Dunlap, linux-kernel,
	Samuel Pitoiset, Pekka Paalanen, amd-gfx, kernel-dev,
	alexander.deucher, Pekka Paalanen, christian.koenig

On 8/2/23 09:38, Marek Olšák wrote:
> 
> The precedent from the CPU land is pretty strong here. There is
> SIGSEGV for invalid CPU memory access and SIGILL for invalid CPU
> instructions, yet we do nothing for invalid GPU memory access and
> invalid GPU instructions. Sending a terminating signal from the kernel
> would be the most natural thing to do.

After an unhandled SIGSEGV or SIGILL, the process is in an inconsistent state and cannot safely continue executing. That's why the process is terminated by default in those cases.

The same is not true when an OpenGL context stops working. Any threads / other parts of the process not using that OpenGL context continue working normally. And any attempts to use that OpenGL context can be safely ignored (or the OpenGL implementation couldn't support the robustness extensions).


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-06-27 13:23 [PATCH v5 1/1] drm/doc: Document DRM device reset expectations André Almeida
                   ` (4 preceding siblings ...)
  2023-07-25  2:55 ` Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations) André Almeida
@ 2023-08-04 13:03 ` Daniel Vetter
  2023-08-08 12:13   ` Sebastian Wick
  5 siblings, 1 reply; 41+ messages in thread
From: Daniel Vetter @ 2023-08-04 13:03 UTC (permalink / raw)
  To: André Almeida
  Cc: pierre-eric.pelloux-prayer, Samuel Pitoiset, Randy Dunlap,
	Pekka Paalanen, 'Marek Olšák',
	Michel Dänzer, Timur Kristóf, linux-kernel, amd-gfx,
	dri-devel, kernel-dev, alexander.deucher, Pekka Paalanen,
	christian.koenig

On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
> 
> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
> 
> v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
> 
> Changes:
>  - Grammar fixes (Randy)
> 
>  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
>  1 file changed, 68 insertions(+)
> 
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>  mmapped regular files. Threads cause additional pain with signal
>  handling as well.
>  
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +sections describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context. This is needed to propagate to the rest of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if the UMD requires it. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriate API error code, as explained in the section below about
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,

Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat
offender killing is more a policy where the kernel enforces policy, and no
longer up to userspace to dtrt (because very clearly userspace is not
really doing the right thing anymore when it's just hanging the gpu in an
endless loop). Also maybe tune it down further to something like "the
kernel driver may implemnent ..."

In my opinion the umd shouldn't implement these kind of magic guesses, the
entire point of robustness apis is to delegate responsibility for
correctly recovering to the application. And the kernel is left with
enforcing fair resource usage policies (which eventually might be a
cgroups limit on how much gpu time you're allowed to waste with gpu
resets).

> +likely in a broken loop. This is done to ensure that it does not keep blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If it is possible to
> +determine that robustness is not in use, the UMD will terminate the app when a
> +reset is detected, giving that the contexts are lost and the app won't be able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.

Since we do not seem to have a solid consensus in the community about
non-robust userspace, maybe we could just document that lack of consensus
to unblock this patch? Something like this:

Non-Robust Userspace
--------------------

Userspace that doesn't support robust interfaces (like an non-robust
OpenGL context or API without any robustness support like libva) leave the
robustness handling entirely to the userspace driver. There is no strong
community consensus on what the userspace driver should do in that case,
since all reasonable approaches have some clear downsides.

With the s/UMD/KMD/ further up and maybe something added to record the
non-robustness non-consensus:

Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Cheers, Daniel



> +
>  .. _drm_driver_ioctl:
>  
>  IOCTL Support on Device Nodes
> -- 
> 2.41.0
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-08-04 13:03 ` [PATCH v5 1/1] drm/doc: Document DRM device reset expectations Daniel Vetter
@ 2023-08-08 12:13   ` Sebastian Wick
  2023-08-08 17:03     ` Marek Olšák
  0 siblings, 1 reply; 41+ messages in thread
From: Sebastian Wick @ 2023-08-08 12:13 UTC (permalink / raw)
  To: André Almeida, dri-devel, amd-gfx, linux-kernel, kernel-dev,
	alexander.deucher, christian.koenig, pierre-eric.pelloux-prayer,
	Simon Ser, Rob Clark, Pekka Paalanen, Daniel Stone,
	Marek Olšák, Dave Airlie, Michel Dänzer,
	Samuel Pitoiset, Timur Kristóf, Bas Nieuwenhuizen,
	Randy Dunlap, Pekka Paalanen

On Fri, Aug 4, 2023 at 3:03 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote:
> > Create a section that specifies how to deal with DRM device resets for
> > kernel and userspace drivers.
> >
> > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > ---
> >
> > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
> >
> > Changes:
> >  - Grammar fixes (Randy)
> >
> >  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 68 insertions(+)
> >
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 65fb3036a580..3cbffa25ed93 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> >  mmapped regular files. Threads cause additional pain with signal
> >  handling as well.
> >
> > +Device reset
> > +============
> > +
> > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > +faulty applications and everything in between the many layers. Some errors
> > +require resetting the device in order to make the device usable again. This
> > +sections describes the expectations for DRM and usermode drivers when a
> > +device resets and how to propagate the reset status.
> > +
> > +Kernel Mode Driver
> > +------------------
> > +
> > +The KMD is responsible for checking if the device needs a reset, and to perform
> > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> > +should keep track of resets, because userspace can query any time about the
> > +reset stats for an specific context. This is needed to propagate to the rest of
> > +the stack that a reset has happened. Currently, this is implemented by each
> > +driver separately, with no common DRM interface.
> > +
> > +User Mode Driver
> > +----------------
> > +
> > +The UMD should check before submitting new commands to the KMD if the device has
> > +been reset, and this can be checked more often if the UMD requires it. After
> > +detecting a reset, UMD will then proceed to report it to the application using
> > +the appropriate API error code, as explained in the section below about
> > +robustness.
> > +
> > +Robustness
> > +----------
> > +
> > +The only way to try to keep an application working after a reset is if it
> > +complies with the robustness aspects of the graphical API that it is using.
> > +
> > +Graphical APIs provide ways to applications to deal with device resets. However,
> > +there is no guarantee that the app will use such features correctly, and the
> > +UMD can implement policies to close the app if it is a repeating offender,
>
> Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat
> offender killing is more a policy where the kernel enforces policy, and no
> longer up to userspace to dtrt (because very clearly userspace is not
> really doing the right thing anymore when it's just hanging the gpu in an
> endless loop). Also maybe tune it down further to something like "the
> kernel driver may implemnent ..."
>
> In my opinion the umd shouldn't implement these kind of magic guesses, the
> entire point of robustness apis is to delegate responsibility for
> correctly recovering to the application. And the kernel is left with
> enforcing fair resource usage policies (which eventually might be a
> cgroups limit on how much gpu time you're allowed to waste with gpu
> resets).

Killing apps that the kernel thinks are misbehaving really doesn't
seem like a good idea to me. What if the process is a service getting
restarted after getting killed? What if killing that process leaves
the system in a bad state?

Can't the kernel provide some information to user space so that e.g.
systemd can handle those situations?

> > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > +the user interface from being correctly displayed. This should be done even if
> > +the app is correct but happens to trigger some bug in the hardware/driver.
> > +
> > +OpenGL
> > +~~~~~~
> > +
> > +Apps using OpenGL should use the available robust interfaces, like the
> > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> > +interface tells if a reset has happened, and if so, all the context state is
> > +considered lost and the app proceeds by creating new ones. If it is possible to
> > +determine that robustness is not in use, the UMD will terminate the app when a
> > +reset is detected, giving that the contexts are lost and the app won't be able
> > +to figure this out and recreate the contexts.
> > +
> > +Vulkan
> > +~~~~~~
> > +
> > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> > +This error code means, among other things, that a device reset has happened and
> > +it needs to recreate the contexts to keep going.
> > +
> > +Reporting causes of resets
> > +--------------------------
> > +
> > +Apart from propagating the reset through the stack so apps can recover, it's
> > +really useful for driver developers to learn more about what caused the reset in
> > +first place. DRM devices should make use of devcoredump to store relevant
> > +information about the reset, so this information can be added to user bug
> > +reports.
>
> Since we do not seem to have a solid consensus in the community about
> non-robust userspace, maybe we could just document that lack of consensus
> to unblock this patch? Something like this:
>
> Non-Robust Userspace
> --------------------
>
> Userspace that doesn't support robust interfaces (like an non-robust
> OpenGL context or API without any robustness support like libva) leave the
> robustness handling entirely to the userspace driver. There is no strong
> community consensus on what the userspace driver should do in that case,
> since all reasonable approaches have some clear downsides.
>
> With the s/UMD/KMD/ further up and maybe something added to record the
> non-robustness non-consensus:
>
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>
> Cheers, Daniel
>
>
>
> > +
> >  .. _drm_driver_ioctl:
> >
> >  IOCTL Support on Device Nodes
> > --
> > 2.41.0
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-08-08 12:13   ` Sebastian Wick
@ 2023-08-08 17:03     ` Marek Olšák
  2023-08-09  7:35       ` Michel Dänzer
  0 siblings, 1 reply; 41+ messages in thread
From: Marek Olšák @ 2023-08-08 17:03 UTC (permalink / raw)
  To: Sebastian Wick
  Cc: pierre-eric.pelloux-prayer, Samuel Pitoiset, Randy Dunlap,
	André Almeida, Michel Dänzer, Pekka Paalanen,
	linux-kernel, amd-gfx, Timur Kristóf, dri-devel, kernel-dev,
	alexander.deucher, Pekka Paalanen, christian.koenig

It's the same situation as SIGSEGV. A process can catch the signal,
but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
to catch the GPU error and prevent the process termination. If you
don't use the API, you'll get undefined behavior, which means anything
can happen, including process termination.



Marek

On Tue, Aug 8, 2023 at 8:14 AM Sebastian Wick <sebastian.wick@redhat.com> wrote:
>
> On Fri, Aug 4, 2023 at 3:03 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote:
> > > Create a section that specifies how to deal with DRM device resets for
> > > kernel and userspace drivers.
> > >
> > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > ---
> > >
> > > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
> > >
> > > Changes:
> > >  - Grammar fixes (Randy)
> > >
> > >  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> > >  1 file changed, 68 insertions(+)
> > >
> > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > index 65fb3036a580..3cbffa25ed93 100644
> > > --- a/Documentation/gpu/drm-uapi.rst
> > > +++ b/Documentation/gpu/drm-uapi.rst
> > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> > >  mmapped regular files. Threads cause additional pain with signal
> > >  handling as well.
> > >
> > > +Device reset
> > > +============
> > > +
> > > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > > +faulty applications and everything in between the many layers. Some errors
> > > +require resetting the device in order to make the device usable again. This
> > > +sections describes the expectations for DRM and usermode drivers when a
> > > +device resets and how to propagate the reset status.
> > > +
> > > +Kernel Mode Driver
> > > +------------------
> > > +
> > > +The KMD is responsible for checking if the device needs a reset, and to perform
> > > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> > > +should keep track of resets, because userspace can query any time about the
> > > +reset stats for an specific context. This is needed to propagate to the rest of
> > > +the stack that a reset has happened. Currently, this is implemented by each
> > > +driver separately, with no common DRM interface.
> > > +
> > > +User Mode Driver
> > > +----------------
> > > +
> > > +The UMD should check before submitting new commands to the KMD if the device has
> > > +been reset, and this can be checked more often if the UMD requires it. After
> > > +detecting a reset, UMD will then proceed to report it to the application using
> > > +the appropriate API error code, as explained in the section below about
> > > +robustness.
> > > +
> > > +Robustness
> > > +----------
> > > +
> > > +The only way to try to keep an application working after a reset is if it
> > > +complies with the robustness aspects of the graphical API that it is using.
> > > +
> > > +Graphical APIs provide ways to applications to deal with device resets. However,
> > > +there is no guarantee that the app will use such features correctly, and the
> > > +UMD can implement policies to close the app if it is a repeating offender,
> >
> > Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat
> > offender killing is more a policy where the kernel enforces policy, and no
> > longer up to userspace to dtrt (because very clearly userspace is not
> > really doing the right thing anymore when it's just hanging the gpu in an
> > endless loop). Also maybe tune it down further to something like "the
> > kernel driver may implemnent ..."
> >
> > In my opinion the umd shouldn't implement these kind of magic guesses, the
> > entire point of robustness apis is to delegate responsibility for
> > correctly recovering to the application. And the kernel is left with
> > enforcing fair resource usage policies (which eventually might be a
> > cgroups limit on how much gpu time you're allowed to waste with gpu
> > resets).
>
> Killing apps that the kernel thinks are misbehaving really doesn't
> seem like a good idea to me. What if the process is a service getting
> restarted after getting killed? What if killing that process leaves
> the system in a bad state?
>
> Can't the kernel provide some information to user space so that e.g.
> systemd can handle those situations?
>
> > > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > > +the user interface from being correctly displayed. This should be done even if
> > > +the app is correct but happens to trigger some bug in the hardware/driver.
> > > +
> > > +OpenGL
> > > +~~~~~~
> > > +
> > > +Apps using OpenGL should use the available robust interfaces, like the
> > > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> > > +interface tells if a reset has happened, and if so, all the context state is
> > > +considered lost and the app proceeds by creating new ones. If it is possible to
> > > +determine that robustness is not in use, the UMD will terminate the app when a
> > > +reset is detected, giving that the contexts are lost and the app won't be able
> > > +to figure this out and recreate the contexts.
> > > +
> > > +Vulkan
> > > +~~~~~~
> > > +
> > > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> > > +This error code means, among other things, that a device reset has happened and
> > > +it needs to recreate the contexts to keep going.
> > > +
> > > +Reporting causes of resets
> > > +--------------------------
> > > +
> > > +Apart from propagating the reset through the stack so apps can recover, it's
> > > +really useful for driver developers to learn more about what caused the reset in
> > > +first place. DRM devices should make use of devcoredump to store relevant
> > > +information about the reset, so this information can be added to user bug
> > > +reports.
> >
> > Since we do not seem to have a solid consensus in the community about
> > non-robust userspace, maybe we could just document that lack of consensus
> > to unblock this patch? Something like this:
> >
> > Non-Robust Userspace
> > --------------------
> >
> > Userspace that doesn't support robust interfaces (like an non-robust
> > OpenGL context or API without any robustness support like libva) leave the
> > robustness handling entirely to the userspace driver. There is no strong
> > community consensus on what the userspace driver should do in that case,
> > since all reasonable approaches have some clear downsides.
> >
> > With the s/UMD/KMD/ further up and maybe something added to record the
> > non-robustness non-consensus:
> >
> > Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> >
> > Cheers, Daniel
> >
> >
> >
> > > +
> > >  .. _drm_driver_ioctl:
> > >
> > >  IOCTL Support on Device Nodes
> > > --
> > > 2.41.0
> > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
> >
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-08-08 17:03     ` Marek Olšák
@ 2023-08-09  7:35       ` Michel Dänzer
  2023-08-09 19:15         ` Marek Olšák
  0 siblings, 1 reply; 41+ messages in thread
From: Michel Dänzer @ 2023-08-09  7:35 UTC (permalink / raw)
  To: Marek Olšák, Sebastian Wick
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	Timur Kristóf, dri-devel, Randy Dunlap, linux-kernel,
	Samuel Pitoiset, Pekka Paalanen, amd-gfx, kernel-dev,
	alexander.deucher, Pekka Paalanen, christian.koenig

On 8/8/23 19:03, Marek Olšák wrote:
> It's the same situation as SIGSEGV. A process can catch the signal,
> but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
> to catch the GPU error and prevent the process termination. If you
> don't use the API, you'll get undefined behavior, which means anything
> can happen, including process termination.

Got a spec reference for that?

I know the spec allows process termination in response to e.g. out of bounds buffer access by the application (which corresponds to SIGSEGV). There are other causes for GPU hangs though, e.g. driver bugs. The ARB_robustness spec says:

    If the reset notification behavior is NO_RESET_NOTIFICATION_ARB,
    then the implementation will never deliver notification of reset
    events, and GetGraphicsResetStatusARB will always return
    NO_ERROR[fn1].
       [fn1: In this case it is recommended that implementations should
        not allow loss of context state no matter what events occur.
        However, this is only a recommendation, and cannot be relied
        upon by applications.]

No mention of process termination, that rather sounds to me like the GL implementation should do its best to keep the application running.


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-08-09  7:35       ` Michel Dänzer
@ 2023-08-09 19:15         ` Marek Olšák
  2023-08-10  7:33           ` Michel Dänzer
  0 siblings, 1 reply; 41+ messages in thread
From: Marek Olšák @ 2023-08-09 19:15 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: pierre-eric.pelloux-prayer, Sebastian Wick, André Almeida,
	Timur Kristóf, dri-devel, Randy Dunlap, linux-kernel,
	Samuel Pitoiset, Pekka Paalanen, amd-gfx, kernel-dev,
	alexander.deucher, Pekka Paalanen, christian.koenig

On Wed, Aug 9, 2023 at 3:35 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote:
>
> On 8/8/23 19:03, Marek Olšák wrote:
> > It's the same situation as SIGSEGV. A process can catch the signal,
> > but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
> > to catch the GPU error and prevent the process termination. If you
> > don't use the API, you'll get undefined behavior, which means anything
> > can happen, including process termination.
>
> Got a spec reference for that?
>
> I know the spec allows process termination in response to e.g. out of bounds buffer access by the application (which corresponds to SIGSEGV). There are other causes for GPU hangs though, e.g. driver bugs. The ARB_robustness spec says:
>
>     If the reset notification behavior is NO_RESET_NOTIFICATION_ARB,
>     then the implementation will never deliver notification of reset
>     events, and GetGraphicsResetStatusARB will always return
>     NO_ERROR[fn1].
>        [fn1: In this case it is recommended that implementations should
>         not allow loss of context state no matter what events occur.
>         However, this is only a recommendation, and cannot be relied
>         upon by applications.]
>
> No mention of process termination, that rather sounds to me like the GL implementation should do its best to keep the application running.

It basically says that we can do anything.

A frozen window or flipping between 2 random frames can't be described
as "keeping the application running". That's the worst user
experience. I will not accept it.

A window system can force-enable robustness for its non-robust apps
and control that. That's the best possible user experience and it's
achievable everywhere. Everything else doesn't matter.

Marek




Marek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations
  2023-08-09 19:15         ` Marek Olšák
@ 2023-08-10  7:33           ` Michel Dänzer
  0 siblings, 0 replies; 41+ messages in thread
From: Michel Dänzer @ 2023-08-10  7:33 UTC (permalink / raw)
  To: Marek Olšák
  Cc: pierre-eric.pelloux-prayer, Sebastian Wick, amd-gfx,
	André Almeida, Timur Kristóf, Randy Dunlap,
	linux-kernel, dri-devel, Pekka Paalanen, Samuel Pitoiset,
	kernel-dev, alexander.deucher, Pekka Paalanen, christian.koenig

On 8/9/23 21:15, Marek Olšák wrote:
> On Wed, Aug 9, 2023 at 3:35 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote:
>> On 8/8/23 19:03, Marek Olšák wrote:
>>> It's the same situation as SIGSEGV. A process can catch the signal,
>>> but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
>>> to catch the GPU error and prevent the process termination. If you
>>> don't use the API, you'll get undefined behavior, which means anything
>>> can happen, including process termination.
>>
>> Got a spec reference for that?
>>
>> I know the spec allows process termination in response to e.g. out of bounds buffer access by the application (which corresponds to SIGSEGV). There are other causes for GPU hangs though, e.g. driver bugs. The ARB_robustness spec says:
>>
>>     If the reset notification behavior is NO_RESET_NOTIFICATION_ARB,
>>     then the implementation will never deliver notification of reset
>>     events, and GetGraphicsResetStatusARB will always return
>>     NO_ERROR[fn1].
>>        [fn1: In this case it is recommended that implementations should
>>         not allow loss of context state no matter what events occur.
>>         However, this is only a recommendation, and cannot be relied
>>         upon by applications.]
>>
>> No mention of process termination, that rather sounds to me like the GL implementation should do its best to keep the application running.
> 
> It basically says that we can do anything.

Not really? If program termination is a possible outcome, the spec otherwise mentions that explicitly, ala "including program termination".


> A frozen window or flipping between 2 random frames can't be described
> as "keeping the application running".

This assumes that an application which uses OpenGL cannot have any other purpose than using OpenGL.


-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2023-08-10  7:33 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-27 13:23 [PATCH v5 1/1] drm/doc: Document DRM device reset expectations André Almeida
2023-06-27 16:09 ` Randy Dunlap
2023-06-27 17:47 ` Christian König
2023-06-27 21:17   ` André Almeida
2023-06-29 13:11     ` André Almeida
2023-06-27 18:57 ` Marek Olšák
2023-06-27 21:31   ` André Almeida
2023-06-28  0:36     ` Marek Olšák
2023-06-30 14:48 ` Sebastian Wick
2023-06-30 14:59   ` Alex Deucher
2023-06-30 15:11     ` Michel Dänzer
2023-06-30 20:32       ` Marek Olšák
2023-07-03  7:12         ` Michel Dänzer
2023-07-03  8:49           ` Pekka Paalanen
2023-07-03 15:00             ` André Almeida
2023-07-04  7:42               ` Pekka Paalanen
2023-07-04  2:34           ` Marek Olšák
2023-07-04  2:38             ` Randy Dunlap
2023-07-04  2:44               ` Marek Olšák
2023-07-04  2:48                 ` Randy Dunlap
2023-07-04  7:54             ` Michel Dänzer
2023-07-05  6:30               ` Marek Olšák
2023-07-05  7:32                 ` Michel Dänzer
2023-07-05 15:53                   ` Marek Olšák
2023-06-30 15:21     ` Sebastian Wick
2023-07-25  2:55 ` Non-robust apps and resets (was Re: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations) André Almeida
2023-07-25  7:02   ` Simon Ser
2023-07-25  8:03   ` Michel Dänzer
2023-07-25 13:02     ` André Almeida
2023-07-26  8:07       ` Michel Dänzer
2023-08-02  7:38         ` Marek Olšák
2023-08-02  8:34           ` Michel Dänzer
2023-07-25 15:05     ` Marek Olšák
2023-07-25 17:00       ` Michel Dänzer
2023-07-26  7:55         ` Timur Kristóf
2023-08-04 13:03 ` [PATCH v5 1/1] drm/doc: Document DRM device reset expectations Daniel Vetter
2023-08-08 12:13   ` Sebastian Wick
2023-08-08 17:03     ` Marek Olšák
2023-08-09  7:35       ` Michel Dänzer
2023-08-09 19:15         ` Marek Olšák
2023-08-10  7:33           ` Michel Dänzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).