All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-08 18:04 Shashank Sharma
  2022-03-08 18:04 ` [PATCH v2 2/2] drm/amdgpu: add work function for GPU reset event Shashank Sharma
                   ` (5 more replies)
  0 siblings, 6 replies; 95+ messages in thread
From: Shashank Sharma @ 2022-03-08 18:04 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexandar Deucher, Amaranath Somalapuram, Christian Koenig,
	dri-devel, Shashank Sharma

From: Shashank Sharma <shashank.sharma@amd.com>

This patch adds a new sysfs event, which will indicate
the userland about a GPU reset, and can also provide
some information like:
- process ID of the process involved with the GPU reset
- process name of the involved process
- the GPU status info (using flags)

This patch also introduces the first flag of the flags
bitmap, which can be appended as and when required.

V2: Addressed review comments from Christian and Amar
   - move the reset information structure to DRM layer
   - drop _ctx from struct name
   - make pid 32 bit(than 64)
   - set flag when VRAM invalid (than valid)
   - add process name as well (Amar)

Cc: Alexandar Deucher <alexander.deucher@amd.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Amaranath Somalapuram <amaranath.somalapuram@amd.com>
Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
---
 drivers/gpu/drm/drm_sysfs.c | 31 +++++++++++++++++++++++++++++++
 include/drm/drm_sysfs.h     | 10 ++++++++++
 2 files changed, 41 insertions(+)

diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
index 430e00b16eec..840994810910 100644
--- a/drivers/gpu/drm/drm_sysfs.c
+++ b/drivers/gpu/drm/drm_sysfs.c
@@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device *dev)
 }
 EXPORT_SYMBOL(drm_sysfs_hotplug_event);
 
+/**
+ * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
+ * @dev: DRM device
+ * @reset_info: The contextual information about the reset (like PID, flags)
+ *
+ * Send a uevent for the DRM device specified by @dev. This informs
+ * user that a GPU reset has occurred, so that an interested client
+ * can take any recovery or profiling measure.
+ */
+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info)
+{
+	unsigned char pid_str[13];
+	unsigned char flags_str[15];
+	unsigned char pname_str[TASK_COMM_LEN + 6];
+	unsigned char reset_str[] = "RESET=1";
+	char *envp[] = { reset_str, pid_str, pname_str, flags_str, NULL };
+
+	if (!reset_info) {
+		DRM_WARN("No reset info, not sending the event\n");
+		return;
+	}
+
+	DRM_DEBUG("generating reset event\n");
+
+	snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u", reset_info->pid);
+	snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s", reset_info->pname);
+	snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", reset_info->flags);
+	kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
+}
+EXPORT_SYMBOL(drm_sysfs_reset_event);
+
 /**
  * drm_sysfs_connector_hotplug_event - generate a DRM uevent for any connector
  * change
diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
index 6273cac44e47..5ba11c760619 100644
--- a/include/drm/drm_sysfs.h
+++ b/include/drm/drm_sysfs.h
@@ -1,16 +1,26 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _DRM_SYSFS_H_
 #define _DRM_SYSFS_H_
+#include <linux/sched.h>
+
+#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)
 
 struct drm_device;
 struct device;
 struct drm_connector;
 struct drm_property;
 
+struct drm_reset_event {
+	uint32_t pid;
+	uint32_t flags;
+	char pname[TASK_COMM_LEN];
+};
+
 int drm_class_device_register(struct device *dev);
 void drm_class_device_unregister(struct device *dev);
 
 void drm_sysfs_hotplug_event(struct drm_device *dev);
+void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info);
 void drm_sysfs_connector_hotplug_event(struct drm_connector *connector);
 void drm_sysfs_connector_status_event(struct drm_connector *connector,
 				      struct drm_property *property);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 2/2] drm/amdgpu: add work function for GPU reset event
  2022-03-08 18:04 [PATCH v2 1/2] drm: Add GPU reset sysfs event Shashank Sharma
@ 2022-03-08 18:04 ` Shashank Sharma
  2022-03-09  7:47 ` [PATCH v2 1/2] drm: Add GPU reset sysfs event Simon Ser
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 95+ messages in thread
From: Shashank Sharma @ 2022-03-08 18:04 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander Deucher, Amaranath Somalapuram, Christian Koenig,
	dri-devel, Shashank Sharma

From: Shashank Sharma <shashank.sharma@amd.com>

This patch adds a work function, which sends a GPU reset
uevent and some contextual infomration, like the PID and
some status flags. This work should be scheduled during
a GPU reset.

The userspace can do some recovery and post-processing work
based on this event and information.

V2: Addressed review comments from Christian
- Changed the name of the work to gpu_reset_event_work
- Added a structure to accommodate some additional information
  (like a PID and some flags)
- Do not add new structure in amdgpu.h

Cc: Alexander Deucher <alexander.deucher@amd.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Amaranath Somalapuram <amaranath.somalapuram@amd.com>
Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  4 ++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 +++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index d8b854fcbffa..6a97c585bdfd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -61,6 +61,7 @@
 #include <drm/drm_gem.h>
 #include <drm/drm_ioctl.h>
 #include <drm/gpu_scheduler.h>
+#include <drm/drm_sysfs.h>
 
 #include <kgd_kfd_interface.h>
 #include "dm_pp_interface.h"
@@ -813,6 +814,7 @@ struct amd_powerplay {
 #define AMDGPU_RESET_MAGIC_NUM 64
 #define AMDGPU_MAX_DF_PERFMONS 4
 #define AMDGPU_PRODUCT_NAME_LEN 64
+
 struct amdgpu_device {
 	struct device			*dev;
 	struct pci_dev			*pdev;
@@ -1063,6 +1065,7 @@ struct amdgpu_device {
 
 	int asic_reset_res;
 	struct work_struct		xgmi_reset_work;
+	struct work_struct		gpu_reset_event_work;
 	struct list_head		reset_list;
 
 	long				gfx_timeout;
@@ -1097,6 +1100,7 @@ struct amdgpu_device {
 	pci_channel_state_t		pci_channel_state;
 
 	struct amdgpu_reset_control     *reset_cntl;
+	struct drm_reset_event		reset_event_info;
 	uint32_t                        ip_versions[MAX_HWIP][HWIP_MAX_INSTANCE];
 
 	bool				ram_is_direct_mapped;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ed077de426d9..1aef07fd0dff 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -73,6 +73,7 @@
 #include <linux/pm_runtime.h>
 
 #include <drm/drm_drv.h>
+#include <drm/drm_sysfs.h>
 
 MODULE_FIRMWARE("amdgpu/vega10_gpu_info.bin");
 MODULE_FIRMWARE("amdgpu/vega12_gpu_info.bin");
@@ -3277,6 +3278,17 @@ bool amdgpu_device_has_dc_support(struct amdgpu_device *adev)
 	return amdgpu_device_asic_has_dc_support(adev->asic_type);
 }
 
+static void amdgpu_device_reset_event_func(struct work_struct *__work)
+{
+	struct amdgpu_device *adev = container_of(__work, struct amdgpu_device,
+						  gpu_reset_event_work);
+	/*
+	 * A GPU reset has happened, inform the userspace and pass the
+	 * reset related information.
+	 */
+	drm_sysfs_reset_event(&adev->ddev, &adev->reset_event_info);
+}
+
 static void amdgpu_device_xgmi_reset_func(struct work_struct *__work)
 {
 	struct amdgpu_device *adev =
@@ -3525,6 +3537,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 			  amdgpu_device_delay_enable_gfx_off);
 
 	INIT_WORK(&adev->xgmi_reset_work, amdgpu_device_xgmi_reset_func);
+	INIT_WORK(&adev->gpu_reset_event_work, amdgpu_device_reset_event_func);
 
 	adev->gfx.gfx_off_req_count = 1;
 	adev->pm.ac_power = power_supply_is_system_supplied() > 0;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-08 18:04 [PATCH v2 1/2] drm: Add GPU reset sysfs event Shashank Sharma
  2022-03-08 18:04 ` [PATCH v2 2/2] drm/amdgpu: add work function for GPU reset event Shashank Sharma
@ 2022-03-09  7:47 ` Simon Ser
  2022-03-09 11:18   ` Sharma, Shashank
  2022-03-09  8:09 ` Christian König
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 95+ messages in thread
From: Simon Ser @ 2022-03-09  7:47 UTC (permalink / raw)
  To: Shashank Sharma
  Cc: Shashank Sharma, Amaranath Somalapuram, dri-devel, amd-gfx,
	Alexandar Deucher, Christian Koenig

Hi,

Maybe it would be a good idea to state the intended use-case in the
commit message? And explain why the current (driver-specific IIRC) APIs
aren't enough?

Since this introduces new uAPI, can you point to a user-space patch
which uses the new uAPI? See this link for more info on DRM user-space
requirements:
https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements

Please use drm_dbg_* and drm_warn_* helpers instead of DRM_DEBUG and
DRM_WARN. This allows identifying on which device the uevent happens.

On Tuesday, March 8th, 2022 at 19:04, Shashank Sharma <contactshashanksharma@gmail.com> wrote:

> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info)

reset_info can be const.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-08 18:04 [PATCH v2 1/2] drm: Add GPU reset sysfs event Shashank Sharma
  2022-03-08 18:04 ` [PATCH v2 2/2] drm/amdgpu: add work function for GPU reset event Shashank Sharma
  2022-03-09  7:47 ` [PATCH v2 1/2] drm: Add GPU reset sysfs event Simon Ser
@ 2022-03-09  8:09 ` Christian König
  2022-03-09  9:56 ` Pierre-Eric Pelloux-Prayer
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 95+ messages in thread
From: Christian König @ 2022-03-09  8:09 UTC (permalink / raw)
  To: Shashank Sharma, amd-gfx
  Cc: Alexandar Deucher, Amaranath Somalapuram, dri-devel, Shashank Sharma

Am 08.03.22 um 19:04 schrieb Shashank Sharma:
> From: Shashank Sharma <shashank.sharma@amd.com>
>
> This patch adds a new sysfs event, which will indicate
> the userland about a GPU reset, and can also provide
> some information like:
> - process ID of the process involved with the GPU reset
> - process name of the involved process
> - the GPU status info (using flags)
>
> This patch also introduces the first flag of the flags
> bitmap, which can be appended as and when required.
>
> V2: Addressed review comments from Christian and Amar
>     - move the reset information structure to DRM layer
>     - drop _ctx from struct name
>     - make pid 32 bit(than 64)
>     - set flag when VRAM invalid (than valid)
>     - add process name as well (Amar)
>
> Cc: Alexandar Deucher <alexander.deucher@amd.com>
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Amaranath Somalapuram <amaranath.somalapuram@amd.com>
> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
> ---
>   drivers/gpu/drm/drm_sysfs.c | 31 +++++++++++++++++++++++++++++++
>   include/drm/drm_sysfs.h     | 10 ++++++++++
>   2 files changed, 41 insertions(+)
>
> diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
> index 430e00b16eec..840994810910 100644
> --- a/drivers/gpu/drm/drm_sysfs.c
> +++ b/drivers/gpu/drm/drm_sysfs.c
> @@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device *dev)
>   }
>   EXPORT_SYMBOL(drm_sysfs_hotplug_event);
>   
> +/**
> + * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
> + * @dev: DRM device
> + * @reset_info: The contextual information about the reset (like PID, flags)
> + *
> + * Send a uevent for the DRM device specified by @dev. This informs
> + * user that a GPU reset has occurred, so that an interested client
> + * can take any recovery or profiling measure.
> + */
> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info)
> +{
> +	unsigned char pid_str[13];
> +	unsigned char flags_str[15];
> +	unsigned char pname_str[TASK_COMM_LEN + 6];
> +	unsigned char reset_str[] = "RESET=1";
> +	char *envp[] = { reset_str, pid_str, pname_str, flags_str, NULL };

In general we try to sort those longest first and maybe break envp on 
multiple lines.

> +
> +	if (!reset_info) {
> +		DRM_WARN("No reset info, not sending the event\n");
> +		return;
> +	}

Please don't do stuff like that, providing no reset_info is a coding 
error and should really result in a crash.

Regards,
Christian.

> +
> +	DRM_DEBUG("generating reset event\n");
> +
> +	snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u", reset_info->pid);
> +	snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s", reset_info->pname);
> +	snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", reset_info->flags);
> +	kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> +}
> +EXPORT_SYMBOL(drm_sysfs_reset_event);
> +
>   /**
>    * drm_sysfs_connector_hotplug_event - generate a DRM uevent for any connector
>    * change
> diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
> index 6273cac44e47..5ba11c760619 100644
> --- a/include/drm/drm_sysfs.h
> +++ b/include/drm/drm_sysfs.h
> @@ -1,16 +1,26 @@
>   /* SPDX-License-Identifier: GPL-2.0 */
>   #ifndef _DRM_SYSFS_H_
>   #define _DRM_SYSFS_H_
> +#include <linux/sched.h>
> +
> +#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)
>   
>   struct drm_device;
>   struct device;
>   struct drm_connector;
>   struct drm_property;
>   
> +struct drm_reset_event {
> +	uint32_t pid;
> +	uint32_t flags;
> +	char pname[TASK_COMM_LEN];
> +};
> +
>   int drm_class_device_register(struct device *dev);
>   void drm_class_device_unregister(struct device *dev);
>   
>   void drm_sysfs_hotplug_event(struct drm_device *dev);
> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info);
>   void drm_sysfs_connector_hotplug_event(struct drm_connector *connector);
>   void drm_sysfs_connector_status_event(struct drm_connector *connector,
>   				      struct drm_property *property);


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-08 18:04 [PATCH v2 1/2] drm: Add GPU reset sysfs event Shashank Sharma
                   ` (2 preceding siblings ...)
  2022-03-09  8:09 ` Christian König
@ 2022-03-09  9:56 ` Pierre-Eric Pelloux-Prayer
  2022-03-09 10:10   ` Simon Ser
  2022-03-09 18:12 ` Rob Clark
  2022-03-16 21:50 ` Rob Clark
  5 siblings, 1 reply; 95+ messages in thread
From: Pierre-Eric Pelloux-Prayer @ 2022-03-09  9:56 UTC (permalink / raw)
  To: Shashank Sharma, amd-gfx
  Cc: Alexandar Deucher, Amaranath Somalapuram, Christian Koenig,
	dri-devel, Shashank Sharma

Hi Shashank,

On 08/03/2022 19:04, Shashank Sharma wrote:
> From: Shashank Sharma <shashank.sharma@amd.com>
> 
> This patch adds a new sysfs event, which will indicate
> the userland about a GPU reset, and can also provide
> some information like:
> - process ID of the process involved with the GPU reset
> - process name of the involved process

Would it be possible to include the app parameters as well?

piglit has a shader_runner test application that can cause hangs,
and it's run like this:

   shader_runner 1.shader_test

Knowing that shader_runner caused the hang is not really useful
without the "1.shader_test" part.

Thanks,
Pierre-Eric 

> - the GPU status info (using flags)
> 
> This patch also introduces the first flag of the flags
> bitmap, which can be appended as and when required.
> 
> V2: Addressed review comments from Christian and Amar
>    - move the reset information structure to DRM layer
>    - drop _ctx from struct name
>    - make pid 32 bit(than 64)
>    - set flag when VRAM invalid (than valid)
>    - add process name as well (Amar)
> 
> Cc: Alexandar Deucher <alexander.deucher@amd.com>
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Amaranath Somalapuram <amaranath.somalapuram@amd.com>
> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
> ---
>  drivers/gpu/drm/drm_sysfs.c | 31 +++++++++++++++++++++++++++++++
>  include/drm/drm_sysfs.h     | 10 ++++++++++
>  2 files changed, 41 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
> index 430e00b16eec..840994810910 100644
> --- a/drivers/gpu/drm/drm_sysfs.c
> +++ b/drivers/gpu/drm/drm_sysfs.c
> @@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device *dev)
>  }
>  EXPORT_SYMBOL(drm_sysfs_hotplug_event);
>  
> +/**
> + * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
> + * @dev: DRM device
> + * @reset_info: The contextual information about the reset (like PID, flags)
> + *
> + * Send a uevent for the DRM device specified by @dev. This informs
> + * user that a GPU reset has occurred, so that an interested client
> + * can take any recovery or profiling measure.
> + */
> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info)
> +{
> +	unsigned char pid_str[13];
> +	unsigned char flags_str[15];
> +	unsigned char pname_str[TASK_COMM_LEN + 6];
> +	unsigned char reset_str[] = "RESET=1";
> +	char *envp[] = { reset_str, pid_str, pname_str, flags_str, NULL };
> +
> +	if (!reset_info) {
> +		DRM_WARN("No reset info, not sending the event\n");
> +		return;
> +	}
> +
> +	DRM_DEBUG("generating reset event\n");
> +
> +	snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u", reset_info->pid);
> +	snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s", reset_info->pname);
> +	snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", reset_info->flags);
> +	kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> +}
> +EXPORT_SYMBOL(drm_sysfs_reset_event);
> +
>  /**
>   * drm_sysfs_connector_hotplug_event - generate a DRM uevent for any connector
>   * change
> diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
> index 6273cac44e47..5ba11c760619 100644
> --- a/include/drm/drm_sysfs.h
> +++ b/include/drm/drm_sysfs.h
> @@ -1,16 +1,26 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  #ifndef _DRM_SYSFS_H_
>  #define _DRM_SYSFS_H_
> +#include <linux/sched.h>
> +
> +#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)
>  
>  struct drm_device;
>  struct device;
>  struct drm_connector;
>  struct drm_property;
>  
> +struct drm_reset_event {
> +	uint32_t pid;
> +	uint32_t flags;
> +	char pname[TASK_COMM_LEN];
> +};
> +
>  int drm_class_device_register(struct device *dev);
>  void drm_class_device_unregister(struct device *dev);
>  
>  void drm_sysfs_hotplug_event(struct drm_device *dev);
> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info);
>  void drm_sysfs_connector_hotplug_event(struct drm_connector *connector);
>  void drm_sysfs_connector_status_event(struct drm_connector *connector,
>  				      struct drm_property *property);
> 

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-09  9:56 ` Pierre-Eric Pelloux-Prayer
@ 2022-03-09 10:10   ` Simon Ser
  2022-03-09 10:24     ` Christian König
  0 siblings, 1 reply; 95+ messages in thread
From: Simon Ser @ 2022-03-09 10:10 UTC (permalink / raw)
  To: Pierre-Eric Pelloux-Prayer
  Cc: Shashank Sharma, Amaranath Somalapuram, dri-devel, amd-gfx,
	Alexandar Deucher, Shashank Sharma, Christian Koenig

On Wednesday, March 9th, 2022 at 10:56, Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> wrote:

> Would it be possible to include the app parameters as well?

Can all processes read sysfs events?

There might be security implications here. The app parameters might
contain sensitive information, like passwords or tokens.

Can't the uevent receiver query the kernel to get that info from the
PID?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-09 10:10   ` Simon Ser
@ 2022-03-09 10:24     ` Christian König
  2022-03-09 10:28       ` Simon Ser
  2022-03-09 10:28       ` Pierre-Eric Pelloux-Prayer
  0 siblings, 2 replies; 95+ messages in thread
From: Christian König @ 2022-03-09 10:24 UTC (permalink / raw)
  To: Simon Ser, Pierre-Eric Pelloux-Prayer
  Cc: Shashank Sharma, Amaranath Somalapuram, amd-gfx, dri-devel,
	Alexandar Deucher, Shashank Sharma, Christian Koenig

Am 09.03.22 um 11:10 schrieb Simon Ser:
> On Wednesday, March 9th, 2022 at 10:56, Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> wrote:
>
>> Would it be possible to include the app parameters as well?
> Can all processes read sysfs events?

No, but application parameters are usually not secret.

> There might be security implications here. The app parameters might
> contain sensitive information, like passwords or tokens.

It's a well known security vulnerably to provide password etc.. as 
application parameter since everybody can read them through procfs.

> Can't the uevent receiver query the kernel to get that info from the
> PID?

I'm leaning also towards just providing the PID and not the application 
name. The information should be as short as possible.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-09 10:24     ` Christian König
@ 2022-03-09 10:28       ` Simon Ser
  2022-03-09 10:28       ` Pierre-Eric Pelloux-Prayer
  1 sibling, 0 replies; 95+ messages in thread
From: Simon Ser @ 2022-03-09 10:28 UTC (permalink / raw)
  To: Christian König
  Cc: Pierre-Eric Pelloux-Prayer, Shashank Sharma,
	Amaranath Somalapuram, amd-gfx, dri-devel, Alexandar Deucher,
	Shashank Sharma, Christian Koenig

On Wednesday, March 9th, 2022 at 11:24, Christian König <ckoenig.leichtzumerken@gmail.com> wrote:

> Am 09.03.22 um 11:10 schrieb Simon Ser:
> > On Wednesday, March 9th, 2022 at 10:56, Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> wrote:
> >
> >> Would it be possible to include the app parameters as well?
> > Can all processes read sysfs events?
>
> No, but application parameters are usually not secret.

It's a bad practice, yes. But people still do it.

> > There might be security implications here. The app parameters might
> > contain sensitive information, like passwords or tokens.
>
> It's a well known security vulnerably to provide password etc.. as
> application parameter since everybody can read them through procfs.

I was thinking mostly about Flatpak apps here. Flatpak apps are running
in a process namespace so they can't query the CLI parameters of PIDs
outside of the namespace. But they might still receive sysfs uevents.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-09 10:24     ` Christian König
  2022-03-09 10:28       ` Simon Ser
@ 2022-03-09 10:28       ` Pierre-Eric Pelloux-Prayer
  1 sibling, 0 replies; 95+ messages in thread
From: Pierre-Eric Pelloux-Prayer @ 2022-03-09 10:28 UTC (permalink / raw)
  To: Christian König, Simon Ser
  Cc: Shashank Sharma, Amaranath Somalapuram, amd-gfx, dri-devel,
	Alexandar Deucher, Shashank Sharma, Christian Koenig



On 09/03/2022 11:24, Christian König wrote:
> Am 09.03.22 um 11:10 schrieb Simon Ser:
>> On Wednesday, March 9th, 2022 at 10:56, Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> wrote:
>>
>>> Would it be possible to include the app parameters as well?
>> Can all processes read sysfs events?
> 
> No, but application parameters are usually not secret.
> 
>> There might be security implications here. The app parameters might
>> contain sensitive information, like passwords or tokens.
> 
> It's a well known security vulnerably to provide password etc.. as application parameter since everybody can read them through procfs.
> 
>> Can't the uevent receiver query the kernel to get that info from the
>> PID?
> 
> I'm leaning also towards just providing the PID and not the application name. The information should be as short as possible.
> 

Indeed you're both right. The PID + reading procfs should be enough.

Thanks,
Pierre-Eric

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-09  7:47 ` [PATCH v2 1/2] drm: Add GPU reset sysfs event Simon Ser
@ 2022-03-09 11:18   ` Sharma, Shashank
  0 siblings, 0 replies; 95+ messages in thread
From: Sharma, Shashank @ 2022-03-09 11:18 UTC (permalink / raw)
  To: Simon Ser, Shashank Sharma
  Cc: Alexandar Deucher, dri-devel, Amaranath Somalapuram,
	Christian Koenig, amd-gfx



On 3/9/2022 8:47 AM, Simon Ser wrote:
> Hi,
> 
> Maybe it would be a good idea to state the intended use-case in the
> commit message? 

It was added in the second patch, but yeah, it makes more sense to add a 
cover-letter probably.

And explain why the current (driver-specific IIRC) APIs
> aren't enough?
> 
As you mentioned, we also started with a driver specific API, and then 
realized that most of the drivers can benefit from this infrastructure, 
as most of them would be more or less interested in the similar 
information.

> Since this introduces new uAPI, can you point to a user-space patch
> which uses the new uAPI? See this link for more info on DRM user-space
> requirements:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freedesktop.org%2Fdocs%2Fdrm%2Fgpu%2Fdrm-uapi.html%23open-source-userspace-requirements&amp;data=04%7C01%7Cshashank.sharma%40amd.com%7C85e8e40c84524f3272e908da01a11b1c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637824088716047386%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=%2BMTzBIfl3FGHdAv%2BSx%2F83a86x8eksbOhdQVqdrJ2Rq0%3D&amp;reserved=0
> 
The userspace work is in progress, this was to get a general feedback 
from the community and the interest.

> Please use drm_dbg_* and drm_warn_* helpers instead of DRM_DEBUG and
> DRM_WARN. This allows identifying on which device the uevent happens.
> 
Well, I don't want to break the consistency across the file, as other
functions seems to be using DRM_DEBUG family of prints.
> On Tuesday, March 8th, 2022 at 19:04, Shashank Sharma <contactshashanksharma@gmail.com> wrote:
> 
>> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info)
> 
> reset_info can be const.
Sure.

- Shashank

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-08 18:04 [PATCH v2 1/2] drm: Add GPU reset sysfs event Shashank Sharma
                   ` (3 preceding siblings ...)
  2022-03-09  9:56 ` Pierre-Eric Pelloux-Prayer
@ 2022-03-09 18:12 ` Rob Clark
  2022-03-10  9:55   ` Christian König
  2022-03-16 21:50 ` Rob Clark
  5 siblings, 1 reply; 95+ messages in thread
From: Rob Clark @ 2022-03-09 18:12 UTC (permalink / raw)
  To: Shashank Sharma
  Cc: Rob Clark, Shashank Sharma, Amaranath Somalapuram, dri-devel,
	amd-gfx list, Alexandar Deucher, Christian Koenig

On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
<contactshashanksharma@gmail.com> wrote:
>
> From: Shashank Sharma <shashank.sharma@amd.com>
>
> This patch adds a new sysfs event, which will indicate
> the userland about a GPU reset, and can also provide
> some information like:
> - process ID of the process involved with the GPU reset
> - process name of the involved process
> - the GPU status info (using flags)
>
> This patch also introduces the first flag of the flags
> bitmap, which can be appended as and when required.

Why invent something new, rather than using the already existing devcoredump?

I don't think we need (or should encourage/allow) something drm
specific when there is already an existing solution used by both drm
and non-drm drivers.  Userspace should not have to learn to support
yet another mechanism to do the same thing.

BR,
-R

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-09 18:12 ` Rob Clark
@ 2022-03-10  9:55   ` Christian König
  2022-03-10 15:24     ` Rob Clark
  0 siblings, 1 reply; 95+ messages in thread
From: Christian König @ 2022-03-10  9:55 UTC (permalink / raw)
  To: Rob Clark, Shashank Sharma
  Cc: Rob Clark, Shashank Sharma, Amaranath Somalapuram, amd-gfx list,
	dri-devel, Alexandar Deucher, Christian Koenig



Am 09.03.22 um 19:12 schrieb Rob Clark:
> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
> <contactshashanksharma@gmail.com> wrote:
>> From: Shashank Sharma <shashank.sharma@amd.com>
>>
>> This patch adds a new sysfs event, which will indicate
>> the userland about a GPU reset, and can also provide
>> some information like:
>> - process ID of the process involved with the GPU reset
>> - process name of the involved process
>> - the GPU status info (using flags)
>>
>> This patch also introduces the first flag of the flags
>> bitmap, which can be appended as and when required.
> Why invent something new, rather than using the already existing devcoredump?

Yeah, that's a really valid question.

> I don't think we need (or should encourage/allow) something drm
> specific when there is already an existing solution used by both drm
> and non-drm drivers.  Userspace should not have to learn to support
> yet another mechanism to do the same thing.

Question is how is userspace notified about new available core dumps?

Regards,
Christian.

>
> BR,
> -R


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-10  9:55   ` Christian König
@ 2022-03-10 15:24     ` Rob Clark
  2022-03-10 16:21       ` Sharma, Shashank
  0 siblings, 1 reply; 95+ messages in thread
From: Rob Clark @ 2022-03-10 15:24 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Shashank Sharma, Amaranath Somalapuram, amd-gfx list,
	dri-devel, Alexandar Deucher, Shashank Sharma, Christian Koenig

On Thu, Mar 10, 2022 at 1:55 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
>
>
> Am 09.03.22 um 19:12 schrieb Rob Clark:
> > On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
> > <contactshashanksharma@gmail.com> wrote:
> >> From: Shashank Sharma <shashank.sharma@amd.com>
> >>
> >> This patch adds a new sysfs event, which will indicate
> >> the userland about a GPU reset, and can also provide
> >> some information like:
> >> - process ID of the process involved with the GPU reset
> >> - process name of the involved process
> >> - the GPU status info (using flags)
> >>
> >> This patch also introduces the first flag of the flags
> >> bitmap, which can be appended as and when required.
> > Why invent something new, rather than using the already existing devcoredump?
>
> Yeah, that's a really valid question.
>
> > I don't think we need (or should encourage/allow) something drm
> > specific when there is already an existing solution used by both drm
> > and non-drm drivers.  Userspace should not have to learn to support
> > yet another mechanism to do the same thing.
>
> Question is how is userspace notified about new available core dumps?

I haven't looked into it too closely, as the CrOS userspace
crash-reporter already had support for devcoredump, so it "just
worked" out of the box[1].  I believe a udev event is what triggers
the crash-reporter to go read the devcore dump out of sysfs.

BR,
-R

[1] One small change to allow-list the drm/msm devcore dumps was
needed after a privacy review/audit of what is contained in the GPU
devcored to ensure it does not contain any PII .. for CrOS on amd
devices I'd be happy to walk whoever deals with amd CrOS devices
through the process offline.

> Regards,
> Christian.
>
> >
> > BR,
> > -R
>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-10 15:24     ` Rob Clark
@ 2022-03-10 16:21       ` Sharma, Shashank
  2022-03-10 16:27         ` Andrey Grodzovsky
  2022-03-10 17:10         ` Rob Clark
  0 siblings, 2 replies; 95+ messages in thread
From: Sharma, Shashank @ 2022-03-10 16:21 UTC (permalink / raw)
  To: Rob Clark, Christian König
  Cc: Rob Clark, Amaranath Somalapuram, amd-gfx list, dri-devel,
	Alexandar Deucher, Shashank Sharma, Christian Koenig



On 3/10/2022 4:24 PM, Rob Clark wrote:
> On Thu, Mar 10, 2022 at 1:55 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>>
>>
>>
>> Am 09.03.22 um 19:12 schrieb Rob Clark:
>>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
>>> <contactshashanksharma@gmail.com> wrote:
>>>> From: Shashank Sharma <shashank.sharma@amd.com>
>>>>
>>>> This patch adds a new sysfs event, which will indicate
>>>> the userland about a GPU reset, and can also provide
>>>> some information like:
>>>> - process ID of the process involved with the GPU reset
>>>> - process name of the involved process
>>>> - the GPU status info (using flags)
>>>>
>>>> This patch also introduces the first flag of the flags
>>>> bitmap, which can be appended as and when required.
>>> Why invent something new, rather than using the already existing devcoredump?
>>
>> Yeah, that's a really valid question.
>>
>>> I don't think we need (or should encourage/allow) something drm
>>> specific when there is already an existing solution used by both drm
>>> and non-drm drivers.  Userspace should not have to learn to support
>>> yet another mechanism to do the same thing.
>>
>> Question is how is userspace notified about new available core dumps?
> 
> I haven't looked into it too closely, as the CrOS userspace
> crash-reporter already had support for devcoredump, so it "just
> worked" out of the box[1].  I believe a udev event is what triggers
> the crash-reporter to go read the devcore dump out of sysfs.

I had a quick look at the devcoredump code, and it doesn't look like 
that is sending an event to the user, so we still need an event to 
indicate a GPU reset.

- Shashank

> 
> BR,
> -R
> 
> [1] One small change to allow-list the drm/msm devcore dumps was
> needed after a privacy review/audit of what is contained in the GPU
> devcored to ensure it does not contain any PII .. for CrOS on amd
> devices I'd be happy to walk whoever deals with amd CrOS devices
> through the process offline.
> 
>> Regards,
>> Christian.
>>
>>>
>>> BR,
>>> -R
>>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-10 16:21       ` Sharma, Shashank
@ 2022-03-10 16:27         ` Andrey Grodzovsky
  2022-03-10 17:16           ` Rob Clark
  2022-03-10 17:10         ` Rob Clark
  1 sibling, 1 reply; 95+ messages in thread
From: Andrey Grodzovsky @ 2022-03-10 16:27 UTC (permalink / raw)
  To: Sharma, Shashank, Rob Clark, Christian König
  Cc: Rob Clark, Amaranath Somalapuram, dri-devel, amd-gfx list,
	Alexandar Deucher, Shashank Sharma, Christian Koenig


On 2022-03-10 11:21, Sharma, Shashank wrote:
>
>
> On 3/10/2022 4:24 PM, Rob Clark wrote:
>> On Thu, Mar 10, 2022 at 1:55 AM Christian König
>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>
>>>
>>>
>>> Am 09.03.22 um 19:12 schrieb Rob Clark:
>>>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
>>>> <contactshashanksharma@gmail.com> wrote:
>>>>> From: Shashank Sharma <shashank.sharma@amd.com>
>>>>>
>>>>> This patch adds a new sysfs event, which will indicate
>>>>> the userland about a GPU reset, and can also provide
>>>>> some information like:
>>>>> - process ID of the process involved with the GPU reset
>>>>> - process name of the involved process
>>>>> - the GPU status info (using flags)
>>>>>
>>>>> This patch also introduces the first flag of the flags
>>>>> bitmap, which can be appended as and when required.
>>>> Why invent something new, rather than using the already existing 
>>>> devcoredump?
>>>
>>> Yeah, that's a really valid question.
>>>
>>>> I don't think we need (or should encourage/allow) something drm
>>>> specific when there is already an existing solution used by both drm
>>>> and non-drm drivers.  Userspace should not have to learn to support
>>>> yet another mechanism to do the same thing.
>>>
>>> Question is how is userspace notified about new available core dumps?
>>
>> I haven't looked into it too closely, as the CrOS userspace
>> crash-reporter already had support for devcoredump, so it "just
>> worked" out of the box[1].  I believe a udev event is what triggers
>> the crash-reporter to go read the devcore dump out of sysfs.
>
> I had a quick look at the devcoredump code, and it doesn't look like 
> that is sending an event to the user, so we still need an event to 
> indicate a GPU reset.
>
> - Shashank


Another point I raised in another thread is that it looks like we might 
want to use devcoredump during ASIC reset to dump more HW related data 
which is useful
for debugging. It means the use client will have to extract the pid and 
process name out from a bigger data set - Is that ok ? We can probably 
put it at the beginning
for easiest parsing.

Andrey


>
>>
>> BR,
>> -R
>>
>> [1] One small change to allow-list the drm/msm devcore dumps was
>> needed after a privacy review/audit of what is contained in the GPU
>> devcored to ensure it does not contain any PII .. for CrOS on amd
>> devices I'd be happy to walk whoever deals with amd CrOS devices
>> through the process offline.
>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> BR,
>>>> -R
>>>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-10 16:21       ` Sharma, Shashank
  2022-03-10 16:27         ` Andrey Grodzovsky
@ 2022-03-10 17:10         ` Rob Clark
  2022-03-10 17:19           ` Sharma, Shashank
  1 sibling, 1 reply; 95+ messages in thread
From: Rob Clark @ 2022-03-10 17:10 UTC (permalink / raw)
  To: Sharma, Shashank
  Cc: Rob Clark, Christian König, Amaranath Somalapuram,
	amd-gfx list, dri-devel, Alexandar Deucher, Shashank Sharma,
	Christian Koenig

On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
<shashank.sharma@amd.com> wrote:
>
>
>
> On 3/10/2022 4:24 PM, Rob Clark wrote:
> > On Thu, Mar 10, 2022 at 1:55 AM Christian König
> > <ckoenig.leichtzumerken@gmail.com> wrote:
> >>
> >>
> >>
> >> Am 09.03.22 um 19:12 schrieb Rob Clark:
> >>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
> >>> <contactshashanksharma@gmail.com> wrote:
> >>>> From: Shashank Sharma <shashank.sharma@amd.com>
> >>>>
> >>>> This patch adds a new sysfs event, which will indicate
> >>>> the userland about a GPU reset, and can also provide
> >>>> some information like:
> >>>> - process ID of the process involved with the GPU reset
> >>>> - process name of the involved process
> >>>> - the GPU status info (using flags)
> >>>>
> >>>> This patch also introduces the first flag of the flags
> >>>> bitmap, which can be appended as and when required.
> >>> Why invent something new, rather than using the already existing devcoredump?
> >>
> >> Yeah, that's a really valid question.
> >>
> >>> I don't think we need (or should encourage/allow) something drm
> >>> specific when there is already an existing solution used by both drm
> >>> and non-drm drivers.  Userspace should not have to learn to support
> >>> yet another mechanism to do the same thing.
> >>
> >> Question is how is userspace notified about new available core dumps?
> >
> > I haven't looked into it too closely, as the CrOS userspace
> > crash-reporter already had support for devcoredump, so it "just
> > worked" out of the box[1].  I believe a udev event is what triggers
> > the crash-reporter to go read the devcore dump out of sysfs.
>
> I had a quick look at the devcoredump code, and it doesn't look like
> that is sending an event to the user, so we still need an event to
> indicate a GPU reset.

There definitely is an event to userspace, I suspect somewhere down
the device_add() path?

BR,
-R

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-10 16:27         ` Andrey Grodzovsky
@ 2022-03-10 17:16           ` Rob Clark
  0 siblings, 0 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-10 17:16 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, dri-devel, amd-gfx list,
	Alexandar Deucher, Shashank Sharma, Christian Koenig

On Thu, Mar 10, 2022 at 8:27 AM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
>
> On 2022-03-10 11:21, Sharma, Shashank wrote:
> >
> >
> > On 3/10/2022 4:24 PM, Rob Clark wrote:
> >> On Thu, Mar 10, 2022 at 1:55 AM Christian König
> >> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>>
> >>>
> >>>
> >>> Am 09.03.22 um 19:12 schrieb Rob Clark:
> >>>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
> >>>> <contactshashanksharma@gmail.com> wrote:
> >>>>> From: Shashank Sharma <shashank.sharma@amd.com>
> >>>>>
> >>>>> This patch adds a new sysfs event, which will indicate
> >>>>> the userland about a GPU reset, and can also provide
> >>>>> some information like:
> >>>>> - process ID of the process involved with the GPU reset
> >>>>> - process name of the involved process
> >>>>> - the GPU status info (using flags)
> >>>>>
> >>>>> This patch also introduces the first flag of the flags
> >>>>> bitmap, which can be appended as and when required.
> >>>> Why invent something new, rather than using the already existing
> >>>> devcoredump?
> >>>
> >>> Yeah, that's a really valid question.
> >>>
> >>>> I don't think we need (or should encourage/allow) something drm
> >>>> specific when there is already an existing solution used by both drm
> >>>> and non-drm drivers.  Userspace should not have to learn to support
> >>>> yet another mechanism to do the same thing.
> >>>
> >>> Question is how is userspace notified about new available core dumps?
> >>
> >> I haven't looked into it too closely, as the CrOS userspace
> >> crash-reporter already had support for devcoredump, so it "just
> >> worked" out of the box[1].  I believe a udev event is what triggers
> >> the crash-reporter to go read the devcore dump out of sysfs.
> >
> > I had a quick look at the devcoredump code, and it doesn't look like
> > that is sending an event to the user, so we still need an event to
> > indicate a GPU reset.
> >
> > - Shashank
>
>
> Another point I raised in another thread is that it looks like we might
> want to use devcoredump during ASIC reset to dump more HW related data
> which is useful
> for debugging. It means the use client will have to extract the pid and
> process name out from a bigger data set - Is that ok ? We can probably
> put it at the beginning
> for easiest parsing.
>

Yes, this is what we do for drm/msm.. the start of the devcore file
has something like:

----
kernel: 5.14.0-rc3-debug+
module: msm
time: 1632763923.453207637
comm: deqp-gles3:sq0
cmdline: ./deqp-gles31 --deqp-surface-width=256
--deqp-surface-height=256 --deqp-gl-config-name=rgba8888d24s8ms0
--deqp-visibility=hidden
--deqp-caselist-file=/home/robclark/src/deqp/build/modules/gles31/new-run/c33.r1.caselist.txt
--deqp-log-filename=/home/robclark/src/deqp/build/modules/gles31/new-run/c33.r1.qpa
--deqp-log-flush=disable
--deqp-shadercache-filename=/home/robclark/src/deqp/build/modules/gles31/new-run/t499826814672.shader_cache
--deqp-shadercache-truncate=disable
revision: 618 (6.1.8.0)
----

We capture quite a lot of state, cmdstream that triggered the hang,
register/state dumps, microcontroller state, etc.  But we do go out of
our way to not capture textures or caches that might contain texture
data by default (for privacy reasons)

It has been hugely useful for debugging a few issues that happen
rarely enough that they are difficult to reproduce.  I guess that is
crowd-sourced debugging ;-)

BR,
-R

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-10 17:10         ` Rob Clark
@ 2022-03-10 17:19           ` Sharma, Shashank
  2022-03-10 17:40             ` Rob Clark
  0 siblings, 1 reply; 95+ messages in thread
From: Sharma, Shashank @ 2022-03-10 17:19 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Christian König, Amaranath Somalapuram,
	amd-gfx list, dri-devel, Alexandar Deucher, Shashank Sharma,
	Christian Koenig



On 3/10/2022 6:10 PM, Rob Clark wrote:
> On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
> <shashank.sharma@amd.com> wrote:
>>
>>
>>
>> On 3/10/2022 4:24 PM, Rob Clark wrote:
>>> On Thu, Mar 10, 2022 at 1:55 AM Christian König
>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> Am 09.03.22 um 19:12 schrieb Rob Clark:
>>>>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
>>>>> <contactshashanksharma@gmail.com> wrote:
>>>>>> From: Shashank Sharma <shashank.sharma@amd.com>
>>>>>>
>>>>>> This patch adds a new sysfs event, which will indicate
>>>>>> the userland about a GPU reset, and can also provide
>>>>>> some information like:
>>>>>> - process ID of the process involved with the GPU reset
>>>>>> - process name of the involved process
>>>>>> - the GPU status info (using flags)
>>>>>>
>>>>>> This patch also introduces the first flag of the flags
>>>>>> bitmap, which can be appended as and when required.
>>>>> Why invent something new, rather than using the already existing devcoredump?
>>>>
>>>> Yeah, that's a really valid question.
>>>>
>>>>> I don't think we need (or should encourage/allow) something drm
>>>>> specific when there is already an existing solution used by both drm
>>>>> and non-drm drivers.  Userspace should not have to learn to support
>>>>> yet another mechanism to do the same thing.
>>>>
>>>> Question is how is userspace notified about new available core dumps?
>>>
>>> I haven't looked into it too closely, as the CrOS userspace
>>> crash-reporter already had support for devcoredump, so it "just
>>> worked" out of the box[1].  I believe a udev event is what triggers
>>> the crash-reporter to go read the devcore dump out of sysfs.
>>
>> I had a quick look at the devcoredump code, and it doesn't look like
>> that is sending an event to the user, so we still need an event to
>> indicate a GPU reset.
> 
> There definitely is an event to userspace, I suspect somewhere down
> the device_add() path?
> 

Let me check that out as well, hope that is not due to a driver-private 
event for GPU reset, coz I think I have seen some of those in a few DRM 
drivers.

- Shashank

> BR,
> -R

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-10 17:19           ` Sharma, Shashank
@ 2022-03-10 17:40             ` Rob Clark
  2022-03-10 18:33               ` Abhinav Kumar
  0 siblings, 1 reply; 95+ messages in thread
From: Rob Clark @ 2022-03-10 17:40 UTC (permalink / raw)
  To: Sharma, Shashank
  Cc: Rob Clark, Christian König, Amaranath Somalapuram,
	amd-gfx list, dri-devel, Alexandar Deucher, Shashank Sharma,
	Christian Koenig

On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank
<shashank.sharma@amd.com> wrote:
>
>
>
> On 3/10/2022 6:10 PM, Rob Clark wrote:
> > On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
> > <shashank.sharma@amd.com> wrote:
> >>
> >>
> >>
> >> On 3/10/2022 4:24 PM, Rob Clark wrote:
> >>> On Thu, Mar 10, 2022 at 1:55 AM Christian König
> >>> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> Am 09.03.22 um 19:12 schrieb Rob Clark:
> >>>>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
> >>>>> <contactshashanksharma@gmail.com> wrote:
> >>>>>> From: Shashank Sharma <shashank.sharma@amd.com>
> >>>>>>
> >>>>>> This patch adds a new sysfs event, which will indicate
> >>>>>> the userland about a GPU reset, and can also provide
> >>>>>> some information like:
> >>>>>> - process ID of the process involved with the GPU reset
> >>>>>> - process name of the involved process
> >>>>>> - the GPU status info (using flags)
> >>>>>>
> >>>>>> This patch also introduces the first flag of the flags
> >>>>>> bitmap, which can be appended as and when required.
> >>>>> Why invent something new, rather than using the already existing devcoredump?
> >>>>
> >>>> Yeah, that's a really valid question.
> >>>>
> >>>>> I don't think we need (or should encourage/allow) something drm
> >>>>> specific when there is already an existing solution used by both drm
> >>>>> and non-drm drivers.  Userspace should not have to learn to support
> >>>>> yet another mechanism to do the same thing.
> >>>>
> >>>> Question is how is userspace notified about new available core dumps?
> >>>
> >>> I haven't looked into it too closely, as the CrOS userspace
> >>> crash-reporter already had support for devcoredump, so it "just
> >>> worked" out of the box[1].  I believe a udev event is what triggers
> >>> the crash-reporter to go read the devcore dump out of sysfs.
> >>
> >> I had a quick look at the devcoredump code, and it doesn't look like
> >> that is sending an event to the user, so we still need an event to
> >> indicate a GPU reset.
> >
> > There definitely is an event to userspace, I suspect somewhere down
> > the device_add() path?
> >
>
> Let me check that out as well, hope that is not due to a driver-private
> event for GPU reset, coz I think I have seen some of those in a few DRM
> drivers.
>

Definitely no driver private event for drm/msm .. I haven't dug
through it all but this is the collector for devcoredump, triggered
somehow via udev.  Most likely from event triggered by device_add()

https://chromium.googlesource.com/chromiumos/platform2/+/HEAD/crash-reporter/udev_collector.cc

BR,
-R

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-10 17:40             ` Rob Clark
@ 2022-03-10 18:33               ` Abhinav Kumar
  2022-03-10 19:14                 ` Sharma, Shashank
  0 siblings, 1 reply; 95+ messages in thread
From: Abhinav Kumar @ 2022-03-10 18:33 UTC (permalink / raw)
  To: Rob Clark, Sharma, Shashank
  Cc: Rob Clark, Christian König, Amaranath Somalapuram,
	dri-devel, amd-gfx list, Alexandar Deucher, Shashank Sharma,
	Christian Koenig



On 3/10/2022 9:40 AM, Rob Clark wrote:
> On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank
> <shashank.sharma@amd.com> wrote:
>>
>>
>>
>> On 3/10/2022 6:10 PM, Rob Clark wrote:
>>> On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
>>> <shashank.sharma@amd.com> wrote:
>>>>
>>>>
>>>>
>>>> On 3/10/2022 4:24 PM, Rob Clark wrote:
>>>>> On Thu, Mar 10, 2022 at 1:55 AM Christian König
>>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Am 09.03.22 um 19:12 schrieb Rob Clark:
>>>>>>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
>>>>>>> <contactshashanksharma@gmail.com> wrote:
>>>>>>>> From: Shashank Sharma <shashank.sharma@amd.com>
>>>>>>>>
>>>>>>>> This patch adds a new sysfs event, which will indicate
>>>>>>>> the userland about a GPU reset, and can also provide
>>>>>>>> some information like:
>>>>>>>> - process ID of the process involved with the GPU reset
>>>>>>>> - process name of the involved process
>>>>>>>> - the GPU status info (using flags)
>>>>>>>>
>>>>>>>> This patch also introduces the first flag of the flags
>>>>>>>> bitmap, which can be appended as and when required.
>>>>>>> Why invent something new, rather than using the already existing devcoredump?
>>>>>>
>>>>>> Yeah, that's a really valid question.
>>>>>>
>>>>>>> I don't think we need (or should encourage/allow) something drm
>>>>>>> specific when there is already an existing solution used by both drm
>>>>>>> and non-drm drivers.  Userspace should not have to learn to support
>>>>>>> yet another mechanism to do the same thing.
>>>>>>
>>>>>> Question is how is userspace notified about new available core dumps?
>>>>>
>>>>> I haven't looked into it too closely, as the CrOS userspace
>>>>> crash-reporter already had support for devcoredump, so it "just
>>>>> worked" out of the box[1].  I believe a udev event is what triggers
>>>>> the crash-reporter to go read the devcore dump out of sysfs.
>>>>
>>>> I had a quick look at the devcoredump code, and it doesn't look like
>>>> that is sending an event to the user, so we still need an event to
>>>> indicate a GPU reset.
>>>
>>> There definitely is an event to userspace, I suspect somewhere down
>>> the device_add() path?
>>>
>>
>> Let me check that out as well, hope that is not due to a driver-private
>> event for GPU reset, coz I think I have seen some of those in a few DRM
>> drivers.
>>
> 
> Definitely no driver private event for drm/msm .. I haven't dug
> through it all but this is the collector for devcoredump, triggered
> somehow via udev.  Most likely from event triggered by device_add()
> 
> https://chromium.googlesource.com/chromiumos/platform2/+/HEAD/crash-reporter/udev_collector.cc

Yes, that is correct. the uevent for devcoredump is from device_add()

https://github.com/torvalds/linux/blob/master/drivers/base/core.c#L3340

> 
> BR,
> -R

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-10 18:33               ` Abhinav Kumar
@ 2022-03-10 19:14                 ` Sharma, Shashank
  2022-03-10 19:35                   ` Rob Clark
  0 siblings, 1 reply; 95+ messages in thread
From: Sharma, Shashank @ 2022-03-10 19:14 UTC (permalink / raw)
  To: Abhinav Kumar, Rob Clark
  Cc: Rob Clark, Christian König, Amaranath Somalapuram,
	dri-devel, amd-gfx list, Alexandar Deucher, Shashank Sharma,
	Christian Koenig



On 3/10/2022 7:33 PM, Abhinav Kumar wrote:
> 
> 
> On 3/10/2022 9:40 AM, Rob Clark wrote:
>> On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank
>> <shashank.sharma@amd.com> wrote:
>>>
>>>
>>>
>>> On 3/10/2022 6:10 PM, Rob Clark wrote:
>>>> On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
>>>> <shashank.sharma@amd.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 3/10/2022 4:24 PM, Rob Clark wrote:
>>>>>> On Thu, Mar 10, 2022 at 1:55 AM Christian König
>>>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Am 09.03.22 um 19:12 schrieb Rob Clark:
>>>>>>>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
>>>>>>>> <contactshashanksharma@gmail.com> wrote:
>>>>>>>>> From: Shashank Sharma <shashank.sharma@amd.com>
>>>>>>>>>
>>>>>>>>> This patch adds a new sysfs event, which will indicate
>>>>>>>>> the userland about a GPU reset, and can also provide
>>>>>>>>> some information like:
>>>>>>>>> - process ID of the process involved with the GPU reset
>>>>>>>>> - process name of the involved process
>>>>>>>>> - the GPU status info (using flags)
>>>>>>>>>
>>>>>>>>> This patch also introduces the first flag of the flags
>>>>>>>>> bitmap, which can be appended as and when required.
>>>>>>>> Why invent something new, rather than using the already existing 
>>>>>>>> devcoredump?
>>>>>>>
>>>>>>> Yeah, that's a really valid question.
>>>>>>>
>>>>>>>> I don't think we need (or should encourage/allow) something drm
>>>>>>>> specific when there is already an existing solution used by both 
>>>>>>>> drm
>>>>>>>> and non-drm drivers.  Userspace should not have to learn to support
>>>>>>>> yet another mechanism to do the same thing.
>>>>>>>
>>>>>>> Question is how is userspace notified about new available core 
>>>>>>> dumps?
>>>>>>
>>>>>> I haven't looked into it too closely, as the CrOS userspace
>>>>>> crash-reporter already had support for devcoredump, so it "just
>>>>>> worked" out of the box[1].  I believe a udev event is what triggers
>>>>>> the crash-reporter to go read the devcore dump out of sysfs.
>>>>>
>>>>> I had a quick look at the devcoredump code, and it doesn't look like
>>>>> that is sending an event to the user, so we still need an event to
>>>>> indicate a GPU reset.
>>>>
>>>> There definitely is an event to userspace, I suspect somewhere down
>>>> the device_add() path?
>>>>
>>>
>>> Let me check that out as well, hope that is not due to a driver-private
>>> event for GPU reset, coz I think I have seen some of those in a few DRM
>>> drivers.
>>>
>>
>> Definitely no driver private event for drm/msm .. I haven't dug
>> through it all but this is the collector for devcoredump, triggered
>> somehow via udev.  Most likely from event triggered by device_add()
>>
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.googlesource.com%2Fchromiumos%2Fplatform2%2F%2B%2FHEAD%2Fcrash-reporter%2Fudev_collector.cc&amp;data=04%7C01%7Cshashank.sharma%40amd.com%7C86146416b717420501fc08da02c4785b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637825340130157925%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=LncI%2F5mIpeG1Avj2YXLmbZ5f1ONUfpf6TzJZH3%2Fs8%2Fw%3D&amp;reserved=0 
>>
> 
> Yes, that is correct. the uevent for devcoredump is from device_add()
> 
Yes, I could confirm in the code that device_add() sends a uevent.

kobject_uevent(&dev->kobj, KOBJ_ADD);

I was trying to map the ChromiumOs's udev event rules with the event 
being sent from device_add(), what I could see is there is only one udev 
rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:

ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \
   RUN+="/sbin/crash_reporter 
--udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"

Can someone confirm that this is the rule which gets triggered when a 
devcoredump is generated ? I could not find an ERROR=1 string in the 
env[] while sending this event from dev_add();

- Shashank

> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftorvalds%2Flinux%2Fblob%2Fmaster%2Fdrivers%2Fbase%2Fcore.c%23L3340&amp;data=04%7C01%7Cshashank.sharma%40amd.com%7C86146416b717420501fc08da02c4785b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637825340130157925%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=5HyWYZ5ZWYz4mUPWeTW51QFdoY0NlA50Nbj1dAC6os4%3D&amp;reserved=0 
> 
> 
>>
>> BR,
>> -R

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-10 19:14                 ` Sharma, Shashank
@ 2022-03-10 19:35                   ` Rob Clark
  2022-03-10 19:44                     ` Sharma, Shashank
  0 siblings, 1 reply; 95+ messages in thread
From: Rob Clark @ 2022-03-10 19:35 UTC (permalink / raw)
  To: Sharma, Shashank
  Cc: Rob Clark, Christian König, Amaranath Somalapuram,
	Abhinav Kumar, dri-devel, amd-gfx list, Alexandar Deucher,
	Shashank Sharma, Christian Koenig

On Thu, Mar 10, 2022 at 11:14 AM Sharma, Shashank
<shashank.sharma@amd.com> wrote:
>
>
>
> On 3/10/2022 7:33 PM, Abhinav Kumar wrote:
> >
> >
> > On 3/10/2022 9:40 AM, Rob Clark wrote:
> >> On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank
> >> <shashank.sharma@amd.com> wrote:
> >>>
> >>>
> >>>
> >>> On 3/10/2022 6:10 PM, Rob Clark wrote:
> >>>> On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
> >>>> <shashank.sharma@amd.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 3/10/2022 4:24 PM, Rob Clark wrote:
> >>>>>> On Thu, Mar 10, 2022 at 1:55 AM Christian König
> >>>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Am 09.03.22 um 19:12 schrieb Rob Clark:
> >>>>>>>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
> >>>>>>>> <contactshashanksharma@gmail.com> wrote:
> >>>>>>>>> From: Shashank Sharma <shashank.sharma@amd.com>
> >>>>>>>>>
> >>>>>>>>> This patch adds a new sysfs event, which will indicate
> >>>>>>>>> the userland about a GPU reset, and can also provide
> >>>>>>>>> some information like:
> >>>>>>>>> - process ID of the process involved with the GPU reset
> >>>>>>>>> - process name of the involved process
> >>>>>>>>> - the GPU status info (using flags)
> >>>>>>>>>
> >>>>>>>>> This patch also introduces the first flag of the flags
> >>>>>>>>> bitmap, which can be appended as and when required.
> >>>>>>>> Why invent something new, rather than using the already existing
> >>>>>>>> devcoredump?
> >>>>>>>
> >>>>>>> Yeah, that's a really valid question.
> >>>>>>>
> >>>>>>>> I don't think we need (or should encourage/allow) something drm
> >>>>>>>> specific when there is already an existing solution used by both
> >>>>>>>> drm
> >>>>>>>> and non-drm drivers.  Userspace should not have to learn to support
> >>>>>>>> yet another mechanism to do the same thing.
> >>>>>>>
> >>>>>>> Question is how is userspace notified about new available core
> >>>>>>> dumps?
> >>>>>>
> >>>>>> I haven't looked into it too closely, as the CrOS userspace
> >>>>>> crash-reporter already had support for devcoredump, so it "just
> >>>>>> worked" out of the box[1].  I believe a udev event is what triggers
> >>>>>> the crash-reporter to go read the devcore dump out of sysfs.
> >>>>>
> >>>>> I had a quick look at the devcoredump code, and it doesn't look like
> >>>>> that is sending an event to the user, so we still need an event to
> >>>>> indicate a GPU reset.
> >>>>
> >>>> There definitely is an event to userspace, I suspect somewhere down
> >>>> the device_add() path?
> >>>>
> >>>
> >>> Let me check that out as well, hope that is not due to a driver-private
> >>> event for GPU reset, coz I think I have seen some of those in a few DRM
> >>> drivers.
> >>>
> >>
> >> Definitely no driver private event for drm/msm .. I haven't dug
> >> through it all but this is the collector for devcoredump, triggered
> >> somehow via udev.  Most likely from event triggered by device_add()
> >>
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.googlesource.com%2Fchromiumos%2Fplatform2%2F%2B%2FHEAD%2Fcrash-reporter%2Fudev_collector.cc&amp;data=04%7C01%7Cshashank.sharma%40amd.com%7C86146416b717420501fc08da02c4785b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637825340130157925%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=LncI%2F5mIpeG1Avj2YXLmbZ5f1ONUfpf6TzJZH3%2Fs8%2Fw%3D&amp;reserved=0
> >>
> >
> > Yes, that is correct. the uevent for devcoredump is from device_add()
> >
> Yes, I could confirm in the code that device_add() sends a uevent.
>
> kobject_uevent(&dev->kobj, KOBJ_ADD);
>
> I was trying to map the ChromiumOs's udev event rules with the event
> being sent from device_add(), what I could see is there is only one udev
> rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:
>
> ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \
>    RUN+="/sbin/crash_reporter
> --udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"
>
> Can someone confirm that this is the rule which gets triggered when a
> devcoredump is generated ? I could not find an ERROR=1 string in the
> env[] while sending this event from dev_add();

I think it is actually this rule:

ACTION=="add", SUBSYSTEM=="devcoredump", \
  RUN+="/sbin/crash_reporter
--udev=SUBSYSTEM=devcoredump:ACTION=add:KERNEL_NUMBER=%n"

It is something non-drm specific because it supports devcore dumps
from non drm drivers.  I know at least some of the wifi and remoteproc
drivers use it.

BR,
-R

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-10 19:35                   ` Rob Clark
@ 2022-03-10 19:44                     ` Sharma, Shashank
  2022-03-10 19:56                       ` Rob Clark
  0 siblings, 1 reply; 95+ messages in thread
From: Sharma, Shashank @ 2022-03-10 19:44 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Christian König, Amaranath Somalapuram,
	Abhinav Kumar, dri-devel, amd-gfx list, Alexandar Deucher,
	Shashank Sharma, Christian Koenig



On 3/10/2022 8:35 PM, Rob Clark wrote:
> On Thu, Mar 10, 2022 at 11:14 AM Sharma, Shashank
> <shashank.sharma@amd.com> wrote:
>>
>>
>>
>> On 3/10/2022 7:33 PM, Abhinav Kumar wrote:
>>>
>>>
>>> On 3/10/2022 9:40 AM, Rob Clark wrote:
>>>> On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank
>>>> <shashank.sharma@amd.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 3/10/2022 6:10 PM, Rob Clark wrote:
>>>>>> On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
>>>>>> <shashank.sharma@amd.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 3/10/2022 4:24 PM, Rob Clark wrote:
>>>>>>>> On Thu, Mar 10, 2022 at 1:55 AM Christian König
>>>>>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 09.03.22 um 19:12 schrieb Rob Clark:
>>>>>>>>>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
>>>>>>>>>> <contactshashanksharma@gmail.com> wrote:
>>>>>>>>>>> From: Shashank Sharma <shashank.sharma@amd.com>
>>>>>>>>>>>
>>>>>>>>>>> This patch adds a new sysfs event, which will indicate
>>>>>>>>>>> the userland about a GPU reset, and can also provide
>>>>>>>>>>> some information like:
>>>>>>>>>>> - process ID of the process involved with the GPU reset
>>>>>>>>>>> - process name of the involved process
>>>>>>>>>>> - the GPU status info (using flags)
>>>>>>>>>>>
>>>>>>>>>>> This patch also introduces the first flag of the flags
>>>>>>>>>>> bitmap, which can be appended as and when required.
>>>>>>>>>> Why invent something new, rather than using the already existing
>>>>>>>>>> devcoredump?
>>>>>>>>>
>>>>>>>>> Yeah, that's a really valid question.
>>>>>>>>>
>>>>>>>>>> I don't think we need (or should encourage/allow) something drm
>>>>>>>>>> specific when there is already an existing solution used by both
>>>>>>>>>> drm
>>>>>>>>>> and non-drm drivers.  Userspace should not have to learn to support
>>>>>>>>>> yet another mechanism to do the same thing.
>>>>>>>>>
>>>>>>>>> Question is how is userspace notified about new available core
>>>>>>>>> dumps?
>>>>>>>>
>>>>>>>> I haven't looked into it too closely, as the CrOS userspace
>>>>>>>> crash-reporter already had support for devcoredump, so it "just
>>>>>>>> worked" out of the box[1].  I believe a udev event is what triggers
>>>>>>>> the crash-reporter to go read the devcore dump out of sysfs.
>>>>>>>
>>>>>>> I had a quick look at the devcoredump code, and it doesn't look like
>>>>>>> that is sending an event to the user, so we still need an event to
>>>>>>> indicate a GPU reset.
>>>>>>
>>>>>> There definitely is an event to userspace, I suspect somewhere down
>>>>>> the device_add() path?
>>>>>>
>>>>>
>>>>> Let me check that out as well, hope that is not due to a driver-private
>>>>> event for GPU reset, coz I think I have seen some of those in a few DRM
>>>>> drivers.
>>>>>
>>>>
>>>> Definitely no driver private event for drm/msm .. I haven't dug
>>>> through it all but this is the collector for devcoredump, triggered
>>>> somehow via udev.  Most likely from event triggered by device_add()
>>>>
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.googlesource.com%2Fchromiumos%2Fplatform2%2F%2B%2FHEAD%2Fcrash-reporter%2Fudev_collector.cc&amp;data=04%7C01%7Cshashank.sharma%40amd.com%7Cb4e920f125ae4d7de29708da02cd3112%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637825377562005233%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=M4xHPErex4vn7l3lNPgniiMp%2BKb3SpOHQo2QLAndxDQ%3D&amp;reserved=0
>>>>
>>>
>>> Yes, that is correct. the uevent for devcoredump is from device_add()
>>>
>> Yes, I could confirm in the code that device_add() sends a uevent.
>>
>> kobject_uevent(&dev->kobj, KOBJ_ADD);
>>
>> I was trying to map the ChromiumOs's udev event rules with the event
>> being sent from device_add(), what I could see is there is only one udev
>> rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:
>>
>> ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \
>>     RUN+="/sbin/crash_reporter
>> --udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"
>>
>> Can someone confirm that this is the rule which gets triggered when a
>> devcoredump is generated ? I could not find an ERROR=1 string in the
>> env[] while sending this event from dev_add();
> 
> I think it is actually this rule:
> 
> ACTION=="add", SUBSYSTEM=="devcoredump", \
>    RUN+="/sbin/crash_reporter
> --udev=SUBSYSTEM=devcoredump:ACTION=add:KERNEL_NUMBER=%n"
> 
> It is something non-drm specific because it supports devcore dumps
> from non drm drivers.  I know at least some of the wifi and remoteproc
> drivers use it.
> 

Ah, this seems like a problem for me. I understand it will work for a 
reset/recovery app well, but if a DRM client (like a compositor), who 
wants to listen only to DRM events (like a GPU reset), wouldn't this 
create a lot of noise for it ? Like every time any subsystem produces 
this coredump, there will be a change in devcoresump subsystem, and the 
client will have to parse the core file, and then will have to decide if 
it wants to react to this, or ignore.

Wouldn't a GPU reset event, specific to DRM subsystem server better in 
such case ?

- Shashank

> BR,
> -R

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-10 19:44                     ` Sharma, Shashank
@ 2022-03-10 19:56                       ` Rob Clark
  2022-03-10 20:17                         ` Sharma, Shashank
  2022-03-11  8:30                         ` Pekka Paalanen
  0 siblings, 2 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-10 19:56 UTC (permalink / raw)
  To: Sharma, Shashank
  Cc: Rob Clark, Christian König, Amaranath Somalapuram,
	Abhinav Kumar, dri-devel, amd-gfx list, Alexandar Deucher,
	Shashank Sharma, Christian Koenig

On Thu, Mar 10, 2022 at 11:44 AM Sharma, Shashank
<shashank.sharma@amd.com> wrote:
>
>
>
> On 3/10/2022 8:35 PM, Rob Clark wrote:
> > On Thu, Mar 10, 2022 at 11:14 AM Sharma, Shashank
> > <shashank.sharma@amd.com> wrote:
> >>
> >>
> >>
> >> On 3/10/2022 7:33 PM, Abhinav Kumar wrote:
> >>>
> >>>
> >>> On 3/10/2022 9:40 AM, Rob Clark wrote:
> >>>> On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank
> >>>> <shashank.sharma@amd.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 3/10/2022 6:10 PM, Rob Clark wrote:
> >>>>>> On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
> >>>>>> <shashank.sharma@amd.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 3/10/2022 4:24 PM, Rob Clark wrote:
> >>>>>>>> On Thu, Mar 10, 2022 at 1:55 AM Christian König
> >>>>>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Am 09.03.22 um 19:12 schrieb Rob Clark:
> >>>>>>>>>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
> >>>>>>>>>> <contactshashanksharma@gmail.com> wrote:
> >>>>>>>>>>> From: Shashank Sharma <shashank.sharma@amd.com>
> >>>>>>>>>>>
> >>>>>>>>>>> This patch adds a new sysfs event, which will indicate
> >>>>>>>>>>> the userland about a GPU reset, and can also provide
> >>>>>>>>>>> some information like:
> >>>>>>>>>>> - process ID of the process involved with the GPU reset
> >>>>>>>>>>> - process name of the involved process
> >>>>>>>>>>> - the GPU status info (using flags)
> >>>>>>>>>>>
> >>>>>>>>>>> This patch also introduces the first flag of the flags
> >>>>>>>>>>> bitmap, which can be appended as and when required.
> >>>>>>>>>> Why invent something new, rather than using the already existing
> >>>>>>>>>> devcoredump?
> >>>>>>>>>
> >>>>>>>>> Yeah, that's a really valid question.
> >>>>>>>>>
> >>>>>>>>>> I don't think we need (or should encourage/allow) something drm
> >>>>>>>>>> specific when there is already an existing solution used by both
> >>>>>>>>>> drm
> >>>>>>>>>> and non-drm drivers.  Userspace should not have to learn to support
> >>>>>>>>>> yet another mechanism to do the same thing.
> >>>>>>>>>
> >>>>>>>>> Question is how is userspace notified about new available core
> >>>>>>>>> dumps?
> >>>>>>>>
> >>>>>>>> I haven't looked into it too closely, as the CrOS userspace
> >>>>>>>> crash-reporter already had support for devcoredump, so it "just
> >>>>>>>> worked" out of the box[1].  I believe a udev event is what triggers
> >>>>>>>> the crash-reporter to go read the devcore dump out of sysfs.
> >>>>>>>
> >>>>>>> I had a quick look at the devcoredump code, and it doesn't look like
> >>>>>>> that is sending an event to the user, so we still need an event to
> >>>>>>> indicate a GPU reset.
> >>>>>>
> >>>>>> There definitely is an event to userspace, I suspect somewhere down
> >>>>>> the device_add() path?
> >>>>>>
> >>>>>
> >>>>> Let me check that out as well, hope that is not due to a driver-private
> >>>>> event for GPU reset, coz I think I have seen some of those in a few DRM
> >>>>> drivers.
> >>>>>
> >>>>
> >>>> Definitely no driver private event for drm/msm .. I haven't dug
> >>>> through it all but this is the collector for devcoredump, triggered
> >>>> somehow via udev.  Most likely from event triggered by device_add()
> >>>>
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.googlesource.com%2Fchromiumos%2Fplatform2%2F%2B%2FHEAD%2Fcrash-reporter%2Fudev_collector.cc&amp;data=04%7C01%7Cshashank.sharma%40amd.com%7Cb4e920f125ae4d7de29708da02cd3112%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637825377562005233%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=M4xHPErex4vn7l3lNPgniiMp%2BKb3SpOHQo2QLAndxDQ%3D&amp;reserved=0
> >>>>
> >>>
> >>> Yes, that is correct. the uevent for devcoredump is from device_add()
> >>>
> >> Yes, I could confirm in the code that device_add() sends a uevent.
> >>
> >> kobject_uevent(&dev->kobj, KOBJ_ADD);
> >>
> >> I was trying to map the ChromiumOs's udev event rules with the event
> >> being sent from device_add(), what I could see is there is only one udev
> >> rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:
> >>
> >> ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \
> >>     RUN+="/sbin/crash_reporter
> >> --udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"
> >>
> >> Can someone confirm that this is the rule which gets triggered when a
> >> devcoredump is generated ? I could not find an ERROR=1 string in the
> >> env[] while sending this event from dev_add();
> >
> > I think it is actually this rule:
> >
> > ACTION=="add", SUBSYSTEM=="devcoredump", \
> >    RUN+="/sbin/crash_reporter
> > --udev=SUBSYSTEM=devcoredump:ACTION=add:KERNEL_NUMBER=%n"
> >
> > It is something non-drm specific because it supports devcore dumps
> > from non drm drivers.  I know at least some of the wifi and remoteproc
> > drivers use it.
> >
>
> Ah, this seems like a problem for me. I understand it will work for a
> reset/recovery app well, but if a DRM client (like a compositor), who
> wants to listen only to DRM events (like a GPU reset), wouldn't this
> create a lot of noise for it ? Like every time any subsystem produces
> this coredump, there will be a change in devcoresump subsystem, and the
> client will have to parse the core file, and then will have to decide if
> it wants to react to this, or ignore.
>
> Wouldn't a GPU reset event, specific to DRM subsystem server better in
> such case ?
>

So, I suppose there are two different use-cases.. for something like
distro which has generic crash telemetry (ie. when users opt in to
automated crash reporting), and in general for debugging gpu crashes,
you want devcoredump, preferably with plenty of information about gpu
state, etc, so you actually have a chance of debugging problems you
can't necessarily reproduce locally.  Note also that mesa CI has some
limited support for collecting devcore dumps if a CI run triggers a
GPU fault.

For something like just notifying a compositor that a gpu crash
happened, perhaps drm_event is more suitable.  See
virtio_gpu_fence_event_create() for an example of adding new event
types.  Although maybe you want it to be an event which is not device
specific.  This isn't so much of a debugging use-case as simply
notification.

BR,
-R

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-10 19:56                       ` Rob Clark
@ 2022-03-10 20:17                         ` Sharma, Shashank
  2022-03-11  8:30                         ` Pekka Paalanen
  1 sibling, 0 replies; 95+ messages in thread
From: Sharma, Shashank @ 2022-03-10 20:17 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Christian König, Amaranath Somalapuram,
	Abhinav Kumar, dri-devel, amd-gfx list, Alexandar Deucher,
	Shashank Sharma, Christian Koenig



On 3/10/2022 8:56 PM, Rob Clark wrote:
> On Thu, Mar 10, 2022 at 11:44 AM Sharma, Shashank
> <shashank.sharma@amd.com> wrote:
>>
>>
>>
>> On 3/10/2022 8:35 PM, Rob Clark wrote:
>>> On Thu, Mar 10, 2022 at 11:14 AM Sharma, Shashank
>>> <shashank.sharma@amd.com> wrote:
>>>>
>>>>
>>>>
>>>> On 3/10/2022 7:33 PM, Abhinav Kumar wrote:
>>>>>
>>>>>
>>>>> On 3/10/2022 9:40 AM, Rob Clark wrote:
>>>>>> On Thu, Mar 10, 2022 at 9:19 AM Sharma, Shashank
>>>>>> <shashank.sharma@amd.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 3/10/2022 6:10 PM, Rob Clark wrote:
>>>>>>>> On Thu, Mar 10, 2022 at 8:21 AM Sharma, Shashank
>>>>>>>> <shashank.sharma@amd.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/10/2022 4:24 PM, Rob Clark wrote:
>>>>>>>>>> On Thu, Mar 10, 2022 at 1:55 AM Christian König
>>>>>>>>>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Am 09.03.22 um 19:12 schrieb Rob Clark:
>>>>>>>>>>>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
>>>>>>>>>>>> <contactshashanksharma@gmail.com> wrote:
>>>>>>>>>>>>> From: Shashank Sharma <shashank.sharma@amd.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This patch adds a new sysfs event, which will indicate
>>>>>>>>>>>>> the userland about a GPU reset, and can also provide
>>>>>>>>>>>>> some information like:
>>>>>>>>>>>>> - process ID of the process involved with the GPU reset
>>>>>>>>>>>>> - process name of the involved process
>>>>>>>>>>>>> - the GPU status info (using flags)
>>>>>>>>>>>>>
>>>>>>>>>>>>> This patch also introduces the first flag of the flags
>>>>>>>>>>>>> bitmap, which can be appended as and when required.
>>>>>>>>>>>> Why invent something new, rather than using the already existing
>>>>>>>>>>>> devcoredump?
>>>>>>>>>>>
>>>>>>>>>>> Yeah, that's a really valid question.
>>>>>>>>>>>
>>>>>>>>>>>> I don't think we need (or should encourage/allow) something drm
>>>>>>>>>>>> specific when there is already an existing solution used by both
>>>>>>>>>>>> drm
>>>>>>>>>>>> and non-drm drivers.  Userspace should not have to learn to support
>>>>>>>>>>>> yet another mechanism to do the same thing.
>>>>>>>>>>>
>>>>>>>>>>> Question is how is userspace notified about new available core
>>>>>>>>>>> dumps?
>>>>>>>>>>
>>>>>>>>>> I haven't looked into it too closely, as the CrOS userspace
>>>>>>>>>> crash-reporter already had support for devcoredump, so it "just
>>>>>>>>>> worked" out of the box[1].  I believe a udev event is what triggers
>>>>>>>>>> the crash-reporter to go read the devcore dump out of sysfs.
>>>>>>>>>
>>>>>>>>> I had a quick look at the devcoredump code, and it doesn't look like
>>>>>>>>> that is sending an event to the user, so we still need an event to
>>>>>>>>> indicate a GPU reset.
>>>>>>>>
>>>>>>>> There definitely is an event to userspace, I suspect somewhere down
>>>>>>>> the device_add() path?
>>>>>>>>
>>>>>>>
>>>>>>> Let me check that out as well, hope that is not due to a driver-private
>>>>>>> event for GPU reset, coz I think I have seen some of those in a few DRM
>>>>>>> drivers.
>>>>>>>
>>>>>>
>>>>>> Definitely no driver private event for drm/msm .. I haven't dug
>>>>>> through it all but this is the collector for devcoredump, triggered
>>>>>> somehow via udev.  Most likely from event triggered by device_add()
>>>>>>
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fchromium.googlesource.com%2Fchromiumos%2Fplatform2%2F%2B%2FHEAD%2Fcrash-reporter%2Fudev_collector.cc&amp;data=04%7C01%7Cshashank.sharma%40amd.com%7C3b5c0e8744234962061d08da02d00248%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637825389694363926%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=mWI1Z%2B8eMJApePc5ajRipbGUG9Cw9wXf2FCw6NQxVaM%3D&amp;reserved=0
>>>>>>
>>>>>
>>>>> Yes, that is correct. the uevent for devcoredump is from device_add()
>>>>>
>>>> Yes, I could confirm in the code that device_add() sends a uevent.
>>>>
>>>> kobject_uevent(&dev->kobj, KOBJ_ADD);
>>>>
>>>> I was trying to map the ChromiumOs's udev event rules with the event
>>>> being sent from device_add(), what I could see is there is only one udev
>>>> rule for any DRM subsystem events in ChromiumOs's 99-crash-reporter.rules:
>>>>
>>>> ACTION=="change", SUBSYSTEM=="drm", KERNEL=="card0", ENV{ERROR}=="1", \
>>>>      RUN+="/sbin/crash_reporter
>>>> --udev=KERNEL=card0:SUBSYSTEM=drm:ACTION=change"
>>>>
>>>> Can someone confirm that this is the rule which gets triggered when a
>>>> devcoredump is generated ? I could not find an ERROR=1 string in the
>>>> env[] while sending this event from dev_add();
>>>
>>> I think it is actually this rule:
>>>
>>> ACTION=="add", SUBSYSTEM=="devcoredump", \
>>>     RUN+="/sbin/crash_reporter
>>> --udev=SUBSYSTEM=devcoredump:ACTION=add:KERNEL_NUMBER=%n"
>>>
>>> It is something non-drm specific because it supports devcore dumps
>>> from non drm drivers.  I know at least some of the wifi and remoteproc
>>> drivers use it.
>>>
>>
>> Ah, this seems like a problem for me. I understand it will work for a
>> reset/recovery app well, but if a DRM client (like a compositor), who
>> wants to listen only to DRM events (like a GPU reset), wouldn't this
>> create a lot of noise for it ? Like every time any subsystem produces
>> this coredump, there will be a change in devcoresump subsystem, and the
>> client will have to parse the core file, and then will have to decide if
>> it wants to react to this, or ignore.
>>
>> Wouldn't a GPU reset event, specific to DRM subsystem server better in
>> such case ?
>>
> 
> So, I suppose there are two different use-cases.. for something like
> distro which has generic crash telemetry (ie. when users opt in to
> automated crash reporting), and in general for debugging gpu crashes,
> you want devcoredump, preferably with plenty of information about gpu
> state, etc, so you actually have a chance of debugging problems you
> can't necessarily reproduce locally.  Note also that mesa CI has some
> limited support for collecting devcore dumps if a CI run triggers a
> GPU fault.
> 
> For something like just notifying a compositor that a gpu crash
> happened, perhaps drm_event is more suitable.  See
> virtio_gpu_fence_event_create() for an example of adding new event
> types.  Although maybe you want it to be an event which is not device
> specific.  This isn't so much of a debugging use-case as simply
> notification.

Exactly, in fact that should have been a part of my cover-letter. May be 
I will send a V2 with better doc.

- Shashank

> 
> BR,
> -R

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-10 19:56                       ` Rob Clark
  2022-03-10 20:17                         ` Sharma, Shashank
@ 2022-03-11  8:30                         ` Pekka Paalanen
  2022-03-14 14:23                             ` Alex Deucher
  1 sibling, 1 reply; 95+ messages in thread
From: Pekka Paalanen @ 2022-03-11  8:30 UTC (permalink / raw)
  To: Rob Clark, Shashank Sharma
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, amd-gfx list,
	Alexandar Deucher, Christian Koenig

[-- Attachment #1: Type: text/plain, Size: 1688 bytes --]

On Thu, 10 Mar 2022 11:56:41 -0800
Rob Clark <robdclark@gmail.com> wrote:

> For something like just notifying a compositor that a gpu crash
> happened, perhaps drm_event is more suitable.  See
> virtio_gpu_fence_event_create() for an example of adding new event
> types.  Although maybe you want it to be an event which is not device
> specific.  This isn't so much of a debugging use-case as simply
> notification.

Hi,

for this particular use case, are we now talking about the display
device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?

If the former, I wasn't aware that display device crashes are a thing.
How should a userspace display server react to those?

If the latter, don't we have EGL extensions or Vulkan API already to
deliver that?

The above would be about device crashes that directly affect the
display server. Is that the use case in mind here, or is it instead
about notifying the display server that some application has caused a
driver/hardware crash? If the latter, how should a display server react
to that? Disconnect the application?

Shashank, what is the actual use case you are developing this for?

I've read all the emails here so far, and I don't recall seeing it
explained.

Btw. somewhat relatedly, there has been work aiming to allow
graceful hot-unplug of DRM devices. There is a kernel doc outlining how
the various APIs should react towards userspace when a DRM device
suddenly disappears. That seems to have some overlap here IMO.

See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
which also has a couple pointers to EGL and Vulkan APIs.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-11  8:30                         ` Pekka Paalanen
@ 2022-03-14 14:23                             ` Alex Deucher
  0 siblings, 0 replies; 95+ messages in thread
From: Alex Deucher @ 2022-03-14 14:23 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, amd-gfx list,
	Alexandar Deucher, Shashank Sharma, Christian Koenig

On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>
> On Thu, 10 Mar 2022 11:56:41 -0800
> Rob Clark <robdclark@gmail.com> wrote:
>
> > For something like just notifying a compositor that a gpu crash
> > happened, perhaps drm_event is more suitable.  See
> > virtio_gpu_fence_event_create() for an example of adding new event
> > types.  Although maybe you want it to be an event which is not device
> > specific.  This isn't so much of a debugging use-case as simply
> > notification.
>
> Hi,
>
> for this particular use case, are we now talking about the display
> device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
>
> If the former, I wasn't aware that display device crashes are a thing.
> How should a userspace display server react to those?
>
> If the latter, don't we have EGL extensions or Vulkan API already to
> deliver that?
>
> The above would be about device crashes that directly affect the
> display server. Is that the use case in mind here, or is it instead
> about notifying the display server that some application has caused a
> driver/hardware crash? If the latter, how should a display server react
> to that? Disconnect the application?
>
> Shashank, what is the actual use case you are developing this for?
>
> I've read all the emails here so far, and I don't recall seeing it
> explained.
>

The idea is that a support daemon or compositor would listen for GPU
reset notifications and do something useful with them (kill the guilty
app, restart the desktop environment, etc.).  Today when the GPU
resets, most applications just continue assuming nothing is wrong,
meanwhile the GPU has stopped accepting work until the apps re-init
their context so all of their command submissions just get rejected.

> Btw. somewhat relatedly, there has been work aiming to allow
> graceful hot-unplug of DRM devices. There is a kernel doc outlining how
> the various APIs should react towards userspace when a DRM device
> suddenly disappears. That seems to have some overlap here IMO.
>
> See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> which also has a couple pointers to EGL and Vulkan APIs.

The problem is most applications don't use the GL or VK robustness
APIs.  You could use something like that in the compositor, but those
APIs tend to be focused more on the application itself rather than the
GPU in general.  E.g., Is my context lost.  Which is fine for
restarting your context, but doesn't really help if you want to try
and do something with another application (i.e., the likely guilty
app).  Also, on dGPU at least, when you reset the GPU, vram is usually
lost (either due to the memory controller being reset, or vram being
zero'd on init due to ECC support), so even if you are not the guilty
process, in that case you'd need to re-init your context anyway.

Alex

>
>
> Thanks,
> pq

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-14 14:23                             ` Alex Deucher
  0 siblings, 0 replies; 95+ messages in thread
From: Alex Deucher @ 2022-03-14 14:23 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, Rob Clark,
	amd-gfx list, Alexandar Deucher, Shashank Sharma,
	Christian Koenig

On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>
> On Thu, 10 Mar 2022 11:56:41 -0800
> Rob Clark <robdclark@gmail.com> wrote:
>
> > For something like just notifying a compositor that a gpu crash
> > happened, perhaps drm_event is more suitable.  See
> > virtio_gpu_fence_event_create() for an example of adding new event
> > types.  Although maybe you want it to be an event which is not device
> > specific.  This isn't so much of a debugging use-case as simply
> > notification.
>
> Hi,
>
> for this particular use case, are we now talking about the display
> device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
>
> If the former, I wasn't aware that display device crashes are a thing.
> How should a userspace display server react to those?
>
> If the latter, don't we have EGL extensions or Vulkan API already to
> deliver that?
>
> The above would be about device crashes that directly affect the
> display server. Is that the use case in mind here, or is it instead
> about notifying the display server that some application has caused a
> driver/hardware crash? If the latter, how should a display server react
> to that? Disconnect the application?
>
> Shashank, what is the actual use case you are developing this for?
>
> I've read all the emails here so far, and I don't recall seeing it
> explained.
>

The idea is that a support daemon or compositor would listen for GPU
reset notifications and do something useful with them (kill the guilty
app, restart the desktop environment, etc.).  Today when the GPU
resets, most applications just continue assuming nothing is wrong,
meanwhile the GPU has stopped accepting work until the apps re-init
their context so all of their command submissions just get rejected.

> Btw. somewhat relatedly, there has been work aiming to allow
> graceful hot-unplug of DRM devices. There is a kernel doc outlining how
> the various APIs should react towards userspace when a DRM device
> suddenly disappears. That seems to have some overlap here IMO.
>
> See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> which also has a couple pointers to EGL and Vulkan APIs.

The problem is most applications don't use the GL or VK robustness
APIs.  You could use something like that in the compositor, but those
APIs tend to be focused more on the application itself rather than the
GPU in general.  E.g., Is my context lost.  Which is fine for
restarting your context, but doesn't really help if you want to try
and do something with another application (i.e., the likely guilty
app).  Also, on dGPU at least, when you reset the GPU, vram is usually
lost (either due to the memory controller being reset, or vram being
zero'd on init due to ECC support), so even if you are not the guilty
process, in that case you'd need to re-init your context anyway.

Alex

>
>
> Thanks,
> pq

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-14 14:23                             ` Alex Deucher
@ 2022-03-14 15:26                               ` Pekka Paalanen
  -1 siblings, 0 replies; 95+ messages in thread
From: Pekka Paalanen @ 2022-03-14 15:26 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, amd-gfx list,
	Alexandar Deucher, Shashank Sharma, Christian Koenig

[-- Attachment #1: Type: text/plain, Size: 4081 bytes --]

On Mon, 14 Mar 2022 10:23:27 -0400
Alex Deucher <alexdeucher@gmail.com> wrote:

> On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> >
> > On Thu, 10 Mar 2022 11:56:41 -0800
> > Rob Clark <robdclark@gmail.com> wrote:
> >  
> > > For something like just notifying a compositor that a gpu crash
> > > happened, perhaps drm_event is more suitable.  See
> > > virtio_gpu_fence_event_create() for an example of adding new event
> > > types.  Although maybe you want it to be an event which is not device
> > > specific.  This isn't so much of a debugging use-case as simply
> > > notification.  
> >
> > Hi,
> >
> > for this particular use case, are we now talking about the display
> > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
> >
> > If the former, I wasn't aware that display device crashes are a thing.
> > How should a userspace display server react to those?
> >
> > If the latter, don't we have EGL extensions or Vulkan API already to
> > deliver that?
> >
> > The above would be about device crashes that directly affect the
> > display server. Is that the use case in mind here, or is it instead
> > about notifying the display server that some application has caused a
> > driver/hardware crash? If the latter, how should a display server react
> > to that? Disconnect the application?
> >
> > Shashank, what is the actual use case you are developing this for?
> >
> > I've read all the emails here so far, and I don't recall seeing it
> > explained.
> >  
> 
> The idea is that a support daemon or compositor would listen for GPU
> reset notifications and do something useful with them (kill the guilty
> app, restart the desktop environment, etc.).  Today when the GPU
> resets, most applications just continue assuming nothing is wrong,
> meanwhile the GPU has stopped accepting work until the apps re-init
> their context so all of their command submissions just get rejected.
> 
> > Btw. somewhat relatedly, there has been work aiming to allow
> > graceful hot-unplug of DRM devices. There is a kernel doc outlining how
> > the various APIs should react towards userspace when a DRM device
> > suddenly disappears. That seems to have some overlap here IMO.
> >
> > See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> > which also has a couple pointers to EGL and Vulkan APIs.  
> 
> The problem is most applications don't use the GL or VK robustness
> APIs.

Hi,

how would this new event help with that?

I mean, yeah, there could be a daemon that kills those GPU users, but
then what? You still lose any unsaved work, and may need to manually
restart them.

Is the idea that it is better to have the app crash and disappear than
to look like it froze while it otherwise still runs?

If some daemon or compositor goes killing apps that trigger GPU resets,
then how do we stop that for an app that actually does use the
appropriate EGL or Vulkan APIs to detect and remedy that situation
itself?

>  You could use something like that in the compositor, but those
> APIs tend to be focused more on the application itself rather than the
> GPU in general.  E.g., Is my context lost.  Which is fine for
> restarting your context, but doesn't really help if you want to try
> and do something with another application (i.e., the likely guilty
> app).  Also, on dGPU at least, when you reset the GPU, vram is usually
> lost (either due to the memory controller being reset, or vram being
> zero'd on init due to ECC support), so even if you are not the guilty
> process, in that case you'd need to re-init your context anyway.

Why should something like a compositor listen for this and kill apps
that triggered GPU resets, instead of e.g. Mesa noticing that in the app
and killing itself? Mesa in the app would know if robustness API is
being used.

Would be really nice to have the answers to all these questions to be
collected and reiterated in the next version of this proposal.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-14 15:26                               ` Pekka Paalanen
  0 siblings, 0 replies; 95+ messages in thread
From: Pekka Paalanen @ 2022-03-14 15:26 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, Rob Clark,
	amd-gfx list, Alexandar Deucher, Shashank Sharma,
	Christian Koenig

[-- Attachment #1: Type: text/plain, Size: 4081 bytes --]

On Mon, 14 Mar 2022 10:23:27 -0400
Alex Deucher <alexdeucher@gmail.com> wrote:

> On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> >
> > On Thu, 10 Mar 2022 11:56:41 -0800
> > Rob Clark <robdclark@gmail.com> wrote:
> >  
> > > For something like just notifying a compositor that a gpu crash
> > > happened, perhaps drm_event is more suitable.  See
> > > virtio_gpu_fence_event_create() for an example of adding new event
> > > types.  Although maybe you want it to be an event which is not device
> > > specific.  This isn't so much of a debugging use-case as simply
> > > notification.  
> >
> > Hi,
> >
> > for this particular use case, are we now talking about the display
> > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
> >
> > If the former, I wasn't aware that display device crashes are a thing.
> > How should a userspace display server react to those?
> >
> > If the latter, don't we have EGL extensions or Vulkan API already to
> > deliver that?
> >
> > The above would be about device crashes that directly affect the
> > display server. Is that the use case in mind here, or is it instead
> > about notifying the display server that some application has caused a
> > driver/hardware crash? If the latter, how should a display server react
> > to that? Disconnect the application?
> >
> > Shashank, what is the actual use case you are developing this for?
> >
> > I've read all the emails here so far, and I don't recall seeing it
> > explained.
> >  
> 
> The idea is that a support daemon or compositor would listen for GPU
> reset notifications and do something useful with them (kill the guilty
> app, restart the desktop environment, etc.).  Today when the GPU
> resets, most applications just continue assuming nothing is wrong,
> meanwhile the GPU has stopped accepting work until the apps re-init
> their context so all of their command submissions just get rejected.
> 
> > Btw. somewhat relatedly, there has been work aiming to allow
> > graceful hot-unplug of DRM devices. There is a kernel doc outlining how
> > the various APIs should react towards userspace when a DRM device
> > suddenly disappears. That seems to have some overlap here IMO.
> >
> > See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> > which also has a couple pointers to EGL and Vulkan APIs.  
> 
> The problem is most applications don't use the GL or VK robustness
> APIs.

Hi,

how would this new event help with that?

I mean, yeah, there could be a daemon that kills those GPU users, but
then what? You still lose any unsaved work, and may need to manually
restart them.

Is the idea that it is better to have the app crash and disappear than
to look like it froze while it otherwise still runs?

If some daemon or compositor goes killing apps that trigger GPU resets,
then how do we stop that for an app that actually does use the
appropriate EGL or Vulkan APIs to detect and remedy that situation
itself?

>  You could use something like that in the compositor, but those
> APIs tend to be focused more on the application itself rather than the
> GPU in general.  E.g., Is my context lost.  Which is fine for
> restarting your context, but doesn't really help if you want to try
> and do something with another application (i.e., the likely guilty
> app).  Also, on dGPU at least, when you reset the GPU, vram is usually
> lost (either due to the memory controller being reset, or vram being
> zero'd on init due to ECC support), so even if you are not the guilty
> process, in that case you'd need to re-init your context anyway.

Why should something like a compositor listen for this and kill apps
that triggered GPU resets, instead of e.g. Mesa noticing that in the app
and killing itself? Mesa in the app would know if robustness API is
being used.

Would be really nice to have the answers to all these questions to be
collected and reiterated in the next version of this proposal.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-14 14:23                             ` Alex Deucher
@ 2022-03-15  7:13                               ` Dave Airlie
  -1 siblings, 0 replies; 95+ messages in thread
From: Dave Airlie @ 2022-03-15  7:13 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, amd-gfx list,
	Alexandar Deucher, Shashank Sharma, Christian Koenig

On Tue, 15 Mar 2022 at 00:23, Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> >
> > On Thu, 10 Mar 2022 11:56:41 -0800
> > Rob Clark <robdclark@gmail.com> wrote:
> >
> > > For something like just notifying a compositor that a gpu crash
> > > happened, perhaps drm_event is more suitable.  See
> > > virtio_gpu_fence_event_create() for an example of adding new event
> > > types.  Although maybe you want it to be an event which is not device
> > > specific.  This isn't so much of a debugging use-case as simply
> > > notification.
> >
> > Hi,
> >
> > for this particular use case, are we now talking about the display
> > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
> >
> > If the former, I wasn't aware that display device crashes are a thing.
> > How should a userspace display server react to those?
> >
> > If the latter, don't we have EGL extensions or Vulkan API already to
> > deliver that?
> >
> > The above would be about device crashes that directly affect the
> > display server. Is that the use case in mind here, or is it instead
> > about notifying the display server that some application has caused a
> > driver/hardware crash? If the latter, how should a display server react
> > to that? Disconnect the application?
> >
> > Shashank, what is the actual use case you are developing this for?
> >
> > I've read all the emails here so far, and I don't recall seeing it
> > explained.
> >
>
> The idea is that a support daemon or compositor would listen for GPU
> reset notifications and do something useful with them (kill the guilty
> app, restart the desktop environment, etc.).  Today when the GPU
> resets, most applications just continue assuming nothing is wrong,
> meanwhile the GPU has stopped accepting work until the apps re-init
> their context so all of their command submissions just get rejected.

Just one thing comes to mind reading this, racy PID reuse.

process 1234 does something bad to GPU.
process 1234 dies in parallel to sysfs notification being sent.
other process 1234 reuses the pid
new process 1234 gets destroyed by receiver of sysfs notification.

Dave.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-15  7:13                               ` Dave Airlie
  0 siblings, 0 replies; 95+ messages in thread
From: Dave Airlie @ 2022-03-15  7:13 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Rob Clark, Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, Pekka Paalanen,
	amd-gfx list, Alexandar Deucher, Shashank Sharma,
	Christian Koenig

On Tue, 15 Mar 2022 at 00:23, Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> >
> > On Thu, 10 Mar 2022 11:56:41 -0800
> > Rob Clark <robdclark@gmail.com> wrote:
> >
> > > For something like just notifying a compositor that a gpu crash
> > > happened, perhaps drm_event is more suitable.  See
> > > virtio_gpu_fence_event_create() for an example of adding new event
> > > types.  Although maybe you want it to be an event which is not device
> > > specific.  This isn't so much of a debugging use-case as simply
> > > notification.
> >
> > Hi,
> >
> > for this particular use case, are we now talking about the display
> > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
> >
> > If the former, I wasn't aware that display device crashes are a thing.
> > How should a userspace display server react to those?
> >
> > If the latter, don't we have EGL extensions or Vulkan API already to
> > deliver that?
> >
> > The above would be about device crashes that directly affect the
> > display server. Is that the use case in mind here, or is it instead
> > about notifying the display server that some application has caused a
> > driver/hardware crash? If the latter, how should a display server react
> > to that? Disconnect the application?
> >
> > Shashank, what is the actual use case you are developing this for?
> >
> > I've read all the emails here so far, and I don't recall seeing it
> > explained.
> >
>
> The idea is that a support daemon or compositor would listen for GPU
> reset notifications and do something useful with them (kill the guilty
> app, restart the desktop environment, etc.).  Today when the GPU
> resets, most applications just continue assuming nothing is wrong,
> meanwhile the GPU has stopped accepting work until the apps re-init
> their context so all of their command submissions just get rejected.

Just one thing comes to mind reading this, racy PID reuse.

process 1234 does something bad to GPU.
process 1234 dies in parallel to sysfs notification being sent.
other process 1234 reuses the pid
new process 1234 gets destroyed by receiver of sysfs notification.

Dave.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-15  7:13                               ` Dave Airlie
@ 2022-03-15  7:25                                 ` Simon Ser
  -1 siblings, 0 replies; 95+ messages in thread
From: Simon Ser @ 2022-03-15  7:25 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Shashank Sharma,
	Christian Koenig

On Tuesday, March 15th, 2022 at 08:13, Dave Airlie <airlied@gmail.com> wrote:

> Just one thing comes to mind reading this, racy PID reuse.
>
> process 1234 does something bad to GPU.
> process 1234 dies in parallel to sysfs notification being sent.
> other process 1234 reuses the pid
> new process 1234 gets destroyed by receiver of sysfs notification.

This is a more general problem with PIDs. One solution is pidfd.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-15  7:25                                 ` Simon Ser
  0 siblings, 0 replies; 95+ messages in thread
From: Simon Ser @ 2022-03-15  7:25 UTC (permalink / raw)
  To: Dave Airlie
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Alex Deucher, Shashank Sharma,
	Christian Koenig

On Tuesday, March 15th, 2022 at 08:13, Dave Airlie <airlied@gmail.com> wrote:

> Just one thing comes to mind reading this, racy PID reuse.
>
> process 1234 does something bad to GPU.
> process 1234 dies in parallel to sysfs notification being sent.
> other process 1234 reuses the pid
> new process 1234 gets destroyed by receiver of sysfs notification.

This is a more general problem with PIDs. One solution is pidfd.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-15  7:13                               ` Dave Airlie
@ 2022-03-15  7:25                                 ` Christian König
  -1 siblings, 0 replies; 95+ messages in thread
From: Christian König @ 2022-03-15  7:25 UTC (permalink / raw)
  To: Dave Airlie, Alex Deucher
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, amd-gfx list,
	Alexandar Deucher, Shashank Sharma

Am 15.03.22 um 08:13 schrieb Dave Airlie:
> On Tue, 15 Mar 2022 at 00:23, Alex Deucher <alexdeucher@gmail.com> wrote:
>> On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>>> On Thu, 10 Mar 2022 11:56:41 -0800
>>> Rob Clark <robdclark@gmail.com> wrote:
>>>
>>>> For something like just notifying a compositor that a gpu crash
>>>> happened, perhaps drm_event is more suitable.  See
>>>> virtio_gpu_fence_event_create() for an example of adding new event
>>>> types.  Although maybe you want it to be an event which is not device
>>>> specific.  This isn't so much of a debugging use-case as simply
>>>> notification.
>>> Hi,
>>>
>>> for this particular use case, are we now talking about the display
>>> device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
>>>
>>> If the former, I wasn't aware that display device crashes are a thing.
>>> How should a userspace display server react to those?
>>>
>>> If the latter, don't we have EGL extensions or Vulkan API already to
>>> deliver that?
>>>
>>> The above would be about device crashes that directly affect the
>>> display server. Is that the use case in mind here, or is it instead
>>> about notifying the display server that some application has caused a
>>> driver/hardware crash? If the latter, how should a display server react
>>> to that? Disconnect the application?
>>>
>>> Shashank, what is the actual use case you are developing this for?
>>>
>>> I've read all the emails here so far, and I don't recall seeing it
>>> explained.
>>>
>> The idea is that a support daemon or compositor would listen for GPU
>> reset notifications and do something useful with them (kill the guilty
>> app, restart the desktop environment, etc.).  Today when the GPU
>> resets, most applications just continue assuming nothing is wrong,
>> meanwhile the GPU has stopped accepting work until the apps re-init
>> their context so all of their command submissions just get rejected.
> Just one thing comes to mind reading this, racy PID reuse.
>
> process 1234 does something bad to GPU.
> process 1234 dies in parallel to sysfs notification being sent.
> other process 1234 reuses the pid
> new process 1234 gets destroyed by receiver of sysfs notification.

That's a well known problem inherit to the uses of PIDs.

IIRC because of this the kernel only reuses PIDs when 
/proc/sys/kernel/pid_max is reached and then wraps around.

Regards,
Christian.

>
> Dave.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-15  7:25                                 ` Christian König
  0 siblings, 0 replies; 95+ messages in thread
From: Christian König @ 2022-03-15  7:25 UTC (permalink / raw)
  To: Dave Airlie, Alex Deucher
  Cc: Rob Clark, Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, Pekka Paalanen,
	amd-gfx list, Alexandar Deucher, Shashank Sharma

Am 15.03.22 um 08:13 schrieb Dave Airlie:
> On Tue, 15 Mar 2022 at 00:23, Alex Deucher <alexdeucher@gmail.com> wrote:
>> On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>>> On Thu, 10 Mar 2022 11:56:41 -0800
>>> Rob Clark <robdclark@gmail.com> wrote:
>>>
>>>> For something like just notifying a compositor that a gpu crash
>>>> happened, perhaps drm_event is more suitable.  See
>>>> virtio_gpu_fence_event_create() for an example of adding new event
>>>> types.  Although maybe you want it to be an event which is not device
>>>> specific.  This isn't so much of a debugging use-case as simply
>>>> notification.
>>> Hi,
>>>
>>> for this particular use case, are we now talking about the display
>>> device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
>>>
>>> If the former, I wasn't aware that display device crashes are a thing.
>>> How should a userspace display server react to those?
>>>
>>> If the latter, don't we have EGL extensions or Vulkan API already to
>>> deliver that?
>>>
>>> The above would be about device crashes that directly affect the
>>> display server. Is that the use case in mind here, or is it instead
>>> about notifying the display server that some application has caused a
>>> driver/hardware crash? If the latter, how should a display server react
>>> to that? Disconnect the application?
>>>
>>> Shashank, what is the actual use case you are developing this for?
>>>
>>> I've read all the emails here so far, and I don't recall seeing it
>>> explained.
>>>
>> The idea is that a support daemon or compositor would listen for GPU
>> reset notifications and do something useful with them (kill the guilty
>> app, restart the desktop environment, etc.).  Today when the GPU
>> resets, most applications just continue assuming nothing is wrong,
>> meanwhile the GPU has stopped accepting work until the apps re-init
>> their context so all of their command submissions just get rejected.
> Just one thing comes to mind reading this, racy PID reuse.
>
> process 1234 does something bad to GPU.
> process 1234 dies in parallel to sysfs notification being sent.
> other process 1234 reuses the pid
> new process 1234 gets destroyed by receiver of sysfs notification.

That's a well known problem inherit to the uses of PIDs.

IIRC because of this the kernel only reuses PIDs when 
/proc/sys/kernel/pid_max is reached and then wraps around.

Regards,
Christian.

>
> Dave.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-14 15:26                               ` Pekka Paalanen
@ 2022-03-15 14:54                                 ` Alex Deucher
  -1 siblings, 0 replies; 95+ messages in thread
From: Alex Deucher @ 2022-03-15 14:54 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, amd-gfx list,
	Alexandar Deucher, Shashank Sharma, Christian Koenig

On Mon, Mar 14, 2022 at 11:26 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>
> On Mon, 14 Mar 2022 10:23:27 -0400
> Alex Deucher <alexdeucher@gmail.com> wrote:
>
> > On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> > >
> > > On Thu, 10 Mar 2022 11:56:41 -0800
> > > Rob Clark <robdclark@gmail.com> wrote:
> > >
> > > > For something like just notifying a compositor that a gpu crash
> > > > happened, perhaps drm_event is more suitable.  See
> > > > virtio_gpu_fence_event_create() for an example of adding new event
> > > > types.  Although maybe you want it to be an event which is not device
> > > > specific.  This isn't so much of a debugging use-case as simply
> > > > notification.
> > >
> > > Hi,
> > >
> > > for this particular use case, are we now talking about the display
> > > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
> > >
> > > If the former, I wasn't aware that display device crashes are a thing.
> > > How should a userspace display server react to those?
> > >
> > > If the latter, don't we have EGL extensions or Vulkan API already to
> > > deliver that?
> > >
> > > The above would be about device crashes that directly affect the
> > > display server. Is that the use case in mind here, or is it instead
> > > about notifying the display server that some application has caused a
> > > driver/hardware crash? If the latter, how should a display server react
> > > to that? Disconnect the application?
> > >
> > > Shashank, what is the actual use case you are developing this for?
> > >
> > > I've read all the emails here so far, and I don't recall seeing it
> > > explained.
> > >
> >
> > The idea is that a support daemon or compositor would listen for GPU
> > reset notifications and do something useful with them (kill the guilty
> > app, restart the desktop environment, etc.).  Today when the GPU
> > resets, most applications just continue assuming nothing is wrong,
> > meanwhile the GPU has stopped accepting work until the apps re-init
> > their context so all of their command submissions just get rejected.
> >
> > > Btw. somewhat relatedly, there has been work aiming to allow
> > > graceful hot-unplug of DRM devices. There is a kernel doc outlining how
> > > the various APIs should react towards userspace when a DRM device
> > > suddenly disappears. That seems to have some overlap here IMO.
> > >
> > > See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> > > which also has a couple pointers to EGL and Vulkan APIs.
> >
> > The problem is most applications don't use the GL or VK robustness
> > APIs.
>
> Hi,
>
> how would this new event help with that?

This event would provide notification that a GPU reset occurred.

>
> I mean, yeah, there could be a daemon that kills those GPU users, but
> then what? You still lose any unsaved work, and may need to manually
> restart them.
>
> Is the idea that it is better to have the app crash and disappear than
> to look like it froze while it otherwise still runs?

Yes.  The daemon could also send the user some sort of notification
that a GPU reset occurred.

>
> If some daemon or compositor goes killing apps that trigger GPU resets,
> then how do we stop that for an app that actually does use the
> appropriate EGL or Vulkan APIs to detect and remedy that situation
> itself?

I guess the daemon could keep some sort of whitelist.  OTOH, very few
if any applications, especially games actually support these
extensions.

>
> >  You could use something like that in the compositor, but those
> > APIs tend to be focused more on the application itself rather than the
> > GPU in general.  E.g., Is my context lost.  Which is fine for
> > restarting your context, but doesn't really help if you want to try
> > and do something with another application (i.e., the likely guilty
> > app).  Also, on dGPU at least, when you reset the GPU, vram is usually
> > lost (either due to the memory controller being reset, or vram being
> > zero'd on init due to ECC support), so even if you are not the guilty
> > process, in that case you'd need to re-init your context anyway.
>
> Why should something like a compositor listen for this and kill apps
> that triggered GPU resets, instead of e.g. Mesa noticing that in the app
> and killing itself? Mesa in the app would know if robustness API is
> being used.

That's another possibility, but it doesn't handle the case where the
compositor doesn't support any sort of robustness extension so if the
GPU was reset, you'd lose your desktop anyway even if the app kept
running.

>
> Would be really nice to have the answers to all these questions to be
> collected and reiterated in the next version of this proposal.

The idea is to provide the notification of a GPU reset.  What the
various desktop environments or daemons do with it is up to them.  I
still think there is value in a notification even if you don't kill
apps or anything like that.  E.g., you can have a daemon running that
gets notified and logs the error, collects debug info, sends an email,
etc.

Alex

>
>
> Thanks,
> pq

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-15 14:54                                 ` Alex Deucher
  0 siblings, 0 replies; 95+ messages in thread
From: Alex Deucher @ 2022-03-15 14:54 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, Rob Clark,
	amd-gfx list, Alexandar Deucher, Shashank Sharma,
	Christian Koenig

On Mon, Mar 14, 2022 at 11:26 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>
> On Mon, 14 Mar 2022 10:23:27 -0400
> Alex Deucher <alexdeucher@gmail.com> wrote:
>
> > On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> > >
> > > On Thu, 10 Mar 2022 11:56:41 -0800
> > > Rob Clark <robdclark@gmail.com> wrote:
> > >
> > > > For something like just notifying a compositor that a gpu crash
> > > > happened, perhaps drm_event is more suitable.  See
> > > > virtio_gpu_fence_event_create() for an example of adding new event
> > > > types.  Although maybe you want it to be an event which is not device
> > > > specific.  This isn't so much of a debugging use-case as simply
> > > > notification.
> > >
> > > Hi,
> > >
> > > for this particular use case, are we now talking about the display
> > > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
> > >
> > > If the former, I wasn't aware that display device crashes are a thing.
> > > How should a userspace display server react to those?
> > >
> > > If the latter, don't we have EGL extensions or Vulkan API already to
> > > deliver that?
> > >
> > > The above would be about device crashes that directly affect the
> > > display server. Is that the use case in mind here, or is it instead
> > > about notifying the display server that some application has caused a
> > > driver/hardware crash? If the latter, how should a display server react
> > > to that? Disconnect the application?
> > >
> > > Shashank, what is the actual use case you are developing this for?
> > >
> > > I've read all the emails here so far, and I don't recall seeing it
> > > explained.
> > >
> >
> > The idea is that a support daemon or compositor would listen for GPU
> > reset notifications and do something useful with them (kill the guilty
> > app, restart the desktop environment, etc.).  Today when the GPU
> > resets, most applications just continue assuming nothing is wrong,
> > meanwhile the GPU has stopped accepting work until the apps re-init
> > their context so all of their command submissions just get rejected.
> >
> > > Btw. somewhat relatedly, there has been work aiming to allow
> > > graceful hot-unplug of DRM devices. There is a kernel doc outlining how
> > > the various APIs should react towards userspace when a DRM device
> > > suddenly disappears. That seems to have some overlap here IMO.
> > >
> > > See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> > > which also has a couple pointers to EGL and Vulkan APIs.
> >
> > The problem is most applications don't use the GL or VK robustness
> > APIs.
>
> Hi,
>
> how would this new event help with that?

This event would provide notification that a GPU reset occurred.

>
> I mean, yeah, there could be a daemon that kills those GPU users, but
> then what? You still lose any unsaved work, and may need to manually
> restart them.
>
> Is the idea that it is better to have the app crash and disappear than
> to look like it froze while it otherwise still runs?

Yes.  The daemon could also send the user some sort of notification
that a GPU reset occurred.

>
> If some daemon or compositor goes killing apps that trigger GPU resets,
> then how do we stop that for an app that actually does use the
> appropriate EGL or Vulkan APIs to detect and remedy that situation
> itself?

I guess the daemon could keep some sort of whitelist.  OTOH, very few
if any applications, especially games actually support these
extensions.

>
> >  You could use something like that in the compositor, but those
> > APIs tend to be focused more on the application itself rather than the
> > GPU in general.  E.g., Is my context lost.  Which is fine for
> > restarting your context, but doesn't really help if you want to try
> > and do something with another application (i.e., the likely guilty
> > app).  Also, on dGPU at least, when you reset the GPU, vram is usually
> > lost (either due to the memory controller being reset, or vram being
> > zero'd on init due to ECC support), so even if you are not the guilty
> > process, in that case you'd need to re-init your context anyway.
>
> Why should something like a compositor listen for this and kill apps
> that triggered GPU resets, instead of e.g. Mesa noticing that in the app
> and killing itself? Mesa in the app would know if robustness API is
> being used.

That's another possibility, but it doesn't handle the case where the
compositor doesn't support any sort of robustness extension so if the
GPU was reset, you'd lose your desktop anyway even if the app kept
running.

>
> Would be really nice to have the answers to all these questions to be
> collected and reiterated in the next version of this proposal.

The idea is to provide the notification of a GPU reset.  What the
various desktop environments or daemons do with it is up to them.  I
still think there is value in a notification even if you don't kill
apps or anything like that.  E.g., you can have a daemon running that
gets notified and logs the error, collects debug info, sends an email,
etc.

Alex

>
>
> Thanks,
> pq

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-15 14:54                                 ` Alex Deucher
@ 2022-03-16  8:48                                   ` Pekka Paalanen
  -1 siblings, 0 replies; 95+ messages in thread
From: Pekka Paalanen @ 2022-03-16  8:48 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, amd-gfx list,
	Alexandar Deucher, Shashank Sharma, Christian Koenig

[-- Attachment #1: Type: text/plain, Size: 6974 bytes --]

On Tue, 15 Mar 2022 10:54:38 -0400
Alex Deucher <alexdeucher@gmail.com> wrote:

> On Mon, Mar 14, 2022 at 11:26 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> >
> > On Mon, 14 Mar 2022 10:23:27 -0400
> > Alex Deucher <alexdeucher@gmail.com> wrote:
> >  
> > > On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:  
> > > >
> > > > On Thu, 10 Mar 2022 11:56:41 -0800
> > > > Rob Clark <robdclark@gmail.com> wrote:
> > > >  
> > > > > For something like just notifying a compositor that a gpu crash
> > > > > happened, perhaps drm_event is more suitable.  See
> > > > > virtio_gpu_fence_event_create() for an example of adding new event
> > > > > types.  Although maybe you want it to be an event which is not device
> > > > > specific.  This isn't so much of a debugging use-case as simply
> > > > > notification.  
> > > >
> > > > Hi,
> > > >
> > > > for this particular use case, are we now talking about the display
> > > > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
> > > >
> > > > If the former, I wasn't aware that display device crashes are a thing.
> > > > How should a userspace display server react to those?
> > > >
> > > > If the latter, don't we have EGL extensions or Vulkan API already to
> > > > deliver that?
> > > >
> > > > The above would be about device crashes that directly affect the
> > > > display server. Is that the use case in mind here, or is it instead
> > > > about notifying the display server that some application has caused a
> > > > driver/hardware crash? If the latter, how should a display server react
> > > > to that? Disconnect the application?
> > > >
> > > > Shashank, what is the actual use case you are developing this for?
> > > >
> > > > I've read all the emails here so far, and I don't recall seeing it
> > > > explained.
> > > >  
> > >
> > > The idea is that a support daemon or compositor would listen for GPU
> > > reset notifications and do something useful with them (kill the guilty
> > > app, restart the desktop environment, etc.).  Today when the GPU
> > > resets, most applications just continue assuming nothing is wrong,
> > > meanwhile the GPU has stopped accepting work until the apps re-init
> > > their context so all of their command submissions just get rejected.
> > >  
> > > > Btw. somewhat relatedly, there has been work aiming to allow
> > > > graceful hot-unplug of DRM devices. There is a kernel doc outlining how
> > > > the various APIs should react towards userspace when a DRM device
> > > > suddenly disappears. That seems to have some overlap here IMO.
> > > >
> > > > See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> > > > which also has a couple pointers to EGL and Vulkan APIs.  
> > >
> > > The problem is most applications don't use the GL or VK robustness
> > > APIs.  
> >
> > Hi,
> >
> > how would this new event help with that?  
> 
> This event would provide notification that a GPU reset occurred.
> 
> >
> > I mean, yeah, there could be a daemon that kills those GPU users, but
> > then what? You still lose any unsaved work, and may need to manually
> > restart them.
> >
> > Is the idea that it is better to have the app crash and disappear than
> > to look like it froze while it otherwise still runs?  
> 
> Yes.

Ok. That was just a wild guess.

>  The daemon could also send the user some sort of notification
> that a GPU reset occurred.
> 
> >
> > If some daemon or compositor goes killing apps that trigger GPU resets,
> > then how do we stop that for an app that actually does use the
> > appropriate EGL or Vulkan APIs to detect and remedy that situation
> > itself?  
> 
> I guess the daemon could keep some sort of whitelist.  OTOH, very few
> if any applications, especially games actually support these
> extensions.

I would think that a white-list is a non-starter due to the maintenance
nightmare and the "wrong by default" design for well behaving programs.

I am not a fan of optimising for broken software. I understand that
with games this is routine, but we're talking about everything here,
not just games. Games cannot be fixed, but the rest could if the
problem was not sweeped under the rug. It's like the inverse of the
platform problem.

> > >  You could use something like that in the compositor, but those
> > > APIs tend to be focused more on the application itself rather than the
> > > GPU in general.  E.g., Is my context lost.  Which is fine for
> > > restarting your context, but doesn't really help if you want to try
> > > and do something with another application (i.e., the likely guilty
> > > app).  Also, on dGPU at least, when you reset the GPU, vram is usually
> > > lost (either due to the memory controller being reset, or vram being
> > > zero'd on init due to ECC support), so even if you are not the guilty
> > > process, in that case you'd need to re-init your context anyway.  
> >
> > Why should something like a compositor listen for this and kill apps
> > that triggered GPU resets, instead of e.g. Mesa noticing that in the app
> > and killing itself? Mesa in the app would know if robustness API is
> > being used.  
> 
> That's another possibility, but it doesn't handle the case where the
> compositor doesn't support any sort of robustness extension so if the
> GPU was reset, you'd lose your desktop anyway even if the app kept
> running.

Why does that matter?

A GPU reset happens when it happens. If a compositor does not use
robustness extensions, it's as good as dead anyway, right?

Killing a compositor from inside in Mesa if it doesn't use robustness
might be better than leaving the compositor running blind - assuming
the compositor does not quit itself after seeing crucial EGL/Vulkan
calls failing.

> >
> > Would be really nice to have the answers to all these questions to be
> > collected and reiterated in the next version of this proposal.  
> 
> The idea is to provide the notification of a GPU reset.  What the
> various desktop environments or daemons do with it is up to them.  I
> still think there is value in a notification even if you don't kill
> apps or anything like that.  E.g., you can have a daemon running that
> gets notified and logs the error, collects debug info, sends an email,
> etc.

With new UAPI comes the demand of userspace proof, not hand-waving. You
would not be proposing this new interface if you didn't have use cases
in mind, even just one. You have to document what you imagine the new
thing to be used for, so that the appropriateness can be evaluated. If
the use case is deemed inappropriate for the proposed UAPI, you need to
find another use case to justify adding the new UAPI. If there is no
use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping
someone finds use for it seems backwards to me.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-16  8:48                                   ` Pekka Paalanen
  0 siblings, 0 replies; 95+ messages in thread
From: Pekka Paalanen @ 2022-03-16  8:48 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, Rob Clark,
	amd-gfx list, Alexandar Deucher, Shashank Sharma,
	Christian Koenig

[-- Attachment #1: Type: text/plain, Size: 6974 bytes --]

On Tue, 15 Mar 2022 10:54:38 -0400
Alex Deucher <alexdeucher@gmail.com> wrote:

> On Mon, Mar 14, 2022 at 11:26 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> >
> > On Mon, 14 Mar 2022 10:23:27 -0400
> > Alex Deucher <alexdeucher@gmail.com> wrote:
> >  
> > > On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:  
> > > >
> > > > On Thu, 10 Mar 2022 11:56:41 -0800
> > > > Rob Clark <robdclark@gmail.com> wrote:
> > > >  
> > > > > For something like just notifying a compositor that a gpu crash
> > > > > happened, perhaps drm_event is more suitable.  See
> > > > > virtio_gpu_fence_event_create() for an example of adding new event
> > > > > types.  Although maybe you want it to be an event which is not device
> > > > > specific.  This isn't so much of a debugging use-case as simply
> > > > > notification.  
> > > >
> > > > Hi,
> > > >
> > > > for this particular use case, are we now talking about the display
> > > > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
> > > >
> > > > If the former, I wasn't aware that display device crashes are a thing.
> > > > How should a userspace display server react to those?
> > > >
> > > > If the latter, don't we have EGL extensions or Vulkan API already to
> > > > deliver that?
> > > >
> > > > The above would be about device crashes that directly affect the
> > > > display server. Is that the use case in mind here, or is it instead
> > > > about notifying the display server that some application has caused a
> > > > driver/hardware crash? If the latter, how should a display server react
> > > > to that? Disconnect the application?
> > > >
> > > > Shashank, what is the actual use case you are developing this for?
> > > >
> > > > I've read all the emails here so far, and I don't recall seeing it
> > > > explained.
> > > >  
> > >
> > > The idea is that a support daemon or compositor would listen for GPU
> > > reset notifications and do something useful with them (kill the guilty
> > > app, restart the desktop environment, etc.).  Today when the GPU
> > > resets, most applications just continue assuming nothing is wrong,
> > > meanwhile the GPU has stopped accepting work until the apps re-init
> > > their context so all of their command submissions just get rejected.
> > >  
> > > > Btw. somewhat relatedly, there has been work aiming to allow
> > > > graceful hot-unplug of DRM devices. There is a kernel doc outlining how
> > > > the various APIs should react towards userspace when a DRM device
> > > > suddenly disappears. That seems to have some overlap here IMO.
> > > >
> > > > See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> > > > which also has a couple pointers to EGL and Vulkan APIs.  
> > >
> > > The problem is most applications don't use the GL or VK robustness
> > > APIs.  
> >
> > Hi,
> >
> > how would this new event help with that?  
> 
> This event would provide notification that a GPU reset occurred.
> 
> >
> > I mean, yeah, there could be a daemon that kills those GPU users, but
> > then what? You still lose any unsaved work, and may need to manually
> > restart them.
> >
> > Is the idea that it is better to have the app crash and disappear than
> > to look like it froze while it otherwise still runs?  
> 
> Yes.

Ok. That was just a wild guess.

>  The daemon could also send the user some sort of notification
> that a GPU reset occurred.
> 
> >
> > If some daemon or compositor goes killing apps that trigger GPU resets,
> > then how do we stop that for an app that actually does use the
> > appropriate EGL or Vulkan APIs to detect and remedy that situation
> > itself?  
> 
> I guess the daemon could keep some sort of whitelist.  OTOH, very few
> if any applications, especially games actually support these
> extensions.

I would think that a white-list is a non-starter due to the maintenance
nightmare and the "wrong by default" design for well behaving programs.

I am not a fan of optimising for broken software. I understand that
with games this is routine, but we're talking about everything here,
not just games. Games cannot be fixed, but the rest could if the
problem was not sweeped under the rug. It's like the inverse of the
platform problem.

> > >  You could use something like that in the compositor, but those
> > > APIs tend to be focused more on the application itself rather than the
> > > GPU in general.  E.g., Is my context lost.  Which is fine for
> > > restarting your context, but doesn't really help if you want to try
> > > and do something with another application (i.e., the likely guilty
> > > app).  Also, on dGPU at least, when you reset the GPU, vram is usually
> > > lost (either due to the memory controller being reset, or vram being
> > > zero'd on init due to ECC support), so even if you are not the guilty
> > > process, in that case you'd need to re-init your context anyway.  
> >
> > Why should something like a compositor listen for this and kill apps
> > that triggered GPU resets, instead of e.g. Mesa noticing that in the app
> > and killing itself? Mesa in the app would know if robustness API is
> > being used.  
> 
> That's another possibility, but it doesn't handle the case where the
> compositor doesn't support any sort of robustness extension so if the
> GPU was reset, you'd lose your desktop anyway even if the app kept
> running.

Why does that matter?

A GPU reset happens when it happens. If a compositor does not use
robustness extensions, it's as good as dead anyway, right?

Killing a compositor from inside in Mesa if it doesn't use robustness
might be better than leaving the compositor running blind - assuming
the compositor does not quit itself after seeing crucial EGL/Vulkan
calls failing.

> >
> > Would be really nice to have the answers to all these questions to be
> > collected and reiterated in the next version of this proposal.  
> 
> The idea is to provide the notification of a GPU reset.  What the
> various desktop environments or daemons do with it is up to them.  I
> still think there is value in a notification even if you don't kill
> apps or anything like that.  E.g., you can have a daemon running that
> gets notified and logs the error, collects debug info, sends an email,
> etc.

With new UAPI comes the demand of userspace proof, not hand-waving. You
would not be proposing this new interface if you didn't have use cases
in mind, even just one. You have to document what you imagine the new
thing to be used for, so that the appropriateness can be evaluated. If
the use case is deemed inappropriate for the proposed UAPI, you need to
find another use case to justify adding the new UAPI. If there is no
use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping
someone finds use for it seems backwards to me.


Thanks,
pq

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-16  8:48                                   ` Pekka Paalanen
@ 2022-03-16 14:12                                     ` Alex Deucher
  -1 siblings, 0 replies; 95+ messages in thread
From: Alex Deucher @ 2022-03-16 14:12 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, amd-gfx list,
	Alexandar Deucher, Shashank Sharma, Christian Koenig

On Wed, Mar 16, 2022 at 4:48 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>
> On Tue, 15 Mar 2022 10:54:38 -0400
> Alex Deucher <alexdeucher@gmail.com> wrote:
>
> > On Mon, Mar 14, 2022 at 11:26 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> > >
> > > On Mon, 14 Mar 2022 10:23:27 -0400
> > > Alex Deucher <alexdeucher@gmail.com> wrote:
> > >
> > > > On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> > > > >
> > > > > On Thu, 10 Mar 2022 11:56:41 -0800
> > > > > Rob Clark <robdclark@gmail.com> wrote:
> > > > >
> > > > > > For something like just notifying a compositor that a gpu crash
> > > > > > happened, perhaps drm_event is more suitable.  See
> > > > > > virtio_gpu_fence_event_create() for an example of adding new event
> > > > > > types.  Although maybe you want it to be an event which is not device
> > > > > > specific.  This isn't so much of a debugging use-case as simply
> > > > > > notification.
> > > > >
> > > > > Hi,
> > > > >
> > > > > for this particular use case, are we now talking about the display
> > > > > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
> > > > >
> > > > > If the former, I wasn't aware that display device crashes are a thing.
> > > > > How should a userspace display server react to those?
> > > > >
> > > > > If the latter, don't we have EGL extensions or Vulkan API already to
> > > > > deliver that?
> > > > >
> > > > > The above would be about device crashes that directly affect the
> > > > > display server. Is that the use case in mind here, or is it instead
> > > > > about notifying the display server that some application has caused a
> > > > > driver/hardware crash? If the latter, how should a display server react
> > > > > to that? Disconnect the application?
> > > > >
> > > > > Shashank, what is the actual use case you are developing this for?
> > > > >
> > > > > I've read all the emails here so far, and I don't recall seeing it
> > > > > explained.
> > > > >
> > > >
> > > > The idea is that a support daemon or compositor would listen for GPU
> > > > reset notifications and do something useful with them (kill the guilty
> > > > app, restart the desktop environment, etc.).  Today when the GPU
> > > > resets, most applications just continue assuming nothing is wrong,
> > > > meanwhile the GPU has stopped accepting work until the apps re-init
> > > > their context so all of their command submissions just get rejected.
> > > >
> > > > > Btw. somewhat relatedly, there has been work aiming to allow
> > > > > graceful hot-unplug of DRM devices. There is a kernel doc outlining how
> > > > > the various APIs should react towards userspace when a DRM device
> > > > > suddenly disappears. That seems to have some overlap here IMO.
> > > > >
> > > > > See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> > > > > which also has a couple pointers to EGL and Vulkan APIs.
> > > >
> > > > The problem is most applications don't use the GL or VK robustness
> > > > APIs.
> > >
> > > Hi,
> > >
> > > how would this new event help with that?
> >
> > This event would provide notification that a GPU reset occurred.
> >
> > >
> > > I mean, yeah, there could be a daemon that kills those GPU users, but
> > > then what? You still lose any unsaved work, and may need to manually
> > > restart them.
> > >
> > > Is the idea that it is better to have the app crash and disappear than
> > > to look like it froze while it otherwise still runs?
> >
> > Yes.
>
> Ok. That was just a wild guess.
>
> >  The daemon could also send the user some sort of notification
> > that a GPU reset occurred.
> >
> > >
> > > If some daemon or compositor goes killing apps that trigger GPU resets,
> > > then how do we stop that for an app that actually does use the
> > > appropriate EGL or Vulkan APIs to detect and remedy that situation
> > > itself?
> >
> > I guess the daemon could keep some sort of whitelist.  OTOH, very few
> > if any applications, especially games actually support these
> > extensions.
>
> I would think that a white-list is a non-starter due to the maintenance
> nightmare and the "wrong by default" design for well behaving programs.
>
> I am not a fan of optimising for broken software. I understand that
> with games this is routine, but we're talking about everything here,
> not just games. Games cannot be fixed, but the rest could if the
> problem was not sweeped under the rug. It's like the inverse of the
> platform problem.

Fair enough, but it hasn't happened in the last 15 years or so.

>
> > > >  You could use something like that in the compositor, but those
> > > > APIs tend to be focused more on the application itself rather than the
> > > > GPU in general.  E.g., Is my context lost.  Which is fine for
> > > > restarting your context, but doesn't really help if you want to try
> > > > and do something with another application (i.e., the likely guilty
> > > > app).  Also, on dGPU at least, when you reset the GPU, vram is usually
> > > > lost (either due to the memory controller being reset, or vram being
> > > > zero'd on init due to ECC support), so even if you are not the guilty
> > > > process, in that case you'd need to re-init your context anyway.
> > >
> > > Why should something like a compositor listen for this and kill apps
> > > that triggered GPU resets, instead of e.g. Mesa noticing that in the app
> > > and killing itself? Mesa in the app would know if robustness API is
> > > being used.
> >
> > That's another possibility, but it doesn't handle the case where the
> > compositor doesn't support any sort of robustness extension so if the
> > GPU was reset, you'd lose your desktop anyway even if the app kept
> > running.
>
> Why does that matter?
>
> A GPU reset happens when it happens. If a compositor does not use
> robustness extensions, it's as good as dead anyway, right?

Right.  Same with the application that supports robustness.  If the
app supports robustness, but the compositor does not, the app is dead
anyway if the compositor dies.  So while there may be an application
today that supports robustness, it is kind of pointless because
nothing else does and so if it manages to recover, nothing else it
relies on does.

>
> Killing a compositor from inside in Mesa if it doesn't use robustness
> might be better than leaving the compositor running blind - assuming
> the compositor does not quit itself after seeing crucial EGL/Vulkan
> calls failing.

I'm not sure I follow the second part of this statement.  I guess
having mesa kill applications that don't support robustness is fine,
but I don't really see it as much better than the status quo.  Today
most apps just continue to try and run until, possibly dying
eventually due to undefined state along the way.  If mesa kills them,
they are killed.  The end result is the same, the user loses their
desktop.

>
> > >
> > > Would be really nice to have the answers to all these questions to be
> > > collected and reiterated in the next version of this proposal.
> >
> > The idea is to provide the notification of a GPU reset.  What the
> > various desktop environments or daemons do with it is up to them.  I
> > still think there is value in a notification even if you don't kill
> > apps or anything like that.  E.g., you can have a daemon running that
> > gets notified and logs the error, collects debug info, sends an email,
> > etc.
>
> With new UAPI comes the demand of userspace proof, not hand-waving. You
> would not be proposing this new interface if you didn't have use cases
> in mind, even just one. You have to document what you imagine the new
> thing to be used for, so that the appropriateness can be evaluated. If
> the use case is deemed inappropriate for the proposed UAPI, you need to
> find another use case to justify adding the new UAPI. If there is no
> use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping
> someone finds use for it seems backwards to me.

We do have a use case.  It's what I described originally.  There is a
user space daemon (could be a compositor, could be something else)
that runs and listens for GPU reset notifications.  When it receives a
notification, it takes action and kills the guilty app and restarts
the compositer and gathers any relevant data related to the GPU hang
(if possible).  We can revisit this discussion once we have the whole
implementation complete.  Other drivers seem to do similar things
already today via different means (msm using devcoredump, i915 seems
to have its own GPU reset notification mechanism, etc.).  It just
seemed like there was value in having a generic drm GPU reset
notification, but maybe not yet.

Alex

>
>
> Thanks,
> pq

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-16 14:12                                     ` Alex Deucher
  0 siblings, 0 replies; 95+ messages in thread
From: Alex Deucher @ 2022-03-16 14:12 UTC (permalink / raw)
  To: Pekka Paalanen
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, Rob Clark,
	amd-gfx list, Alexandar Deucher, Shashank Sharma,
	Christian Koenig

On Wed, Mar 16, 2022 at 4:48 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>
> On Tue, 15 Mar 2022 10:54:38 -0400
> Alex Deucher <alexdeucher@gmail.com> wrote:
>
> > On Mon, Mar 14, 2022 at 11:26 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> > >
> > > On Mon, 14 Mar 2022 10:23:27 -0400
> > > Alex Deucher <alexdeucher@gmail.com> wrote:
> > >
> > > > On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> > > > >
> > > > > On Thu, 10 Mar 2022 11:56:41 -0800
> > > > > Rob Clark <robdclark@gmail.com> wrote:
> > > > >
> > > > > > For something like just notifying a compositor that a gpu crash
> > > > > > happened, perhaps drm_event is more suitable.  See
> > > > > > virtio_gpu_fence_event_create() for an example of adding new event
> > > > > > types.  Although maybe you want it to be an event which is not device
> > > > > > specific.  This isn't so much of a debugging use-case as simply
> > > > > > notification.
> > > > >
> > > > > Hi,
> > > > >
> > > > > for this particular use case, are we now talking about the display
> > > > > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
> > > > >
> > > > > If the former, I wasn't aware that display device crashes are a thing.
> > > > > How should a userspace display server react to those?
> > > > >
> > > > > If the latter, don't we have EGL extensions or Vulkan API already to
> > > > > deliver that?
> > > > >
> > > > > The above would be about device crashes that directly affect the
> > > > > display server. Is that the use case in mind here, or is it instead
> > > > > about notifying the display server that some application has caused a
> > > > > driver/hardware crash? If the latter, how should a display server react
> > > > > to that? Disconnect the application?
> > > > >
> > > > > Shashank, what is the actual use case you are developing this for?
> > > > >
> > > > > I've read all the emails here so far, and I don't recall seeing it
> > > > > explained.
> > > > >
> > > >
> > > > The idea is that a support daemon or compositor would listen for GPU
> > > > reset notifications and do something useful with them (kill the guilty
> > > > app, restart the desktop environment, etc.).  Today when the GPU
> > > > resets, most applications just continue assuming nothing is wrong,
> > > > meanwhile the GPU has stopped accepting work until the apps re-init
> > > > their context so all of their command submissions just get rejected.
> > > >
> > > > > Btw. somewhat relatedly, there has been work aiming to allow
> > > > > graceful hot-unplug of DRM devices. There is a kernel doc outlining how
> > > > > the various APIs should react towards userspace when a DRM device
> > > > > suddenly disappears. That seems to have some overlap here IMO.
> > > > >
> > > > > See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> > > > > which also has a couple pointers to EGL and Vulkan APIs.
> > > >
> > > > The problem is most applications don't use the GL or VK robustness
> > > > APIs.
> > >
> > > Hi,
> > >
> > > how would this new event help with that?
> >
> > This event would provide notification that a GPU reset occurred.
> >
> > >
> > > I mean, yeah, there could be a daemon that kills those GPU users, but
> > > then what? You still lose any unsaved work, and may need to manually
> > > restart them.
> > >
> > > Is the idea that it is better to have the app crash and disappear than
> > > to look like it froze while it otherwise still runs?
> >
> > Yes.
>
> Ok. That was just a wild guess.
>
> >  The daemon could also send the user some sort of notification
> > that a GPU reset occurred.
> >
> > >
> > > If some daemon or compositor goes killing apps that trigger GPU resets,
> > > then how do we stop that for an app that actually does use the
> > > appropriate EGL or Vulkan APIs to detect and remedy that situation
> > > itself?
> >
> > I guess the daemon could keep some sort of whitelist.  OTOH, very few
> > if any applications, especially games actually support these
> > extensions.
>
> I would think that a white-list is a non-starter due to the maintenance
> nightmare and the "wrong by default" design for well behaving programs.
>
> I am not a fan of optimising for broken software. I understand that
> with games this is routine, but we're talking about everything here,
> not just games. Games cannot be fixed, but the rest could if the
> problem was not sweeped under the rug. It's like the inverse of the
> platform problem.

Fair enough, but it hasn't happened in the last 15 years or so.

>
> > > >  You could use something like that in the compositor, but those
> > > > APIs tend to be focused more on the application itself rather than the
> > > > GPU in general.  E.g., Is my context lost.  Which is fine for
> > > > restarting your context, but doesn't really help if you want to try
> > > > and do something with another application (i.e., the likely guilty
> > > > app).  Also, on dGPU at least, when you reset the GPU, vram is usually
> > > > lost (either due to the memory controller being reset, or vram being
> > > > zero'd on init due to ECC support), so even if you are not the guilty
> > > > process, in that case you'd need to re-init your context anyway.
> > >
> > > Why should something like a compositor listen for this and kill apps
> > > that triggered GPU resets, instead of e.g. Mesa noticing that in the app
> > > and killing itself? Mesa in the app would know if robustness API is
> > > being used.
> >
> > That's another possibility, but it doesn't handle the case where the
> > compositor doesn't support any sort of robustness extension so if the
> > GPU was reset, you'd lose your desktop anyway even if the app kept
> > running.
>
> Why does that matter?
>
> A GPU reset happens when it happens. If a compositor does not use
> robustness extensions, it's as good as dead anyway, right?

Right.  Same with the application that supports robustness.  If the
app supports robustness, but the compositor does not, the app is dead
anyway if the compositor dies.  So while there may be an application
today that supports robustness, it is kind of pointless because
nothing else does and so if it manages to recover, nothing else it
relies on does.

>
> Killing a compositor from inside in Mesa if it doesn't use robustness
> might be better than leaving the compositor running blind - assuming
> the compositor does not quit itself after seeing crucial EGL/Vulkan
> calls failing.

I'm not sure I follow the second part of this statement.  I guess
having mesa kill applications that don't support robustness is fine,
but I don't really see it as much better than the status quo.  Today
most apps just continue to try and run until, possibly dying
eventually due to undefined state along the way.  If mesa kills them,
they are killed.  The end result is the same, the user loses their
desktop.

>
> > >
> > > Would be really nice to have the answers to all these questions to be
> > > collected and reiterated in the next version of this proposal.
> >
> > The idea is to provide the notification of a GPU reset.  What the
> > various desktop environments or daemons do with it is up to them.  I
> > still think there is value in a notification even if you don't kill
> > apps or anything like that.  E.g., you can have a daemon running that
> > gets notified and logs the error, collects debug info, sends an email,
> > etc.
>
> With new UAPI comes the demand of userspace proof, not hand-waving. You
> would not be proposing this new interface if you didn't have use cases
> in mind, even just one. You have to document what you imagine the new
> thing to be used for, so that the appropriateness can be evaluated. If
> the use case is deemed inappropriate for the proposed UAPI, you need to
> find another use case to justify adding the new UAPI. If there is no
> use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping
> someone finds use for it seems backwards to me.

We do have a use case.  It's what I described originally.  There is a
user space daemon (could be a compositor, could be something else)
that runs and listens for GPU reset notifications.  When it receives a
notification, it takes action and kills the guilty app and restarts
the compositer and gathers any relevant data related to the GPU hang
(if possible).  We can revisit this discussion once we have the whole
implementation complete.  Other drivers seem to do similar things
already today via different means (msm using devcoredump, i915 seems
to have its own GPU reset notification mechanism, etc.).  It just
seemed like there was value in having a generic drm GPU reset
notification, but maybe not yet.

Alex

>
>
> Thanks,
> pq

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-16 14:12                                     ` Alex Deucher
@ 2022-03-16 15:36                                       ` Rob Clark
  -1 siblings, 0 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-16 15:36 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, amd-gfx list,
	Alexandar Deucher, Shashank Sharma, Christian Koenig

On Wed, Mar 16, 2022 at 7:12 AM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Wed, Mar 16, 2022 at 4:48 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> >
[snip]
> > With new UAPI comes the demand of userspace proof, not hand-waving. You
> > would not be proposing this new interface if you didn't have use cases
> > in mind, even just one. You have to document what you imagine the new
> > thing to be used for, so that the appropriateness can be evaluated. If
> > the use case is deemed inappropriate for the proposed UAPI, you need to
> > find another use case to justify adding the new UAPI. If there is no
> > use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping
> > someone finds use for it seems backwards to me.
>
> We do have a use case.  It's what I described originally.  There is a
> user space daemon (could be a compositor, could be something else)
> that runs and listens for GPU reset notifications.  When it receives a
> notification, it takes action and kills the guilty app and restarts
> the compositer and gathers any relevant data related to the GPU hang
> (if possible).  We can revisit this discussion once we have the whole
> implementation complete.  Other drivers seem to do similar things
> already today via different means (msm using devcoredump, i915 seems
> to have its own GPU reset notification mechanism, etc.).  It just
> seemed like there was value in having a generic drm GPU reset
> notification, but maybe not yet.

just one point of clarification.. in the msm and i915 case it is
purely for debugging and telemetry (ie. sending crash logs back to
distro for analysis if user has crash reporting enabled).. it isn't
used for triggering any action like killing app or compositor.

I would however *strongly* recommend devcoredump support in other GPU
drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
to debug and fix a couple obscure issues that I was not able to
reproduce by myself.

BR,
-R

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-16 15:36                                       ` Rob Clark
  0 siblings, 0 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-16 15:36 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, Pekka Paalanen,
	amd-gfx list, Alexandar Deucher, Shashank Sharma,
	Christian Koenig

On Wed, Mar 16, 2022 at 7:12 AM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Wed, Mar 16, 2022 at 4:48 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> >
[snip]
> > With new UAPI comes the demand of userspace proof, not hand-waving. You
> > would not be proposing this new interface if you didn't have use cases
> > in mind, even just one. You have to document what you imagine the new
> > thing to be used for, so that the appropriateness can be evaluated. If
> > the use case is deemed inappropriate for the proposed UAPI, you need to
> > find another use case to justify adding the new UAPI. If there is no
> > use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping
> > someone finds use for it seems backwards to me.
>
> We do have a use case.  It's what I described originally.  There is a
> user space daemon (could be a compositor, could be something else)
> that runs and listens for GPU reset notifications.  When it receives a
> notification, it takes action and kills the guilty app and restarts
> the compositer and gathers any relevant data related to the GPU hang
> (if possible).  We can revisit this discussion once we have the whole
> implementation complete.  Other drivers seem to do similar things
> already today via different means (msm using devcoredump, i915 seems
> to have its own GPU reset notification mechanism, etc.).  It just
> seemed like there was value in having a generic drm GPU reset
> notification, but maybe not yet.

just one point of clarification.. in the msm and i915 case it is
purely for debugging and telemetry (ie. sending crash logs back to
distro for analysis if user has crash reporting enabled).. it isn't
used for triggering any action like killing app or compositor.

I would however *strongly* recommend devcoredump support in other GPU
drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
to debug and fix a couple obscure issues that I was not able to
reproduce by myself.

BR,
-R

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-16 15:36                                       ` Rob Clark
@ 2022-03-16 15:48                                         ` Alex Deucher
  -1 siblings, 0 replies; 95+ messages in thread
From: Alex Deucher @ 2022-03-16 15:48 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, amd-gfx list,
	Alexandar Deucher, Shashank Sharma, Christian Koenig

On Wed, Mar 16, 2022 at 11:35 AM Rob Clark <robdclark@gmail.com> wrote:
>
> On Wed, Mar 16, 2022 at 7:12 AM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Wed, Mar 16, 2022 at 4:48 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> > >
> [snip]
> > > With new UAPI comes the demand of userspace proof, not hand-waving. You
> > > would not be proposing this new interface if you didn't have use cases
> > > in mind, even just one. You have to document what you imagine the new
> > > thing to be used for, so that the appropriateness can be evaluated. If
> > > the use case is deemed inappropriate for the proposed UAPI, you need to
> > > find another use case to justify adding the new UAPI. If there is no
> > > use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping
> > > someone finds use for it seems backwards to me.
> >
> > We do have a use case.  It's what I described originally.  There is a
> > user space daemon (could be a compositor, could be something else)
> > that runs and listens for GPU reset notifications.  When it receives a
> > notification, it takes action and kills the guilty app and restarts
> > the compositer and gathers any relevant data related to the GPU hang
> > (if possible).  We can revisit this discussion once we have the whole
> > implementation complete.  Other drivers seem to do similar things
> > already today via different means (msm using devcoredump, i915 seems
> > to have its own GPU reset notification mechanism, etc.).  It just
> > seemed like there was value in having a generic drm GPU reset
> > notification, but maybe not yet.
>
> just one point of clarification.. in the msm and i915 case it is
> purely for debugging and telemetry (ie. sending crash logs back to
> distro for analysis if user has crash reporting enabled).. it isn't
> used for triggering any action like killing app or compositor.

Sure.  you could use this proposed event for telemetry gathering as
well.  The important part is the event.  What you do with it is up to
the user.

>
> I would however *strongly* recommend devcoredump support in other GPU
> drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> to debug and fix a couple obscure issues that I was not able to
> reproduce by myself.

Agreed.  I think devcoredump makes sense and we'll ultimately enable
support for that in amdgpu.  I think there is value in a GPU specific
reset event as well for the use case I outlined, but maybe the
devcoredump one is enough.  We'd just need to rely on the userspace
side to filter as appropriate for events it cares about.  I'm not sure
how many other devcore dump events are commonly seen.

Alex

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-16 15:48                                         ` Alex Deucher
  0 siblings, 0 replies; 95+ messages in thread
From: Alex Deucher @ 2022-03-16 15:48 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, Pekka Paalanen,
	amd-gfx list, Alexandar Deucher, Shashank Sharma,
	Christian Koenig

On Wed, Mar 16, 2022 at 11:35 AM Rob Clark <robdclark@gmail.com> wrote:
>
> On Wed, Mar 16, 2022 at 7:12 AM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Wed, Mar 16, 2022 at 4:48 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> > >
> [snip]
> > > With new UAPI comes the demand of userspace proof, not hand-waving. You
> > > would not be proposing this new interface if you didn't have use cases
> > > in mind, even just one. You have to document what you imagine the new
> > > thing to be used for, so that the appropriateness can be evaluated. If
> > > the use case is deemed inappropriate for the proposed UAPI, you need to
> > > find another use case to justify adding the new UAPI. If there is no
> > > use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping
> > > someone finds use for it seems backwards to me.
> >
> > We do have a use case.  It's what I described originally.  There is a
> > user space daemon (could be a compositor, could be something else)
> > that runs and listens for GPU reset notifications.  When it receives a
> > notification, it takes action and kills the guilty app and restarts
> > the compositer and gathers any relevant data related to the GPU hang
> > (if possible).  We can revisit this discussion once we have the whole
> > implementation complete.  Other drivers seem to do similar things
> > already today via different means (msm using devcoredump, i915 seems
> > to have its own GPU reset notification mechanism, etc.).  It just
> > seemed like there was value in having a generic drm GPU reset
> > notification, but maybe not yet.
>
> just one point of clarification.. in the msm and i915 case it is
> purely for debugging and telemetry (ie. sending crash logs back to
> distro for analysis if user has crash reporting enabled).. it isn't
> used for triggering any action like killing app or compositor.

Sure.  you could use this proposed event for telemetry gathering as
well.  The important part is the event.  What you do with it is up to
the user.

>
> I would however *strongly* recommend devcoredump support in other GPU
> drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> to debug and fix a couple obscure issues that I was not able to
> reproduce by myself.

Agreed.  I think devcoredump makes sense and we'll ultimately enable
support for that in amdgpu.  I think there is value in a GPU specific
reset event as well for the use case I outlined, but maybe the
devcoredump one is enough.  We'd just need to rely on the userspace
side to filter as appropriate for events it cares about.  I'm not sure
how many other devcore dump events are commonly seen.

Alex

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-16 15:48                                         ` Alex Deucher
@ 2022-03-16 16:30                                           ` Rob Clark
  -1 siblings, 0 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-16 16:30 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, amd-gfx list,
	Alexandar Deucher, Shashank Sharma, Christian Koenig

On Wed, Mar 16, 2022 at 8:48 AM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Wed, Mar 16, 2022 at 11:35 AM Rob Clark <robdclark@gmail.com> wrote:
> >
> > On Wed, Mar 16, 2022 at 7:12 AM Alex Deucher <alexdeucher@gmail.com> wrote:
> > >
> > > On Wed, Mar 16, 2022 at 4:48 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> > > >
> > [snip]
> > > > With new UAPI comes the demand of userspace proof, not hand-waving. You
> > > > would not be proposing this new interface if you didn't have use cases
> > > > in mind, even just one. You have to document what you imagine the new
> > > > thing to be used for, so that the appropriateness can be evaluated. If
> > > > the use case is deemed inappropriate for the proposed UAPI, you need to
> > > > find another use case to justify adding the new UAPI. If there is no
> > > > use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping
> > > > someone finds use for it seems backwards to me.
> > >
> > > We do have a use case.  It's what I described originally.  There is a
> > > user space daemon (could be a compositor, could be something else)
> > > that runs and listens for GPU reset notifications.  When it receives a
> > > notification, it takes action and kills the guilty app and restarts
> > > the compositer and gathers any relevant data related to the GPU hang
> > > (if possible).  We can revisit this discussion once we have the whole
> > > implementation complete.  Other drivers seem to do similar things
> > > already today via different means (msm using devcoredump, i915 seems
> > > to have its own GPU reset notification mechanism, etc.).  It just
> > > seemed like there was value in having a generic drm GPU reset
> > > notification, but maybe not yet.
> >
> > just one point of clarification.. in the msm and i915 case it is
> > purely for debugging and telemetry (ie. sending crash logs back to
> > distro for analysis if user has crash reporting enabled).. it isn't
> > used for triggering any action like killing app or compositor.
>
> Sure.  you could use this proposed event for telemetry gathering as
> well.  The important part is the event.  What you do with it is up to
> the user.

True, but for that purpose (telemetry) let's leverage something that
userspace already has support for

> > I would however *strongly* recommend devcoredump support in other GPU
> > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > to debug and fix a couple obscure issues that I was not able to
> > reproduce by myself.
>
> Agreed.  I think devcoredump makes sense and we'll ultimately enable
> support for that in amdgpu.  I think there is value in a GPU specific
> reset event as well for the use case I outlined, but maybe the
> devcoredump one is enough.  We'd just need to rely on the userspace
> side to filter as appropriate for events it cares about.  I'm not sure
> how many other devcore dump events are commonly seen.

They are, like CPU coredumps, not expected to be a frequent thing.
There are a number of other non-GPU drivers which also use devcoredump
(mostly wifi and remoteproc, ie. things like DSPs).  We did also
recently add msm devcoredump support for the display side of things,
in addition to gpu, to dump display state if we hit underruns, link
training fail, etc.  Each "crash" creates a new transient devcore
device so crashes from different devices can be reported in parallel.
But we do rate-limit the GPU crash reports by not reporting another
one until the previous one is read out and cleared by userspace.  (You
don't strictly need to do that, but it is probably a good idea..
usually the first crash is the interesting one.)

The ChromeOS crash-reporter does have an allow-list for which devcore
dumps are sent back, mainly because in the case of wifi drivers it is
hard to ensure that the dump does not include PII (like network names,
etc).  It would be pretty trivial to add amdgpu to the allow-list and
setup a "magic signature" so that queries could be run on amdgpu
devcore dumps.  I can help out with that, since I went through the
same process already for msm.

I'm not entirely sure what other distros do in this area.

BR,
-R

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-16 16:30                                           ` Rob Clark
  0 siblings, 0 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-16 16:30 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, Pekka Paalanen,
	amd-gfx list, Alexandar Deucher, Shashank Sharma,
	Christian Koenig

On Wed, Mar 16, 2022 at 8:48 AM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Wed, Mar 16, 2022 at 11:35 AM Rob Clark <robdclark@gmail.com> wrote:
> >
> > On Wed, Mar 16, 2022 at 7:12 AM Alex Deucher <alexdeucher@gmail.com> wrote:
> > >
> > > On Wed, Mar 16, 2022 at 4:48 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> > > >
> > [snip]
> > > > With new UAPI comes the demand of userspace proof, not hand-waving. You
> > > > would not be proposing this new interface if you didn't have use cases
> > > > in mind, even just one. You have to document what you imagine the new
> > > > thing to be used for, so that the appropriateness can be evaluated. If
> > > > the use case is deemed inappropriate for the proposed UAPI, you need to
> > > > find another use case to justify adding the new UAPI. If there is no
> > > > use for the UAPI, it shouldn't be added, right? Adding UAPI and hoping
> > > > someone finds use for it seems backwards to me.
> > >
> > > We do have a use case.  It's what I described originally.  There is a
> > > user space daemon (could be a compositor, could be something else)
> > > that runs and listens for GPU reset notifications.  When it receives a
> > > notification, it takes action and kills the guilty app and restarts
> > > the compositer and gathers any relevant data related to the GPU hang
> > > (if possible).  We can revisit this discussion once we have the whole
> > > implementation complete.  Other drivers seem to do similar things
> > > already today via different means (msm using devcoredump, i915 seems
> > > to have its own GPU reset notification mechanism, etc.).  It just
> > > seemed like there was value in having a generic drm GPU reset
> > > notification, but maybe not yet.
> >
> > just one point of clarification.. in the msm and i915 case it is
> > purely for debugging and telemetry (ie. sending crash logs back to
> > distro for analysis if user has crash reporting enabled).. it isn't
> > used for triggering any action like killing app or compositor.
>
> Sure.  you could use this proposed event for telemetry gathering as
> well.  The important part is the event.  What you do with it is up to
> the user.

True, but for that purpose (telemetry) let's leverage something that
userspace already has support for

> > I would however *strongly* recommend devcoredump support in other GPU
> > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > to debug and fix a couple obscure issues that I was not able to
> > reproduce by myself.
>
> Agreed.  I think devcoredump makes sense and we'll ultimately enable
> support for that in amdgpu.  I think there is value in a GPU specific
> reset event as well for the use case I outlined, but maybe the
> devcoredump one is enough.  We'd just need to rely on the userspace
> side to filter as appropriate for events it cares about.  I'm not sure
> how many other devcore dump events are commonly seen.

They are, like CPU coredumps, not expected to be a frequent thing.
There are a number of other non-GPU drivers which also use devcoredump
(mostly wifi and remoteproc, ie. things like DSPs).  We did also
recently add msm devcoredump support for the display side of things,
in addition to gpu, to dump display state if we hit underruns, link
training fail, etc.  Each "crash" creates a new transient devcore
device so crashes from different devices can be reported in parallel.
But we do rate-limit the GPU crash reports by not reporting another
one until the previous one is read out and cleared by userspace.  (You
don't strictly need to do that, but it is probably a good idea..
usually the first crash is the interesting one.)

The ChromeOS crash-reporter does have an allow-list for which devcore
dumps are sent back, mainly because in the case of wifi drivers it is
hard to ensure that the dump does not include PII (like network names,
etc).  It would be pretty trivial to add amdgpu to the allow-list and
setup a "magic signature" so that queries could be run on amdgpu
devcore dumps.  I can help out with that, since I went through the
same process already for msm.

I'm not entirely sure what other distros do in this area.

BR,
-R

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-08 18:04 [PATCH v2 1/2] drm: Add GPU reset sysfs event Shashank Sharma
                   ` (4 preceding siblings ...)
  2022-03-09 18:12 ` Rob Clark
@ 2022-03-16 21:50 ` Rob Clark
  2022-03-17  8:42   ` Sharma, Shashank
  5 siblings, 1 reply; 95+ messages in thread
From: Rob Clark @ 2022-03-16 21:50 UTC (permalink / raw)
  To: Shashank Sharma
  Cc: Rob Clark, Shashank Sharma, Amaranath Somalapuram, dri-devel,
	amd-gfx list, Alexandar Deucher, Christian Koenig

On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
<contactshashanksharma@gmail.com> wrote:
>
> From: Shashank Sharma <shashank.sharma@amd.com>
>
> This patch adds a new sysfs event, which will indicate
> the userland about a GPU reset, and can also provide
> some information like:
> - process ID of the process involved with the GPU reset
> - process name of the involved process
> - the GPU status info (using flags)
>
> This patch also introduces the first flag of the flags
> bitmap, which can be appended as and when required.
>
> V2: Addressed review comments from Christian and Amar
>    - move the reset information structure to DRM layer
>    - drop _ctx from struct name
>    - make pid 32 bit(than 64)
>    - set flag when VRAM invalid (than valid)
>    - add process name as well (Amar)
>
> Cc: Alexandar Deucher <alexander.deucher@amd.com>
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Amaranath Somalapuram <amaranath.somalapuram@amd.com>
> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
> ---
>  drivers/gpu/drm/drm_sysfs.c | 31 +++++++++++++++++++++++++++++++
>  include/drm/drm_sysfs.h     | 10 ++++++++++
>  2 files changed, 41 insertions(+)
>
> diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
> index 430e00b16eec..840994810910 100644
> --- a/drivers/gpu/drm/drm_sysfs.c
> +++ b/drivers/gpu/drm/drm_sysfs.c
> @@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device *dev)
>  }
>  EXPORT_SYMBOL(drm_sysfs_hotplug_event);
>
> +/**
> + * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
> + * @dev: DRM device
> + * @reset_info: The contextual information about the reset (like PID, flags)
> + *
> + * Send a uevent for the DRM device specified by @dev. This informs
> + * user that a GPU reset has occurred, so that an interested client
> + * can take any recovery or profiling measure.
> + */
> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info)
> +{
> +       unsigned char pid_str[13];
> +       unsigned char flags_str[15];
> +       unsigned char pname_str[TASK_COMM_LEN + 6];
> +       unsigned char reset_str[] = "RESET=1";
> +       char *envp[] = { reset_str, pid_str, pname_str, flags_str, NULL };
> +
> +       if (!reset_info) {
> +               DRM_WARN("No reset info, not sending the event\n");
> +               return;
> +       }
> +
> +       DRM_DEBUG("generating reset event\n");
> +
> +       snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u", reset_info->pid);
> +       snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s", reset_info->pname);
> +       snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", reset_info->flags);
> +       kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
> +}
> +EXPORT_SYMBOL(drm_sysfs_reset_event);
> +
>  /**
>   * drm_sysfs_connector_hotplug_event - generate a DRM uevent for any connector
>   * change
> diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
> index 6273cac44e47..5ba11c760619 100644
> --- a/include/drm/drm_sysfs.h
> +++ b/include/drm/drm_sysfs.h
> @@ -1,16 +1,26 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  #ifndef _DRM_SYSFS_H_
>  #define _DRM_SYSFS_H_
> +#include <linux/sched.h>
> +
> +#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)
>
>  struct drm_device;
>  struct device;
>  struct drm_connector;
>  struct drm_property;
>
> +struct drm_reset_event {
> +       uint32_t pid;

One side note, unrelated to devcoredump vs this..

AFAIU you probably want to be passing around a `struct pid *`, and
then somehow use pid_vnr() in the context of the process reading the
event to get the numeric pid.  Otherwise things will not do what you
expect if the process triggering the crash is in a different pid
namespace from the compositor.

BR,
-R

> +       uint32_t flags;
> +       char pname[TASK_COMM_LEN];
> +};
> +
>  int drm_class_device_register(struct device *dev);
>  void drm_class_device_unregister(struct device *dev);
>
>  void drm_sysfs_hotplug_event(struct drm_device *dev);
> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info);
>  void drm_sysfs_connector_hotplug_event(struct drm_connector *connector);
>  void drm_sysfs_connector_status_event(struct drm_connector *connector,
>                                       struct drm_property *property);
> --
> 2.32.0
>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-16 15:36                                       ` Rob Clark
@ 2022-03-17  7:03                                         ` Christian König
  -1 siblings, 0 replies; 95+ messages in thread
From: Christian König @ 2022-03-17  7:03 UTC (permalink / raw)
  To: Rob Clark, Alex Deucher
  Cc: Rob Clark, Sharma, Shashank, Amaranath Somalapuram,
	Abhinav Kumar, dri-devel, amd-gfx list, Alexandar Deucher,
	Shashank Sharma, Christian Koenig

Am 16.03.22 um 16:36 schrieb Rob Clark:
> [SNIP]
> just one point of clarification.. in the msm and i915 case it is
> purely for debugging and telemetry (ie. sending crash logs back to
> distro for analysis if user has crash reporting enabled).. it isn't
> used for triggering any action like killing app or compositor.

By the way, how does msm it's memory management for the devcoredumps?

I mean it is strictly forbidden to allocate any memory in the GPU reset 
path.

> I would however *strongly* recommend devcoredump support in other GPU
> drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> to debug and fix a couple obscure issues that I was not able to
> reproduce by myself.

Yes, completely agree as well.

Thanks,
Christian.

>
> BR,
> -R


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-17  7:03                                         ` Christian König
  0 siblings, 0 replies; 95+ messages in thread
From: Christian König @ 2022-03-17  7:03 UTC (permalink / raw)
  To: Rob Clark, Alex Deucher
  Cc: Rob Clark, Sharma, Shashank, Amaranath Somalapuram,
	Abhinav Kumar, dri-devel, Pekka Paalanen, amd-gfx list,
	Alexandar Deucher, Shashank Sharma, Christian Koenig

Am 16.03.22 um 16:36 schrieb Rob Clark:
> [SNIP]
> just one point of clarification.. in the msm and i915 case it is
> purely for debugging and telemetry (ie. sending crash logs back to
> distro for analysis if user has crash reporting enabled).. it isn't
> used for triggering any action like killing app or compositor.

By the way, how does msm it's memory management for the devcoredumps?

I mean it is strictly forbidden to allocate any memory in the GPU reset 
path.

> I would however *strongly* recommend devcoredump support in other GPU
> drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> to debug and fix a couple obscure issues that I was not able to
> reproduce by myself.

Yes, completely agree as well.

Thanks,
Christian.

>
> BR,
> -R


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-16 21:50 ` Rob Clark
@ 2022-03-17  8:42   ` Sharma, Shashank
  2022-03-17  9:21     ` Christian König
  0 siblings, 1 reply; 95+ messages in thread
From: Sharma, Shashank @ 2022-03-17  8:42 UTC (permalink / raw)
  To: Rob Clark, Shashank Sharma
  Cc: Rob Clark, Amaranath Somalapuram, dri-devel, amd-gfx list,
	Alexandar Deucher, Christian Koenig



On 3/16/2022 10:50 PM, Rob Clark wrote:
> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
> <contactshashanksharma@gmail.com> wrote:
>>
>> From: Shashank Sharma <shashank.sharma@amd.com>
>>
>> This patch adds a new sysfs event, which will indicate
>> the userland about a GPU reset, and can also provide
>> some information like:
>> - process ID of the process involved with the GPU reset
>> - process name of the involved process
>> - the GPU status info (using flags)
>>
>> This patch also introduces the first flag of the flags
>> bitmap, which can be appended as and when required.
>>
>> V2: Addressed review comments from Christian and Amar
>>     - move the reset information structure to DRM layer
>>     - drop _ctx from struct name
>>     - make pid 32 bit(than 64)
>>     - set flag when VRAM invalid (than valid)
>>     - add process name as well (Amar)
>>
>> Cc: Alexandar Deucher <alexander.deucher@amd.com>
>> Cc: Christian Koenig <christian.koenig@amd.com>
>> Cc: Amaranath Somalapuram <amaranath.somalapuram@amd.com>
>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
>> ---
>>   drivers/gpu/drm/drm_sysfs.c | 31 +++++++++++++++++++++++++++++++
>>   include/drm/drm_sysfs.h     | 10 ++++++++++
>>   2 files changed, 41 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
>> index 430e00b16eec..840994810910 100644
>> --- a/drivers/gpu/drm/drm_sysfs.c
>> +++ b/drivers/gpu/drm/drm_sysfs.c
>> @@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device *dev)
>>   }
>>   EXPORT_SYMBOL(drm_sysfs_hotplug_event);
>>
>> +/**
>> + * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
>> + * @dev: DRM device
>> + * @reset_info: The contextual information about the reset (like PID, flags)
>> + *
>> + * Send a uevent for the DRM device specified by @dev. This informs
>> + * user that a GPU reset has occurred, so that an interested client
>> + * can take any recovery or profiling measure.
>> + */
>> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info)
>> +{
>> +       unsigned char pid_str[13];
>> +       unsigned char flags_str[15];
>> +       unsigned char pname_str[TASK_COMM_LEN + 6];
>> +       unsigned char reset_str[] = "RESET=1";
>> +       char *envp[] = { reset_str, pid_str, pname_str, flags_str, NULL };
>> +
>> +       if (!reset_info) {
>> +               DRM_WARN("No reset info, not sending the event\n");
>> +               return;
>> +       }
>> +
>> +       DRM_DEBUG("generating reset event\n");
>> +
>> +       snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u", reset_info->pid);
>> +       snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s", reset_info->pname);
>> +       snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", reset_info->flags);
>> +       kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
>> +}
>> +EXPORT_SYMBOL(drm_sysfs_reset_event);
>> +
>>   /**
>>    * drm_sysfs_connector_hotplug_event - generate a DRM uevent for any connector
>>    * change
>> diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
>> index 6273cac44e47..5ba11c760619 100644
>> --- a/include/drm/drm_sysfs.h
>> +++ b/include/drm/drm_sysfs.h
>> @@ -1,16 +1,26 @@
>>   /* SPDX-License-Identifier: GPL-2.0 */
>>   #ifndef _DRM_SYSFS_H_
>>   #define _DRM_SYSFS_H_
>> +#include <linux/sched.h>
>> +
>> +#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)
>>
>>   struct drm_device;
>>   struct device;
>>   struct drm_connector;
>>   struct drm_property;
>>
>> +struct drm_reset_event {
>> +       uint32_t pid;
> 
> One side note, unrelated to devcoredump vs this..
> 
> AFAIU you probably want to be passing around a `struct pid *`, and
> then somehow use pid_vnr() in the context of the process reading the
> event to get the numeric pid.  Otherwise things will not do what you
> expect if the process triggering the crash is in a different pid
> namespace from the compositor.
> 

I am not sure if it is a good idea to add the pid extraction complexity 
in here, it is left upto the driver to extract this information and pass 
it to the work queue. In case of AMDGPU, its extracted from GPU VM. It 
would be then more flexible for the drivers as well.

- Shashank

> BR,
> -R
> 
>> +       uint32_t flags;
>> +       char pname[TASK_COMM_LEN];
>> +};
>> +
>>   int drm_class_device_register(struct device *dev);
>>   void drm_class_device_unregister(struct device *dev);
>>
>>   void drm_sysfs_hotplug_event(struct drm_device *dev);
>> +void drm_sysfs_reset_event(struct drm_device *dev, struct drm_reset_event *reset_info);
>>   void drm_sysfs_connector_hotplug_event(struct drm_connector *connector);
>>   void drm_sysfs_connector_status_event(struct drm_connector *connector,
>>                                        struct drm_property *property);
>> --
>> 2.32.0
>>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-17  8:42   ` Sharma, Shashank
@ 2022-03-17  9:21     ` Christian König
  2022-03-17 10:31         ` Daniel Stone
  0 siblings, 1 reply; 95+ messages in thread
From: Christian König @ 2022-03-17  9:21 UTC (permalink / raw)
  To: Sharma, Shashank, Rob Clark, Shashank Sharma
  Cc: Alexandar Deucher, Rob Clark, Amaranath Somalapuram, dri-devel,
	amd-gfx list

Am 17.03.22 um 09:42 schrieb Sharma, Shashank:
> On 3/16/2022 10:50 PM, Rob Clark wrote:
>> On Tue, Mar 8, 2022 at 11:40 PM Shashank Sharma
>> <contactshashanksharma@gmail.com> wrote:
>>>
>>> From: Shashank Sharma <shashank.sharma@amd.com>
>>>
>>> This patch adds a new sysfs event, which will indicate
>>> the userland about a GPU reset, and can also provide
>>> some information like:
>>> - process ID of the process involved with the GPU reset
>>> - process name of the involved process
>>> - the GPU status info (using flags)
>>>
>>> This patch also introduces the first flag of the flags
>>> bitmap, which can be appended as and when required.
>>>
>>> V2: Addressed review comments from Christian and Amar
>>>     - move the reset information structure to DRM layer
>>>     - drop _ctx from struct name
>>>     - make pid 32 bit(than 64)
>>>     - set flag when VRAM invalid (than valid)
>>>     - add process name as well (Amar)
>>>
>>> Cc: Alexandar Deucher <alexander.deucher@amd.com>
>>> Cc: Christian Koenig <christian.koenig@amd.com>
>>> Cc: Amaranath Somalapuram <amaranath.somalapuram@amd.com>
>>> Signed-off-by: Shashank Sharma <shashank.sharma@amd.com>
>>> ---
>>>   drivers/gpu/drm/drm_sysfs.c | 31 +++++++++++++++++++++++++++++++
>>>   include/drm/drm_sysfs.h     | 10 ++++++++++
>>>   2 files changed, 41 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/drm_sysfs.c b/drivers/gpu/drm/drm_sysfs.c
>>> index 430e00b16eec..840994810910 100644
>>> --- a/drivers/gpu/drm/drm_sysfs.c
>>> +++ b/drivers/gpu/drm/drm_sysfs.c
>>> @@ -409,6 +409,37 @@ void drm_sysfs_hotplug_event(struct drm_device 
>>> *dev)
>>>   }
>>>   EXPORT_SYMBOL(drm_sysfs_hotplug_event);
>>>
>>> +/**
>>> + * drm_sysfs_reset_event - generate a DRM uevent to indicate GPU reset
>>> + * @dev: DRM device
>>> + * @reset_info: The contextual information about the reset (like 
>>> PID, flags)
>>> + *
>>> + * Send a uevent for the DRM device specified by @dev. This informs
>>> + * user that a GPU reset has occurred, so that an interested client
>>> + * can take any recovery or profiling measure.
>>> + */
>>> +void drm_sysfs_reset_event(struct drm_device *dev, struct 
>>> drm_reset_event *reset_info)
>>> +{
>>> +       unsigned char pid_str[13];
>>> +       unsigned char flags_str[15];
>>> +       unsigned char pname_str[TASK_COMM_LEN + 6];
>>> +       unsigned char reset_str[] = "RESET=1";
>>> +       char *envp[] = { reset_str, pid_str, pname_str, flags_str, 
>>> NULL };
>>> +
>>> +       if (!reset_info) {
>>> +               DRM_WARN("No reset info, not sending the event\n");
>>> +               return;
>>> +       }
>>> +
>>> +       DRM_DEBUG("generating reset event\n");
>>> +
>>> +       snprintf(pid_str, ARRAY_SIZE(pid_str), "PID=%u", 
>>> reset_info->pid);
>>> +       snprintf(pname_str, ARRAY_SIZE(pname_str), "NAME=%s", 
>>> reset_info->pname);
>>> +       snprintf(flags_str, ARRAY_SIZE(flags_str), "FLAGS=%u", 
>>> reset_info->flags);
>>> + kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp);
>>> +}
>>> +EXPORT_SYMBOL(drm_sysfs_reset_event);
>>> +
>>>   /**
>>>    * drm_sysfs_connector_hotplug_event - generate a DRM uevent for 
>>> any connector
>>>    * change
>>> diff --git a/include/drm/drm_sysfs.h b/include/drm/drm_sysfs.h
>>> index 6273cac44e47..5ba11c760619 100644
>>> --- a/include/drm/drm_sysfs.h
>>> +++ b/include/drm/drm_sysfs.h
>>> @@ -1,16 +1,26 @@
>>>   /* SPDX-License-Identifier: GPL-2.0 */
>>>   #ifndef _DRM_SYSFS_H_
>>>   #define _DRM_SYSFS_H_
>>> +#include <linux/sched.h>
>>> +
>>> +#define DRM_GPU_RESET_FLAG_VRAM_INVALID (1 << 0)
>>>
>>>   struct drm_device;
>>>   struct device;
>>>   struct drm_connector;
>>>   struct drm_property;
>>>
>>> +struct drm_reset_event {
>>> +       uint32_t pid;
>>
>> One side note, unrelated to devcoredump vs this..
>>
>> AFAIU you probably want to be passing around a `struct pid *`, and
>> then somehow use pid_vnr() in the context of the process reading the
>> event to get the numeric pid.  Otherwise things will not do what you
>> expect if the process triggering the crash is in a different pid
>> namespace from the compositor.
>>
>
> I am not sure if it is a good idea to add the pid extraction 
> complexity in here, it is left upto the driver to extract this 
> information and pass it to the work queue. In case of AMDGPU, its 
> extracted from GPU VM. It would be then more flexible for the drivers 
> as well.

Yeah, but that is just used for debugging.

If we want to use the pid for housekeeping, like for a daemon which 
kills/restarts processes, we absolutely need that or otherwise won't be 
able to work with containers.

Regards,
Christian.

>
> - Shashank
>
>> BR,
>> -R
>>
>>> +       uint32_t flags;
>>> +       char pname[TASK_COMM_LEN];
>>> +};
>>> +
>>>   int drm_class_device_register(struct device *dev);
>>>   void drm_class_device_unregister(struct device *dev);
>>>
>>>   void drm_sysfs_hotplug_event(struct drm_device *dev);
>>> +void drm_sysfs_reset_event(struct drm_device *dev, struct 
>>> drm_reset_event *reset_info);
>>>   void drm_sysfs_connector_hotplug_event(struct drm_connector 
>>> *connector);
>>>   void drm_sysfs_connector_status_event(struct drm_connector 
>>> *connector,
>>>                                        struct drm_property *property);
>>> -- 
>>> 2.32.0
>>>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-14 14:23                             ` Alex Deucher
                                               ` (2 preceding siblings ...)
  (?)
@ 2022-03-17  9:25                             ` Daniel Vetter
  -1 siblings, 0 replies; 95+ messages in thread
From: Daniel Vetter @ 2022-03-17  9:25 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, Pekka Paalanen,
	amd-gfx list, Alexandar Deucher, Shashank Sharma,
	Christian Koenig

On Mon, Mar 14, 2022 at 10:23:27AM -0400, Alex Deucher wrote:
> On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> >
> > On Thu, 10 Mar 2022 11:56:41 -0800
> > Rob Clark <robdclark@gmail.com> wrote:
> >
> > > For something like just notifying a compositor that a gpu crash
> > > happened, perhaps drm_event is more suitable.  See
> > > virtio_gpu_fence_event_create() for an example of adding new event
> > > types.  Although maybe you want it to be an event which is not device
> > > specific.  This isn't so much of a debugging use-case as simply
> > > notification.
> >
> > Hi,
> >
> > for this particular use case, are we now talking about the display
> > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
> >
> > If the former, I wasn't aware that display device crashes are a thing.
> > How should a userspace display server react to those?
> >
> > If the latter, don't we have EGL extensions or Vulkan API already to
> > deliver that?
> >
> > The above would be about device crashes that directly affect the
> > display server. Is that the use case in mind here, or is it instead
> > about notifying the display server that some application has caused a
> > driver/hardware crash? If the latter, how should a display server react
> > to that? Disconnect the application?
> >
> > Shashank, what is the actual use case you are developing this for?
> >
> > I've read all the emails here so far, and I don't recall seeing it
> > explained.
> >
> 
> The idea is that a support daemon or compositor would listen for GPU
> reset notifications and do something useful with them (kill the guilty
> app, restart the desktop environment, etc.).  Today when the GPU
> resets, most applications just continue assuming nothing is wrong,
> meanwhile the GPU has stopped accepting work until the apps re-init
> their context so all of their command submissions just get rejected.
> 
> > Btw. somewhat relatedly, there has been work aiming to allow
> > graceful hot-unplug of DRM devices. There is a kernel doc outlining how
> > the various APIs should react towards userspace when a DRM device
> > suddenly disappears. That seems to have some overlap here IMO.
> >
> > See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> > which also has a couple pointers to EGL and Vulkan APIs.
> 
> The problem is most applications don't use the GL or VK robustness
> APIs.  You could use something like that in the compositor, but those
> APIs tend to be focused more on the application itself rather than the
> GPU in general.  E.g., Is my context lost.  Which is fine for
> restarting your context, but doesn't really help if you want to try
> and do something with another application (i.e., the likely guilty
> app).  Also, on dGPU at least, when you reset the GPU, vram is usually
> lost (either due to the memory controller being reset, or vram being
> zero'd on init due to ECC support), so even if you are not the guilty
> process, in that case you'd need to re-init your context anyway.

Isn't that what arb robustness and all that stuff is for? Doing that
through sysfs event sounds very wrong, since in general apps just don't
have access to that. Also vk equivalent is vk_error_device_lost. Iirc both
have information like whether the app was the guilty one causing the hang,
or whether it was just victimized because the gpu can't do anything else
than a full gpu reset which nukes everything (like amdgpu currently has,
aside from the thread unblock trick in the first attempt).

And if your app/compositor doesn't use robust contexts then the userspace
driver gets to do a best effort attempt at recovery, or exit(). Whatever
you can do really.

Also note that you don't actually want an event, but a query ioctl (plus
maybe a specific errno on your CS ioctl). Neither of the above flows
supports events for gpu resets. RESET_STATS ioctl is the i915
implementation of this stuff.

For the core dump aspect yes pls devcoredump and not reinvented wheels
(and i915 is a bad example here, but in defence the i915 sysfs hang event
predates devcoredump).

Cheers, Daniel


> 
> Alex
> 
> >
> >
> > Thanks,
> > pq

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-17  7:03                                         ` Christian König
@ 2022-03-17  9:29                                           ` Daniel Vetter
  -1 siblings, 0 replies; 95+ messages in thread
From: Daniel Vetter @ 2022-03-17  9:29 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Sharma, Shashank, Amaranath Somalapuram,
	Abhinav Kumar, dri-devel, Alexandar Deucher, amd-gfx list,
	Shashank Sharma, Christian Koenig

On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> Am 16.03.22 um 16:36 schrieb Rob Clark:
> > [SNIP]
> > just one point of clarification.. in the msm and i915 case it is
> > purely for debugging and telemetry (ie. sending crash logs back to
> > distro for analysis if user has crash reporting enabled).. it isn't
> > used for triggering any action like killing app or compositor.
> 
> By the way, how does msm it's memory management for the devcoredumps?

GFP_NORECLAIM all the way. It's purely best effort.

Note that the fancy new plan for i915 discrete gpu is to only support gpu
crash dumps on non-recoverable gpu contexts, i.e. those that do not
continue to the next batch when something bad happens. This is what vk
wants and also what iris now uses (we do context recovery in userspace in
all cases), and non-recoverable contexts greatly simplify the crash dump
gather: Only thing you need to gather is the register state from hw
(before you reset it), all the batchbuffer bo and indirect state bo (in
i915 you can mark which bo to capture in the CS ioctl) can be captured in
a worker later on. Which for non-recoverable context is no issue, since
subsequent batchbuffers won't trample over any of these things.

And that way you can record the crashdump (or at least the big pieces like
all the indirect state stuff) with GFP_KERNEL.

msm probably gets it wrong since embedded drivers have much less shrinker
and generally no mmu notifiers going on :-)

> I mean it is strictly forbidden to allocate any memory in the GPU reset
> path.
> 
> > I would however *strongly* recommend devcoredump support in other GPU
> > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > to debug and fix a couple obscure issues that I was not able to
> > reproduce by myself.
> 
> Yes, completely agree as well.

+1

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-17  9:29                                           ` Daniel Vetter
  0 siblings, 0 replies; 95+ messages in thread
From: Daniel Vetter @ 2022-03-17  9:29 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Sharma, Shashank, Amaranath Somalapuram,
	Abhinav Kumar, dri-devel, Alexandar Deucher, Rob Clark,
	amd-gfx list, Alex Deucher, Shashank Sharma, Christian Koenig

On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> Am 16.03.22 um 16:36 schrieb Rob Clark:
> > [SNIP]
> > just one point of clarification.. in the msm and i915 case it is
> > purely for debugging and telemetry (ie. sending crash logs back to
> > distro for analysis if user has crash reporting enabled).. it isn't
> > used for triggering any action like killing app or compositor.
> 
> By the way, how does msm it's memory management for the devcoredumps?

GFP_NORECLAIM all the way. It's purely best effort.

Note that the fancy new plan for i915 discrete gpu is to only support gpu
crash dumps on non-recoverable gpu contexts, i.e. those that do not
continue to the next batch when something bad happens. This is what vk
wants and also what iris now uses (we do context recovery in userspace in
all cases), and non-recoverable contexts greatly simplify the crash dump
gather: Only thing you need to gather is the register state from hw
(before you reset it), all the batchbuffer bo and indirect state bo (in
i915 you can mark which bo to capture in the CS ioctl) can be captured in
a worker later on. Which for non-recoverable context is no issue, since
subsequent batchbuffers won't trample over any of these things.

And that way you can record the crashdump (or at least the big pieces like
all the indirect state stuff) with GFP_KERNEL.

msm probably gets it wrong since embedded drivers have much less shrinker
and generally no mmu notifiers going on :-)

> I mean it is strictly forbidden to allocate any memory in the GPU reset
> path.
> 
> > I would however *strongly* recommend devcoredump support in other GPU
> > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > to debug and fix a couple obscure issues that I was not able to
> > reproduce by myself.
> 
> Yes, completely agree as well.

+1

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-17  9:29                                           ` Daniel Vetter
@ 2022-03-17  9:46                                             ` Christian König
  -1 siblings, 0 replies; 95+ messages in thread
From: Christian König @ 2022-03-17  9:46 UTC (permalink / raw)
  To: Daniel Vetter, Christian König
  Cc: Rob Clark, Sharma, Shashank, Amaranath Somalapuram,
	Abhinav Kumar, dri-devel, Alexandar Deucher, amd-gfx list,
	Shashank Sharma

Am 17.03.22 um 10:29 schrieb Daniel Vetter:
> On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
>> Am 16.03.22 um 16:36 schrieb Rob Clark:
>>> [SNIP]
>>> just one point of clarification.. in the msm and i915 case it is
>>> purely for debugging and telemetry (ie. sending crash logs back to
>>> distro for analysis if user has crash reporting enabled).. it isn't
>>> used for triggering any action like killing app or compositor.
>> By the way, how does msm it's memory management for the devcoredumps?
> GFP_NORECLAIM all the way. It's purely best effort.

Ok, good to know that it's as simple as that.

> Note that the fancy new plan for i915 discrete gpu is to only support gpu
> crash dumps on non-recoverable gpu contexts, i.e. those that do not
> continue to the next batch when something bad happens.

> This is what vk wants

That's exactly what I'm telling an internal team for a couple of years 
now as well. Good to know that this is not that totally crazy.

>   and also what iris now uses (we do context recovery in userspace in
> all cases), and non-recoverable contexts greatly simplify the crash dump
> gather: Only thing you need to gather is the register state from hw
> (before you reset it), all the batchbuffer bo and indirect state bo (in
> i915 you can mark which bo to capture in the CS ioctl) can be captured in
> a worker later on. Which for non-recoverable context is no issue, since
> subsequent batchbuffers won't trample over any of these things.
>
> And that way you can record the crashdump (or at least the big pieces like
> all the indirect state stuff) with GFP_KERNEL.

Interesting idea, so basically we only do the state we need to reset 
initially and grab a reference on the killed application to gather the 
rest before we clean them up.

Going to keep that in mind as well.

Thanks,
Christian.

>
> msm probably gets it wrong since embedded drivers have much less shrinker
> and generally no mmu notifiers going on :-)
>
>> I mean it is strictly forbidden to allocate any memory in the GPU reset
>> path.
>>
>>> I would however *strongly* recommend devcoredump support in other GPU
>>> drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
>>> to debug and fix a couple obscure issues that I was not able to
>>> reproduce by myself.
>> Yes, completely agree as well.
> +1
>
> Cheers, Daniel


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-17  9:46                                             ` Christian König
  0 siblings, 0 replies; 95+ messages in thread
From: Christian König @ 2022-03-17  9:46 UTC (permalink / raw)
  To: Daniel Vetter, Christian König
  Cc: Rob Clark, Sharma, Shashank, Amaranath Somalapuram,
	Abhinav Kumar, dri-devel, Alexandar Deucher, Rob Clark,
	amd-gfx list, Alex Deucher, Shashank Sharma

Am 17.03.22 um 10:29 schrieb Daniel Vetter:
> On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
>> Am 16.03.22 um 16:36 schrieb Rob Clark:
>>> [SNIP]
>>> just one point of clarification.. in the msm and i915 case it is
>>> purely for debugging and telemetry (ie. sending crash logs back to
>>> distro for analysis if user has crash reporting enabled).. it isn't
>>> used for triggering any action like killing app or compositor.
>> By the way, how does msm it's memory management for the devcoredumps?
> GFP_NORECLAIM all the way. It's purely best effort.

Ok, good to know that it's as simple as that.

> Note that the fancy new plan for i915 discrete gpu is to only support gpu
> crash dumps on non-recoverable gpu contexts, i.e. those that do not
> continue to the next batch when something bad happens.

> This is what vk wants

That's exactly what I'm telling an internal team for a couple of years 
now as well. Good to know that this is not that totally crazy.

>   and also what iris now uses (we do context recovery in userspace in
> all cases), and non-recoverable contexts greatly simplify the crash dump
> gather: Only thing you need to gather is the register state from hw
> (before you reset it), all the batchbuffer bo and indirect state bo (in
> i915 you can mark which bo to capture in the CS ioctl) can be captured in
> a worker later on. Which for non-recoverable context is no issue, since
> subsequent batchbuffers won't trample over any of these things.
>
> And that way you can record the crashdump (or at least the big pieces like
> all the indirect state stuff) with GFP_KERNEL.

Interesting idea, so basically we only do the state we need to reset 
initially and grab a reference on the killed application to gather the 
rest before we clean them up.

Going to keep that in mind as well.

Thanks,
Christian.

>
> msm probably gets it wrong since embedded drivers have much less shrinker
> and generally no mmu notifiers going on :-)
>
>> I mean it is strictly forbidden to allocate any memory in the GPU reset
>> path.
>>
>>> I would however *strongly* recommend devcoredump support in other GPU
>>> drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
>>> to debug and fix a couple obscure issues that I was not able to
>>> reproduce by myself.
>> Yes, completely agree as well.
> +1
>
> Cheers, Daniel


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-17  9:21     ` Christian König
@ 2022-03-17 10:31         ` Daniel Stone
  0 siblings, 0 replies; 95+ messages in thread
From: Daniel Stone @ 2022-03-17 10:31 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Sharma, Shashank, Amaranath Somalapuram, dri-devel,
	amd-gfx list, Alexandar Deucher, Shashank Sharma

Hi,

On Thu, 17 Mar 2022 at 09:21, Christian König <christian.koenig@amd.com> wrote:
> Am 17.03.22 um 09:42 schrieb Sharma, Shashank:
> >> AFAIU you probably want to be passing around a `struct pid *`, and
> >> then somehow use pid_vnr() in the context of the process reading the
> >> event to get the numeric pid.  Otherwise things will not do what you
> >> expect if the process triggering the crash is in a different pid
> >> namespace from the compositor.
> >
> > I am not sure if it is a good idea to add the pid extraction
> > complexity in here, it is left upto the driver to extract this
> > information and pass it to the work queue. In case of AMDGPU, its
> > extracted from GPU VM. It would be then more flexible for the drivers
> > as well.
>
> Yeah, but that is just used for debugging.
>
> If we want to use the pid for housekeeping, like for a daemon which
> kills/restarts processes, we absolutely need that or otherwise won't be
> able to work with containers.

100% this.

Pushing back to the compositor is a red herring. The compositor is
just a service which tries to handle window management and input. If
you're looking to kill the offending process or whatever, then that
should go through the session manager - be it systemd or something
container-centric or whatever. At least that way it can deal with
cgroups at the same time, unlike the compositor which is not really
aware of what the thing on the other end of the socket is doing. This
ties in with the support they already have for things like coredump
analysis, and would also be useful for other devices.

Some environments combine compositor and session manager, and a lot of
them have them strongly related, but they're very definitely not the
same thing ...

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-17 10:31         ` Daniel Stone
  0 siblings, 0 replies; 95+ messages in thread
From: Daniel Stone @ 2022-03-17 10:31 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Sharma, Shashank, Amaranath Somalapuram, dri-devel,
	Rob Clark, amd-gfx list, Alexandar Deucher, Shashank Sharma

Hi,

On Thu, 17 Mar 2022 at 09:21, Christian König <christian.koenig@amd.com> wrote:
> Am 17.03.22 um 09:42 schrieb Sharma, Shashank:
> >> AFAIU you probably want to be passing around a `struct pid *`, and
> >> then somehow use pid_vnr() in the context of the process reading the
> >> event to get the numeric pid.  Otherwise things will not do what you
> >> expect if the process triggering the crash is in a different pid
> >> namespace from the compositor.
> >
> > I am not sure if it is a good idea to add the pid extraction
> > complexity in here, it is left upto the driver to extract this
> > information and pass it to the work queue. In case of AMDGPU, its
> > extracted from GPU VM. It would be then more flexible for the drivers
> > as well.
>
> Yeah, but that is just used for debugging.
>
> If we want to use the pid for housekeeping, like for a daemon which
> kills/restarts processes, we absolutely need that or otherwise won't be
> able to work with containers.

100% this.

Pushing back to the compositor is a red herring. The compositor is
just a service which tries to handle window management and input. If
you're looking to kill the offending process or whatever, then that
should go through the session manager - be it systemd or something
container-centric or whatever. At least that way it can deal with
cgroups at the same time, unlike the compositor which is not really
aware of what the thing on the other end of the socket is doing. This
ties in with the support they already have for things like coredump
analysis, and would also be useful for other devices.

Some environments combine compositor and session manager, and a lot of
them have them strongly related, but they're very definitely not the
same thing ...

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-17  9:29                                           ` Daniel Vetter
@ 2022-03-17 15:34                                             ` Rob Clark
  -1 siblings, 0 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-17 15:34 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Shashank Sharma,
	Christian Koenig

On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> > Am 16.03.22 um 16:36 schrieb Rob Clark:
> > > [SNIP]
> > > just one point of clarification.. in the msm and i915 case it is
> > > purely for debugging and telemetry (ie. sending crash logs back to
> > > distro for analysis if user has crash reporting enabled).. it isn't
> > > used for triggering any action like killing app or compositor.
> >
> > By the way, how does msm it's memory management for the devcoredumps?
>
> GFP_NORECLAIM all the way. It's purely best effort.

We do one GEM obj allocation in the snapshot path (the hw has a
mechanism to snapshot it's own state into a gpu buffer.. not sure if
nice debugging functionality like that is a commentary on the blob
driver quality, but I'm not complaining)

I suppose we could pre-allocate this buffer up-front.. but it doesn't
seem like a problem, ie. if allocation fails we just skip snapshotting
stuff that needs the hw crashdumper.  I guess since vram is not
involved, perhaps that makes the situation a bit more straightforward.

> Note that the fancy new plan for i915 discrete gpu is to only support gpu
> crash dumps on non-recoverable gpu contexts, i.e. those that do not
> continue to the next batch when something bad happens. This is what vk
> wants and also what iris now uses (we do context recovery in userspace in
> all cases), and non-recoverable contexts greatly simplify the crash dump
> gather: Only thing you need to gather is the register state from hw
> (before you reset it), all the batchbuffer bo and indirect state bo (in
> i915 you can mark which bo to capture in the CS ioctl) can be captured in
> a worker later on. Which for non-recoverable context is no issue, since
> subsequent batchbuffers won't trample over any of these things.
>
> And that way you can record the crashdump (or at least the big pieces like
> all the indirect state stuff) with GFP_KERNEL.
>
> msm probably gets it wrong since embedded drivers have much less shrinker
> and generally no mmu notifiers going on :-)

Note that the bo's associated with the batch are still pinned at this
point, from the bo lifecycle the batch is still active.  So from the
point of view of shrinker, there should be no interaction.  We aren't
doing anything with mmu notifiers (yet), so not entirely sure offhand
the concern there.

Currently we just use GFP_KERNEL and bail if allocation fails.

BR,
-R

> > I mean it is strictly forbidden to allocate any memory in the GPU reset
> > path.
> >
> > > I would however *strongly* recommend devcoredump support in other GPU
> > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > > to debug and fix a couple obscure issues that I was not able to
> > > reproduce by myself.
> >
> > Yes, completely agree as well.
>
> +1
>
> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-17 15:34                                             ` Rob Clark
  0 siblings, 0 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-17 15:34 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Alex Deucher, Shashank Sharma,
	Christian Koenig

On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> > Am 16.03.22 um 16:36 schrieb Rob Clark:
> > > [SNIP]
> > > just one point of clarification.. in the msm and i915 case it is
> > > purely for debugging and telemetry (ie. sending crash logs back to
> > > distro for analysis if user has crash reporting enabled).. it isn't
> > > used for triggering any action like killing app or compositor.
> >
> > By the way, how does msm it's memory management for the devcoredumps?
>
> GFP_NORECLAIM all the way. It's purely best effort.

We do one GEM obj allocation in the snapshot path (the hw has a
mechanism to snapshot it's own state into a gpu buffer.. not sure if
nice debugging functionality like that is a commentary on the blob
driver quality, but I'm not complaining)

I suppose we could pre-allocate this buffer up-front.. but it doesn't
seem like a problem, ie. if allocation fails we just skip snapshotting
stuff that needs the hw crashdumper.  I guess since vram is not
involved, perhaps that makes the situation a bit more straightforward.

> Note that the fancy new plan for i915 discrete gpu is to only support gpu
> crash dumps on non-recoverable gpu contexts, i.e. those that do not
> continue to the next batch when something bad happens. This is what vk
> wants and also what iris now uses (we do context recovery in userspace in
> all cases), and non-recoverable contexts greatly simplify the crash dump
> gather: Only thing you need to gather is the register state from hw
> (before you reset it), all the batchbuffer bo and indirect state bo (in
> i915 you can mark which bo to capture in the CS ioctl) can be captured in
> a worker later on. Which for non-recoverable context is no issue, since
> subsequent batchbuffers won't trample over any of these things.
>
> And that way you can record the crashdump (or at least the big pieces like
> all the indirect state stuff) with GFP_KERNEL.
>
> msm probably gets it wrong since embedded drivers have much less shrinker
> and generally no mmu notifiers going on :-)

Note that the bo's associated with the batch are still pinned at this
point, from the bo lifecycle the batch is still active.  So from the
point of view of shrinker, there should be no interaction.  We aren't
doing anything with mmu notifiers (yet), so not entirely sure offhand
the concern there.

Currently we just use GFP_KERNEL and bail if allocation fails.

BR,
-R

> > I mean it is strictly forbidden to allocate any memory in the GPU reset
> > path.
> >
> > > I would however *strongly* recommend devcoredump support in other GPU
> > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > > to debug and fix a couple obscure issues that I was not able to
> > > reproduce by myself.
> >
> > Yes, completely agree as well.
>
> +1
>
> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-17  9:29                                           ` Daniel Vetter
@ 2022-03-17 15:40                                             ` Rob Clark
  -1 siblings, 0 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-17 15:40 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Shashank Sharma,
	Christian Koenig

On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> > Am 16.03.22 um 16:36 schrieb Rob Clark:
> > > [SNIP]
> > > just one point of clarification.. in the msm and i915 case it is
> > > purely for debugging and telemetry (ie. sending crash logs back to
> > > distro for analysis if user has crash reporting enabled).. it isn't
> > > used for triggering any action like killing app or compositor.
> >
> > By the way, how does msm it's memory management for the devcoredumps?
>
> GFP_NORECLAIM all the way. It's purely best effort.
>
> Note that the fancy new plan for i915 discrete gpu is to only support gpu
> crash dumps on non-recoverable gpu contexts, i.e. those that do not
> continue to the next batch when something bad happens. This is what vk
> wants and also what iris now uses (we do context recovery in userspace in
> all cases), and non-recoverable contexts greatly simplify the crash dump
> gather: Only thing you need to gather is the register state from hw
> (before you reset it), all the batchbuffer bo and indirect state bo (in
> i915 you can mark which bo to capture in the CS ioctl) can be captured in
> a worker later on. Which for non-recoverable context is no issue, since
> subsequent batchbuffers won't trample over any of these things.

fwiw, we snapshot everything (cmdstream and bo's marked with dump
flag, in addition to hw state) before resuming the GPU, so there is no
danger of things being trampled.  After state is captured and GPU
reset, we "replay" the submits that were written into the ringbuffer
after the faulting submit.  GPU crashes should be a thing you don't
need to try to optimize.

(At some point, I'd like to use scheduler for the replay, and actually
use drm_sched_stop()/etc.. but last time I looked there were still
some sched bugs in that area which prevented me from deleting a bunch
of code ;-))

BR,
-R

>
> And that way you can record the crashdump (or at least the big pieces like
> all the indirect state stuff) with GFP_KERNEL.
>
> msm probably gets it wrong since embedded drivers have much less shrinker
> and generally no mmu notifiers going on :-)
>
> > I mean it is strictly forbidden to allocate any memory in the GPU reset
> > path.
> >
> > > I would however *strongly* recommend devcoredump support in other GPU
> > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > > to debug and fix a couple obscure issues that I was not able to
> > > reproduce by myself.
> >
> > Yes, completely agree as well.
>
> +1
>
> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-17 15:40                                             ` Rob Clark
  0 siblings, 0 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-17 15:40 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Alex Deucher, Shashank Sharma,
	Christian Koenig

On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> > Am 16.03.22 um 16:36 schrieb Rob Clark:
> > > [SNIP]
> > > just one point of clarification.. in the msm and i915 case it is
> > > purely for debugging and telemetry (ie. sending crash logs back to
> > > distro for analysis if user has crash reporting enabled).. it isn't
> > > used for triggering any action like killing app or compositor.
> >
> > By the way, how does msm it's memory management for the devcoredumps?
>
> GFP_NORECLAIM all the way. It's purely best effort.
>
> Note that the fancy new plan for i915 discrete gpu is to only support gpu
> crash dumps on non-recoverable gpu contexts, i.e. those that do not
> continue to the next batch when something bad happens. This is what vk
> wants and also what iris now uses (we do context recovery in userspace in
> all cases), and non-recoverable contexts greatly simplify the crash dump
> gather: Only thing you need to gather is the register state from hw
> (before you reset it), all the batchbuffer bo and indirect state bo (in
> i915 you can mark which bo to capture in the CS ioctl) can be captured in
> a worker later on. Which for non-recoverable context is no issue, since
> subsequent batchbuffers won't trample over any of these things.

fwiw, we snapshot everything (cmdstream and bo's marked with dump
flag, in addition to hw state) before resuming the GPU, so there is no
danger of things being trampled.  After state is captured and GPU
reset, we "replay" the submits that were written into the ringbuffer
after the faulting submit.  GPU crashes should be a thing you don't
need to try to optimize.

(At some point, I'd like to use scheduler for the replay, and actually
use drm_sched_stop()/etc.. but last time I looked there were still
some sched bugs in that area which prevented me from deleting a bunch
of code ;-))

BR,
-R

>
> And that way you can record the crashdump (or at least the big pieces like
> all the indirect state stuff) with GFP_KERNEL.
>
> msm probably gets it wrong since embedded drivers have much less shrinker
> and generally no mmu notifiers going on :-)
>
> > I mean it is strictly forbidden to allocate any memory in the GPU reset
> > path.
> >
> > > I would however *strongly* recommend devcoredump support in other GPU
> > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > > to debug and fix a couple obscure issues that I was not able to
> > > reproduce by myself.
> >
> > Yes, completely agree as well.
>
> +1
>
> Cheers, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-17 15:34                                             ` Rob Clark
@ 2022-03-17 17:23                                               ` Daniel Vetter
  -1 siblings, 0 replies; 95+ messages in thread
From: Daniel Vetter @ 2022-03-17 17:23 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Shashank Sharma,
	Christian Koenig

On Thu, Mar 17, 2022 at 08:34:21AM -0700, Rob Clark wrote:
> On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> > > Am 16.03.22 um 16:36 schrieb Rob Clark:
> > > > [SNIP]
> > > > just one point of clarification.. in the msm and i915 case it is
> > > > purely for debugging and telemetry (ie. sending crash logs back to
> > > > distro for analysis if user has crash reporting enabled).. it isn't
> > > > used for triggering any action like killing app or compositor.
> > >
> > > By the way, how does msm it's memory management for the devcoredumps?
> >
> > GFP_NORECLAIM all the way. It's purely best effort.
> 
> We do one GEM obj allocation in the snapshot path (the hw has a
> mechanism to snapshot it's own state into a gpu buffer.. not sure if
> nice debugging functionality like that is a commentary on the blob
> driver quality, but I'm not complaining)
> 
> I suppose we could pre-allocate this buffer up-front.. but it doesn't
> seem like a problem, ie. if allocation fails we just skip snapshotting
> stuff that needs the hw crashdumper.  I guess since vram is not
> involved, perhaps that makes the situation a bit more straightforward.

The problem is that you need to allocate with GFP_ATOMIC, instead of
GFP_KERNEL, or things go very bad.

The scheduler dma-fence annotations I've had (well still have them here)
would catch this stuff, but thus far they got nowhere.

> > Note that the fancy new plan for i915 discrete gpu is to only support gpu
> > crash dumps on non-recoverable gpu contexts, i.e. those that do not
> > continue to the next batch when something bad happens. This is what vk
> > wants and also what iris now uses (we do context recovery in userspace in
> > all cases), and non-recoverable contexts greatly simplify the crash dump
> > gather: Only thing you need to gather is the register state from hw
> > (before you reset it), all the batchbuffer bo and indirect state bo (in
> > i915 you can mark which bo to capture in the CS ioctl) can be captured in
> > a worker later on. Which for non-recoverable context is no issue, since
> > subsequent batchbuffers won't trample over any of these things.
> >
> > And that way you can record the crashdump (or at least the big pieces like
> > all the indirect state stuff) with GFP_KERNEL.
> >
> > msm probably gets it wrong since embedded drivers have much less shrinker
> > and generally no mmu notifiers going on :-)
> 
> Note that the bo's associated with the batch are still pinned at this
> point, from the bo lifecycle the batch is still active.  So from the
> point of view of shrinker, there should be no interaction.  We aren't
> doing anything with mmu notifiers (yet), so not entirely sure offhand
> the concern there.
> 
> Currently we just use GFP_KERNEL and bail if allocation fails.

Yeah you have a simple enough shrinker for this not to be a problem. The
issue is that sooner or later things tend to not stay like that, and we're
trying to have common rules for dma_fence to make sure everyone follows
the same rules.
-Daniel

> 
> BR,
> -R
> 
> > > I mean it is strictly forbidden to allocate any memory in the GPU reset
> > > path.
> > >
> > > > I would however *strongly* recommend devcoredump support in other GPU
> > > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > > > to debug and fix a couple obscure issues that I was not able to
> > > > reproduce by myself.
> > >
> > > Yes, completely agree as well.
> >
> > +1
> >
> > Cheers, Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-17 17:23                                               ` Daniel Vetter
  0 siblings, 0 replies; 95+ messages in thread
From: Daniel Vetter @ 2022-03-17 17:23 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Daniel Vetter, Alex Deucher,
	Shashank Sharma, Christian Koenig

On Thu, Mar 17, 2022 at 08:34:21AM -0700, Rob Clark wrote:
> On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> > > Am 16.03.22 um 16:36 schrieb Rob Clark:
> > > > [SNIP]
> > > > just one point of clarification.. in the msm and i915 case it is
> > > > purely for debugging and telemetry (ie. sending crash logs back to
> > > > distro for analysis if user has crash reporting enabled).. it isn't
> > > > used for triggering any action like killing app or compositor.
> > >
> > > By the way, how does msm it's memory management for the devcoredumps?
> >
> > GFP_NORECLAIM all the way. It's purely best effort.
> 
> We do one GEM obj allocation in the snapshot path (the hw has a
> mechanism to snapshot it's own state into a gpu buffer.. not sure if
> nice debugging functionality like that is a commentary on the blob
> driver quality, but I'm not complaining)
> 
> I suppose we could pre-allocate this buffer up-front.. but it doesn't
> seem like a problem, ie. if allocation fails we just skip snapshotting
> stuff that needs the hw crashdumper.  I guess since vram is not
> involved, perhaps that makes the situation a bit more straightforward.

The problem is that you need to allocate with GFP_ATOMIC, instead of
GFP_KERNEL, or things go very bad.

The scheduler dma-fence annotations I've had (well still have them here)
would catch this stuff, but thus far they got nowhere.

> > Note that the fancy new plan for i915 discrete gpu is to only support gpu
> > crash dumps on non-recoverable gpu contexts, i.e. those that do not
> > continue to the next batch when something bad happens. This is what vk
> > wants and also what iris now uses (we do context recovery in userspace in
> > all cases), and non-recoverable contexts greatly simplify the crash dump
> > gather: Only thing you need to gather is the register state from hw
> > (before you reset it), all the batchbuffer bo and indirect state bo (in
> > i915 you can mark which bo to capture in the CS ioctl) can be captured in
> > a worker later on. Which for non-recoverable context is no issue, since
> > subsequent batchbuffers won't trample over any of these things.
> >
> > And that way you can record the crashdump (or at least the big pieces like
> > all the indirect state stuff) with GFP_KERNEL.
> >
> > msm probably gets it wrong since embedded drivers have much less shrinker
> > and generally no mmu notifiers going on :-)
> 
> Note that the bo's associated with the batch are still pinned at this
> point, from the bo lifecycle the batch is still active.  So from the
> point of view of shrinker, there should be no interaction.  We aren't
> doing anything with mmu notifiers (yet), so not entirely sure offhand
> the concern there.
> 
> Currently we just use GFP_KERNEL and bail if allocation fails.

Yeah you have a simple enough shrinker for this not to be a problem. The
issue is that sooner or later things tend to not stay like that, and we're
trying to have common rules for dma_fence to make sure everyone follows
the same rules.
-Daniel

> 
> BR,
> -R
> 
> > > I mean it is strictly forbidden to allocate any memory in the GPU reset
> > > path.
> > >
> > > > I would however *strongly* recommend devcoredump support in other GPU
> > > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > > > to debug and fix a couple obscure issues that I was not able to
> > > > reproduce by myself.
> > >
> > > Yes, completely agree as well.
> >
> > +1
> >
> > Cheers, Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-17 15:40                                             ` Rob Clark
@ 2022-03-17 17:26                                               ` Daniel Vetter
  -1 siblings, 0 replies; 95+ messages in thread
From: Daniel Vetter @ 2022-03-17 17:26 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Shashank Sharma,
	Christian Koenig

On Thu, Mar 17, 2022 at 08:40:51AM -0700, Rob Clark wrote:
> On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> > > Am 16.03.22 um 16:36 schrieb Rob Clark:
> > > > [SNIP]
> > > > just one point of clarification.. in the msm and i915 case it is
> > > > purely for debugging and telemetry (ie. sending crash logs back to
> > > > distro for analysis if user has crash reporting enabled).. it isn't
> > > > used for triggering any action like killing app or compositor.
> > >
> > > By the way, how does msm it's memory management for the devcoredumps?
> >
> > GFP_NORECLAIM all the way. It's purely best effort.
> >
> > Note that the fancy new plan for i915 discrete gpu is to only support gpu
> > crash dumps on non-recoverable gpu contexts, i.e. those that do not
> > continue to the next batch when something bad happens. This is what vk
> > wants and also what iris now uses (we do context recovery in userspace in
> > all cases), and non-recoverable contexts greatly simplify the crash dump
> > gather: Only thing you need to gather is the register state from hw
> > (before you reset it), all the batchbuffer bo and indirect state bo (in
> > i915 you can mark which bo to capture in the CS ioctl) can be captured in
> > a worker later on. Which for non-recoverable context is no issue, since
> > subsequent batchbuffers won't trample over any of these things.
> 
> fwiw, we snapshot everything (cmdstream and bo's marked with dump
> flag, in addition to hw state) before resuming the GPU, so there is no
> danger of things being trampled.  After state is captured and GPU
> reset, we "replay" the submits that were written into the ringbuffer
> after the faulting submit.  GPU crashes should be a thing you don't
> need to try to optimize.

Not sure why you think we optimize anything here?

> (At some point, I'd like to use scheduler for the replay, and actually
> use drm_sched_stop()/etc.. but last time I looked there were still
> some sched bugs in that area which prevented me from deleting a bunch
> of code ;-))

Not sure about your hw, but at least on intel replaying tends to just
result in follow-on fun. And that holds even more so the more complex a
workload is. This is why vk just dies immediately and does not try to
replay anything, offloading it to the app. Same with arb robusteness.
Afaik it's really only media and classic gl which insist that the driver
stack somehow recover.

And recovering from a mess in userspace is a lot simpler than trying to
pull of the same magic in the kernel. Plus it also helps with a few of the
dma_fence rules, which is a nice bonus.
-Daniel

> 
> BR,
> -R
> 
> >
> > And that way you can record the crashdump (or at least the big pieces like
> > all the indirect state stuff) with GFP_KERNEL.
> >
> > msm probably gets it wrong since embedded drivers have much less shrinker
> > and generally no mmu notifiers going on :-)
> >
> > > I mean it is strictly forbidden to allocate any memory in the GPU reset
> > > path.
> > >
> > > > I would however *strongly* recommend devcoredump support in other GPU
> > > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > > > to debug and fix a couple obscure issues that I was not able to
> > > > reproduce by myself.
> > >
> > > Yes, completely agree as well.
> >
> > +1
> >
> > Cheers, Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-17 17:26                                               ` Daniel Vetter
  0 siblings, 0 replies; 95+ messages in thread
From: Daniel Vetter @ 2022-03-17 17:26 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Daniel Vetter, Alex Deucher,
	Shashank Sharma, Christian Koenig

On Thu, Mar 17, 2022 at 08:40:51AM -0700, Rob Clark wrote:
> On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> > > Am 16.03.22 um 16:36 schrieb Rob Clark:
> > > > [SNIP]
> > > > just one point of clarification.. in the msm and i915 case it is
> > > > purely for debugging and telemetry (ie. sending crash logs back to
> > > > distro for analysis if user has crash reporting enabled).. it isn't
> > > > used for triggering any action like killing app or compositor.
> > >
> > > By the way, how does msm it's memory management for the devcoredumps?
> >
> > GFP_NORECLAIM all the way. It's purely best effort.
> >
> > Note that the fancy new plan for i915 discrete gpu is to only support gpu
> > crash dumps on non-recoverable gpu contexts, i.e. those that do not
> > continue to the next batch when something bad happens. This is what vk
> > wants and also what iris now uses (we do context recovery in userspace in
> > all cases), and non-recoverable contexts greatly simplify the crash dump
> > gather: Only thing you need to gather is the register state from hw
> > (before you reset it), all the batchbuffer bo and indirect state bo (in
> > i915 you can mark which bo to capture in the CS ioctl) can be captured in
> > a worker later on. Which for non-recoverable context is no issue, since
> > subsequent batchbuffers won't trample over any of these things.
> 
> fwiw, we snapshot everything (cmdstream and bo's marked with dump
> flag, in addition to hw state) before resuming the GPU, so there is no
> danger of things being trampled.  After state is captured and GPU
> reset, we "replay" the submits that were written into the ringbuffer
> after the faulting submit.  GPU crashes should be a thing you don't
> need to try to optimize.

Not sure why you think we optimize anything here?

> (At some point, I'd like to use scheduler for the replay, and actually
> use drm_sched_stop()/etc.. but last time I looked there were still
> some sched bugs in that area which prevented me from deleting a bunch
> of code ;-))

Not sure about your hw, but at least on intel replaying tends to just
result in follow-on fun. And that holds even more so the more complex a
workload is. This is why vk just dies immediately and does not try to
replay anything, offloading it to the app. Same with arb robusteness.
Afaik it's really only media and classic gl which insist that the driver
stack somehow recover.

And recovering from a mess in userspace is a lot simpler than trying to
pull of the same magic in the kernel. Plus it also helps with a few of the
dma_fence rules, which is a nice bonus.
-Daniel

> 
> BR,
> -R
> 
> >
> > And that way you can record the crashdump (or at least the big pieces like
> > all the indirect state stuff) with GFP_KERNEL.
> >
> > msm probably gets it wrong since embedded drivers have much less shrinker
> > and generally no mmu notifiers going on :-)
> >
> > > I mean it is strictly forbidden to allocate any memory in the GPU reset
> > > path.
> > >
> > > > I would however *strongly* recommend devcoredump support in other GPU
> > > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > > > to debug and fix a couple obscure issues that I was not able to
> > > > reproduce by myself.
> > >
> > > Yes, completely agree as well.
> >
> > +1
> >
> > Cheers, Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-17 17:26                                               ` Daniel Vetter
@ 2022-03-17 17:31                                                 ` Rob Clark
  -1 siblings, 0 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-17 17:31 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Shashank Sharma,
	Christian Koenig

On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Thu, Mar 17, 2022 at 08:40:51AM -0700, Rob Clark wrote:
> > On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> > > > Am 16.03.22 um 16:36 schrieb Rob Clark:
> > > > > [SNIP]
> > > > > just one point of clarification.. in the msm and i915 case it is
> > > > > purely for debugging and telemetry (ie. sending crash logs back to
> > > > > distro for analysis if user has crash reporting enabled).. it isn't
> > > > > used for triggering any action like killing app or compositor.
> > > >
> > > > By the way, how does msm it's memory management for the devcoredumps?
> > >
> > > GFP_NORECLAIM all the way. It's purely best effort.
> > >
> > > Note that the fancy new plan for i915 discrete gpu is to only support gpu
> > > crash dumps on non-recoverable gpu contexts, i.e. those that do not
> > > continue to the next batch when something bad happens. This is what vk
> > > wants and also what iris now uses (we do context recovery in userspace in
> > > all cases), and non-recoverable contexts greatly simplify the crash dump
> > > gather: Only thing you need to gather is the register state from hw
> > > (before you reset it), all the batchbuffer bo and indirect state bo (in
> > > i915 you can mark which bo to capture in the CS ioctl) can be captured in
> > > a worker later on. Which for non-recoverable context is no issue, since
> > > subsequent batchbuffers won't trample over any of these things.
> >
> > fwiw, we snapshot everything (cmdstream and bo's marked with dump
> > flag, in addition to hw state) before resuming the GPU, so there is no
> > danger of things being trampled.  After state is captured and GPU
> > reset, we "replay" the submits that were written into the ringbuffer
> > after the faulting submit.  GPU crashes should be a thing you don't
> > need to try to optimize.
>
> Not sure why you think we optimize anything here?
>
> > (At some point, I'd like to use scheduler for the replay, and actually
> > use drm_sched_stop()/etc.. but last time I looked there were still
> > some sched bugs in that area which prevented me from deleting a bunch
> > of code ;-))
>
> Not sure about your hw, but at least on intel replaying tends to just
> result in follow-on fun. And that holds even more so the more complex a
> workload is. This is why vk just dies immediately and does not try to
> replay anything, offloading it to the app. Same with arb robusteness.
> Afaik it's really only media and classic gl which insist that the driver
> stack somehow recover.

At least for us, each submit must be self-contained (ie. not rely on
previous GPU hw state), so in practice replay works out pretty well.
The worst case is subsequent submits from same process fail as well
(if they depended on something that crashing submit failed to write
back to memory.. but in that case they just crash as well and we move
on to the next one.. the recent gens (a5xx+ at least) are pretty good
about quickly detecting problems and giving us an error irq.

BR,
-R

> And recovering from a mess in userspace is a lot simpler than trying to
> pull of the same magic in the kernel. Plus it also helps with a few of the
> dma_fence rules, which is a nice bonus.
> -Daniel
>
> >
> > BR,
> > -R
> >
> > >
> > > And that way you can record the crashdump (or at least the big pieces like
> > > all the indirect state stuff) with GFP_KERNEL.
> > >
> > > msm probably gets it wrong since embedded drivers have much less shrinker
> > > and generally no mmu notifiers going on :-)
> > >
> > > > I mean it is strictly forbidden to allocate any memory in the GPU reset
> > > > path.
> > > >
> > > > > I would however *strongly* recommend devcoredump support in other GPU
> > > > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > > > > to debug and fix a couple obscure issues that I was not able to
> > > > > reproduce by myself.
> > > >
> > > > Yes, completely agree as well.
> > >
> > > +1
> > >
> > > Cheers, Daniel
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-17 17:31                                                 ` Rob Clark
  0 siblings, 0 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-17 17:31 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Alex Deucher, Shashank Sharma,
	Christian Koenig

On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Thu, Mar 17, 2022 at 08:40:51AM -0700, Rob Clark wrote:
> > On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> > > > Am 16.03.22 um 16:36 schrieb Rob Clark:
> > > > > [SNIP]
> > > > > just one point of clarification.. in the msm and i915 case it is
> > > > > purely for debugging and telemetry (ie. sending crash logs back to
> > > > > distro for analysis if user has crash reporting enabled).. it isn't
> > > > > used for triggering any action like killing app or compositor.
> > > >
> > > > By the way, how does msm it's memory management for the devcoredumps?
> > >
> > > GFP_NORECLAIM all the way. It's purely best effort.
> > >
> > > Note that the fancy new plan for i915 discrete gpu is to only support gpu
> > > crash dumps on non-recoverable gpu contexts, i.e. those that do not
> > > continue to the next batch when something bad happens. This is what vk
> > > wants and also what iris now uses (we do context recovery in userspace in
> > > all cases), and non-recoverable contexts greatly simplify the crash dump
> > > gather: Only thing you need to gather is the register state from hw
> > > (before you reset it), all the batchbuffer bo and indirect state bo (in
> > > i915 you can mark which bo to capture in the CS ioctl) can be captured in
> > > a worker later on. Which for non-recoverable context is no issue, since
> > > subsequent batchbuffers won't trample over any of these things.
> >
> > fwiw, we snapshot everything (cmdstream and bo's marked with dump
> > flag, in addition to hw state) before resuming the GPU, so there is no
> > danger of things being trampled.  After state is captured and GPU
> > reset, we "replay" the submits that were written into the ringbuffer
> > after the faulting submit.  GPU crashes should be a thing you don't
> > need to try to optimize.
>
> Not sure why you think we optimize anything here?
>
> > (At some point, I'd like to use scheduler for the replay, and actually
> > use drm_sched_stop()/etc.. but last time I looked there were still
> > some sched bugs in that area which prevented me from deleting a bunch
> > of code ;-))
>
> Not sure about your hw, but at least on intel replaying tends to just
> result in follow-on fun. And that holds even more so the more complex a
> workload is. This is why vk just dies immediately and does not try to
> replay anything, offloading it to the app. Same with arb robusteness.
> Afaik it's really only media and classic gl which insist that the driver
> stack somehow recover.

At least for us, each submit must be self-contained (ie. not rely on
previous GPU hw state), so in practice replay works out pretty well.
The worst case is subsequent submits from same process fail as well
(if they depended on something that crashing submit failed to write
back to memory.. but in that case they just crash as well and we move
on to the next one.. the recent gens (a5xx+ at least) are pretty good
about quickly detecting problems and giving us an error irq.

BR,
-R

> And recovering from a mess in userspace is a lot simpler than trying to
> pull of the same magic in the kernel. Plus it also helps with a few of the
> dma_fence rules, which is a nice bonus.
> -Daniel
>
> >
> > BR,
> > -R
> >
> > >
> > > And that way you can record the crashdump (or at least the big pieces like
> > > all the indirect state stuff) with GFP_KERNEL.
> > >
> > > msm probably gets it wrong since embedded drivers have much less shrinker
> > > and generally no mmu notifiers going on :-)
> > >
> > > > I mean it is strictly forbidden to allocate any memory in the GPU reset
> > > > path.
> > > >
> > > > > I would however *strongly* recommend devcoredump support in other GPU
> > > > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > > > > to debug and fix a couple obscure issues that I was not able to
> > > > > reproduce by myself.
> > > >
> > > > Yes, completely agree as well.
> > >
> > > +1
> > >
> > > Cheers, Daniel
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-17 17:31                                                 ` Rob Clark
@ 2022-03-18  7:42                                                   ` Christian König
  -1 siblings, 0 replies; 95+ messages in thread
From: Christian König @ 2022-03-18  7:42 UTC (permalink / raw)
  To: Rob Clark, Daniel Vetter
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Shashank Sharma

Am 17.03.22 um 18:31 schrieb Rob Clark:
> On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>> [SNIP]
>>> (At some point, I'd like to use scheduler for the replay, and actually
>>> use drm_sched_stop()/etc.. but last time I looked there were still
>>> some sched bugs in that area which prevented me from deleting a bunch
>>> of code ;-))
>> Not sure about your hw, but at least on intel replaying tends to just
>> result in follow-on fun. And that holds even more so the more complex a
>> workload is. This is why vk just dies immediately and does not try to
>> replay anything, offloading it to the app. Same with arb robusteness.
>> Afaik it's really only media and classic gl which insist that the driver
>> stack somehow recover.
> At least for us, each submit must be self-contained (ie. not rely on
> previous GPU hw state), so in practice replay works out pretty well.
> The worst case is subsequent submits from same process fail as well
> (if they depended on something that crashing submit failed to write
> back to memory.. but in that case they just crash as well and we move
> on to the next one.. the recent gens (a5xx+ at least) are pretty good
> about quickly detecting problems and giving us an error irq.

Well I absolutely agree with Daniel.

The whole replay thing AMD did in the scheduler is an absolutely mess 
and should probably be killed with fire.

I strongly recommend not to do the same mistake in other drivers.

If you want to have some replay feature then please make it driver 
specific and don't use anything from the infrastructure in the DRM 
scheduler.

Thanks,
Christian.

>
> BR,
> -R
>
>> And recovering from a mess in userspace is a lot simpler than trying to
>> pull of the same magic in the kernel. Plus it also helps with a few of the
>> dma_fence rules, which is a nice bonus.
>> -Daniel
>>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-18  7:42                                                   ` Christian König
  0 siblings, 0 replies; 95+ messages in thread
From: Christian König @ 2022-03-18  7:42 UTC (permalink / raw)
  To: Rob Clark, Daniel Vetter
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Alex Deucher, Shashank Sharma

Am 17.03.22 um 18:31 schrieb Rob Clark:
> On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>> [SNIP]
>>> (At some point, I'd like to use scheduler for the replay, and actually
>>> use drm_sched_stop()/etc.. but last time I looked there were still
>>> some sched bugs in that area which prevented me from deleting a bunch
>>> of code ;-))
>> Not sure about your hw, but at least on intel replaying tends to just
>> result in follow-on fun. And that holds even more so the more complex a
>> workload is. This is why vk just dies immediately and does not try to
>> replay anything, offloading it to the app. Same with arb robusteness.
>> Afaik it's really only media and classic gl which insist that the driver
>> stack somehow recover.
> At least for us, each submit must be self-contained (ie. not rely on
> previous GPU hw state), so in practice replay works out pretty well.
> The worst case is subsequent submits from same process fail as well
> (if they depended on something that crashing submit failed to write
> back to memory.. but in that case they just crash as well and we move
> on to the next one.. the recent gens (a5xx+ at least) are pretty good
> about quickly detecting problems and giving us an error irq.

Well I absolutely agree with Daniel.

The whole replay thing AMD did in the scheduler is an absolutely mess 
and should probably be killed with fire.

I strongly recommend not to do the same mistake in other drivers.

If you want to have some replay feature then please make it driver 
specific and don't use anything from the infrastructure in the DRM 
scheduler.

Thanks,
Christian.

>
> BR,
> -R
>
>> And recovering from a mess in userspace is a lot simpler than trying to
>> pull of the same magic in the kernel. Plus it also helps with a few of the
>> dma_fence rules, which is a nice bonus.
>> -Daniel
>>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-18  7:42                                                   ` Christian König
@ 2022-03-18 15:12                                                     ` Rob Clark
  -1 siblings, 0 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-18 15:12 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Shashank Sharma

On Fri, Mar 18, 2022 at 12:42 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 17.03.22 um 18:31 schrieb Rob Clark:
> > On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >> [SNIP]
> >>> (At some point, I'd like to use scheduler for the replay, and actually
> >>> use drm_sched_stop()/etc.. but last time I looked there were still
> >>> some sched bugs in that area which prevented me from deleting a bunch
> >>> of code ;-))
> >> Not sure about your hw, but at least on intel replaying tends to just
> >> result in follow-on fun. And that holds even more so the more complex a
> >> workload is. This is why vk just dies immediately and does not try to
> >> replay anything, offloading it to the app. Same with arb robusteness.
> >> Afaik it's really only media and classic gl which insist that the driver
> >> stack somehow recover.
> > At least for us, each submit must be self-contained (ie. not rely on
> > previous GPU hw state), so in practice replay works out pretty well.
> > The worst case is subsequent submits from same process fail as well
> > (if they depended on something that crashing submit failed to write
> > back to memory.. but in that case they just crash as well and we move
> > on to the next one.. the recent gens (a5xx+ at least) are pretty good
> > about quickly detecting problems and giving us an error irq.
>
> Well I absolutely agree with Daniel.
>
> The whole replay thing AMD did in the scheduler is an absolutely mess
> and should probably be killed with fire.
>
> I strongly recommend not to do the same mistake in other drivers.
>
> If you want to have some replay feature then please make it driver
> specific and don't use anything from the infrastructure in the DRM
> scheduler.

hmm, perhaps I was not clear, but I'm only talking about re-emitting
jobs *following* the faulting one (which could be from other contexts,
etc).. not trying to restart the faulting job.

You *absolutely* need to replay jobs following the faulting one, they
could be from unrelated contexts/processes.  You can't just drop them
on the floor.

Currently it is all driver specific, but I wanted to delete a lot of
code and move to using scheduler to handle faults/timeouts (but
blocked on that until [1] is resolved)

[1] https://patchwork.kernel.org/project/dri-devel/patch/1630457207-13107-2-git-send-email-Monk.Liu@amd.com/

BR,
-R

> Thanks,
> Christian.
>
> >
> > BR,
> > -R
> >
> >> And recovering from a mess in userspace is a lot simpler than trying to
> >> pull of the same magic in the kernel. Plus it also helps with a few of the
> >> dma_fence rules, which is a nice bonus.
> >> -Daniel
> >>
>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-18 15:12                                                     ` Rob Clark
  0 siblings, 0 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-18 15:12 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Daniel Vetter, Alex Deucher,
	Shashank Sharma

On Fri, Mar 18, 2022 at 12:42 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 17.03.22 um 18:31 schrieb Rob Clark:
> > On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >> [SNIP]
> >>> (At some point, I'd like to use scheduler for the replay, and actually
> >>> use drm_sched_stop()/etc.. but last time I looked there were still
> >>> some sched bugs in that area which prevented me from deleting a bunch
> >>> of code ;-))
> >> Not sure about your hw, but at least on intel replaying tends to just
> >> result in follow-on fun. And that holds even more so the more complex a
> >> workload is. This is why vk just dies immediately and does not try to
> >> replay anything, offloading it to the app. Same with arb robusteness.
> >> Afaik it's really only media and classic gl which insist that the driver
> >> stack somehow recover.
> > At least for us, each submit must be self-contained (ie. not rely on
> > previous GPU hw state), so in practice replay works out pretty well.
> > The worst case is subsequent submits from same process fail as well
> > (if they depended on something that crashing submit failed to write
> > back to memory.. but in that case they just crash as well and we move
> > on to the next one.. the recent gens (a5xx+ at least) are pretty good
> > about quickly detecting problems and giving us an error irq.
>
> Well I absolutely agree with Daniel.
>
> The whole replay thing AMD did in the scheduler is an absolutely mess
> and should probably be killed with fire.
>
> I strongly recommend not to do the same mistake in other drivers.
>
> If you want to have some replay feature then please make it driver
> specific and don't use anything from the infrastructure in the DRM
> scheduler.

hmm, perhaps I was not clear, but I'm only talking about re-emitting
jobs *following* the faulting one (which could be from other contexts,
etc).. not trying to restart the faulting job.

You *absolutely* need to replay jobs following the faulting one, they
could be from unrelated contexts/processes.  You can't just drop them
on the floor.

Currently it is all driver specific, but I wanted to delete a lot of
code and move to using scheduler to handle faults/timeouts (but
blocked on that until [1] is resolved)

[1] https://patchwork.kernel.org/project/dri-devel/patch/1630457207-13107-2-git-send-email-Monk.Liu@amd.com/

BR,
-R

> Thanks,
> Christian.
>
> >
> > BR,
> > -R
> >
> >> And recovering from a mess in userspace is a lot simpler than trying to
> >> pull of the same magic in the kernel. Plus it also helps with a few of the
> >> dma_fence rules, which is a nice bonus.
> >> -Daniel
> >>
>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-18 15:12                                                     ` Rob Clark
@ 2022-03-21  9:30                                                       ` Christian König
  -1 siblings, 0 replies; 95+ messages in thread
From: Christian König @ 2022-03-21  9:30 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Shashank Sharma

Am 18.03.22 um 16:12 schrieb Rob Clark:
> On Fri, Mar 18, 2022 at 12:42 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 17.03.22 um 18:31 schrieb Rob Clark:
>>> On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>>>> [SNIP]
>>>>> (At some point, I'd like to use scheduler for the replay, and actually
>>>>> use drm_sched_stop()/etc.. but last time I looked there were still
>>>>> some sched bugs in that area which prevented me from deleting a bunch
>>>>> of code ;-))
>>>> Not sure about your hw, but at least on intel replaying tends to just
>>>> result in follow-on fun. And that holds even more so the more complex a
>>>> workload is. This is why vk just dies immediately and does not try to
>>>> replay anything, offloading it to the app. Same with arb robusteness.
>>>> Afaik it's really only media and classic gl which insist that the driver
>>>> stack somehow recover.
>>> At least for us, each submit must be self-contained (ie. not rely on
>>> previous GPU hw state), so in practice replay works out pretty well.
>>> The worst case is subsequent submits from same process fail as well
>>> (if they depended on something that crashing submit failed to write
>>> back to memory.. but in that case they just crash as well and we move
>>> on to the next one.. the recent gens (a5xx+ at least) are pretty good
>>> about quickly detecting problems and giving us an error irq.
>> Well I absolutely agree with Daniel.
>>
>> The whole replay thing AMD did in the scheduler is an absolutely mess
>> and should probably be killed with fire.
>>
>> I strongly recommend not to do the same mistake in other drivers.
>>
>> If you want to have some replay feature then please make it driver
>> specific and don't use anything from the infrastructure in the DRM
>> scheduler.
> hmm, perhaps I was not clear, but I'm only talking about re-emitting
> jobs *following* the faulting one (which could be from other contexts,
> etc).. not trying to restart the faulting job.
>
> You *absolutely* need to replay jobs following the faulting one, they
> could be from unrelated contexts/processes.  You can't just drop them
> on the floor.

Well you can, it just means that their contexts are lost as well.

If you re-submit jobs which were already pushed to the hardware you 
absolutely need to make a couple of things sure:

1. Don't race with your hardware. E.g. you need a way to stop processing 
in case of a timeout and then double check once more if things haven't 
finished in the meantime.

2. Make absolutely sure you never re-submit an operation when it's 
dma-fence is already signaled. Otherwise you run into memory corruption.

3. When you have multiple engines it becomes really tricky because then 
even innocent jobs might have already been started on different queues 
which now hang.

> Currently it is all driver specific, but I wanted to delete a lot of
> code and move to using scheduler to handle faults/timeouts (but
> blocked on that until [1] is resolved)

Please don't.

Especially don't use the pending_list or any of the scheduler 
infrastructure for GPU reset. We need to get rid of that again sooner or 
later.

This is extremely hardware dependent and pushing the amdgpu specific 
handling into the GPU scheduler was a mistake we shouldn't repeat for 
other drivers.

Regards,
Christian.

>
> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F1630457207-13107-2-git-send-email-Monk.Liu%40amd.com%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C1f6ddc253f9341231fa108da08f1afa9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637832131381866493%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=e%2F1tOh3nxH3QfzKQKiJKjCU7Z5S6haX07F8rzwZhRVY%3D&amp;reserved=0
>
> BR,
> -R
>
>> Thanks,
>> Christian.
>>
>>> BR,
>>> -R
>>>
>>>> And recovering from a mess in userspace is a lot simpler than trying to
>>>> pull of the same magic in the kernel. Plus it also helps with a few of the
>>>> dma_fence rules, which is a nice bonus.
>>>> -Daniel
>>>>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-21  9:30                                                       ` Christian König
  0 siblings, 0 replies; 95+ messages in thread
From: Christian König @ 2022-03-21  9:30 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Daniel Vetter, Alex Deucher,
	Shashank Sharma

Am 18.03.22 um 16:12 schrieb Rob Clark:
> On Fri, Mar 18, 2022 at 12:42 AM Christian König
> <christian.koenig@amd.com> wrote:
>> Am 17.03.22 um 18:31 schrieb Rob Clark:
>>> On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>>>> [SNIP]
>>>>> (At some point, I'd like to use scheduler for the replay, and actually
>>>>> use drm_sched_stop()/etc.. but last time I looked there were still
>>>>> some sched bugs in that area which prevented me from deleting a bunch
>>>>> of code ;-))
>>>> Not sure about your hw, but at least on intel replaying tends to just
>>>> result in follow-on fun. And that holds even more so the more complex a
>>>> workload is. This is why vk just dies immediately and does not try to
>>>> replay anything, offloading it to the app. Same with arb robusteness.
>>>> Afaik it's really only media and classic gl which insist that the driver
>>>> stack somehow recover.
>>> At least for us, each submit must be self-contained (ie. not rely on
>>> previous GPU hw state), so in practice replay works out pretty well.
>>> The worst case is subsequent submits from same process fail as well
>>> (if they depended on something that crashing submit failed to write
>>> back to memory.. but in that case they just crash as well and we move
>>> on to the next one.. the recent gens (a5xx+ at least) are pretty good
>>> about quickly detecting problems and giving us an error irq.
>> Well I absolutely agree with Daniel.
>>
>> The whole replay thing AMD did in the scheduler is an absolutely mess
>> and should probably be killed with fire.
>>
>> I strongly recommend not to do the same mistake in other drivers.
>>
>> If you want to have some replay feature then please make it driver
>> specific and don't use anything from the infrastructure in the DRM
>> scheduler.
> hmm, perhaps I was not clear, but I'm only talking about re-emitting
> jobs *following* the faulting one (which could be from other contexts,
> etc).. not trying to restart the faulting job.
>
> You *absolutely* need to replay jobs following the faulting one, they
> could be from unrelated contexts/processes.  You can't just drop them
> on the floor.

Well you can, it just means that their contexts are lost as well.

If you re-submit jobs which were already pushed to the hardware you 
absolutely need to make a couple of things sure:

1. Don't race with your hardware. E.g. you need a way to stop processing 
in case of a timeout and then double check once more if things haven't 
finished in the meantime.

2. Make absolutely sure you never re-submit an operation when it's 
dma-fence is already signaled. Otherwise you run into memory corruption.

3. When you have multiple engines it becomes really tricky because then 
even innocent jobs might have already been started on different queues 
which now hang.

> Currently it is all driver specific, but I wanted to delete a lot of
> code and move to using scheduler to handle faults/timeouts (but
> blocked on that until [1] is resolved)

Please don't.

Especially don't use the pending_list or any of the scheduler 
infrastructure for GPU reset. We need to get rid of that again sooner or 
later.

This is extremely hardware dependent and pushing the amdgpu specific 
handling into the GPU scheduler was a mistake we shouldn't repeat for 
other drivers.

Regards,
Christian.

>
> [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F1630457207-13107-2-git-send-email-Monk.Liu%40amd.com%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C1f6ddc253f9341231fa108da08f1afa9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637832131381866493%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=e%2F1tOh3nxH3QfzKQKiJKjCU7Z5S6haX07F8rzwZhRVY%3D&amp;reserved=0
>
> BR,
> -R
>
>> Thanks,
>> Christian.
>>
>>> BR,
>>> -R
>>>
>>>> And recovering from a mess in userspace is a lot simpler than trying to
>>>> pull of the same magic in the kernel. Plus it also helps with a few of the
>>>> dma_fence rules, which is a nice bonus.
>>>> -Daniel
>>>>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-18 15:12                                                     ` Rob Clark
@ 2022-03-21 14:15                                                       ` Daniel Vetter
  -1 siblings, 0 replies; 95+ messages in thread
From: Daniel Vetter @ 2022-03-21 14:15 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Shashank Sharma,
	Christian König

On Fri, Mar 18, 2022 at 08:12:54AM -0700, Rob Clark wrote:
> On Fri, Mar 18, 2022 at 12:42 AM Christian König
> <christian.koenig@amd.com> wrote:
> >
> > Am 17.03.22 um 18:31 schrieb Rob Clark:
> > > On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >> [SNIP]
> > >>> (At some point, I'd like to use scheduler for the replay, and actually
> > >>> use drm_sched_stop()/etc.. but last time I looked there were still
> > >>> some sched bugs in that area which prevented me from deleting a bunch
> > >>> of code ;-))
> > >> Not sure about your hw, but at least on intel replaying tends to just
> > >> result in follow-on fun. And that holds even more so the more complex a
> > >> workload is. This is why vk just dies immediately and does not try to
> > >> replay anything, offloading it to the app. Same with arb robusteness.
> > >> Afaik it's really only media and classic gl which insist that the driver
> > >> stack somehow recover.
> > > At least for us, each submit must be self-contained (ie. not rely on
> > > previous GPU hw state), so in practice replay works out pretty well.
> > > The worst case is subsequent submits from same process fail as well
> > > (if they depended on something that crashing submit failed to write
> > > back to memory.. but in that case they just crash as well and we move
> > > on to the next one.. the recent gens (a5xx+ at least) are pretty good
> > > about quickly detecting problems and giving us an error irq.
> >
> > Well I absolutely agree with Daniel.
> >
> > The whole replay thing AMD did in the scheduler is an absolutely mess
> > and should probably be killed with fire.
> >
> > I strongly recommend not to do the same mistake in other drivers.
> >
> > If you want to have some replay feature then please make it driver
> > specific and don't use anything from the infrastructure in the DRM
> > scheduler.
> 
> hmm, perhaps I was not clear, but I'm only talking about re-emitting
> jobs *following* the faulting one (which could be from other contexts,
> etc).. not trying to restart the faulting job.

You absolutely can drop jobs on the floor, this is what both anv and iris
expect. They use what we call non-recoverable context, meaning when any
gpu hang happens and the context is affect (whether as the guilty on, or
because it was a multi-engine reset and it was victimized) we kill it
entirely. No replaying, and any further execbuf ioctl fails with -EIO.

Userspace then gets to sort out the mess, which for vk is
VK_ERROR_DEVICE_LOST, for robust gl it's the same, and for non-robust gl
iris re-creates a pile of things.

Anything in-between _is_ dropped on the floor completely.

Also note that this is obviously uapi, if you have an userspace which
expect contexts to survive, then replaying makes some sense.

> You *absolutely* need to replay jobs following the faulting one, they
> could be from unrelated contexts/processes.  You can't just drop them
> on the floor.
> 
> Currently it is all driver specific, but I wanted to delete a lot of
> code and move to using scheduler to handle faults/timeouts (but
> blocked on that until [1] is resolved)

Yeah for the drivers where the uapi is "you can safely replay after a
hang, and you're supposed to", then sharing the code is ofc a good idea.

Just wanted to make it clear that this is only one of many uapi flavours
you can pick from, dropping it all on the floor is a perfectly legit
approach :-) And imo it's the more robust one, and also better fits with
latest apis like gl_arb_robustness or vk.

Cheers, Daniel


> 
> [1] https://patchwork.kernel.org/project/dri-devel/patch/1630457207-13107-2-git-send-email-Monk.Liu@amd.com/
> 
> BR,
> -R
> 
> > Thanks,
> > Christian.
> >
> > >
> > > BR,
> > > -R
> > >
> > >> And recovering from a mess in userspace is a lot simpler than trying to
> > >> pull of the same magic in the kernel. Plus it also helps with a few of the
> > >> dma_fence rules, which is a nice bonus.
> > >> -Daniel
> > >>
> >

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-21 14:15                                                       ` Daniel Vetter
  0 siblings, 0 replies; 95+ messages in thread
From: Daniel Vetter @ 2022-03-21 14:15 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Daniel Vetter, Alex Deucher,
	Shashank Sharma, Christian König

On Fri, Mar 18, 2022 at 08:12:54AM -0700, Rob Clark wrote:
> On Fri, Mar 18, 2022 at 12:42 AM Christian König
> <christian.koenig@amd.com> wrote:
> >
> > Am 17.03.22 um 18:31 schrieb Rob Clark:
> > > On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >> [SNIP]
> > >>> (At some point, I'd like to use scheduler for the replay, and actually
> > >>> use drm_sched_stop()/etc.. but last time I looked there were still
> > >>> some sched bugs in that area which prevented me from deleting a bunch
> > >>> of code ;-))
> > >> Not sure about your hw, but at least on intel replaying tends to just
> > >> result in follow-on fun. And that holds even more so the more complex a
> > >> workload is. This is why vk just dies immediately and does not try to
> > >> replay anything, offloading it to the app. Same with arb robusteness.
> > >> Afaik it's really only media and classic gl which insist that the driver
> > >> stack somehow recover.
> > > At least for us, each submit must be self-contained (ie. not rely on
> > > previous GPU hw state), so in practice replay works out pretty well.
> > > The worst case is subsequent submits from same process fail as well
> > > (if they depended on something that crashing submit failed to write
> > > back to memory.. but in that case they just crash as well and we move
> > > on to the next one.. the recent gens (a5xx+ at least) are pretty good
> > > about quickly detecting problems and giving us an error irq.
> >
> > Well I absolutely agree with Daniel.
> >
> > The whole replay thing AMD did in the scheduler is an absolutely mess
> > and should probably be killed with fire.
> >
> > I strongly recommend not to do the same mistake in other drivers.
> >
> > If you want to have some replay feature then please make it driver
> > specific and don't use anything from the infrastructure in the DRM
> > scheduler.
> 
> hmm, perhaps I was not clear, but I'm only talking about re-emitting
> jobs *following* the faulting one (which could be from other contexts,
> etc).. not trying to restart the faulting job.

You absolutely can drop jobs on the floor, this is what both anv and iris
expect. They use what we call non-recoverable context, meaning when any
gpu hang happens and the context is affect (whether as the guilty on, or
because it was a multi-engine reset and it was victimized) we kill it
entirely. No replaying, and any further execbuf ioctl fails with -EIO.

Userspace then gets to sort out the mess, which for vk is
VK_ERROR_DEVICE_LOST, for robust gl it's the same, and for non-robust gl
iris re-creates a pile of things.

Anything in-between _is_ dropped on the floor completely.

Also note that this is obviously uapi, if you have an userspace which
expect contexts to survive, then replaying makes some sense.

> You *absolutely* need to replay jobs following the faulting one, they
> could be from unrelated contexts/processes.  You can't just drop them
> on the floor.
> 
> Currently it is all driver specific, but I wanted to delete a lot of
> code and move to using scheduler to handle faults/timeouts (but
> blocked on that until [1] is resolved)

Yeah for the drivers where the uapi is "you can safely replay after a
hang, and you're supposed to", then sharing the code is ofc a good idea.

Just wanted to make it clear that this is only one of many uapi flavours
you can pick from, dropping it all on the floor is a perfectly legit
approach :-) And imo it's the more robust one, and also better fits with
latest apis like gl_arb_robustness or vk.

Cheers, Daniel


> 
> [1] https://patchwork.kernel.org/project/dri-devel/patch/1630457207-13107-2-git-send-email-Monk.Liu@amd.com/
> 
> BR,
> -R
> 
> > Thanks,
> > Christian.
> >
> > >
> > > BR,
> > > -R
> > >
> > >> And recovering from a mess in userspace is a lot simpler than trying to
> > >> pull of the same magic in the kernel. Plus it also helps with a few of the
> > >> dma_fence rules, which is a nice bonus.
> > >> -Daniel
> > >>
> >

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-21  9:30                                                       ` Christian König
@ 2022-03-21 16:03                                                         ` Rob Clark
  -1 siblings, 0 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-21 16:03 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Shashank Sharma

On Mon, Mar 21, 2022 at 2:30 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 18.03.22 um 16:12 schrieb Rob Clark:
> > On Fri, Mar 18, 2022 at 12:42 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 17.03.22 um 18:31 schrieb Rob Clark:
> >>> On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >>>> [SNIP]
> >>>>> (At some point, I'd like to use scheduler for the replay, and actually
> >>>>> use drm_sched_stop()/etc.. but last time I looked there were still
> >>>>> some sched bugs in that area which prevented me from deleting a bunch
> >>>>> of code ;-))
> >>>> Not sure about your hw, but at least on intel replaying tends to just
> >>>> result in follow-on fun. And that holds even more so the more complex a
> >>>> workload is. This is why vk just dies immediately and does not try to
> >>>> replay anything, offloading it to the app. Same with arb robusteness.
> >>>> Afaik it's really only media and classic gl which insist that the driver
> >>>> stack somehow recover.
> >>> At least for us, each submit must be self-contained (ie. not rely on
> >>> previous GPU hw state), so in practice replay works out pretty well.
> >>> The worst case is subsequent submits from same process fail as well
> >>> (if they depended on something that crashing submit failed to write
> >>> back to memory.. but in that case they just crash as well and we move
> >>> on to the next one.. the recent gens (a5xx+ at least) are pretty good
> >>> about quickly detecting problems and giving us an error irq.
> >> Well I absolutely agree with Daniel.
> >>
> >> The whole replay thing AMD did in the scheduler is an absolutely mess
> >> and should probably be killed with fire.
> >>
> >> I strongly recommend not to do the same mistake in other drivers.
> >>
> >> If you want to have some replay feature then please make it driver
> >> specific and don't use anything from the infrastructure in the DRM
> >> scheduler.
> > hmm, perhaps I was not clear, but I'm only talking about re-emitting
> > jobs *following* the faulting one (which could be from other contexts,
> > etc).. not trying to restart the faulting job.
> >
> > You *absolutely* need to replay jobs following the faulting one, they
> > could be from unrelated contexts/processes.  You can't just drop them
> > on the floor.
>
> Well you can, it just means that their contexts are lost as well.

Which is rather inconvenient when deqp-egl reset tests, for example,
take down your compositor ;-)

(Which for even more lolz, in CrOS restarts the android container or
vm.. which makes running android-cts deqp kinda funny)

> If you re-submit jobs which were already pushed to the hardware you
> absolutely need to make a couple of things sure:
>
> 1. Don't race with your hardware. E.g. you need a way to stop processing
> in case of a timeout and then double check once more if things haven't
> finished in the meantime.
>
> 2. Make absolutely sure you never re-submit an operation when it's
> dma-fence is already signaled. Otherwise you run into memory corruption.
>
> 3. When you have multiple engines it becomes really tricky because then
> even innocent jobs might have already been started on different queues
> which now hang.

We force power-off/on the GPU to reset it which is a pretty good way
to make sure we aren't racing with the GPU.

It's worked like this since pretty much the beginning, and in the
early days of bringing up mesa support for a new gen we tend to
exercise the gpu hang/recovery path quite a lot.. so it at least seems
pretty robust ;-)

BR,
-R

>
> > Currently it is all driver specific, but I wanted to delete a lot of
> > code and move to using scheduler to handle faults/timeouts (but
> > blocked on that until [1] is resolved)
>
> Please don't.
>
> Especially don't use the pending_list or any of the scheduler
> infrastructure for GPU reset. We need to get rid of that again sooner or
> later.
>
> This is extremely hardware dependent and pushing the amdgpu specific
> handling into the GPU scheduler was a mistake we shouldn't repeat for
> other drivers.
>
> Regards,
> Christian.
>
> >
> > [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F1630457207-13107-2-git-send-email-Monk.Liu%40amd.com%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C1f6ddc253f9341231fa108da08f1afa9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637832131381866493%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=e%2F1tOh3nxH3QfzKQKiJKjCU7Z5S6haX07F8rzwZhRVY%3D&amp;reserved=0
> >
> > BR,
> > -R
> >
> >> Thanks,
> >> Christian.
> >>
> >>> BR,
> >>> -R
> >>>
> >>>> And recovering from a mess in userspace is a lot simpler than trying to
> >>>> pull of the same magic in the kernel. Plus it also helps with a few of the
> >>>> dma_fence rules, which is a nice bonus.
> >>>> -Daniel
> >>>>
>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-21 16:03                                                         ` Rob Clark
  0 siblings, 0 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-21 16:03 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, amd-gfx list, Daniel Vetter, Alex Deucher,
	Shashank Sharma

On Mon, Mar 21, 2022 at 2:30 AM Christian König
<christian.koenig@amd.com> wrote:
>
> Am 18.03.22 um 16:12 schrieb Rob Clark:
> > On Fri, Mar 18, 2022 at 12:42 AM Christian König
> > <christian.koenig@amd.com> wrote:
> >> Am 17.03.22 um 18:31 schrieb Rob Clark:
> >>> On Thu, Mar 17, 2022 at 10:27 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >>>> [SNIP]
> >>>>> (At some point, I'd like to use scheduler for the replay, and actually
> >>>>> use drm_sched_stop()/etc.. but last time I looked there were still
> >>>>> some sched bugs in that area which prevented me from deleting a bunch
> >>>>> of code ;-))
> >>>> Not sure about your hw, but at least on intel replaying tends to just
> >>>> result in follow-on fun. And that holds even more so the more complex a
> >>>> workload is. This is why vk just dies immediately and does not try to
> >>>> replay anything, offloading it to the app. Same with arb robusteness.
> >>>> Afaik it's really only media and classic gl which insist that the driver
> >>>> stack somehow recover.
> >>> At least for us, each submit must be self-contained (ie. not rely on
> >>> previous GPU hw state), so in practice replay works out pretty well.
> >>> The worst case is subsequent submits from same process fail as well
> >>> (if they depended on something that crashing submit failed to write
> >>> back to memory.. but in that case they just crash as well and we move
> >>> on to the next one.. the recent gens (a5xx+ at least) are pretty good
> >>> about quickly detecting problems and giving us an error irq.
> >> Well I absolutely agree with Daniel.
> >>
> >> The whole replay thing AMD did in the scheduler is an absolutely mess
> >> and should probably be killed with fire.
> >>
> >> I strongly recommend not to do the same mistake in other drivers.
> >>
> >> If you want to have some replay feature then please make it driver
> >> specific and don't use anything from the infrastructure in the DRM
> >> scheduler.
> > hmm, perhaps I was not clear, but I'm only talking about re-emitting
> > jobs *following* the faulting one (which could be from other contexts,
> > etc).. not trying to restart the faulting job.
> >
> > You *absolutely* need to replay jobs following the faulting one, they
> > could be from unrelated contexts/processes.  You can't just drop them
> > on the floor.
>
> Well you can, it just means that their contexts are lost as well.

Which is rather inconvenient when deqp-egl reset tests, for example,
take down your compositor ;-)

(Which for even more lolz, in CrOS restarts the android container or
vm.. which makes running android-cts deqp kinda funny)

> If you re-submit jobs which were already pushed to the hardware you
> absolutely need to make a couple of things sure:
>
> 1. Don't race with your hardware. E.g. you need a way to stop processing
> in case of a timeout and then double check once more if things haven't
> finished in the meantime.
>
> 2. Make absolutely sure you never re-submit an operation when it's
> dma-fence is already signaled. Otherwise you run into memory corruption.
>
> 3. When you have multiple engines it becomes really tricky because then
> even innocent jobs might have already been started on different queues
> which now hang.

We force power-off/on the GPU to reset it which is a pretty good way
to make sure we aren't racing with the GPU.

It's worked like this since pretty much the beginning, and in the
early days of bringing up mesa support for a new gen we tend to
exercise the gpu hang/recovery path quite a lot.. so it at least seems
pretty robust ;-)

BR,
-R

>
> > Currently it is all driver specific, but I wanted to delete a lot of
> > code and move to using scheduler to handle faults/timeouts (but
> > blocked on that until [1] is resolved)
>
> Please don't.
>
> Especially don't use the pending_list or any of the scheduler
> infrastructure for GPU reset. We need to get rid of that again sooner or
> later.
>
> This is extremely hardware dependent and pushing the amdgpu specific
> handling into the GPU scheduler was a mistake we shouldn't repeat for
> other drivers.
>
> Regards,
> Christian.
>
> >
> > [1] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F1630457207-13107-2-git-send-email-Monk.Liu%40amd.com%2F&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C1f6ddc253f9341231fa108da08f1afa9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637832131381866493%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=e%2F1tOh3nxH3QfzKQKiJKjCU7Z5S6haX07F8rzwZhRVY%3D&amp;reserved=0
> >
> > BR,
> > -R
> >
> >> Thanks,
> >> Christian.
> >>
> >>> BR,
> >>> -R
> >>>
> >>>> And recovering from a mess in userspace is a lot simpler than trying to
> >>>> pull of the same magic in the kernel. Plus it also helps with a few of the
> >>>> dma_fence rules, which is a nice bonus.
> >>>> -Daniel
> >>>>
>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-21 16:03                                                         ` Rob Clark
  (?)
@ 2022-03-23 14:07                                                         ` Daniel Stone
  2022-03-23 15:14                                                             ` Daniel Vetter
  -1 siblings, 1 reply; 95+ messages in thread
From: Daniel Stone @ 2022-03-23 14:07 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, amd-gfx list,
	Alexandar Deucher, Shashank Sharma, Christian König

Hi,

On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com> wrote:
> On Mon, Mar 21, 2022 at 2:30 AM Christian König
> <christian.koenig@amd.com> wrote:
> > Well you can, it just means that their contexts are lost as well.
>
> Which is rather inconvenient when deqp-egl reset tests, for example,
> take down your compositor ;-)

Yeah. Or anything WebGL.

System-wide collateral damage is definitely a non-starter. If that
means that the userspace driver has to do what iris does and ensure
everything's recreated and resubmitted, that works too, just as long
as the response to 'my adblocker didn't detect a crypto miner ad'  is
something better than 'shoot the entire user session'.

Cheers,
Daniel

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-23 14:07                                                         ` Daniel Stone
@ 2022-03-23 15:14                                                             ` Daniel Vetter
  0 siblings, 0 replies; 95+ messages in thread
From: Daniel Vetter @ 2022-03-23 15:14 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, amd-gfx list,
	Alexandar Deucher, Shashank Sharma, Christian König

On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org> wrote:
>
> Hi,
>
> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com> wrote:
> > On Mon, Mar 21, 2022 at 2:30 AM Christian König
> > <christian.koenig@amd.com> wrote:
> > > Well you can, it just means that their contexts are lost as well.
> >
> > Which is rather inconvenient when deqp-egl reset tests, for example,
> > take down your compositor ;-)
>
> Yeah. Or anything WebGL.
>
> System-wide collateral damage is definitely a non-starter. If that
> means that the userspace driver has to do what iris does and ensure
> everything's recreated and resubmitted, that works too, just as long
> as the response to 'my adblocker didn't detect a crypto miner ad'  is
> something better than 'shoot the entire user session'.

Not sure where that idea came from, I thought at least I made it clear
that legacy gl _has_ to recover. It's only vk and arb_robustness gl
which should die without recovery attempt.

The entire discussion here is who should be responsible for replay and
at least if you can decide the uapi, then punting that entirely to
userspace is a good approach.

Ofc it'd be nice if the collateral damage is limited, i.e. requests
not currently on the gpu, or on different engines and all that
shouldn't be nuked, if possible.

Also ofc since msm uapi is that the kernel tries to recover there's
not much we can do there, contexts cannot be shot. But still trying to
replay them as much as possible feels a bit like overkill.
-Daniel

> Cheers,
> Daniel



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-23 15:14                                                             ` Daniel Vetter
  0 siblings, 0 replies; 95+ messages in thread
From: Daniel Vetter @ 2022-03-23 15:14 UTC (permalink / raw)
  To: Daniel Stone
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, Rob Clark,
	amd-gfx list, Alexandar Deucher, Shashank Sharma,
	Christian König

On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org> wrote:
>
> Hi,
>
> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com> wrote:
> > On Mon, Mar 21, 2022 at 2:30 AM Christian König
> > <christian.koenig@amd.com> wrote:
> > > Well you can, it just means that their contexts are lost as well.
> >
> > Which is rather inconvenient when deqp-egl reset tests, for example,
> > take down your compositor ;-)
>
> Yeah. Or anything WebGL.
>
> System-wide collateral damage is definitely a non-starter. If that
> means that the userspace driver has to do what iris does and ensure
> everything's recreated and resubmitted, that works too, just as long
> as the response to 'my adblocker didn't detect a crypto miner ad'  is
> something better than 'shoot the entire user session'.

Not sure where that idea came from, I thought at least I made it clear
that legacy gl _has_ to recover. It's only vk and arb_robustness gl
which should die without recovery attempt.

The entire discussion here is who should be responsible for replay and
at least if you can decide the uapi, then punting that entirely to
userspace is a good approach.

Ofc it'd be nice if the collateral damage is limited, i.e. requests
not currently on the gpu, or on different engines and all that
shouldn't be nuked, if possible.

Also ofc since msm uapi is that the kernel tries to recover there's
not much we can do there, contexts cannot be shot. But still trying to
replay them as much as possible feels a bit like overkill.
-Daniel

> Cheers,
> Daniel



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-23 15:14                                                             ` Daniel Vetter
@ 2022-03-23 15:25                                                               ` Christian König
  -1 siblings, 0 replies; 95+ messages in thread
From: Christian König @ 2022-03-23 15:25 UTC (permalink / raw)
  To: Daniel Vetter, Daniel Stone, Olsak, Marek, Andrey Grodzovsky
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, amd-gfx list,
	Alexandar Deucher, Shashank Sharma

[Adding Marek and Andrey as well]

Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org> wrote:
>> Hi,
>>
>> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com> wrote:
>>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Well you can, it just means that their contexts are lost as well.
>>> Which is rather inconvenient when deqp-egl reset tests, for example,
>>> take down your compositor ;-)
>> Yeah. Or anything WebGL.
>>
>> System-wide collateral damage is definitely a non-starter. If that
>> means that the userspace driver has to do what iris does and ensure
>> everything's recreated and resubmitted, that works too, just as long
>> as the response to 'my adblocker didn't detect a crypto miner ad'  is
>> something better than 'shoot the entire user session'.
> Not sure where that idea came from, I thought at least I made it clear
> that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> which should die without recovery attempt.
>
> The entire discussion here is who should be responsible for replay and
> at least if you can decide the uapi, then punting that entirely to
> userspace is a good approach.

Yes, completely agree. We have the approach of re-submitting things in 
the kernel and that failed quite miserable.

In other words currently a GPU reset has something like a 99% chance to 
get down your whole desktop.

Daniel can you briefly explain what exactly iris does when a lost 
context is detected without gl robustness?

It sounds like you guys got that working quite well.

Thanks,
Christian.

>
> Ofc it'd be nice if the collateral damage is limited, i.e. requests
> not currently on the gpu, or on different engines and all that
> shouldn't be nuked, if possible.
>
> Also ofc since msm uapi is that the kernel tries to recover there's
> not much we can do there, contexts cannot be shot. But still trying to
> replay them as much as possible feels a bit like overkill.
> -Daniel
>
>> Cheers,
>> Daniel
>
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-23 15:25                                                               ` Christian König
  0 siblings, 0 replies; 95+ messages in thread
From: Christian König @ 2022-03-23 15:25 UTC (permalink / raw)
  To: Daniel Vetter, Daniel Stone, Olsak, Marek, Andrey Grodzovsky
  Cc: Rob Clark, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel, Rob Clark,
	amd-gfx list, Alexandar Deucher, Shashank Sharma

[Adding Marek and Andrey as well]

Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org> wrote:
>> Hi,
>>
>> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com> wrote:
>>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Well you can, it just means that their contexts are lost as well.
>>> Which is rather inconvenient when deqp-egl reset tests, for example,
>>> take down your compositor ;-)
>> Yeah. Or anything WebGL.
>>
>> System-wide collateral damage is definitely a non-starter. If that
>> means that the userspace driver has to do what iris does and ensure
>> everything's recreated and resubmitted, that works too, just as long
>> as the response to 'my adblocker didn't detect a crypto miner ad'  is
>> something better than 'shoot the entire user session'.
> Not sure where that idea came from, I thought at least I made it clear
> that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> which should die without recovery attempt.
>
> The entire discussion here is who should be responsible for replay and
> at least if you can decide the uapi, then punting that entirely to
> userspace is a good approach.

Yes, completely agree. We have the approach of re-submitting things in 
the kernel and that failed quite miserable.

In other words currently a GPU reset has something like a 99% chance to 
get down your whole desktop.

Daniel can you briefly explain what exactly iris does when a lost 
context is detected without gl robustness?

It sounds like you guys got that working quite well.

Thanks,
Christian.

>
> Ofc it'd be nice if the collateral damage is limited, i.e. requests
> not currently on the gpu, or on different engines and all that
> shouldn't be nuked, if possible.
>
> Also ofc since msm uapi is that the kernel tries to recover there's
> not much we can do there, contexts cannot be shot. But still trying to
> replay them as much as possible feels a bit like overkill.
> -Daniel
>
>> Cheers,
>> Daniel
>
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-23 15:14                                                             ` Daniel Vetter
@ 2022-03-23 17:30                                                               ` Rob Clark
  -1 siblings, 0 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-23 17:30 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Rob Clark, amd-gfx list, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, dri-devel,
	Alexandar Deucher, Shashank Sharma, Christian König

On Wed, Mar 23, 2022 at 8:14 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org> wrote:
> >
> > Hi,
> >
> > On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com> wrote:
> > > On Mon, Mar 21, 2022 at 2:30 AM Christian König
> > > <christian.koenig@amd.com> wrote:
> > > > Well you can, it just means that their contexts are lost as well.
> > >
> > > Which is rather inconvenient when deqp-egl reset tests, for example,
> > > take down your compositor ;-)
> >
> > Yeah. Or anything WebGL.
> >
> > System-wide collateral damage is definitely a non-starter. If that
> > means that the userspace driver has to do what iris does and ensure
> > everything's recreated and resubmitted, that works too, just as long
> > as the response to 'my adblocker didn't detect a crypto miner ad'  is
> > something better than 'shoot the entire user session'.
>
> Not sure where that idea came from, I thought at least I made it clear
> that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> which should die without recovery attempt.
>
> The entire discussion here is who should be responsible for replay and
> at least if you can decide the uapi, then punting that entirely to
> userspace is a good approach.
>
> Ofc it'd be nice if the collateral damage is limited, i.e. requests
> not currently on the gpu, or on different engines and all that
> shouldn't be nuked, if possible.
>
> Also ofc since msm uapi is that the kernel tries to recover there's
> not much we can do there, contexts cannot be shot. But still trying to
> replay them as much as possible feels a bit like overkill.

It would perhaps be nice if older gens which don't (yet) have
per-process pgtables to have gone with the userspace-replays (although
that would require a lot more tracking in userspace than what is done
currently).. but fortunately those older gens don't use "state
objects" which could potentially be corrupted, but instead re-emit
state in cmdstream, so there is a lot less possibility for bad
collateral damage.  (On all the gens we also use gpu read-only buffers
whenever the gpu does not need to be able to write them.)

For newer stuff, the process isolation works pretty well.  In fact we
recently changed MSM_PARAM_FAULTS to only report faults/hangs in the
same address space, so the compositor is not even aware (and doesn't
need to be aware).

BR,
-R

> -Daniel
>
> > Cheers,
> > Daniel
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-23 17:30                                                               ` Rob Clark
  0 siblings, 0 replies; 95+ messages in thread
From: Rob Clark @ 2022-03-23 17:30 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Rob Clark, amd-gfx list, Sharma, Shashank, Christian König,
	Amaranath Somalapuram, Abhinav Kumar, Daniel Stone, dri-devel,
	Alexandar Deucher, Shashank Sharma, Christian König

On Wed, Mar 23, 2022 at 8:14 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org> wrote:
> >
> > Hi,
> >
> > On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com> wrote:
> > > On Mon, Mar 21, 2022 at 2:30 AM Christian König
> > > <christian.koenig@amd.com> wrote:
> > > > Well you can, it just means that their contexts are lost as well.
> > >
> > > Which is rather inconvenient when deqp-egl reset tests, for example,
> > > take down your compositor ;-)
> >
> > Yeah. Or anything WebGL.
> >
> > System-wide collateral damage is definitely a non-starter. If that
> > means that the userspace driver has to do what iris does and ensure
> > everything's recreated and resubmitted, that works too, just as long
> > as the response to 'my adblocker didn't detect a crypto miner ad'  is
> > something better than 'shoot the entire user session'.
>
> Not sure where that idea came from, I thought at least I made it clear
> that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> which should die without recovery attempt.
>
> The entire discussion here is who should be responsible for replay and
> at least if you can decide the uapi, then punting that entirely to
> userspace is a good approach.
>
> Ofc it'd be nice if the collateral damage is limited, i.e. requests
> not currently on the gpu, or on different engines and all that
> shouldn't be nuked, if possible.
>
> Also ofc since msm uapi is that the kernel tries to recover there's
> not much we can do there, contexts cannot be shot. But still trying to
> replay them as much as possible feels a bit like overkill.

It would perhaps be nice if older gens which don't (yet) have
per-process pgtables to have gone with the userspace-replays (although
that would require a lot more tracking in userspace than what is done
currently).. but fortunately those older gens don't use "state
objects" which could potentially be corrupted, but instead re-emit
state in cmdstream, so there is a lot less possibility for bad
collateral damage.  (On all the gens we also use gpu read-only buffers
whenever the gpu does not need to be able to write them.)

For newer stuff, the process isolation works pretty well.  In fact we
recently changed MSM_PARAM_FAULTS to only report faults/hangs in the
same address space, so the compositor is not even aware (and doesn't
need to be aware).

BR,
-R

> -Daniel
>
> > Cheers,
> > Daniel
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-23 15:25                                                               ` Christian König
@ 2022-03-26  0:53                                                                 ` Olsak, Marek
  -1 siblings, 0 replies; 95+ messages in thread
From: Olsak, Marek @ 2022-03-26  0:53 UTC (permalink / raw)
  To: Koenig, Christian, Daniel Vetter, Daniel Stone, Grodzovsky, Andrey
  Cc: Rob Clark, Sharma, Shashank, Christian König, Somalapuram,
	Amaranath, Abhinav Kumar, dri-devel, amd-gfx list, Deucher,
	Alexander, Shashank Sharma

[-- Attachment #1: Type: text/plain, Size: 4829 bytes --]

[AMD Official Use Only]

amdgpu has 2 resets: soft reset and hard reset.

The soft reset is able to recover from an infinite loop and even some GPU hangs due to bad shaders or bad states. The soft reset uses a signal that kills all currently-running shaders of a certain process (VM context), which unblocks the graphics pipeline, so draws and command buffers finish but are not correctly. This can then cause a hard hang if the shader was supposed to signal work completion through a shader store instruction and a non-shader consumer is waiting for it (skipping the store instruction by killing the shader won't signal the work, and thus the consumer will be stuck, requiring a hard reset).

The hard reset can recover from other hangs, which is great, but it may use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory contents, but we should assume that any process that had running jobs on the GPU during a GPU reset has its memory resources in an inconsistent state, and thus following command buffers can cause another GPU hang. The shader store example above is enough to cause another hard hang due to incorrect content in memory resources, which can contain synchronization primitives that are used internally by the hardware.

Asking the driver to replay a command buffer that caused a hang is a sure way to hang it again. Unrelated processes can be affected due to lost VRAM or the misfortune of using the GPU while the GPU hang occurred. The window system should recreate GPU resources and redraw everything without affecting applications. If apps use GL, they should do the same. Processes that can't recover by redrawing content can be terminated or left alone, but they shouldn't be allowed to submit work to the GPU anymore.

dEQP only exercises the soft reset. I think WebGL is only able to trigger a soft reset at this point, but Vulkan can also trigger a hard reset.

Marek
________________________________
From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: March 23, 2022 11:25
To: Daniel Vetter <daniel@ffwll.ch>; Daniel Stone <daniel@fooishbar.org>; Olsak, Marek <Marek.Olsak@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
Cc: Rob Clark <robdclark@gmail.com>; Rob Clark <robdclark@chromium.org>; Sharma, Shashank <Shashank.Sharma@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Abhinav Kumar <quic_abhinavk@quicinc.com>; dri-devel <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Deucher, Alexander <Alexander.Deucher@amd.com>; Shashank Sharma <contactshashanksharma@gmail.com>
Subject: Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

[Adding Marek and Andrey as well]

Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org> wrote:
>> Hi,
>>
>> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com> wrote:
>>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Well you can, it just means that their contexts are lost as well.
>>> Which is rather inconvenient when deqp-egl reset tests, for example,
>>> take down your compositor ;-)
>> Yeah. Or anything WebGL.
>>
>> System-wide collateral damage is definitely a non-starter. If that
>> means that the userspace driver has to do what iris does and ensure
>> everything's recreated and resubmitted, that works too, just as long
>> as the response to 'my adblocker didn't detect a crypto miner ad'  is
>> something better than 'shoot the entire user session'.
> Not sure where that idea came from, I thought at least I made it clear
> that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> which should die without recovery attempt.
>
> The entire discussion here is who should be responsible for replay and
> at least if you can decide the uapi, then punting that entirely to
> userspace is a good approach.

Yes, completely agree. We have the approach of re-submitting things in
the kernel and that failed quite miserable.

In other words currently a GPU reset has something like a 99% chance to
get down your whole desktop.

Daniel can you briefly explain what exactly iris does when a lost
context is detected without gl robustness?

It sounds like you guys got that working quite well.

Thanks,
Christian.

>
> Ofc it'd be nice if the collateral damage is limited, i.e. requests
> not currently on the gpu, or on different engines and all that
> shouldn't be nuked, if possible.
>
> Also ofc since msm uapi is that the kernel tries to recover there's
> not much we can do there, contexts cannot be shot. But still trying to
> replay them as much as possible feels a bit like overkill.
> -Daniel
>
>> Cheers,
>> Daniel
>
>


[-- Attachment #2: Type: text/html, Size: 7391 bytes --]

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-26  0:53                                                                 ` Olsak, Marek
  0 siblings, 0 replies; 95+ messages in thread
From: Olsak, Marek @ 2022-03-26  0:53 UTC (permalink / raw)
  To: Koenig, Christian, Daniel Vetter, Daniel Stone, Grodzovsky, Andrey
  Cc: Rob Clark, Sharma, Shashank, Christian König, Somalapuram,
	Amaranath, Abhinav Kumar, dri-devel, Rob Clark, amd-gfx list,
	Deucher, Alexander, Shashank Sharma

[-- Attachment #1: Type: text/plain, Size: 4829 bytes --]

[AMD Official Use Only]

amdgpu has 2 resets: soft reset and hard reset.

The soft reset is able to recover from an infinite loop and even some GPU hangs due to bad shaders or bad states. The soft reset uses a signal that kills all currently-running shaders of a certain process (VM context), which unblocks the graphics pipeline, so draws and command buffers finish but are not correctly. This can then cause a hard hang if the shader was supposed to signal work completion through a shader store instruction and a non-shader consumer is waiting for it (skipping the store instruction by killing the shader won't signal the work, and thus the consumer will be stuck, requiring a hard reset).

The hard reset can recover from other hangs, which is great, but it may use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory contents, but we should assume that any process that had running jobs on the GPU during a GPU reset has its memory resources in an inconsistent state, and thus following command buffers can cause another GPU hang. The shader store example above is enough to cause another hard hang due to incorrect content in memory resources, which can contain synchronization primitives that are used internally by the hardware.

Asking the driver to replay a command buffer that caused a hang is a sure way to hang it again. Unrelated processes can be affected due to lost VRAM or the misfortune of using the GPU while the GPU hang occurred. The window system should recreate GPU resources and redraw everything without affecting applications. If apps use GL, they should do the same. Processes that can't recover by redrawing content can be terminated or left alone, but they shouldn't be allowed to submit work to the GPU anymore.

dEQP only exercises the soft reset. I think WebGL is only able to trigger a soft reset at this point, but Vulkan can also trigger a hard reset.

Marek
________________________________
From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: March 23, 2022 11:25
To: Daniel Vetter <daniel@ffwll.ch>; Daniel Stone <daniel@fooishbar.org>; Olsak, Marek <Marek.Olsak@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
Cc: Rob Clark <robdclark@gmail.com>; Rob Clark <robdclark@chromium.org>; Sharma, Shashank <Shashank.Sharma@amd.com>; Christian König <ckoenig.leichtzumerken@gmail.com>; Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>; Abhinav Kumar <quic_abhinavk@quicinc.com>; dri-devel <dri-devel@lists.freedesktop.org>; amd-gfx list <amd-gfx@lists.freedesktop.org>; Deucher, Alexander <Alexander.Deucher@amd.com>; Shashank Sharma <contactshashanksharma@gmail.com>
Subject: Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event

[Adding Marek and Andrey as well]

Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org> wrote:
>> Hi,
>>
>> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com> wrote:
>>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
>>> <christian.koenig@amd.com> wrote:
>>>> Well you can, it just means that their contexts are lost as well.
>>> Which is rather inconvenient when deqp-egl reset tests, for example,
>>> take down your compositor ;-)
>> Yeah. Or anything WebGL.
>>
>> System-wide collateral damage is definitely a non-starter. If that
>> means that the userspace driver has to do what iris does and ensure
>> everything's recreated and resubmitted, that works too, just as long
>> as the response to 'my adblocker didn't detect a crypto miner ad'  is
>> something better than 'shoot the entire user session'.
> Not sure where that idea came from, I thought at least I made it clear
> that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> which should die without recovery attempt.
>
> The entire discussion here is who should be responsible for replay and
> at least if you can decide the uapi, then punting that entirely to
> userspace is a good approach.

Yes, completely agree. We have the approach of re-submitting things in
the kernel and that failed quite miserable.

In other words currently a GPU reset has something like a 99% chance to
get down your whole desktop.

Daniel can you briefly explain what exactly iris does when a lost
context is detected without gl robustness?

It sounds like you guys got that working quite well.

Thanks,
Christian.

>
> Ofc it'd be nice if the collateral damage is limited, i.e. requests
> not currently on the gpu, or on different engines and all that
> shouldn't be nuked, if possible.
>
> Also ofc since msm uapi is that the kernel tries to recover there's
> not much we can do there, contexts cannot be shot. But still trying to
> replay them as much as possible feels a bit like overkill.
> -Daniel
>
>> Cheers,
>> Daniel
>
>


[-- Attachment #2: Type: text/html, Size: 7391 bytes --]

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-26  0:53                                                                 ` Olsak, Marek
@ 2022-03-29 12:14                                                                   ` Christian König
  -1 siblings, 0 replies; 95+ messages in thread
From: Christian König @ 2022-03-29 12:14 UTC (permalink / raw)
  To: Olsak, Marek, Daniel Vetter, Daniel Stone, Grodzovsky, Andrey
  Cc: Rob Clark, Sharma, Shashank, Christian König, Somalapuram,
	Amaranath, Abhinav Kumar, dri-devel, amd-gfx list, Deucher,
	Alexander, Shashank Sharma

[-- Attachment #1: Type: text/plain, Size: 5393 bytes --]

My main question is what does the iris driver better than radeonsi when 
the client doesn't support the robustness extension?

 From Daniels description it sounds like they have at least a partial 
recovery mechanism in place.

Apart from that I completely agree to what you said below.

Christian.

Am 26.03.22 um 01:53 schrieb Olsak, Marek:
>
> [AMD Official Use Only]
>
>
> amdgpu has 2 resets: soft reset and hard reset.
>
> The soft reset is able to recover from an infinite loop and even some 
> GPU hangs due to bad shaders or bad states. The soft reset uses a 
> signal that kills all currently-running shaders of a certain process 
> (VM context), which unblocks the graphics pipeline, so draws and 
> command buffers finish but are not correctly. This can then cause a 
> hard hang if the shader was supposed to signal work completion through 
> a shader store instruction and a non-shader consumer is waiting for it 
> (skipping the store instruction by killing the shader won't signal the 
> work, and thus the consumer will be stuck, requiring a hard reset).
>
> The hard reset can recover from other hangs, which is great, but it 
> may use a PCI reset, which erases VRAM on dGPUs. APUs don't lose 
> memory contents, but we should assume that any process that had 
> running jobs on the GPU during a GPU reset has its memory resources in 
> an inconsistent state, and thus following command buffers can cause 
> another GPU hang. The shader store example above is enough to cause 
> another hard hang due to incorrect content in memory resources, which 
> can contain synchronization primitives that are used internally by the 
> hardware.
>
> Asking the driver to replay a command buffer that caused a hang is a 
> sure way to hang it again. Unrelated processes can be affected due to 
> lost VRAM or the misfortune of using the GPU while the GPU hang 
> occurred. The window system should recreate GPU resources and redraw 
> everything without affecting applications. If apps use GL, they should 
> do the same. Processes that can't recover by redrawing content can be 
> terminated or left alone, but they shouldn't be allowed to submit work 
> to the GPU anymore.
>
> dEQP only exercises the soft reset. I think WebGL is only able to 
> trigger a soft reset at this point, but Vulkan can also trigger a hard 
> reset.
>
> Marek
> ------------------------------------------------------------------------
> *From:* Koenig, Christian <Christian.Koenig@amd.com>
> *Sent:* March 23, 2022 11:25
> *To:* Daniel Vetter <daniel@ffwll.ch>; Daniel Stone 
> <daniel@fooishbar.org>; Olsak, Marek <Marek.Olsak@amd.com>; 
> Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> *Cc:* Rob Clark <robdclark@gmail.com>; Rob Clark 
> <robdclark@chromium.org>; Sharma, Shashank <Shashank.Sharma@amd.com>; 
> Christian König <ckoenig.leichtzumerken@gmail.com>; Somalapuram, 
> Amaranath <Amaranath.Somalapuram@amd.com>; Abhinav Kumar 
> <quic_abhinavk@quicinc.com>; dri-devel 
> <dri-devel@lists.freedesktop.org>; amd-gfx list 
> <amd-gfx@lists.freedesktop.org>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; Shashank Sharma 
> <contactshashanksharma@gmail.com>
> *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
> [Adding Marek and Andrey as well]
>
> Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> > On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org> wrote:
> >> Hi,
> >>
> >> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com> wrote:
> >>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>> Well you can, it just means that their contexts are lost as well.
> >>> Which is rather inconvenient when deqp-egl reset tests, for example,
> >>> take down your compositor ;-)
> >> Yeah. Or anything WebGL.
> >>
> >> System-wide collateral damage is definitely a non-starter. If that
> >> means that the userspace driver has to do what iris does and ensure
> >> everything's recreated and resubmitted, that works too, just as long
> >> as the response to 'my adblocker didn't detect a crypto miner ad'  is
> >> something better than 'shoot the entire user session'.
> > Not sure where that idea came from, I thought at least I made it clear
> > that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> > which should die without recovery attempt.
> >
> > The entire discussion here is who should be responsible for replay and
> > at least if you can decide the uapi, then punting that entirely to
> > userspace is a good approach.
>
> Yes, completely agree. We have the approach of re-submitting things in
> the kernel and that failed quite miserable.
>
> In other words currently a GPU reset has something like a 99% chance to
> get down your whole desktop.
>
> Daniel can you briefly explain what exactly iris does when a lost
> context is detected without gl robustness?
>
> It sounds like you guys got that working quite well.
>
> Thanks,
> Christian.
>
> >
> > Ofc it'd be nice if the collateral damage is limited, i.e. requests
> > not currently on the gpu, or on different engines and all that
> > shouldn't be nuked, if possible.
> >
> > Also ofc since msm uapi is that the kernel tries to recover there's
> > not much we can do there, contexts cannot be shot. But still trying to
> > replay them as much as possible feels a bit like overkill.
> > -Daniel
> >
> >> Cheers,
> >> Daniel
> >
> >
>

[-- Attachment #2: Type: text/html, Size: 11493 bytes --]

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-29 12:14                                                                   ` Christian König
  0 siblings, 0 replies; 95+ messages in thread
From: Christian König @ 2022-03-29 12:14 UTC (permalink / raw)
  To: Olsak, Marek, Daniel Vetter, Daniel Stone, Grodzovsky, Andrey
  Cc: Rob Clark, Sharma, Shashank, Christian König, Somalapuram,
	Amaranath, Abhinav Kumar, dri-devel, Rob Clark, amd-gfx list,
	Deucher, Alexander, Shashank Sharma

[-- Attachment #1: Type: text/plain, Size: 5393 bytes --]

My main question is what does the iris driver better than radeonsi when 
the client doesn't support the robustness extension?

 From Daniels description it sounds like they have at least a partial 
recovery mechanism in place.

Apart from that I completely agree to what you said below.

Christian.

Am 26.03.22 um 01:53 schrieb Olsak, Marek:
>
> [AMD Official Use Only]
>
>
> amdgpu has 2 resets: soft reset and hard reset.
>
> The soft reset is able to recover from an infinite loop and even some 
> GPU hangs due to bad shaders or bad states. The soft reset uses a 
> signal that kills all currently-running shaders of a certain process 
> (VM context), which unblocks the graphics pipeline, so draws and 
> command buffers finish but are not correctly. This can then cause a 
> hard hang if the shader was supposed to signal work completion through 
> a shader store instruction and a non-shader consumer is waiting for it 
> (skipping the store instruction by killing the shader won't signal the 
> work, and thus the consumer will be stuck, requiring a hard reset).
>
> The hard reset can recover from other hangs, which is great, but it 
> may use a PCI reset, which erases VRAM on dGPUs. APUs don't lose 
> memory contents, but we should assume that any process that had 
> running jobs on the GPU during a GPU reset has its memory resources in 
> an inconsistent state, and thus following command buffers can cause 
> another GPU hang. The shader store example above is enough to cause 
> another hard hang due to incorrect content in memory resources, which 
> can contain synchronization primitives that are used internally by the 
> hardware.
>
> Asking the driver to replay a command buffer that caused a hang is a 
> sure way to hang it again. Unrelated processes can be affected due to 
> lost VRAM or the misfortune of using the GPU while the GPU hang 
> occurred. The window system should recreate GPU resources and redraw 
> everything without affecting applications. If apps use GL, they should 
> do the same. Processes that can't recover by redrawing content can be 
> terminated or left alone, but they shouldn't be allowed to submit work 
> to the GPU anymore.
>
> dEQP only exercises the soft reset. I think WebGL is only able to 
> trigger a soft reset at this point, but Vulkan can also trigger a hard 
> reset.
>
> Marek
> ------------------------------------------------------------------------
> *From:* Koenig, Christian <Christian.Koenig@amd.com>
> *Sent:* March 23, 2022 11:25
> *To:* Daniel Vetter <daniel@ffwll.ch>; Daniel Stone 
> <daniel@fooishbar.org>; Olsak, Marek <Marek.Olsak@amd.com>; 
> Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> *Cc:* Rob Clark <robdclark@gmail.com>; Rob Clark 
> <robdclark@chromium.org>; Sharma, Shashank <Shashank.Sharma@amd.com>; 
> Christian König <ckoenig.leichtzumerken@gmail.com>; Somalapuram, 
> Amaranath <Amaranath.Somalapuram@amd.com>; Abhinav Kumar 
> <quic_abhinavk@quicinc.com>; dri-devel 
> <dri-devel@lists.freedesktop.org>; amd-gfx list 
> <amd-gfx@lists.freedesktop.org>; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; Shashank Sharma 
> <contactshashanksharma@gmail.com>
> *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
> [Adding Marek and Andrey as well]
>
> Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> > On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org> wrote:
> >> Hi,
> >>
> >> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com> wrote:
> >>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>> Well you can, it just means that their contexts are lost as well.
> >>> Which is rather inconvenient when deqp-egl reset tests, for example,
> >>> take down your compositor ;-)
> >> Yeah. Or anything WebGL.
> >>
> >> System-wide collateral damage is definitely a non-starter. If that
> >> means that the userspace driver has to do what iris does and ensure
> >> everything's recreated and resubmitted, that works too, just as long
> >> as the response to 'my adblocker didn't detect a crypto miner ad'  is
> >> something better than 'shoot the entire user session'.
> > Not sure where that idea came from, I thought at least I made it clear
> > that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> > which should die without recovery attempt.
> >
> > The entire discussion here is who should be responsible for replay and
> > at least if you can decide the uapi, then punting that entirely to
> > userspace is a good approach.
>
> Yes, completely agree. We have the approach of re-submitting things in
> the kernel and that failed quite miserable.
>
> In other words currently a GPU reset has something like a 99% chance to
> get down your whole desktop.
>
> Daniel can you briefly explain what exactly iris does when a lost
> context is detected without gl robustness?
>
> It sounds like you guys got that working quite well.
>
> Thanks,
> Christian.
>
> >
> > Ofc it'd be nice if the collateral damage is limited, i.e. requests
> > not currently on the gpu, or on different engines and all that
> > shouldn't be nuked, if possible.
> >
> > Also ofc since msm uapi is that the kernel tries to recover there's
> > not much we can do there, contexts cannot be shot. But still trying to
> > replay them as much as possible feels a bit like overkill.
> > -Daniel
> >
> >> Cheers,
> >> Daniel
> >
> >
>

[-- Attachment #2: Type: text/html, Size: 11493 bytes --]

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-29 12:14                                                                   ` Christian König
@ 2022-03-29 16:25                                                                     ` Marek Olšák
  -1 siblings, 0 replies; 95+ messages in thread
From: Marek Olšák @ 2022-03-29 16:25 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, amd-gfx list, Sharma, Shashank, Christian König,
	Olsak, Marek, Somalapuram, Amaranath, Abhinav Kumar, dri-devel,
	Deucher, Alexander, Shashank Sharma

[-- Attachment #1: Type: text/plain, Size: 6183 bytes --]

I don't know what iris does, but I would guess that the same problems as
with AMD GPUs apply, making GPUs resets very fragile.

Marek

On Tue., Mar. 29, 2022, 08:14 Christian König, <christian.koenig@amd.com>
wrote:

> My main question is what does the iris driver better than radeonsi when
> the client doesn't support the robustness extension?
>
> From Daniels description it sounds like they have at least a partial
> recovery mechanism in place.
>
> Apart from that I completely agree to what you said below.
>
> Christian.
>
> Am 26.03.22 um 01:53 schrieb Olsak, Marek:
>
> [AMD Official Use Only]
>
> amdgpu has 2 resets: soft reset and hard reset.
>
> The soft reset is able to recover from an infinite loop and even some GPU
> hangs due to bad shaders or bad states. The soft reset uses a signal that
> kills all currently-running shaders of a certain process (VM context),
> which unblocks the graphics pipeline, so draws and command buffers finish
> but are not correctly. This can then cause a hard hang if the shader was
> supposed to signal work completion through a shader store instruction and a
> non-shader consumer is waiting for it (skipping the store instruction by
> killing the shader won't signal the work, and thus the consumer will be
> stuck, requiring a hard reset).
>
> The hard reset can recover from other hangs, which is great, but it may
> use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory
> contents, but we should assume that any process that had running jobs on
> the GPU during a GPU reset has its memory resources in an inconsistent
> state, and thus following command buffers can cause another GPU hang. The
> shader store example above is enough to cause another hard hang due to
> incorrect content in memory resources, which can contain synchronization
> primitives that are used internally by the hardware.
>
> Asking the driver to replay a command buffer that caused a hang is a sure
> way to hang it again. Unrelated processes can be affected due to lost VRAM
> or the misfortune of using the GPU while the GPU hang occurred. The window
> system should recreate GPU resources and redraw everything without
> affecting applications. If apps use GL, they should do the same. Processes
> that can't recover by redrawing content can be terminated or left alone,
> but they shouldn't be allowed to submit work to the GPU anymore.
>
> dEQP only exercises the soft reset. I think WebGL is only able to trigger
> a soft reset at this point, but Vulkan can also trigger a hard reset.
>
> Marek
> ------------------------------
> *From:* Koenig, Christian <Christian.Koenig@amd.com>
> <Christian.Koenig@amd.com>
> *Sent:* March 23, 2022 11:25
> *To:* Daniel Vetter <daniel@ffwll.ch> <daniel@ffwll.ch>; Daniel Stone
> <daniel@fooishbar.org> <daniel@fooishbar.org>; Olsak, Marek
> <Marek.Olsak@amd.com> <Marek.Olsak@amd.com>; Grodzovsky, Andrey
> <Andrey.Grodzovsky@amd.com> <Andrey.Grodzovsky@amd.com>
> *Cc:* Rob Clark <robdclark@gmail.com> <robdclark@gmail.com>; Rob Clark
> <robdclark@chromium.org> <robdclark@chromium.org>; Sharma, Shashank
> <Shashank.Sharma@amd.com> <Shashank.Sharma@amd.com>; Christian König
> <ckoenig.leichtzumerken@gmail.com> <ckoenig.leichtzumerken@gmail.com>;
> Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>
> <Amaranath.Somalapuram@amd.com>; Abhinav Kumar <quic_abhinavk@quicinc.com>
> <quic_abhinavk@quicinc.com>; dri-devel <dri-devel@lists.freedesktop.org>
> <dri-devel@lists.freedesktop.org>; amd-gfx list
> <amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org>; Deucher,
> Alexander <Alexander.Deucher@amd.com> <Alexander.Deucher@amd.com>;
> Shashank Sharma <contactshashanksharma@gmail.com>
> <contactshashanksharma@gmail.com>
> *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
>
> [Adding Marek and Andrey as well]
>
> Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> > On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org>
> <daniel@fooishbar.org> wrote:
> >> Hi,
> >>
> >> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com>
> <robdclark@gmail.com> wrote:
> >>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
> >>> <christian.koenig@amd.com> <christian.koenig@amd.com> wrote:
> >>>> Well you can, it just means that their contexts are lost as well.
> >>> Which is rather inconvenient when deqp-egl reset tests, for example,
> >>> take down your compositor ;-)
> >> Yeah. Or anything WebGL.
> >>
> >> System-wide collateral damage is definitely a non-starter. If that
> >> means that the userspace driver has to do what iris does and ensure
> >> everything's recreated and resubmitted, that works too, just as long
> >> as the response to 'my adblocker didn't detect a crypto miner ad'  is
> >> something better than 'shoot the entire user session'.
> > Not sure where that idea came from, I thought at least I made it clear
> > that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> > which should die without recovery attempt.
> >
> > The entire discussion here is who should be responsible for replay and
> > at least if you can decide the uapi, then punting that entirely to
> > userspace is a good approach.
>
> Yes, completely agree. We have the approach of re-submitting things in
> the kernel and that failed quite miserable.
>
> In other words currently a GPU reset has something like a 99% chance to
> get down your whole desktop.
>
> Daniel can you briefly explain what exactly iris does when a lost
> context is detected without gl robustness?
>
> It sounds like you guys got that working quite well.
>
> Thanks,
> Christian.
>
> >
> > Ofc it'd be nice if the collateral damage is limited, i.e. requests
> > not currently on the gpu, or on different engines and all that
> > shouldn't be nuked, if possible.
> >
> > Also ofc since msm uapi is that the kernel tries to recover there's
> > not much we can do there, contexts cannot be shot. But still trying to
> > replay them as much as possible feels a bit like overkill.
> > -Daniel
> >
> >> Cheers,
> >> Daniel
> >
> >
>
>
>

[-- Attachment #2: Type: text/html, Size: 11788 bytes --]

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-29 16:25                                                                     ` Marek Olšák
  0 siblings, 0 replies; 95+ messages in thread
From: Marek Olšák @ 2022-03-29 16:25 UTC (permalink / raw)
  To: Christian König
  Cc: Rob Clark, Grodzovsky, Andrey, amd-gfx list, Sharma, Shashank,
	Christian König, Olsak, Marek, Somalapuram, Amaranath,
	Abhinav Kumar, Daniel Stone, Rob Clark, dri-devel, Daniel Vetter,
	Deucher, Alexander, Shashank Sharma

[-- Attachment #1: Type: text/plain, Size: 6183 bytes --]

I don't know what iris does, but I would guess that the same problems as
with AMD GPUs apply, making GPUs resets very fragile.

Marek

On Tue., Mar. 29, 2022, 08:14 Christian König, <christian.koenig@amd.com>
wrote:

> My main question is what does the iris driver better than radeonsi when
> the client doesn't support the robustness extension?
>
> From Daniels description it sounds like they have at least a partial
> recovery mechanism in place.
>
> Apart from that I completely agree to what you said below.
>
> Christian.
>
> Am 26.03.22 um 01:53 schrieb Olsak, Marek:
>
> [AMD Official Use Only]
>
> amdgpu has 2 resets: soft reset and hard reset.
>
> The soft reset is able to recover from an infinite loop and even some GPU
> hangs due to bad shaders or bad states. The soft reset uses a signal that
> kills all currently-running shaders of a certain process (VM context),
> which unblocks the graphics pipeline, so draws and command buffers finish
> but are not correctly. This can then cause a hard hang if the shader was
> supposed to signal work completion through a shader store instruction and a
> non-shader consumer is waiting for it (skipping the store instruction by
> killing the shader won't signal the work, and thus the consumer will be
> stuck, requiring a hard reset).
>
> The hard reset can recover from other hangs, which is great, but it may
> use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory
> contents, but we should assume that any process that had running jobs on
> the GPU during a GPU reset has its memory resources in an inconsistent
> state, and thus following command buffers can cause another GPU hang. The
> shader store example above is enough to cause another hard hang due to
> incorrect content in memory resources, which can contain synchronization
> primitives that are used internally by the hardware.
>
> Asking the driver to replay a command buffer that caused a hang is a sure
> way to hang it again. Unrelated processes can be affected due to lost VRAM
> or the misfortune of using the GPU while the GPU hang occurred. The window
> system should recreate GPU resources and redraw everything without
> affecting applications. If apps use GL, they should do the same. Processes
> that can't recover by redrawing content can be terminated or left alone,
> but they shouldn't be allowed to submit work to the GPU anymore.
>
> dEQP only exercises the soft reset. I think WebGL is only able to trigger
> a soft reset at this point, but Vulkan can also trigger a hard reset.
>
> Marek
> ------------------------------
> *From:* Koenig, Christian <Christian.Koenig@amd.com>
> <Christian.Koenig@amd.com>
> *Sent:* March 23, 2022 11:25
> *To:* Daniel Vetter <daniel@ffwll.ch> <daniel@ffwll.ch>; Daniel Stone
> <daniel@fooishbar.org> <daniel@fooishbar.org>; Olsak, Marek
> <Marek.Olsak@amd.com> <Marek.Olsak@amd.com>; Grodzovsky, Andrey
> <Andrey.Grodzovsky@amd.com> <Andrey.Grodzovsky@amd.com>
> *Cc:* Rob Clark <robdclark@gmail.com> <robdclark@gmail.com>; Rob Clark
> <robdclark@chromium.org> <robdclark@chromium.org>; Sharma, Shashank
> <Shashank.Sharma@amd.com> <Shashank.Sharma@amd.com>; Christian König
> <ckoenig.leichtzumerken@gmail.com> <ckoenig.leichtzumerken@gmail.com>;
> Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>
> <Amaranath.Somalapuram@amd.com>; Abhinav Kumar <quic_abhinavk@quicinc.com>
> <quic_abhinavk@quicinc.com>; dri-devel <dri-devel@lists.freedesktop.org>
> <dri-devel@lists.freedesktop.org>; amd-gfx list
> <amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org>; Deucher,
> Alexander <Alexander.Deucher@amd.com> <Alexander.Deucher@amd.com>;
> Shashank Sharma <contactshashanksharma@gmail.com>
> <contactshashanksharma@gmail.com>
> *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
>
> [Adding Marek and Andrey as well]
>
> Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> > On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org>
> <daniel@fooishbar.org> wrote:
> >> Hi,
> >>
> >> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com>
> <robdclark@gmail.com> wrote:
> >>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
> >>> <christian.koenig@amd.com> <christian.koenig@amd.com> wrote:
> >>>> Well you can, it just means that their contexts are lost as well.
> >>> Which is rather inconvenient when deqp-egl reset tests, for example,
> >>> take down your compositor ;-)
> >> Yeah. Or anything WebGL.
> >>
> >> System-wide collateral damage is definitely a non-starter. If that
> >> means that the userspace driver has to do what iris does and ensure
> >> everything's recreated and resubmitted, that works too, just as long
> >> as the response to 'my adblocker didn't detect a crypto miner ad'  is
> >> something better than 'shoot the entire user session'.
> > Not sure where that idea came from, I thought at least I made it clear
> > that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> > which should die without recovery attempt.
> >
> > The entire discussion here is who should be responsible for replay and
> > at least if you can decide the uapi, then punting that entirely to
> > userspace is a good approach.
>
> Yes, completely agree. We have the approach of re-submitting things in
> the kernel and that failed quite miserable.
>
> In other words currently a GPU reset has something like a 99% chance to
> get down your whole desktop.
>
> Daniel can you briefly explain what exactly iris does when a lost
> context is detected without gl robustness?
>
> It sounds like you guys got that working quite well.
>
> Thanks,
> Christian.
>
> >
> > Ofc it'd be nice if the collateral damage is limited, i.e. requests
> > not currently on the gpu, or on different engines and all that
> > shouldn't be nuked, if possible.
> >
> > Also ofc since msm uapi is that the kernel tries to recover there's
> > not much we can do there, contexts cannot be shot. But still trying to
> > replay them as much as possible feels a bit like overkill.
> > -Daniel
> >
> >> Cheers,
> >> Daniel
> >
> >
>
>
>

[-- Attachment #2: Type: text/html, Size: 11788 bytes --]

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
  2022-03-29 16:25                                                                     ` Marek Olšák
@ 2022-03-30  9:49                                                                       ` Daniel Vetter
  -1 siblings, 0 replies; 95+ messages in thread
From: Daniel Vetter @ 2022-03-30  9:49 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Rob Clark, amd-gfx list, Sharma, Shashank, Christian König,
	Olsak, Marek, Somalapuram, Amaranath, Abhinav Kumar, dri-devel,
	Deucher, Alexander, Shashank Sharma, Christian König

On Tue, Mar 29, 2022 at 12:25:55PM -0400, Marek Olšák wrote:
> I don't know what iris does, but I would guess that the same problems as
> with AMD GPUs apply, making GPUs resets very fragile.

iris_batch_check_for_reset -> replace_kernel_ctx -> iris_lost_context_state

is I think the main call chain of how this is handled/detected. There's
also a side-chain which handles -EIO from execbuf.

Also this is using non-recoverable contexts, i.e. any time they suffer
from a gpu reset (either because guilty themselves, or collateral damage
of a reset that shot more than just the guilty context) the context stops
entirely and refuses any further execbuf with -EIO.

Cheers, Daniel

> 
> Marek
> 
> On Tue., Mar. 29, 2022, 08:14 Christian König, <christian.koenig@amd.com>
> wrote:
> 
> > My main question is what does the iris driver better than radeonsi when
> > the client doesn't support the robustness extension?
> >
> > From Daniels description it sounds like they have at least a partial
> > recovery mechanism in place.
> >
> > Apart from that I completely agree to what you said below.
> >
> > Christian.
> >
> > Am 26.03.22 um 01:53 schrieb Olsak, Marek:
> >
> > [AMD Official Use Only]
> >
> > amdgpu has 2 resets: soft reset and hard reset.
> >
> > The soft reset is able to recover from an infinite loop and even some GPU
> > hangs due to bad shaders or bad states. The soft reset uses a signal that
> > kills all currently-running shaders of a certain process (VM context),
> > which unblocks the graphics pipeline, so draws and command buffers finish
> > but are not correctly. This can then cause a hard hang if the shader was
> > supposed to signal work completion through a shader store instruction and a
> > non-shader consumer is waiting for it (skipping the store instruction by
> > killing the shader won't signal the work, and thus the consumer will be
> > stuck, requiring a hard reset).
> >
> > The hard reset can recover from other hangs, which is great, but it may
> > use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory
> > contents, but we should assume that any process that had running jobs on
> > the GPU during a GPU reset has its memory resources in an inconsistent
> > state, and thus following command buffers can cause another GPU hang. The
> > shader store example above is enough to cause another hard hang due to
> > incorrect content in memory resources, which can contain synchronization
> > primitives that are used internally by the hardware.
> >
> > Asking the driver to replay a command buffer that caused a hang is a sure
> > way to hang it again. Unrelated processes can be affected due to lost VRAM
> > or the misfortune of using the GPU while the GPU hang occurred. The window
> > system should recreate GPU resources and redraw everything without
> > affecting applications. If apps use GL, they should do the same. Processes
> > that can't recover by redrawing content can be terminated or left alone,
> > but they shouldn't be allowed to submit work to the GPU anymore.
> >
> > dEQP only exercises the soft reset. I think WebGL is only able to trigger
> > a soft reset at this point, but Vulkan can also trigger a hard reset.
> >
> > Marek
> > ------------------------------
> > *From:* Koenig, Christian <Christian.Koenig@amd.com>
> > <Christian.Koenig@amd.com>
> > *Sent:* March 23, 2022 11:25
> > *To:* Daniel Vetter <daniel@ffwll.ch> <daniel@ffwll.ch>; Daniel Stone
> > <daniel@fooishbar.org> <daniel@fooishbar.org>; Olsak, Marek
> > <Marek.Olsak@amd.com> <Marek.Olsak@amd.com>; Grodzovsky, Andrey
> > <Andrey.Grodzovsky@amd.com> <Andrey.Grodzovsky@amd.com>
> > *Cc:* Rob Clark <robdclark@gmail.com> <robdclark@gmail.com>; Rob Clark
> > <robdclark@chromium.org> <robdclark@chromium.org>; Sharma, Shashank
> > <Shashank.Sharma@amd.com> <Shashank.Sharma@amd.com>; Christian König
> > <ckoenig.leichtzumerken@gmail.com> <ckoenig.leichtzumerken@gmail.com>;
> > Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>
> > <Amaranath.Somalapuram@amd.com>; Abhinav Kumar <quic_abhinavk@quicinc.com>
> > <quic_abhinavk@quicinc.com>; dri-devel <dri-devel@lists.freedesktop.org>
> > <dri-devel@lists.freedesktop.org>; amd-gfx list
> > <amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org>; Deucher,
> > Alexander <Alexander.Deucher@amd.com> <Alexander.Deucher@amd.com>;
> > Shashank Sharma <contactshashanksharma@gmail.com>
> > <contactshashanksharma@gmail.com>
> > *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
> >
> > [Adding Marek and Andrey as well]
> >
> > Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> > > On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org>
> > <daniel@fooishbar.org> wrote:
> > >> Hi,
> > >>
> > >> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com>
> > <robdclark@gmail.com> wrote:
> > >>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
> > >>> <christian.koenig@amd.com> <christian.koenig@amd.com> wrote:
> > >>>> Well you can, it just means that their contexts are lost as well.
> > >>> Which is rather inconvenient when deqp-egl reset tests, for example,
> > >>> take down your compositor ;-)
> > >> Yeah. Or anything WebGL.
> > >>
> > >> System-wide collateral damage is definitely a non-starter. If that
> > >> means that the userspace driver has to do what iris does and ensure
> > >> everything's recreated and resubmitted, that works too, just as long
> > >> as the response to 'my adblocker didn't detect a crypto miner ad'  is
> > >> something better than 'shoot the entire user session'.
> > > Not sure where that idea came from, I thought at least I made it clear
> > > that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> > > which should die without recovery attempt.
> > >
> > > The entire discussion here is who should be responsible for replay and
> > > at least if you can decide the uapi, then punting that entirely to
> > > userspace is a good approach.
> >
> > Yes, completely agree. We have the approach of re-submitting things in
> > the kernel and that failed quite miserable.
> >
> > In other words currently a GPU reset has something like a 99% chance to
> > get down your whole desktop.
> >
> > Daniel can you briefly explain what exactly iris does when a lost
> > context is detected without gl robustness?
> >
> > It sounds like you guys got that working quite well.
> >
> > Thanks,
> > Christian.
> >
> > >
> > > Ofc it'd be nice if the collateral damage is limited, i.e. requests
> > > not currently on the gpu, or on different engines and all that
> > > shouldn't be nuked, if possible.
> > >
> > > Also ofc since msm uapi is that the kernel tries to recover there's
> > > not much we can do there, contexts cannot be shot. But still trying to
> > > replay them as much as possible feels a bit like overkill.
> > > -Daniel
> > >
> > >> Cheers,
> > >> Daniel
> > >
> > >
> >
> >
> >

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
@ 2022-03-30  9:49                                                                       ` Daniel Vetter
  0 siblings, 0 replies; 95+ messages in thread
From: Daniel Vetter @ 2022-03-30  9:49 UTC (permalink / raw)
  To: Marek Olšák
  Cc: Rob Clark, Grodzovsky, Andrey, amd-gfx list, Sharma, Shashank,
	Christian König, Olsak, Marek, Somalapuram, Amaranath,
	Abhinav Kumar, Daniel Stone, Rob Clark, dri-devel, Daniel Vetter,
	Deucher, Alexander, Shashank Sharma, Christian König

On Tue, Mar 29, 2022 at 12:25:55PM -0400, Marek Olšák wrote:
> I don't know what iris does, but I would guess that the same problems as
> with AMD GPUs apply, making GPUs resets very fragile.

iris_batch_check_for_reset -> replace_kernel_ctx -> iris_lost_context_state

is I think the main call chain of how this is handled/detected. There's
also a side-chain which handles -EIO from execbuf.

Also this is using non-recoverable contexts, i.e. any time they suffer
from a gpu reset (either because guilty themselves, or collateral damage
of a reset that shot more than just the guilty context) the context stops
entirely and refuses any further execbuf with -EIO.

Cheers, Daniel

> 
> Marek
> 
> On Tue., Mar. 29, 2022, 08:14 Christian König, <christian.koenig@amd.com>
> wrote:
> 
> > My main question is what does the iris driver better than radeonsi when
> > the client doesn't support the robustness extension?
> >
> > From Daniels description it sounds like they have at least a partial
> > recovery mechanism in place.
> >
> > Apart from that I completely agree to what you said below.
> >
> > Christian.
> >
> > Am 26.03.22 um 01:53 schrieb Olsak, Marek:
> >
> > [AMD Official Use Only]
> >
> > amdgpu has 2 resets: soft reset and hard reset.
> >
> > The soft reset is able to recover from an infinite loop and even some GPU
> > hangs due to bad shaders or bad states. The soft reset uses a signal that
> > kills all currently-running shaders of a certain process (VM context),
> > which unblocks the graphics pipeline, so draws and command buffers finish
> > but are not correctly. This can then cause a hard hang if the shader was
> > supposed to signal work completion through a shader store instruction and a
> > non-shader consumer is waiting for it (skipping the store instruction by
> > killing the shader won't signal the work, and thus the consumer will be
> > stuck, requiring a hard reset).
> >
> > The hard reset can recover from other hangs, which is great, but it may
> > use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory
> > contents, but we should assume that any process that had running jobs on
> > the GPU during a GPU reset has its memory resources in an inconsistent
> > state, and thus following command buffers can cause another GPU hang. The
> > shader store example above is enough to cause another hard hang due to
> > incorrect content in memory resources, which can contain synchronization
> > primitives that are used internally by the hardware.
> >
> > Asking the driver to replay a command buffer that caused a hang is a sure
> > way to hang it again. Unrelated processes can be affected due to lost VRAM
> > or the misfortune of using the GPU while the GPU hang occurred. The window
> > system should recreate GPU resources and redraw everything without
> > affecting applications. If apps use GL, they should do the same. Processes
> > that can't recover by redrawing content can be terminated or left alone,
> > but they shouldn't be allowed to submit work to the GPU anymore.
> >
> > dEQP only exercises the soft reset. I think WebGL is only able to trigger
> > a soft reset at this point, but Vulkan can also trigger a hard reset.
> >
> > Marek
> > ------------------------------
> > *From:* Koenig, Christian <Christian.Koenig@amd.com>
> > <Christian.Koenig@amd.com>
> > *Sent:* March 23, 2022 11:25
> > *To:* Daniel Vetter <daniel@ffwll.ch> <daniel@ffwll.ch>; Daniel Stone
> > <daniel@fooishbar.org> <daniel@fooishbar.org>; Olsak, Marek
> > <Marek.Olsak@amd.com> <Marek.Olsak@amd.com>; Grodzovsky, Andrey
> > <Andrey.Grodzovsky@amd.com> <Andrey.Grodzovsky@amd.com>
> > *Cc:* Rob Clark <robdclark@gmail.com> <robdclark@gmail.com>; Rob Clark
> > <robdclark@chromium.org> <robdclark@chromium.org>; Sharma, Shashank
> > <Shashank.Sharma@amd.com> <Shashank.Sharma@amd.com>; Christian König
> > <ckoenig.leichtzumerken@gmail.com> <ckoenig.leichtzumerken@gmail.com>;
> > Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>
> > <Amaranath.Somalapuram@amd.com>; Abhinav Kumar <quic_abhinavk@quicinc.com>
> > <quic_abhinavk@quicinc.com>; dri-devel <dri-devel@lists.freedesktop.org>
> > <dri-devel@lists.freedesktop.org>; amd-gfx list
> > <amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org>; Deucher,
> > Alexander <Alexander.Deucher@amd.com> <Alexander.Deucher@amd.com>;
> > Shashank Sharma <contactshashanksharma@gmail.com>
> > <contactshashanksharma@gmail.com>
> > *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
> >
> > [Adding Marek and Andrey as well]
> >
> > Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> > > On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org>
> > <daniel@fooishbar.org> wrote:
> > >> Hi,
> > >>
> > >> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com>
> > <robdclark@gmail.com> wrote:
> > >>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
> > >>> <christian.koenig@amd.com> <christian.koenig@amd.com> wrote:
> > >>>> Well you can, it just means that their contexts are lost as well.
> > >>> Which is rather inconvenient when deqp-egl reset tests, for example,
> > >>> take down your compositor ;-)
> > >> Yeah. Or anything WebGL.
> > >>
> > >> System-wide collateral damage is definitely a non-starter. If that
> > >> means that the userspace driver has to do what iris does and ensure
> > >> everything's recreated and resubmitted, that works too, just as long
> > >> as the response to 'my adblocker didn't detect a crypto miner ad'  is
> > >> something better than 'shoot the entire user session'.
> > > Not sure where that idea came from, I thought at least I made it clear
> > > that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> > > which should die without recovery attempt.
> > >
> > > The entire discussion here is who should be responsible for replay and
> > > at least if you can decide the uapi, then punting that entirely to
> > > userspace is a good approach.
> >
> > Yes, completely agree. We have the approach of re-submitting things in
> > the kernel and that failed quite miserable.
> >
> > In other words currently a GPU reset has something like a 99% chance to
> > get down your whole desktop.
> >
> > Daniel can you briefly explain what exactly iris does when a lost
> > context is detected without gl robustness?
> >
> > It sounds like you guys got that working quite well.
> >
> > Thanks,
> > Christian.
> >
> > >
> > > Ofc it'd be nice if the collateral damage is limited, i.e. requests
> > > not currently on the gpu, or on different engines and all that
> > > shouldn't be nuked, if possible.
> > >
> > > Also ofc since msm uapi is that the kernel tries to recover there's
> > > not much we can do there, contexts cannot be shot. But still trying to
> > > replay them as much as possible feels a bit like overkill.
> > > -Daniel
> > >
> > >> Cheers,
> > >> Daniel
> > >
> > >
> >
> >
> >

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 95+ messages in thread

end of thread, other threads:[~2022-03-30  9:49 UTC | newest]

Thread overview: 95+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-08 18:04 [PATCH v2 1/2] drm: Add GPU reset sysfs event Shashank Sharma
2022-03-08 18:04 ` [PATCH v2 2/2] drm/amdgpu: add work function for GPU reset event Shashank Sharma
2022-03-09  7:47 ` [PATCH v2 1/2] drm: Add GPU reset sysfs event Simon Ser
2022-03-09 11:18   ` Sharma, Shashank
2022-03-09  8:09 ` Christian König
2022-03-09  9:56 ` Pierre-Eric Pelloux-Prayer
2022-03-09 10:10   ` Simon Ser
2022-03-09 10:24     ` Christian König
2022-03-09 10:28       ` Simon Ser
2022-03-09 10:28       ` Pierre-Eric Pelloux-Prayer
2022-03-09 18:12 ` Rob Clark
2022-03-10  9:55   ` Christian König
2022-03-10 15:24     ` Rob Clark
2022-03-10 16:21       ` Sharma, Shashank
2022-03-10 16:27         ` Andrey Grodzovsky
2022-03-10 17:16           ` Rob Clark
2022-03-10 17:10         ` Rob Clark
2022-03-10 17:19           ` Sharma, Shashank
2022-03-10 17:40             ` Rob Clark
2022-03-10 18:33               ` Abhinav Kumar
2022-03-10 19:14                 ` Sharma, Shashank
2022-03-10 19:35                   ` Rob Clark
2022-03-10 19:44                     ` Sharma, Shashank
2022-03-10 19:56                       ` Rob Clark
2022-03-10 20:17                         ` Sharma, Shashank
2022-03-11  8:30                         ` Pekka Paalanen
2022-03-14 14:23                           ` Alex Deucher
2022-03-14 14:23                             ` Alex Deucher
2022-03-14 15:26                             ` Pekka Paalanen
2022-03-14 15:26                               ` Pekka Paalanen
2022-03-15 14:54                               ` Alex Deucher
2022-03-15 14:54                                 ` Alex Deucher
2022-03-16  8:48                                 ` Pekka Paalanen
2022-03-16  8:48                                   ` Pekka Paalanen
2022-03-16 14:12                                   ` Alex Deucher
2022-03-16 14:12                                     ` Alex Deucher
2022-03-16 15:36                                     ` Rob Clark
2022-03-16 15:36                                       ` Rob Clark
2022-03-16 15:48                                       ` Alex Deucher
2022-03-16 15:48                                         ` Alex Deucher
2022-03-16 16:30                                         ` Rob Clark
2022-03-16 16:30                                           ` Rob Clark
2022-03-17  7:03                                       ` Christian König
2022-03-17  7:03                                         ` Christian König
2022-03-17  9:29                                         ` Daniel Vetter
2022-03-17  9:29                                           ` Daniel Vetter
2022-03-17  9:46                                           ` Christian König
2022-03-17  9:46                                             ` Christian König
2022-03-17 15:34                                           ` Rob Clark
2022-03-17 15:34                                             ` Rob Clark
2022-03-17 17:23                                             ` Daniel Vetter
2022-03-17 17:23                                               ` Daniel Vetter
2022-03-17 15:40                                           ` Rob Clark
2022-03-17 15:40                                             ` Rob Clark
2022-03-17 17:26                                             ` Daniel Vetter
2022-03-17 17:26                                               ` Daniel Vetter
2022-03-17 17:31                                               ` Rob Clark
2022-03-17 17:31                                                 ` Rob Clark
2022-03-18  7:42                                                 ` Christian König
2022-03-18  7:42                                                   ` Christian König
2022-03-18 15:12                                                   ` Rob Clark
2022-03-18 15:12                                                     ` Rob Clark
2022-03-21  9:30                                                     ` Christian König
2022-03-21  9:30                                                       ` Christian König
2022-03-21 16:03                                                       ` Rob Clark
2022-03-21 16:03                                                         ` Rob Clark
2022-03-23 14:07                                                         ` Daniel Stone
2022-03-23 15:14                                                           ` Daniel Vetter
2022-03-23 15:14                                                             ` Daniel Vetter
2022-03-23 15:25                                                             ` Christian König
2022-03-23 15:25                                                               ` Christian König
2022-03-26  0:53                                                               ` Olsak, Marek
2022-03-26  0:53                                                                 ` Olsak, Marek
2022-03-29 12:14                                                                 ` Christian König
2022-03-29 12:14                                                                   ` Christian König
2022-03-29 16:25                                                                   ` Marek Olšák
2022-03-29 16:25                                                                     ` Marek Olšák
2022-03-30  9:49                                                                     ` Daniel Vetter
2022-03-30  9:49                                                                       ` Daniel Vetter
2022-03-23 17:30                                                             ` Rob Clark
2022-03-23 17:30                                                               ` Rob Clark
2022-03-21 14:15                                                     ` Daniel Vetter
2022-03-21 14:15                                                       ` Daniel Vetter
2022-03-15  7:13                             ` Dave Airlie
2022-03-15  7:13                               ` Dave Airlie
2022-03-15  7:25                               ` Simon Ser
2022-03-15  7:25                                 ` Simon Ser
2022-03-15  7:25                               ` Christian König
2022-03-15  7:25                                 ` Christian König
2022-03-17  9:25                             ` Daniel Vetter
2022-03-16 21:50 ` Rob Clark
2022-03-17  8:42   ` Sharma, Shashank
2022-03-17  9:21     ` Christian König
2022-03-17 10:31       ` Daniel Stone
2022-03-17 10:31         ` Daniel Stone

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.