All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] drm/amdgpu: Add new reset option and rework coredump
@ 2023-07-11 21:34 ` André Almeida
  0 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:34 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	alexander.deucher, christian.koenig

Hi,

The goal of this patchset is to improve debugging device resets on amdgpu.

The first patch creates a new module parameter to disable soft recoveries,
ensuring every recovery go through the full device reset, making easier to
generate resets from userspace tools like [0] and [1]. This is important to
validate how the stack behaves on resets, from end-to-end.

The second patch is a small addition to mark guilty jobs that causes soft
recoveries for API consistency.

The last patches are a rework to store more information at devcoredump files,
making it more useful to be attached to bug reports.

The new coredump content look like this:

   **** AMDGPU Device Coredump ****
   version: 1
   kernel: 6.4.0-rc7-tony+
   module: amdgpu
   time: 702.743534320
   process_name: vulkan-triangle PID: 4561
   IBs:
   	[0] 0xffff800100545000
   	[1] 0xffff800100001000
   ring name: gfx_0.0.0

Due to nested IBs, this may not be the one that really caused the hang, but it
gives some direction.

Thanks,
	André

[0] https://gitlab.freedesktop.org/andrealmeid/gpu-timeout
[1] https://github.com/andrealmeid/vulkan-triangle-v1

André Almeida (6):
  drm/amdgpu: Create a module param to disable soft recovery
  drm/amdgpu: Mark contexts guilty for causing soft recoveries
  drm/amdgpu: Rework coredump to use memory dynamically
  drm/amdgpu: Limit info in coredump for kernel threads
  drm/amdgpu: Log IBs and ring name at coredump
  drm/amdgpu: Create version number for coredumps

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 21 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c    |  6 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 99 +++++++++++++++++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  9 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   |  6 +-
 5 files changed, 112 insertions(+), 29 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 0/6] drm/amdgpu: Add new reset option and rework coredump
@ 2023-07-11 21:34 ` André Almeida
  0 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:34 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, christian.koenig,
	pierre-eric.pelloux-prayer, 'Marek Olšák',
	Samuel Pitoiset, Bas Nieuwenhuizen, Timur Kristóf,
	michel.daenzer, André Almeida

Hi,

The goal of this patchset is to improve debugging device resets on amdgpu.

The first patch creates a new module parameter to disable soft recoveries,
ensuring every recovery go through the full device reset, making easier to
generate resets from userspace tools like [0] and [1]. This is important to
validate how the stack behaves on resets, from end-to-end.

The second patch is a small addition to mark guilty jobs that causes soft
recoveries for API consistency.

The last patches are a rework to store more information at devcoredump files,
making it more useful to be attached to bug reports.

The new coredump content look like this:

   **** AMDGPU Device Coredump ****
   version: 1
   kernel: 6.4.0-rc7-tony+
   module: amdgpu
   time: 702.743534320
   process_name: vulkan-triangle PID: 4561
   IBs:
   	[0] 0xffff800100545000
   	[1] 0xffff800100001000
   ring name: gfx_0.0.0

Due to nested IBs, this may not be the one that really caused the hang, but it
gives some direction.

Thanks,
	André

[0] https://gitlab.freedesktop.org/andrealmeid/gpu-timeout
[1] https://github.com/andrealmeid/vulkan-triangle-v1

André Almeida (6):
  drm/amdgpu: Create a module param to disable soft recovery
  drm/amdgpu: Mark contexts guilty for causing soft recoveries
  drm/amdgpu: Rework coredump to use memory dynamically
  drm/amdgpu: Limit info in coredump for kernel threads
  drm/amdgpu: Log IBs and ring name at coredump
  drm/amdgpu: Create version number for coredumps

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 21 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c    |  6 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 99 +++++++++++++++++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  9 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   |  6 +-
 5 files changed, 112 insertions(+), 29 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 0/6] drm/amdgpu: Add new reset option and rework coredump
@ 2023-07-11 21:34 ` André Almeida
  0 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:34 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	Bas Nieuwenhuizen, alexander.deucher, christian.koenig

Hi,

The goal of this patchset is to improve debugging device resets on amdgpu.

The first patch creates a new module parameter to disable soft recoveries,
ensuring every recovery go through the full device reset, making easier to
generate resets from userspace tools like [0] and [1]. This is important to
validate how the stack behaves on resets, from end-to-end.

The second patch is a small addition to mark guilty jobs that causes soft
recoveries for API consistency.

The last patches are a rework to store more information at devcoredump files,
making it more useful to be attached to bug reports.

The new coredump content look like this:

   **** AMDGPU Device Coredump ****
   version: 1
   kernel: 6.4.0-rc7-tony+
   module: amdgpu
   time: 702.743534320
   process_name: vulkan-triangle PID: 4561
   IBs:
   	[0] 0xffff800100545000
   	[1] 0xffff800100001000
   ring name: gfx_0.0.0

Due to nested IBs, this may not be the one that really caused the hang, but it
gives some direction.

Thanks,
	André

[0] https://gitlab.freedesktop.org/andrealmeid/gpu-timeout
[1] https://github.com/andrealmeid/vulkan-triangle-v1

André Almeida (6):
  drm/amdgpu: Create a module param to disable soft recovery
  drm/amdgpu: Mark contexts guilty for causing soft recoveries
  drm/amdgpu: Rework coredump to use memory dynamically
  drm/amdgpu: Limit info in coredump for kernel threads
  drm/amdgpu: Log IBs and ring name at coredump
  drm/amdgpu: Create version number for coredumps

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 21 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c    |  6 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 99 +++++++++++++++++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  9 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c   |  6 +-
 5 files changed, 112 insertions(+), 29 deletions(-)

-- 
2.41.0


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/6] drm/amdgpu: Create a module param to disable soft recovery
  2023-07-11 21:34 ` André Almeida
  (?)
@ 2023-07-11 21:34   ` André Almeida
  -1 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:34 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	alexander.deucher, christian.koenig

Create a module parameter to disable soft recoveries on amdgpu, making
every recovery go through the device reset path. This option makes
easier to force device resets for testing and debugging purposes.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h      | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  | 9 +++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 6 +++++-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index a84bd4a0c421..dbe062a087c5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -189,6 +189,7 @@ extern uint amdgpu_force_long_training;
 extern int amdgpu_lbpw;
 extern int amdgpu_compute_multipipe;
 extern int amdgpu_gpu_recovery;
+extern bool amdgpu_soft_recovery;
 extern int amdgpu_emu_mode;
 extern uint amdgpu_smu_memory_pool_size;
 extern int amdgpu_smu_pptable_id;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 3b711babd4e2..7c69f3169aa6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -163,6 +163,7 @@ uint amdgpu_force_long_training;
 int amdgpu_lbpw = -1;
 int amdgpu_compute_multipipe = -1;
 int amdgpu_gpu_recovery = -1; /* auto */
+bool amdgpu_soft_recovery = true;
 int amdgpu_emu_mode;
 uint amdgpu_smu_memory_pool_size;
 int amdgpu_smu_pptable_id = -1;
@@ -540,6 +541,14 @@ module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
 MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)");
 module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
 
+/**
+ * DOC: gpu_soft_recovery (bool)
+ * Set true to allow the driver to try soft recoveries if a job get stuck. Set
+ * to false to always force a GPU reset during recovery.
+ */
+MODULE_PARM_DESC(gpu_soft_recovery, "Enable GPU soft recovery mechanism (default: true)");
+module_param_named(gpu_soft_recovery, amdgpu_soft_recovery, bool, 0644);
+
 /**
  * DOC: emu_mode (int)
  * Set value 1 to enable emulation mode. This is only needed when running on an emulator. The default is 0 (disabled).
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 80d6e132e409..40678d9fb17e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -434,8 +434,12 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
 			       struct dma_fence *fence)
 {
 	unsigned long flags;
+	ktime_t deadline;
 
-	ktime_t deadline = ktime_add_us(ktime_get(), 10000);
+	if (!amdgpu_soft_recovery)
+		return false;
+
+	deadline = ktime_add_us(ktime_get(), 10000);
 
 	if (amdgpu_sriov_vf(ring->adev) || !ring->funcs->soft_recovery || !fence)
 		return false;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 1/6] drm/amdgpu: Create a module param to disable soft recovery
@ 2023-07-11 21:34   ` André Almeida
  0 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:34 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, christian.koenig,
	pierre-eric.pelloux-prayer, 'Marek Olšák',
	Samuel Pitoiset, Bas Nieuwenhuizen, Timur Kristóf,
	michel.daenzer, André Almeida

Create a module parameter to disable soft recoveries on amdgpu, making
every recovery go through the device reset path. This option makes
easier to force device resets for testing and debugging purposes.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h      | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  | 9 +++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 6 +++++-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index a84bd4a0c421..dbe062a087c5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -189,6 +189,7 @@ extern uint amdgpu_force_long_training;
 extern int amdgpu_lbpw;
 extern int amdgpu_compute_multipipe;
 extern int amdgpu_gpu_recovery;
+extern bool amdgpu_soft_recovery;
 extern int amdgpu_emu_mode;
 extern uint amdgpu_smu_memory_pool_size;
 extern int amdgpu_smu_pptable_id;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 3b711babd4e2..7c69f3169aa6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -163,6 +163,7 @@ uint amdgpu_force_long_training;
 int amdgpu_lbpw = -1;
 int amdgpu_compute_multipipe = -1;
 int amdgpu_gpu_recovery = -1; /* auto */
+bool amdgpu_soft_recovery = true;
 int amdgpu_emu_mode;
 uint amdgpu_smu_memory_pool_size;
 int amdgpu_smu_pptable_id = -1;
@@ -540,6 +541,14 @@ module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
 MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)");
 module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
 
+/**
+ * DOC: gpu_soft_recovery (bool)
+ * Set true to allow the driver to try soft recoveries if a job get stuck. Set
+ * to false to always force a GPU reset during recovery.
+ */
+MODULE_PARM_DESC(gpu_soft_recovery, "Enable GPU soft recovery mechanism (default: true)");
+module_param_named(gpu_soft_recovery, amdgpu_soft_recovery, bool, 0644);
+
 /**
  * DOC: emu_mode (int)
  * Set value 1 to enable emulation mode. This is only needed when running on an emulator. The default is 0 (disabled).
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 80d6e132e409..40678d9fb17e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -434,8 +434,12 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
 			       struct dma_fence *fence)
 {
 	unsigned long flags;
+	ktime_t deadline;
 
-	ktime_t deadline = ktime_add_us(ktime_get(), 10000);
+	if (!amdgpu_soft_recovery)
+		return false;
+
+	deadline = ktime_add_us(ktime_get(), 10000);
 
 	if (amdgpu_sriov_vf(ring->adev) || !ring->funcs->soft_recovery || !fence)
 		return false;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 1/6] drm/amdgpu: Create a module param to disable soft recovery
@ 2023-07-11 21:34   ` André Almeida
  0 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:34 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	Bas Nieuwenhuizen, alexander.deucher, christian.koenig

Create a module parameter to disable soft recoveries on amdgpu, making
every recovery go through the device reset path. This option makes
easier to force device resets for testing and debugging purposes.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h      | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  | 9 +++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 6 +++++-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index a84bd4a0c421..dbe062a087c5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -189,6 +189,7 @@ extern uint amdgpu_force_long_training;
 extern int amdgpu_lbpw;
 extern int amdgpu_compute_multipipe;
 extern int amdgpu_gpu_recovery;
+extern bool amdgpu_soft_recovery;
 extern int amdgpu_emu_mode;
 extern uint amdgpu_smu_memory_pool_size;
 extern int amdgpu_smu_pptable_id;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 3b711babd4e2..7c69f3169aa6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -163,6 +163,7 @@ uint amdgpu_force_long_training;
 int amdgpu_lbpw = -1;
 int amdgpu_compute_multipipe = -1;
 int amdgpu_gpu_recovery = -1; /* auto */
+bool amdgpu_soft_recovery = true;
 int amdgpu_emu_mode;
 uint amdgpu_smu_memory_pool_size;
 int amdgpu_smu_pptable_id = -1;
@@ -540,6 +541,14 @@ module_param_named(compute_multipipe, amdgpu_compute_multipipe, int, 0444);
 MODULE_PARM_DESC(gpu_recovery, "Enable GPU recovery mechanism, (1 = enable, 0 = disable, -1 = auto)");
 module_param_named(gpu_recovery, amdgpu_gpu_recovery, int, 0444);
 
+/**
+ * DOC: gpu_soft_recovery (bool)
+ * Set true to allow the driver to try soft recoveries if a job get stuck. Set
+ * to false to always force a GPU reset during recovery.
+ */
+MODULE_PARM_DESC(gpu_soft_recovery, "Enable GPU soft recovery mechanism (default: true)");
+module_param_named(gpu_soft_recovery, amdgpu_soft_recovery, bool, 0644);
+
 /**
  * DOC: emu_mode (int)
  * Set value 1 to enable emulation mode. This is only needed when running on an emulator. The default is 0 (disabled).
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 80d6e132e409..40678d9fb17e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -434,8 +434,12 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
 			       struct dma_fence *fence)
 {
 	unsigned long flags;
+	ktime_t deadline;
 
-	ktime_t deadline = ktime_add_us(ktime_get(), 10000);
+	if (!amdgpu_soft_recovery)
+		return false;
+
+	deadline = ktime_add_us(ktime_get(), 10000);
 
 	if (amdgpu_sriov_vf(ring->adev) || !ring->funcs->soft_recovery || !fence)
 		return false;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/6] drm/amdgpu: Mark contexts guilty for causing soft recoveries
  2023-07-11 21:34 ` André Almeida
  (?)
@ 2023-07-11 21:34   ` André Almeida
  -1 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:34 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	alexander.deucher, christian.koenig

If a DRM fence is set to -ENODATA, that means that this context was a
cause of a soft reset, but is never marked as guilty. Flag it as guilty
and log to user that this context won't accept more submissions.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 0dc9c655c4fb..fe8e47d063da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -459,6 +459,12 @@ int amdgpu_ctx_get_entity(struct amdgpu_ctx *ctx, u32 hw_ip, u32 instance,
 	ctx_entity = &ctx->entities[hw_ip][ring]->entity;
 	r = drm_sched_entity_error(ctx_entity);
 	if (r) {
+		if (r == -ENODATA) {
+			DRM_ERROR("%s (%d) context caused a reset,"
+				  "marking it guilty and refusing new submissions.\n",
+				  current->comm, current->pid);
+			atomic_set(&ctx->guilty, 1);
+		}
 		DRM_DEBUG("error entity %p\n", ctx_entity);
 		return r;
 	}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/6] drm/amdgpu: Mark contexts guilty for causing soft recoveries
@ 2023-07-11 21:34   ` André Almeida
  0 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:34 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, christian.koenig,
	pierre-eric.pelloux-prayer, 'Marek Olšák',
	Samuel Pitoiset, Bas Nieuwenhuizen, Timur Kristóf,
	michel.daenzer, André Almeida

If a DRM fence is set to -ENODATA, that means that this context was a
cause of a soft reset, but is never marked as guilty. Flag it as guilty
and log to user that this context won't accept more submissions.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 0dc9c655c4fb..fe8e47d063da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -459,6 +459,12 @@ int amdgpu_ctx_get_entity(struct amdgpu_ctx *ctx, u32 hw_ip, u32 instance,
 	ctx_entity = &ctx->entities[hw_ip][ring]->entity;
 	r = drm_sched_entity_error(ctx_entity);
 	if (r) {
+		if (r == -ENODATA) {
+			DRM_ERROR("%s (%d) context caused a reset,"
+				  "marking it guilty and refusing new submissions.\n",
+				  current->comm, current->pid);
+			atomic_set(&ctx->guilty, 1);
+		}
 		DRM_DEBUG("error entity %p\n", ctx_entity);
 		return r;
 	}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/6] drm/amdgpu: Mark contexts guilty for causing soft recoveries
@ 2023-07-11 21:34   ` André Almeida
  0 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:34 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	Bas Nieuwenhuizen, alexander.deucher, christian.koenig

If a DRM fence is set to -ENODATA, that means that this context was a
cause of a soft reset, but is never marked as guilty. Flag it as guilty
and log to user that this context won't accept more submissions.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
index 0dc9c655c4fb..fe8e47d063da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
@@ -459,6 +459,12 @@ int amdgpu_ctx_get_entity(struct amdgpu_ctx *ctx, u32 hw_ip, u32 instance,
 	ctx_entity = &ctx->entities[hw_ip][ring]->entity;
 	r = drm_sched_entity_error(ctx_entity);
 	if (r) {
+		if (r == -ENODATA) {
+			DRM_ERROR("%s (%d) context caused a reset,"
+				  "marking it guilty and refusing new submissions.\n",
+				  current->comm, current->pid);
+			atomic_set(&ctx->guilty, 1);
+		}
 		DRM_DEBUG("error entity %p\n", ctx_entity);
 		return r;
 	}
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically
  2023-07-11 21:34 ` André Almeida
  (?)
@ 2023-07-11 21:34   ` André Almeida
  -1 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:34 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	alexander.deucher, christian.koenig

Instead of storing coredump information inside amdgpu_device struct,
move if to a proper separated struct and allocate it dynamically. This
will make it easier to further expand the logged information.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++--------
 2 files changed, 51 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index dbe062a087c5..e1cc83a89d46 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1068,11 +1068,6 @@ struct amdgpu_device {
 	uint32_t                        *reset_dump_reg_list;
 	uint32_t			*reset_dump_reg_value;
 	int                             num_regs;
-#ifdef CONFIG_DEV_COREDUMP
-	struct amdgpu_task_info         reset_task_info;
-	bool                            reset_vram_lost;
-	struct timespec64               reset_time;
-#endif
 
 	bool                            scpm_enabled;
 	uint32_t                        scpm_status;
@@ -1085,6 +1080,15 @@ struct amdgpu_device {
 	uint32_t			aid_mask;
 };
 
+#ifdef CONFIG_DEV_COREDUMP
+struct amdgpu_coredump_info {
+	struct amdgpu_device		*adev;
+	struct amdgpu_task_info         reset_task_info;
+	struct timespec64               reset_time;
+	bool                            reset_vram_lost;
+};
+#endif
+
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
 {
 	return container_of(ddev, struct amdgpu_device, ddev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e25f085ee886..23b9784e9787 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
 	return 0;
 }
 
-#ifdef CONFIG_DEV_COREDUMP
+#ifndef CONFIG_DEV_COREDUMP
+static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
+			    struct amdgpu_reset_context *reset_context)
+{
+}
+#else
 static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 		size_t count, void *data, size_t datalen)
 {
 	struct drm_printer p;
-	struct amdgpu_device *adev = data;
+	struct amdgpu_coredump_info *coredump = data;
 	struct drm_print_iterator iter;
 	int i;
 
@@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
 	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
 	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
-	drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset_time.tv_nsec);
-	if (adev->reset_task_info.pid)
+	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
+	if (coredump->reset_task_info.pid)
 		drm_printf(&p, "process_name: %s PID: %d\n",
-			   adev->reset_task_info.process_name,
-			   adev->reset_task_info.pid);
+			   coredump->reset_task_info.process_name,
+			   coredump->reset_task_info.pid);
 
-	if (adev->reset_vram_lost)
+	if (coredump->reset_vram_lost)
 		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
-	if (adev->num_regs) {
+	if (coredump->adev->num_regs) {
 		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
 
-		for (i = 0; i < adev->num_regs; i++)
+		for (i = 0; i < coredump->adev->num_regs; i++)
 			drm_printf(&p, "0x%08x: 0x%08x\n",
-				   adev->reset_dump_reg_list[i],
-				   adev->reset_dump_reg_value[i]);
+				   coredump->adev->reset_dump_reg_list[i],
+				   coredump->adev->reset_dump_reg_value[i]);
 	}
 
 	return count - iter.remain;
@@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 
 static void amdgpu_devcoredump_free(void *data)
 {
+	kfree(data);
 }
 
-static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev)
+static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
+			    struct amdgpu_reset_context *reset_context)
 {
+	struct amdgpu_coredump_info *coredump;
 	struct drm_device *dev = adev_to_drm(adev);
 
-	ktime_get_ts64(&adev->reset_time);
-	dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL,
+	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
+
+	if (!coredump) {
+		DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__);
+		return;
+	}
+
+	memset(coredump, 0, sizeof(*coredump));
+
+	coredump->reset_vram_lost = vram_lost;
+
+	if (reset_context->job && reset_context->job->vm)
+		coredump->reset_task_info = reset_context->job->vm->task_info;
+
+	coredump->adev = adev;
+
+	ktime_get_ts64(&coredump->reset_time);
+
+	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
 		      amdgpu_devcoredump_read, amdgpu_devcoredump_free);
 }
 #endif
@@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
 					goto out;
 
 				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
-#ifdef CONFIG_DEV_COREDUMP
-				tmp_adev->reset_vram_lost = vram_lost;
-				memset(&tmp_adev->reset_task_info, 0,
-						sizeof(tmp_adev->reset_task_info));
-				if (reset_context->job && reset_context->job->vm)
-					tmp_adev->reset_task_info =
-						reset_context->job->vm->task_info;
-				amdgpu_reset_capture_coredumpm(tmp_adev);
-#endif
+
+				amdgpu_coredump(tmp_adev, vram_lost, reset_context);
+
 				if (vram_lost) {
 					DRM_INFO("VRAM is lost due to GPU reset!\n");
 					amdgpu_inc_vram_lost(tmp_adev);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically
@ 2023-07-11 21:34   ` André Almeida
  0 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:34 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, christian.koenig,
	pierre-eric.pelloux-prayer, 'Marek Olšák',
	Samuel Pitoiset, Bas Nieuwenhuizen, Timur Kristóf,
	michel.daenzer, André Almeida

Instead of storing coredump information inside amdgpu_device struct,
move if to a proper separated struct and allocate it dynamically. This
will make it easier to further expand the logged information.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++--------
 2 files changed, 51 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index dbe062a087c5..e1cc83a89d46 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1068,11 +1068,6 @@ struct amdgpu_device {
 	uint32_t                        *reset_dump_reg_list;
 	uint32_t			*reset_dump_reg_value;
 	int                             num_regs;
-#ifdef CONFIG_DEV_COREDUMP
-	struct amdgpu_task_info         reset_task_info;
-	bool                            reset_vram_lost;
-	struct timespec64               reset_time;
-#endif
 
 	bool                            scpm_enabled;
 	uint32_t                        scpm_status;
@@ -1085,6 +1080,15 @@ struct amdgpu_device {
 	uint32_t			aid_mask;
 };
 
+#ifdef CONFIG_DEV_COREDUMP
+struct amdgpu_coredump_info {
+	struct amdgpu_device		*adev;
+	struct amdgpu_task_info         reset_task_info;
+	struct timespec64               reset_time;
+	bool                            reset_vram_lost;
+};
+#endif
+
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
 {
 	return container_of(ddev, struct amdgpu_device, ddev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e25f085ee886..23b9784e9787 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
 	return 0;
 }
 
-#ifdef CONFIG_DEV_COREDUMP
+#ifndef CONFIG_DEV_COREDUMP
+static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
+			    struct amdgpu_reset_context *reset_context)
+{
+}
+#else
 static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 		size_t count, void *data, size_t datalen)
 {
 	struct drm_printer p;
-	struct amdgpu_device *adev = data;
+	struct amdgpu_coredump_info *coredump = data;
 	struct drm_print_iterator iter;
 	int i;
 
@@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
 	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
 	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
-	drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset_time.tv_nsec);
-	if (adev->reset_task_info.pid)
+	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
+	if (coredump->reset_task_info.pid)
 		drm_printf(&p, "process_name: %s PID: %d\n",
-			   adev->reset_task_info.process_name,
-			   adev->reset_task_info.pid);
+			   coredump->reset_task_info.process_name,
+			   coredump->reset_task_info.pid);
 
-	if (adev->reset_vram_lost)
+	if (coredump->reset_vram_lost)
 		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
-	if (adev->num_regs) {
+	if (coredump->adev->num_regs) {
 		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
 
-		for (i = 0; i < adev->num_regs; i++)
+		for (i = 0; i < coredump->adev->num_regs; i++)
 			drm_printf(&p, "0x%08x: 0x%08x\n",
-				   adev->reset_dump_reg_list[i],
-				   adev->reset_dump_reg_value[i]);
+				   coredump->adev->reset_dump_reg_list[i],
+				   coredump->adev->reset_dump_reg_value[i]);
 	}
 
 	return count - iter.remain;
@@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 
 static void amdgpu_devcoredump_free(void *data)
 {
+	kfree(data);
 }
 
-static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev)
+static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
+			    struct amdgpu_reset_context *reset_context)
 {
+	struct amdgpu_coredump_info *coredump;
 	struct drm_device *dev = adev_to_drm(adev);
 
-	ktime_get_ts64(&adev->reset_time);
-	dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL,
+	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
+
+	if (!coredump) {
+		DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__);
+		return;
+	}
+
+	memset(coredump, 0, sizeof(*coredump));
+
+	coredump->reset_vram_lost = vram_lost;
+
+	if (reset_context->job && reset_context->job->vm)
+		coredump->reset_task_info = reset_context->job->vm->task_info;
+
+	coredump->adev = adev;
+
+	ktime_get_ts64(&coredump->reset_time);
+
+	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
 		      amdgpu_devcoredump_read, amdgpu_devcoredump_free);
 }
 #endif
@@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
 					goto out;
 
 				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
-#ifdef CONFIG_DEV_COREDUMP
-				tmp_adev->reset_vram_lost = vram_lost;
-				memset(&tmp_adev->reset_task_info, 0,
-						sizeof(tmp_adev->reset_task_info));
-				if (reset_context->job && reset_context->job->vm)
-					tmp_adev->reset_task_info =
-						reset_context->job->vm->task_info;
-				amdgpu_reset_capture_coredumpm(tmp_adev);
-#endif
+
+				amdgpu_coredump(tmp_adev, vram_lost, reset_context);
+
 				if (vram_lost) {
 					DRM_INFO("VRAM is lost due to GPU reset!\n");
 					amdgpu_inc_vram_lost(tmp_adev);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically
@ 2023-07-11 21:34   ` André Almeida
  0 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:34 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	Bas Nieuwenhuizen, alexander.deucher, christian.koenig

Instead of storing coredump information inside amdgpu_device struct,
move if to a proper separated struct and allocate it dynamically. This
will make it easier to further expand the logged information.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++--------
 2 files changed, 51 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index dbe062a087c5..e1cc83a89d46 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1068,11 +1068,6 @@ struct amdgpu_device {
 	uint32_t                        *reset_dump_reg_list;
 	uint32_t			*reset_dump_reg_value;
 	int                             num_regs;
-#ifdef CONFIG_DEV_COREDUMP
-	struct amdgpu_task_info         reset_task_info;
-	bool                            reset_vram_lost;
-	struct timespec64               reset_time;
-#endif
 
 	bool                            scpm_enabled;
 	uint32_t                        scpm_status;
@@ -1085,6 +1080,15 @@ struct amdgpu_device {
 	uint32_t			aid_mask;
 };
 
+#ifdef CONFIG_DEV_COREDUMP
+struct amdgpu_coredump_info {
+	struct amdgpu_device		*adev;
+	struct amdgpu_task_info         reset_task_info;
+	struct timespec64               reset_time;
+	bool                            reset_vram_lost;
+};
+#endif
+
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
 {
 	return container_of(ddev, struct amdgpu_device, ddev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e25f085ee886..23b9784e9787 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
 	return 0;
 }
 
-#ifdef CONFIG_DEV_COREDUMP
+#ifndef CONFIG_DEV_COREDUMP
+static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
+			    struct amdgpu_reset_context *reset_context)
+{
+}
+#else
 static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 		size_t count, void *data, size_t datalen)
 {
 	struct drm_printer p;
-	struct amdgpu_device *adev = data;
+	struct amdgpu_coredump_info *coredump = data;
 	struct drm_print_iterator iter;
 	int i;
 
@@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
 	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
 	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
-	drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset_time.tv_nsec);
-	if (adev->reset_task_info.pid)
+	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
+	if (coredump->reset_task_info.pid)
 		drm_printf(&p, "process_name: %s PID: %d\n",
-			   adev->reset_task_info.process_name,
-			   adev->reset_task_info.pid);
+			   coredump->reset_task_info.process_name,
+			   coredump->reset_task_info.pid);
 
-	if (adev->reset_vram_lost)
+	if (coredump->reset_vram_lost)
 		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
-	if (adev->num_regs) {
+	if (coredump->adev->num_regs) {
 		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
 
-		for (i = 0; i < adev->num_regs; i++)
+		for (i = 0; i < coredump->adev->num_regs; i++)
 			drm_printf(&p, "0x%08x: 0x%08x\n",
-				   adev->reset_dump_reg_list[i],
-				   adev->reset_dump_reg_value[i]);
+				   coredump->adev->reset_dump_reg_list[i],
+				   coredump->adev->reset_dump_reg_value[i]);
 	}
 
 	return count - iter.remain;
@@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 
 static void amdgpu_devcoredump_free(void *data)
 {
+	kfree(data);
 }
 
-static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev)
+static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
+			    struct amdgpu_reset_context *reset_context)
 {
+	struct amdgpu_coredump_info *coredump;
 	struct drm_device *dev = adev_to_drm(adev);
 
-	ktime_get_ts64(&adev->reset_time);
-	dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL,
+	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
+
+	if (!coredump) {
+		DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__);
+		return;
+	}
+
+	memset(coredump, 0, sizeof(*coredump));
+
+	coredump->reset_vram_lost = vram_lost;
+
+	if (reset_context->job && reset_context->job->vm)
+		coredump->reset_task_info = reset_context->job->vm->task_info;
+
+	coredump->adev = adev;
+
+	ktime_get_ts64(&coredump->reset_time);
+
+	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
 		      amdgpu_devcoredump_read, amdgpu_devcoredump_free);
 }
 #endif
@@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
 					goto out;
 
 				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
-#ifdef CONFIG_DEV_COREDUMP
-				tmp_adev->reset_vram_lost = vram_lost;
-				memset(&tmp_adev->reset_task_info, 0,
-						sizeof(tmp_adev->reset_task_info));
-				if (reset_context->job && reset_context->job->vm)
-					tmp_adev->reset_task_info =
-						reset_context->job->vm->task_info;
-				amdgpu_reset_capture_coredumpm(tmp_adev);
-#endif
+
+				amdgpu_coredump(tmp_adev, vram_lost, reset_context);
+
 				if (vram_lost) {
 					DRM_INFO("VRAM is lost due to GPU reset!\n");
 					amdgpu_inc_vram_lost(tmp_adev);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/6] drm/amdgpu: Limit info in coredump for kernel threads
  2023-07-11 21:34 ` André Almeida
  (?)
@ 2023-07-11 21:34   ` André Almeida
  -1 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:34 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	alexander.deucher, christian.koenig

If a kernel thread caused the reset, the information available to be
logged will be limited, so return early in the dump function.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 23b9784e9787..7449aead1e13 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4988,10 +4988,14 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
 	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
 	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
-	if (coredump->reset_task_info.pid)
+	if (coredump->reset_task_info.pid) {
 		drm_printf(&p, "process_name: %s PID: %d\n",
 			   coredump->reset_task_info.process_name,
 			   coredump->reset_task_info.pid);
+	} else {
+		drm_printf(&p, "GPU reset caused by a kernel thread\n");
+		return count - iter.remain;
+	}
 
 	if (coredump->reset_vram_lost)
 		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/6] drm/amdgpu: Limit info in coredump for kernel threads
@ 2023-07-11 21:34   ` André Almeida
  0 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:34 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, christian.koenig,
	pierre-eric.pelloux-prayer, 'Marek Olšák',
	Samuel Pitoiset, Bas Nieuwenhuizen, Timur Kristóf,
	michel.daenzer, André Almeida

If a kernel thread caused the reset, the information available to be
logged will be limited, so return early in the dump function.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 23b9784e9787..7449aead1e13 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4988,10 +4988,14 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
 	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
 	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
-	if (coredump->reset_task_info.pid)
+	if (coredump->reset_task_info.pid) {
 		drm_printf(&p, "process_name: %s PID: %d\n",
 			   coredump->reset_task_info.process_name,
 			   coredump->reset_task_info.pid);
+	} else {
+		drm_printf(&p, "GPU reset caused by a kernel thread\n");
+		return count - iter.remain;
+	}
 
 	if (coredump->reset_vram_lost)
 		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/6] drm/amdgpu: Limit info in coredump for kernel threads
@ 2023-07-11 21:34   ` André Almeida
  0 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:34 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	Bas Nieuwenhuizen, alexander.deucher, christian.koenig

If a kernel thread caused the reset, the information available to be
logged will be limited, so return early in the dump function.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 23b9784e9787..7449aead1e13 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4988,10 +4988,14 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
 	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
 	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
-	if (coredump->reset_task_info.pid)
+	if (coredump->reset_task_info.pid) {
 		drm_printf(&p, "process_name: %s PID: %d\n",
 			   coredump->reset_task_info.process_name,
 			   coredump->reset_task_info.pid);
+	} else {
+		drm_printf(&p, "GPU reset caused by a kernel thread\n");
+		return count - iter.remain;
+	}
 
 	if (coredump->reset_vram_lost)
 		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/6] drm/amdgpu: Log IBs and ring name at coredump
  2023-07-11 21:34 ` André Almeida
  (?)
@ 2023-07-11 21:35   ` André Almeida
  -1 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:35 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	alexander.deucher, christian.koenig

Log the IB addresses used by the hung job along with the stuck ring
name. Note that due to nested IBs, the one that caused the reset itself
may be in not listed address.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 31 +++++++++++++++++++++-
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index e1cc83a89d46..cfeaf93934fd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1086,6 +1086,9 @@ struct amdgpu_coredump_info {
 	struct amdgpu_task_info         reset_task_info;
 	struct timespec64               reset_time;
 	bool                            reset_vram_lost;
+	u64				*ibs;
+	u32				num_ibs;
+	char				ring_name[16];
 };
 #endif
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7449aead1e13..38d03ca7a9fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5008,12 +5008,24 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 				   coredump->adev->reset_dump_reg_value[i]);
 	}
 
+	if (coredump->num_ibs) {
+		drm_printf(&p, "IBs:\n");
+		for (i = 0; i < coredump->num_ibs; i++)
+			drm_printf(&p, "\t[%d] 0x%llx\n", i, coredump->ibs[i]);
+	}
+
+	if (coredump->ring_name[0] != '\0')
+		drm_printf(&p, "ring name: %s\n", coredump->ring_name);
+
 	return count - iter.remain;
 }
 
 static void amdgpu_devcoredump_free(void *data)
 {
-	kfree(data);
+	struct amdgpu_coredump_info *coredump = data;
+
+	kfree(coredump->ibs);
+	kfree(coredump);
 }
 
 static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
@@ -5021,6 +5033,8 @@ static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
 {
 	struct amdgpu_coredump_info *coredump;
 	struct drm_device *dev = adev_to_drm(adev);
+	struct amdgpu_job *job = reset_context->job;
+	int i;
 
 	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
 
@@ -5038,6 +5052,21 @@ static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
 
 	coredump->adev = adev;
 
+	if (job && job->num_ibs) {
+		struct amdgpu_ring *ring = to_amdgpu_ring(job->base.sched);
+		u32 num_ibs = job->num_ibs;
+
+		coredump->ibs = kmalloc_array(num_ibs, sizeof(coredump->ibs), GFP_KERNEL);
+		if (coredump->ibs)
+			coredump->num_ibs = num_ibs;
+
+		for (i = 0; i < coredump->num_ibs; i++)
+			coredump->ibs[i] = job->ibs[i].gpu_addr;
+
+		if (ring)
+			strncpy(coredump->ring_name, ring->name, 16);
+	}
+
 	ktime_get_ts64(&coredump->reset_time);
 
 	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/6] drm/amdgpu: Log IBs and ring name at coredump
@ 2023-07-11 21:35   ` André Almeida
  0 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:35 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, christian.koenig,
	pierre-eric.pelloux-prayer, 'Marek Olšák',
	Samuel Pitoiset, Bas Nieuwenhuizen, Timur Kristóf,
	michel.daenzer, André Almeida

Log the IB addresses used by the hung job along with the stuck ring
name. Note that due to nested IBs, the one that caused the reset itself
may be in not listed address.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 31 +++++++++++++++++++++-
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index e1cc83a89d46..cfeaf93934fd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1086,6 +1086,9 @@ struct amdgpu_coredump_info {
 	struct amdgpu_task_info         reset_task_info;
 	struct timespec64               reset_time;
 	bool                            reset_vram_lost;
+	u64				*ibs;
+	u32				num_ibs;
+	char				ring_name[16];
 };
 #endif
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7449aead1e13..38d03ca7a9fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5008,12 +5008,24 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 				   coredump->adev->reset_dump_reg_value[i]);
 	}
 
+	if (coredump->num_ibs) {
+		drm_printf(&p, "IBs:\n");
+		for (i = 0; i < coredump->num_ibs; i++)
+			drm_printf(&p, "\t[%d] 0x%llx\n", i, coredump->ibs[i]);
+	}
+
+	if (coredump->ring_name[0] != '\0')
+		drm_printf(&p, "ring name: %s\n", coredump->ring_name);
+
 	return count - iter.remain;
 }
 
 static void amdgpu_devcoredump_free(void *data)
 {
-	kfree(data);
+	struct amdgpu_coredump_info *coredump = data;
+
+	kfree(coredump->ibs);
+	kfree(coredump);
 }
 
 static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
@@ -5021,6 +5033,8 @@ static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
 {
 	struct amdgpu_coredump_info *coredump;
 	struct drm_device *dev = adev_to_drm(adev);
+	struct amdgpu_job *job = reset_context->job;
+	int i;
 
 	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
 
@@ -5038,6 +5052,21 @@ static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
 
 	coredump->adev = adev;
 
+	if (job && job->num_ibs) {
+		struct amdgpu_ring *ring = to_amdgpu_ring(job->base.sched);
+		u32 num_ibs = job->num_ibs;
+
+		coredump->ibs = kmalloc_array(num_ibs, sizeof(coredump->ibs), GFP_KERNEL);
+		if (coredump->ibs)
+			coredump->num_ibs = num_ibs;
+
+		for (i = 0; i < coredump->num_ibs; i++)
+			coredump->ibs[i] = job->ibs[i].gpu_addr;
+
+		if (ring)
+			strncpy(coredump->ring_name, ring->name, 16);
+	}
+
 	ktime_get_ts64(&coredump->reset_time);
 
 	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/6] drm/amdgpu: Log IBs and ring name at coredump
@ 2023-07-11 21:35   ` André Almeida
  0 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:35 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	Bas Nieuwenhuizen, alexander.deucher, christian.koenig

Log the IB addresses used by the hung job along with the stuck ring
name. Note that due to nested IBs, the one that caused the reset itself
may be in not listed address.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 31 +++++++++++++++++++++-
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index e1cc83a89d46..cfeaf93934fd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1086,6 +1086,9 @@ struct amdgpu_coredump_info {
 	struct amdgpu_task_info         reset_task_info;
 	struct timespec64               reset_time;
 	bool                            reset_vram_lost;
+	u64				*ibs;
+	u32				num_ibs;
+	char				ring_name[16];
 };
 #endif
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7449aead1e13..38d03ca7a9fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5008,12 +5008,24 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 				   coredump->adev->reset_dump_reg_value[i]);
 	}
 
+	if (coredump->num_ibs) {
+		drm_printf(&p, "IBs:\n");
+		for (i = 0; i < coredump->num_ibs; i++)
+			drm_printf(&p, "\t[%d] 0x%llx\n", i, coredump->ibs[i]);
+	}
+
+	if (coredump->ring_name[0] != '\0')
+		drm_printf(&p, "ring name: %s\n", coredump->ring_name);
+
 	return count - iter.remain;
 }
 
 static void amdgpu_devcoredump_free(void *data)
 {
-	kfree(data);
+	struct amdgpu_coredump_info *coredump = data;
+
+	kfree(coredump->ibs);
+	kfree(coredump);
 }
 
 static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
@@ -5021,6 +5033,8 @@ static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
 {
 	struct amdgpu_coredump_info *coredump;
 	struct drm_device *dev = adev_to_drm(adev);
+	struct amdgpu_job *job = reset_context->job;
+	int i;
 
 	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
 
@@ -5038,6 +5052,21 @@ static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
 
 	coredump->adev = adev;
 
+	if (job && job->num_ibs) {
+		struct amdgpu_ring *ring = to_amdgpu_ring(job->base.sched);
+		u32 num_ibs = job->num_ibs;
+
+		coredump->ibs = kmalloc_array(num_ibs, sizeof(coredump->ibs), GFP_KERNEL);
+		if (coredump->ibs)
+			coredump->num_ibs = num_ibs;
+
+		for (i = 0; i < coredump->num_ibs; i++)
+			coredump->ibs[i] = job->ibs[i].gpu_addr;
+
+		if (ring)
+			strncpy(coredump->ring_name, ring->name, 16);
+	}
+
 	ktime_get_ts64(&coredump->reset_time);
 
 	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 6/6] drm/amdgpu: Create version number for coredumps
  2023-07-11 21:34 ` André Almeida
  (?)
@ 2023-07-11 21:35   ` André Almeida
  -1 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:35 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	alexander.deucher, christian.koenig

Even if there's nothing currently parsing amdgpu's coredump files, if
we eventually have such tools they will be glad to find a version field
to properly read the file.

Create a version number to be displayed on top of coredump file, to be
incremented when the file format or content get changed.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
 2 files changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index cfeaf93934fd..905574acf3a0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1081,6 +1081,9 @@ struct amdgpu_device {
 };
 
 #ifdef CONFIG_DEV_COREDUMP
+
+#define AMDGPU_COREDUMP_VERSION "1"
+
 struct amdgpu_coredump_info {
 	struct amdgpu_device		*adev;
 	struct amdgpu_task_info         reset_task_info;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 38d03ca7a9fc..7b448e189717 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4985,6 +4985,7 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 	p = drm_coredump_printer(&iter);
 
 	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
+	drm_printf(&p, "version: " AMDGPU_COREDUMP_VERSION "\n");
 	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
 	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
 	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 6/6] drm/amdgpu: Create version number for coredumps
@ 2023-07-11 21:35   ` André Almeida
  0 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:35 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, christian.koenig,
	pierre-eric.pelloux-prayer, 'Marek Olšák',
	Samuel Pitoiset, Bas Nieuwenhuizen, Timur Kristóf,
	michel.daenzer, André Almeida

Even if there's nothing currently parsing amdgpu's coredump files, if
we eventually have such tools they will be glad to find a version field
to properly read the file.

Create a version number to be displayed on top of coredump file, to be
incremented when the file format or content get changed.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
 2 files changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index cfeaf93934fd..905574acf3a0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1081,6 +1081,9 @@ struct amdgpu_device {
 };
 
 #ifdef CONFIG_DEV_COREDUMP
+
+#define AMDGPU_COREDUMP_VERSION "1"
+
 struct amdgpu_coredump_info {
 	struct amdgpu_device		*adev;
 	struct amdgpu_task_info         reset_task_info;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 38d03ca7a9fc..7b448e189717 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4985,6 +4985,7 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 	p = drm_coredump_printer(&iter);
 
 	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
+	drm_printf(&p, "version: " AMDGPU_COREDUMP_VERSION "\n");
 	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
 	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
 	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 6/6] drm/amdgpu: Create version number for coredumps
@ 2023-07-11 21:35   ` André Almeida
  0 siblings, 0 replies; 37+ messages in thread
From: André Almeida @ 2023-07-11 21:35 UTC (permalink / raw)
  To: dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, André Almeida,
	'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	Bas Nieuwenhuizen, alexander.deucher, christian.koenig

Even if there's nothing currently parsing amdgpu's coredump files, if
we eventually have such tools they will be glad to find a version field
to properly read the file.

Create a version number to be displayed on top of coredump file, to be
incremented when the file format or content get changed.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 1 +
 2 files changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index cfeaf93934fd..905574acf3a0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1081,6 +1081,9 @@ struct amdgpu_device {
 };
 
 #ifdef CONFIG_DEV_COREDUMP
+
+#define AMDGPU_COREDUMP_VERSION "1"
+
 struct amdgpu_coredump_info {
 	struct amdgpu_device		*adev;
 	struct amdgpu_task_info         reset_task_info;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 38d03ca7a9fc..7b448e189717 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4985,6 +4985,7 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
 	p = drm_coredump_printer(&iter);
 
 	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
+	drm_printf(&p, "version: " AMDGPU_COREDUMP_VERSION "\n");
 	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
 	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
 	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically
  2023-07-11 21:34   ` André Almeida
  (?)
@ 2023-07-12  8:37     ` Christian König
  -1 siblings, 0 replies; 37+ messages in thread
From: Christian König @ 2023-07-12  8:37 UTC (permalink / raw)
  To: André Almeida, dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, 'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	alexander.deucher

Am 11.07.23 um 23:34 schrieb André Almeida:
> Instead of storing coredump information inside amdgpu_device struct,
> move if to a proper separated struct and allocate it dynamically. This
> will make it easier to further expand the logged information.

Verry big NAK to this. The problem is that memory allocation isn't 
allowed during a GPU reset.

What you could do is to allocate the memory with GFP_ATOMIC or similar, 
but for a large structure that might not be possible.

Regards,
Christian.

>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++--------
>   2 files changed, 51 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index dbe062a087c5..e1cc83a89d46 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1068,11 +1068,6 @@ struct amdgpu_device {
>   	uint32_t                        *reset_dump_reg_list;
>   	uint32_t			*reset_dump_reg_value;
>   	int                             num_regs;
> -#ifdef CONFIG_DEV_COREDUMP
> -	struct amdgpu_task_info         reset_task_info;
> -	bool                            reset_vram_lost;
> -	struct timespec64               reset_time;
> -#endif
>   
>   	bool                            scpm_enabled;
>   	uint32_t                        scpm_status;
> @@ -1085,6 +1080,15 @@ struct amdgpu_device {
>   	uint32_t			aid_mask;
>   };
>   
> +#ifdef CONFIG_DEV_COREDUMP
> +struct amdgpu_coredump_info {
> +	struct amdgpu_device		*adev;
> +	struct amdgpu_task_info         reset_task_info;
> +	struct timespec64               reset_time;
> +	bool                            reset_vram_lost;
> +};
> +#endif
> +
>   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
>   {
>   	return container_of(ddev, struct amdgpu_device, ddev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e25f085ee886..23b9784e9787 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>   	return 0;
>   }
>   
> -#ifdef CONFIG_DEV_COREDUMP
> +#ifndef CONFIG_DEV_COREDUMP
> +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> +			    struct amdgpu_reset_context *reset_context)
> +{
> +}
> +#else
>   static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>   		size_t count, void *data, size_t datalen)
>   {
>   	struct drm_printer p;
> -	struct amdgpu_device *adev = data;
> +	struct amdgpu_coredump_info *coredump = data;
>   	struct drm_print_iterator iter;
>   	int i;
>   
> @@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>   	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
>   	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
>   	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
> -	drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset_time.tv_nsec);
> -	if (adev->reset_task_info.pid)
> +	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
> +	if (coredump->reset_task_info.pid)
>   		drm_printf(&p, "process_name: %s PID: %d\n",
> -			   adev->reset_task_info.process_name,
> -			   adev->reset_task_info.pid);
> +			   coredump->reset_task_info.process_name,
> +			   coredump->reset_task_info.pid);
>   
> -	if (adev->reset_vram_lost)
> +	if (coredump->reset_vram_lost)
>   		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
> -	if (adev->num_regs) {
> +	if (coredump->adev->num_regs) {
>   		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
>   
> -		for (i = 0; i < adev->num_regs; i++)
> +		for (i = 0; i < coredump->adev->num_regs; i++)
>   			drm_printf(&p, "0x%08x: 0x%08x\n",
> -				   adev->reset_dump_reg_list[i],
> -				   adev->reset_dump_reg_value[i]);
> +				   coredump->adev->reset_dump_reg_list[i],
> +				   coredump->adev->reset_dump_reg_value[i]);
>   	}
>   
>   	return count - iter.remain;
> @@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>   
>   static void amdgpu_devcoredump_free(void *data)
>   {
> +	kfree(data);
>   }
>   
> -static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev)
> +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> +			    struct amdgpu_reset_context *reset_context)
>   {
> +	struct amdgpu_coredump_info *coredump;
>   	struct drm_device *dev = adev_to_drm(adev);
>   
> -	ktime_get_ts64(&adev->reset_time);
> -	dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL,
> +	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
> +
> +	if (!coredump) {
> +		DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__);
> +		return;
> +	}
> +
> +	memset(coredump, 0, sizeof(*coredump));
> +
> +	coredump->reset_vram_lost = vram_lost;
> +
> +	if (reset_context->job && reset_context->job->vm)
> +		coredump->reset_task_info = reset_context->job->vm->task_info;
> +
> +	coredump->adev = adev;
> +
> +	ktime_get_ts64(&coredump->reset_time);
> +
> +	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
>   		      amdgpu_devcoredump_read, amdgpu_devcoredump_free);
>   }
>   #endif
> @@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>   					goto out;
>   
>   				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
> -#ifdef CONFIG_DEV_COREDUMP
> -				tmp_adev->reset_vram_lost = vram_lost;
> -				memset(&tmp_adev->reset_task_info, 0,
> -						sizeof(tmp_adev->reset_task_info));
> -				if (reset_context->job && reset_context->job->vm)
> -					tmp_adev->reset_task_info =
> -						reset_context->job->vm->task_info;
> -				amdgpu_reset_capture_coredumpm(tmp_adev);
> -#endif
> +
> +				amdgpu_coredump(tmp_adev, vram_lost, reset_context);
> +
>   				if (vram_lost) {
>   					DRM_INFO("VRAM is lost due to GPU reset!\n");
>   					amdgpu_inc_vram_lost(tmp_adev);


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically
@ 2023-07-12  8:37     ` Christian König
  0 siblings, 0 replies; 37+ messages in thread
From: Christian König @ 2023-07-12  8:37 UTC (permalink / raw)
  To: André Almeida, dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, 'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	Bas Nieuwenhuizen, alexander.deucher

Am 11.07.23 um 23:34 schrieb André Almeida:
> Instead of storing coredump information inside amdgpu_device struct,
> move if to a proper separated struct and allocate it dynamically. This
> will make it easier to further expand the logged information.

Verry big NAK to this. The problem is that memory allocation isn't 
allowed during a GPU reset.

What you could do is to allocate the memory with GFP_ATOMIC or similar, 
but for a large structure that might not be possible.

Regards,
Christian.

>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++--------
>   2 files changed, 51 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index dbe062a087c5..e1cc83a89d46 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1068,11 +1068,6 @@ struct amdgpu_device {
>   	uint32_t                        *reset_dump_reg_list;
>   	uint32_t			*reset_dump_reg_value;
>   	int                             num_regs;
> -#ifdef CONFIG_DEV_COREDUMP
> -	struct amdgpu_task_info         reset_task_info;
> -	bool                            reset_vram_lost;
> -	struct timespec64               reset_time;
> -#endif
>   
>   	bool                            scpm_enabled;
>   	uint32_t                        scpm_status;
> @@ -1085,6 +1080,15 @@ struct amdgpu_device {
>   	uint32_t			aid_mask;
>   };
>   
> +#ifdef CONFIG_DEV_COREDUMP
> +struct amdgpu_coredump_info {
> +	struct amdgpu_device		*adev;
> +	struct amdgpu_task_info         reset_task_info;
> +	struct timespec64               reset_time;
> +	bool                            reset_vram_lost;
> +};
> +#endif
> +
>   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
>   {
>   	return container_of(ddev, struct amdgpu_device, ddev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e25f085ee886..23b9784e9787 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>   	return 0;
>   }
>   
> -#ifdef CONFIG_DEV_COREDUMP
> +#ifndef CONFIG_DEV_COREDUMP
> +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> +			    struct amdgpu_reset_context *reset_context)
> +{
> +}
> +#else
>   static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>   		size_t count, void *data, size_t datalen)
>   {
>   	struct drm_printer p;
> -	struct amdgpu_device *adev = data;
> +	struct amdgpu_coredump_info *coredump = data;
>   	struct drm_print_iterator iter;
>   	int i;
>   
> @@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>   	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
>   	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
>   	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
> -	drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset_time.tv_nsec);
> -	if (adev->reset_task_info.pid)
> +	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
> +	if (coredump->reset_task_info.pid)
>   		drm_printf(&p, "process_name: %s PID: %d\n",
> -			   adev->reset_task_info.process_name,
> -			   adev->reset_task_info.pid);
> +			   coredump->reset_task_info.process_name,
> +			   coredump->reset_task_info.pid);
>   
> -	if (adev->reset_vram_lost)
> +	if (coredump->reset_vram_lost)
>   		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
> -	if (adev->num_regs) {
> +	if (coredump->adev->num_regs) {
>   		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
>   
> -		for (i = 0; i < adev->num_regs; i++)
> +		for (i = 0; i < coredump->adev->num_regs; i++)
>   			drm_printf(&p, "0x%08x: 0x%08x\n",
> -				   adev->reset_dump_reg_list[i],
> -				   adev->reset_dump_reg_value[i]);
> +				   coredump->adev->reset_dump_reg_list[i],
> +				   coredump->adev->reset_dump_reg_value[i]);
>   	}
>   
>   	return count - iter.remain;
> @@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>   
>   static void amdgpu_devcoredump_free(void *data)
>   {
> +	kfree(data);
>   }
>   
> -static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev)
> +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> +			    struct amdgpu_reset_context *reset_context)
>   {
> +	struct amdgpu_coredump_info *coredump;
>   	struct drm_device *dev = adev_to_drm(adev);
>   
> -	ktime_get_ts64(&adev->reset_time);
> -	dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL,
> +	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
> +
> +	if (!coredump) {
> +		DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__);
> +		return;
> +	}
> +
> +	memset(coredump, 0, sizeof(*coredump));
> +
> +	coredump->reset_vram_lost = vram_lost;
> +
> +	if (reset_context->job && reset_context->job->vm)
> +		coredump->reset_task_info = reset_context->job->vm->task_info;
> +
> +	coredump->adev = adev;
> +
> +	ktime_get_ts64(&coredump->reset_time);
> +
> +	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
>   		      amdgpu_devcoredump_read, amdgpu_devcoredump_free);
>   }
>   #endif
> @@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>   					goto out;
>   
>   				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
> -#ifdef CONFIG_DEV_COREDUMP
> -				tmp_adev->reset_vram_lost = vram_lost;
> -				memset(&tmp_adev->reset_task_info, 0,
> -						sizeof(tmp_adev->reset_task_info));
> -				if (reset_context->job && reset_context->job->vm)
> -					tmp_adev->reset_task_info =
> -						reset_context->job->vm->task_info;
> -				amdgpu_reset_capture_coredumpm(tmp_adev);
> -#endif
> +
> +				amdgpu_coredump(tmp_adev, vram_lost, reset_context);
> +
>   				if (vram_lost) {
>   					DRM_INFO("VRAM is lost due to GPU reset!\n");
>   					amdgpu_inc_vram_lost(tmp_adev);


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically
@ 2023-07-12  8:37     ` Christian König
  0 siblings, 0 replies; 37+ messages in thread
From: Christian König @ 2023-07-12  8:37 UTC (permalink / raw)
  To: André Almeida, dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, pierre-eric.pelloux-prayer,
	'Marek Olšák',
	Samuel Pitoiset, Bas Nieuwenhuizen, Timur Kristóf,
	michel.daenzer

Am 11.07.23 um 23:34 schrieb André Almeida:
> Instead of storing coredump information inside amdgpu_device struct,
> move if to a proper separated struct and allocate it dynamically. This
> will make it easier to further expand the logged information.

Verry big NAK to this. The problem is that memory allocation isn't 
allowed during a GPU reset.

What you could do is to allocate the memory with GFP_ATOMIC or similar, 
but for a large structure that might not be possible.

Regards,
Christian.

>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++--------
>   2 files changed, 51 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index dbe062a087c5..e1cc83a89d46 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1068,11 +1068,6 @@ struct amdgpu_device {
>   	uint32_t                        *reset_dump_reg_list;
>   	uint32_t			*reset_dump_reg_value;
>   	int                             num_regs;
> -#ifdef CONFIG_DEV_COREDUMP
> -	struct amdgpu_task_info         reset_task_info;
> -	bool                            reset_vram_lost;
> -	struct timespec64               reset_time;
> -#endif
>   
>   	bool                            scpm_enabled;
>   	uint32_t                        scpm_status;
> @@ -1085,6 +1080,15 @@ struct amdgpu_device {
>   	uint32_t			aid_mask;
>   };
>   
> +#ifdef CONFIG_DEV_COREDUMP
> +struct amdgpu_coredump_info {
> +	struct amdgpu_device		*adev;
> +	struct amdgpu_task_info         reset_task_info;
> +	struct timespec64               reset_time;
> +	bool                            reset_vram_lost;
> +};
> +#endif
> +
>   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
>   {
>   	return container_of(ddev, struct amdgpu_device, ddev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e25f085ee886..23b9784e9787 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>   	return 0;
>   }
>   
> -#ifdef CONFIG_DEV_COREDUMP
> +#ifndef CONFIG_DEV_COREDUMP
> +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> +			    struct amdgpu_reset_context *reset_context)
> +{
> +}
> +#else
>   static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>   		size_t count, void *data, size_t datalen)
>   {
>   	struct drm_printer p;
> -	struct amdgpu_device *adev = data;
> +	struct amdgpu_coredump_info *coredump = data;
>   	struct drm_print_iterator iter;
>   	int i;
>   
> @@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>   	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
>   	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
>   	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
> -	drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset_time.tv_nsec);
> -	if (adev->reset_task_info.pid)
> +	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
> +	if (coredump->reset_task_info.pid)
>   		drm_printf(&p, "process_name: %s PID: %d\n",
> -			   adev->reset_task_info.process_name,
> -			   adev->reset_task_info.pid);
> +			   coredump->reset_task_info.process_name,
> +			   coredump->reset_task_info.pid);
>   
> -	if (adev->reset_vram_lost)
> +	if (coredump->reset_vram_lost)
>   		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
> -	if (adev->num_regs) {
> +	if (coredump->adev->num_regs) {
>   		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
>   
> -		for (i = 0; i < adev->num_regs; i++)
> +		for (i = 0; i < coredump->adev->num_regs; i++)
>   			drm_printf(&p, "0x%08x: 0x%08x\n",
> -				   adev->reset_dump_reg_list[i],
> -				   adev->reset_dump_reg_value[i]);
> +				   coredump->adev->reset_dump_reg_list[i],
> +				   coredump->adev->reset_dump_reg_value[i]);
>   	}
>   
>   	return count - iter.remain;
> @@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>   
>   static void amdgpu_devcoredump_free(void *data)
>   {
> +	kfree(data);
>   }
>   
> -static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev)
> +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> +			    struct amdgpu_reset_context *reset_context)
>   {
> +	struct amdgpu_coredump_info *coredump;
>   	struct drm_device *dev = adev_to_drm(adev);
>   
> -	ktime_get_ts64(&adev->reset_time);
> -	dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL,
> +	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
> +
> +	if (!coredump) {
> +		DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__);
> +		return;
> +	}
> +
> +	memset(coredump, 0, sizeof(*coredump));
> +
> +	coredump->reset_vram_lost = vram_lost;
> +
> +	if (reset_context->job && reset_context->job->vm)
> +		coredump->reset_task_info = reset_context->job->vm->task_info;
> +
> +	coredump->adev = adev;
> +
> +	ktime_get_ts64(&coredump->reset_time);
> +
> +	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
>   		      amdgpu_devcoredump_read, amdgpu_devcoredump_free);
>   }
>   #endif
> @@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>   					goto out;
>   
>   				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
> -#ifdef CONFIG_DEV_COREDUMP
> -				tmp_adev->reset_vram_lost = vram_lost;
> -				memset(&tmp_adev->reset_task_info, 0,
> -						sizeof(tmp_adev->reset_task_info));
> -				if (reset_context->job && reset_context->job->vm)
> -					tmp_adev->reset_task_info =
> -						reset_context->job->vm->task_info;
> -				amdgpu_reset_capture_coredumpm(tmp_adev);
> -#endif
> +
> +				amdgpu_coredump(tmp_adev, vram_lost, reset_context);
> +
>   				if (vram_lost) {
>   					DRM_INFO("VRAM is lost due to GPU reset!\n");
>   					amdgpu_inc_vram_lost(tmp_adev);


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/6] drm/amdgpu: Mark contexts guilty for causing soft recoveries
  2023-07-11 21:34   ` André Almeida
  (?)
@ 2023-07-12  8:43     ` Christian König
  -1 siblings, 0 replies; 37+ messages in thread
From: Christian König @ 2023-07-12  8:43 UTC (permalink / raw)
  To: André Almeida, dri-devel, amd-gfx, linux-kernel
  Cc: kernel-dev, alexander.deucher, pierre-eric.pelloux-prayer,
	'Marek Olšák',
	Samuel Pitoiset, Bas Nieuwenhuizen, Timur Kristóf,
	michel.daenzer, Michel Dänzer



Am 11.07.23 um 23:34 schrieb André Almeida:
> If a DRM fence is set to -ENODATA, that means that this context was a
> cause of a soft reset, but is never marked as guilty. Flag it as guilty
> and log to user that this context won't accept more submissions.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index 0dc9c655c4fb..fe8e47d063da 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -459,6 +459,12 @@ int amdgpu_ctx_get_entity(struct amdgpu_ctx *ctx, u32 hw_ip, u32 instance,
>   	ctx_entity = &ctx->entities[hw_ip][ring]->entity;
>   	r = drm_sched_entity_error(ctx_entity);
>   	if (r) {
> +		if (r == -ENODATA) {
> +			DRM_ERROR("%s (%d) context caused a reset,"
> +				  "marking it guilty and refusing new submissions.\n",
> +				  current->comm, current->pid);
> +			atomic_set(&ctx->guilty, 1);
> +		}

I'm going back and forth with that as well.

Michel has a very good point that it often is sufficient to cancel just 
one rough shader to keep going.

But Marek has a very good point as well that when that happens multiple 
times we probably want to block the application from making further 
submissions.

Christian.

>   		DRM_DEBUG("error entity %p\n", ctx_entity);
>   		return r;
>   	}


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/6] drm/amdgpu: Mark contexts guilty for causing soft recoveries
@ 2023-07-12  8:43     ` Christian König
  0 siblings, 0 replies; 37+ messages in thread
From: Christian König @ 2023-07-12  8:43 UTC (permalink / raw)
  To: André Almeida, dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, 'Marek Olšák',
	Timur Kristóf, Michel Dänzer, michel.daenzer,
	Samuel Pitoiset, kernel-dev, alexander.deucher



Am 11.07.23 um 23:34 schrieb André Almeida:
> If a DRM fence is set to -ENODATA, that means that this context was a
> cause of a soft reset, but is never marked as guilty. Flag it as guilty
> and log to user that this context won't accept more submissions.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index 0dc9c655c4fb..fe8e47d063da 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -459,6 +459,12 @@ int amdgpu_ctx_get_entity(struct amdgpu_ctx *ctx, u32 hw_ip, u32 instance,
>   	ctx_entity = &ctx->entities[hw_ip][ring]->entity;
>   	r = drm_sched_entity_error(ctx_entity);
>   	if (r) {
> +		if (r == -ENODATA) {
> +			DRM_ERROR("%s (%d) context caused a reset,"
> +				  "marking it guilty and refusing new submissions.\n",
> +				  current->comm, current->pid);
> +			atomic_set(&ctx->guilty, 1);
> +		}

I'm going back and forth with that as well.

Michel has a very good point that it often is sufficient to cancel just 
one rough shader to keep going.

But Marek has a very good point as well that when that happens multiple 
times we probably want to block the application from making further 
submissions.

Christian.

>   		DRM_DEBUG("error entity %p\n", ctx_entity);
>   		return r;
>   	}


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/6] drm/amdgpu: Mark contexts guilty for causing soft recoveries
@ 2023-07-12  8:43     ` Christian König
  0 siblings, 0 replies; 37+ messages in thread
From: Christian König @ 2023-07-12  8:43 UTC (permalink / raw)
  To: André Almeida, dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, 'Marek Olšák',
	Timur Kristóf, Michel Dänzer, michel.daenzer,
	Samuel Pitoiset, kernel-dev, Bas Nieuwenhuizen,
	alexander.deucher



Am 11.07.23 um 23:34 schrieb André Almeida:
> If a DRM fence is set to -ENODATA, that means that this context was a
> cause of a soft reset, but is never marked as guilty. Flag it as guilty
> and log to user that this context won't accept more submissions.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> index 0dc9c655c4fb..fe8e47d063da 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c
> @@ -459,6 +459,12 @@ int amdgpu_ctx_get_entity(struct amdgpu_ctx *ctx, u32 hw_ip, u32 instance,
>   	ctx_entity = &ctx->entities[hw_ip][ring]->entity;
>   	r = drm_sched_entity_error(ctx_entity);
>   	if (r) {
> +		if (r == -ENODATA) {
> +			DRM_ERROR("%s (%d) context caused a reset,"
> +				  "marking it guilty and refusing new submissions.\n",
> +				  current->comm, current->pid);
> +			atomic_set(&ctx->guilty, 1);
> +		}

I'm going back and forth with that as well.

Michel has a very good point that it often is sufficient to cancel just 
one rough shader to keep going.

But Marek has a very good point as well that when that happens multiple 
times we probably want to block the application from making further 
submissions.

Christian.

>   		DRM_DEBUG("error entity %p\n", ctx_entity);
>   		return r;
>   	}


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically
  2023-07-12  8:37     ` Christian König
@ 2023-07-12  8:59       ` Lucas Stach
  -1 siblings, 0 replies; 37+ messages in thread
From: Lucas Stach @ 2023-07-12  8:59 UTC (permalink / raw)
  To: Christian König, André Almeida, dri-devel, amd-gfx,
	linux-kernel
  Cc: pierre-eric.pelloux-prayer, 'Marek Olšák',
	michel.daenzer, Timur Kristóf, Samuel Pitoiset, kernel-dev,
	alexander.deucher

Am Mittwoch, dem 12.07.2023 um 10:37 +0200 schrieb Christian König:
> Am 11.07.23 um 23:34 schrieb André Almeida:
> > Instead of storing coredump information inside amdgpu_device struct,
> > move if to a proper separated struct and allocate it dynamically. This
> > will make it easier to further expand the logged information.
> 
> Verry big NAK to this. The problem is that memory allocation isn't 
> allowed during a GPU reset.
> 
> What you could do is to allocate the memory with GFP_ATOMIC or similar, 
> but for a large structure that might not be possible.
> 
I'm still not fully clear on what the rules are here. In etnaviv we do
devcoredump allocation in the GPU reset path with __GFP_NOWARN |
__GFP_NORETRY, which means the allocation will kick memory reclaim if
necessary, but will just give up if no memory could be made available
easily. This satisfies the forward progress guarantee in the absence of
successful memory allocation, which is the most important property in
this path, I think.

However, I'm not sure if the reclaim could lead to locking issues or
something like that with the more complex use-cases with MMU notifiers
and stuff like that. Christian, do you have any experience or
information that would be good to share in this regard?

Regards,
Lucas

> Regards,
> Christian.
> 
> > 
> > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++--------
> >   2 files changed, 51 insertions(+), 28 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index dbe062a087c5..e1cc83a89d46 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -1068,11 +1068,6 @@ struct amdgpu_device {
> >   	uint32_t                        *reset_dump_reg_list;
> >   	uint32_t			*reset_dump_reg_value;
> >   	int                             num_regs;
> > -#ifdef CONFIG_DEV_COREDUMP
> > -	struct amdgpu_task_info         reset_task_info;
> > -	bool                            reset_vram_lost;
> > -	struct timespec64               reset_time;
> > -#endif
> >   
> >   	bool                            scpm_enabled;
> >   	uint32_t                        scpm_status;
> > @@ -1085,6 +1080,15 @@ struct amdgpu_device {
> >   	uint32_t			aid_mask;
> >   };
> >   
> > +#ifdef CONFIG_DEV_COREDUMP
> > +struct amdgpu_coredump_info {
> > +	struct amdgpu_device		*adev;
> > +	struct amdgpu_task_info         reset_task_info;
> > +	struct timespec64               reset_time;
> > +	bool                            reset_vram_lost;
> > +};
> > +#endif
> > +
> >   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> >   {
> >   	return container_of(ddev, struct amdgpu_device, ddev);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index e25f085ee886..23b9784e9787 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
> >   	return 0;
> >   }
> >   
> > -#ifdef CONFIG_DEV_COREDUMP
> > +#ifndef CONFIG_DEV_COREDUMP
> > +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> > +			    struct amdgpu_reset_context *reset_context)
> > +{
> > +}
> > +#else
> >   static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> >   		size_t count, void *data, size_t datalen)
> >   {
> >   	struct drm_printer p;
> > -	struct amdgpu_device *adev = data;
> > +	struct amdgpu_coredump_info *coredump = data;
> >   	struct drm_print_iterator iter;
> >   	int i;
> >   
> > @@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> >   	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
> >   	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
> >   	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
> > -	drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset_time.tv_nsec);
> > -	if (adev->reset_task_info.pid)
> > +	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
> > +	if (coredump->reset_task_info.pid)
> >   		drm_printf(&p, "process_name: %s PID: %d\n",
> > -			   adev->reset_task_info.process_name,
> > -			   adev->reset_task_info.pid);
> > +			   coredump->reset_task_info.process_name,
> > +			   coredump->reset_task_info.pid);
> >   
> > -	if (adev->reset_vram_lost)
> > +	if (coredump->reset_vram_lost)
> >   		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
> > -	if (adev->num_regs) {
> > +	if (coredump->adev->num_regs) {
> >   		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
> >   
> > -		for (i = 0; i < adev->num_regs; i++)
> > +		for (i = 0; i < coredump->adev->num_regs; i++)
> >   			drm_printf(&p, "0x%08x: 0x%08x\n",
> > -				   adev->reset_dump_reg_list[i],
> > -				   adev->reset_dump_reg_value[i]);
> > +				   coredump->adev->reset_dump_reg_list[i],
> > +				   coredump->adev->reset_dump_reg_value[i]);
> >   	}
> >   
> >   	return count - iter.remain;
> > @@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> >   
> >   static void amdgpu_devcoredump_free(void *data)
> >   {
> > +	kfree(data);
> >   }
> >   
> > -static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev)
> > +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> > +			    struct amdgpu_reset_context *reset_context)
> >   {
> > +	struct amdgpu_coredump_info *coredump;
> >   	struct drm_device *dev = adev_to_drm(adev);
> >   
> > -	ktime_get_ts64(&adev->reset_time);
> > -	dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL,
> > +	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
> > +
> > +	if (!coredump) {
> > +		DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__);
> > +		return;
> > +	}
> > +
> > +	memset(coredump, 0, sizeof(*coredump));
> > +
> > +	coredump->reset_vram_lost = vram_lost;
> > +
> > +	if (reset_context->job && reset_context->job->vm)
> > +		coredump->reset_task_info = reset_context->job->vm->task_info;
> > +
> > +	coredump->adev = adev;
> > +
> > +	ktime_get_ts64(&coredump->reset_time);
> > +
> > +	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
> >   		      amdgpu_devcoredump_read, amdgpu_devcoredump_free);
> >   }
> >   #endif
> > @@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
> >   					goto out;
> >   
> >   				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
> > -#ifdef CONFIG_DEV_COREDUMP
> > -				tmp_adev->reset_vram_lost = vram_lost;
> > -				memset(&tmp_adev->reset_task_info, 0,
> > -						sizeof(tmp_adev->reset_task_info));
> > -				if (reset_context->job && reset_context->job->vm)
> > -					tmp_adev->reset_task_info =
> > -						reset_context->job->vm->task_info;
> > -				amdgpu_reset_capture_coredumpm(tmp_adev);
> > -#endif
> > +
> > +				amdgpu_coredump(tmp_adev, vram_lost, reset_context);
> > +
> >   				if (vram_lost) {
> >   					DRM_INFO("VRAM is lost due to GPU reset!\n");
> >   					amdgpu_inc_vram_lost(tmp_adev);
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically
@ 2023-07-12  8:59       ` Lucas Stach
  0 siblings, 0 replies; 37+ messages in thread
From: Lucas Stach @ 2023-07-12  8:59 UTC (permalink / raw)
  To: Christian König, André Almeida, dri-devel, amd-gfx,
	linux-kernel
  Cc: pierre-eric.pelloux-prayer, 'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	alexander.deucher

Am Mittwoch, dem 12.07.2023 um 10:37 +0200 schrieb Christian König:
> Am 11.07.23 um 23:34 schrieb André Almeida:
> > Instead of storing coredump information inside amdgpu_device struct,
> > move if to a proper separated struct and allocate it dynamically. This
> > will make it easier to further expand the logged information.
> 
> Verry big NAK to this. The problem is that memory allocation isn't 
> allowed during a GPU reset.
> 
> What you could do is to allocate the memory with GFP_ATOMIC or similar, 
> but for a large structure that might not be possible.
> 
I'm still not fully clear on what the rules are here. In etnaviv we do
devcoredump allocation in the GPU reset path with __GFP_NOWARN |
__GFP_NORETRY, which means the allocation will kick memory reclaim if
necessary, but will just give up if no memory could be made available
easily. This satisfies the forward progress guarantee in the absence of
successful memory allocation, which is the most important property in
this path, I think.

However, I'm not sure if the reclaim could lead to locking issues or
something like that with the more complex use-cases with MMU notifiers
and stuff like that. Christian, do you have any experience or
information that would be good to share in this regard?

Regards,
Lucas

> Regards,
> Christian.
> 
> > 
> > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++--------
> >   2 files changed, 51 insertions(+), 28 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > index dbe062a087c5..e1cc83a89d46 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > @@ -1068,11 +1068,6 @@ struct amdgpu_device {
> >   	uint32_t                        *reset_dump_reg_list;
> >   	uint32_t			*reset_dump_reg_value;
> >   	int                             num_regs;
> > -#ifdef CONFIG_DEV_COREDUMP
> > -	struct amdgpu_task_info         reset_task_info;
> > -	bool                            reset_vram_lost;
> > -	struct timespec64               reset_time;
> > -#endif
> >   
> >   	bool                            scpm_enabled;
> >   	uint32_t                        scpm_status;
> > @@ -1085,6 +1080,15 @@ struct amdgpu_device {
> >   	uint32_t			aid_mask;
> >   };
> >   
> > +#ifdef CONFIG_DEV_COREDUMP
> > +struct amdgpu_coredump_info {
> > +	struct amdgpu_device		*adev;
> > +	struct amdgpu_task_info         reset_task_info;
> > +	struct timespec64               reset_time;
> > +	bool                            reset_vram_lost;
> > +};
> > +#endif
> > +
> >   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> >   {
> >   	return container_of(ddev, struct amdgpu_device, ddev);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index e25f085ee886..23b9784e9787 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
> >   	return 0;
> >   }
> >   
> > -#ifdef CONFIG_DEV_COREDUMP
> > +#ifndef CONFIG_DEV_COREDUMP
> > +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> > +			    struct amdgpu_reset_context *reset_context)
> > +{
> > +}
> > +#else
> >   static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> >   		size_t count, void *data, size_t datalen)
> >   {
> >   	struct drm_printer p;
> > -	struct amdgpu_device *adev = data;
> > +	struct amdgpu_coredump_info *coredump = data;
> >   	struct drm_print_iterator iter;
> >   	int i;
> >   
> > @@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> >   	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
> >   	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
> >   	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
> > -	drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset_time.tv_nsec);
> > -	if (adev->reset_task_info.pid)
> > +	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
> > +	if (coredump->reset_task_info.pid)
> >   		drm_printf(&p, "process_name: %s PID: %d\n",
> > -			   adev->reset_task_info.process_name,
> > -			   adev->reset_task_info.pid);
> > +			   coredump->reset_task_info.process_name,
> > +			   coredump->reset_task_info.pid);
> >   
> > -	if (adev->reset_vram_lost)
> > +	if (coredump->reset_vram_lost)
> >   		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
> > -	if (adev->num_regs) {
> > +	if (coredump->adev->num_regs) {
> >   		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
> >   
> > -		for (i = 0; i < adev->num_regs; i++)
> > +		for (i = 0; i < coredump->adev->num_regs; i++)
> >   			drm_printf(&p, "0x%08x: 0x%08x\n",
> > -				   adev->reset_dump_reg_list[i],
> > -				   adev->reset_dump_reg_value[i]);
> > +				   coredump->adev->reset_dump_reg_list[i],
> > +				   coredump->adev->reset_dump_reg_value[i]);
> >   	}
> >   
> >   	return count - iter.remain;
> > @@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> >   
> >   static void amdgpu_devcoredump_free(void *data)
> >   {
> > +	kfree(data);
> >   }
> >   
> > -static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev)
> > +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> > +			    struct amdgpu_reset_context *reset_context)
> >   {
> > +	struct amdgpu_coredump_info *coredump;
> >   	struct drm_device *dev = adev_to_drm(adev);
> >   
> > -	ktime_get_ts64(&adev->reset_time);
> > -	dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL,
> > +	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
> > +
> > +	if (!coredump) {
> > +		DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__);
> > +		return;
> > +	}
> > +
> > +	memset(coredump, 0, sizeof(*coredump));
> > +
> > +	coredump->reset_vram_lost = vram_lost;
> > +
> > +	if (reset_context->job && reset_context->job->vm)
> > +		coredump->reset_task_info = reset_context->job->vm->task_info;
> > +
> > +	coredump->adev = adev;
> > +
> > +	ktime_get_ts64(&coredump->reset_time);
> > +
> > +	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
> >   		      amdgpu_devcoredump_read, amdgpu_devcoredump_free);
> >   }
> >   #endif
> > @@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
> >   					goto out;
> >   
> >   				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
> > -#ifdef CONFIG_DEV_COREDUMP
> > -				tmp_adev->reset_vram_lost = vram_lost;
> > -				memset(&tmp_adev->reset_task_info, 0,
> > -						sizeof(tmp_adev->reset_task_info));
> > -				if (reset_context->job && reset_context->job->vm)
> > -					tmp_adev->reset_task_info =
> > -						reset_context->job->vm->task_info;
> > -				amdgpu_reset_capture_coredumpm(tmp_adev);
> > -#endif
> > +
> > +				amdgpu_coredump(tmp_adev, vram_lost, reset_context);
> > +
> >   				if (vram_lost) {
> >   					DRM_INFO("VRAM is lost due to GPU reset!\n");
> >   					amdgpu_inc_vram_lost(tmp_adev);
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically
  2023-07-12  8:59       ` Lucas Stach
@ 2023-07-12 10:39         ` Christian König
  -1 siblings, 0 replies; 37+ messages in thread
From: Christian König @ 2023-07-12 10:39 UTC (permalink / raw)
  To: Lucas Stach, André Almeida, dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, 'Marek Olšák',
	michel.daenzer, Timur Kristóf, Samuel Pitoiset, kernel-dev,
	alexander.deucher

Am 12.07.23 um 10:59 schrieb Lucas Stach:
> Am Mittwoch, dem 12.07.2023 um 10:37 +0200 schrieb Christian König:
>> Am 11.07.23 um 23:34 schrieb André Almeida:
>>> Instead of storing coredump information inside amdgpu_device struct,
>>> move if to a proper separated struct and allocate it dynamically. This
>>> will make it easier to further expand the logged information.
>> Verry big NAK to this. The problem is that memory allocation isn't
>> allowed during a GPU reset.
>>
>> What you could do is to allocate the memory with GFP_ATOMIC or similar,
>> but for a large structure that might not be possible.
>>
> I'm still not fully clear on what the rules are here. In etnaviv we do
> devcoredump allocation in the GPU reset path with __GFP_NOWARN |
> __GFP_NORETRY, which means the allocation will kick memory reclaim if
> necessary, but will just give up if no memory could be made available
> easily. This satisfies the forward progress guarantee in the absence of
> successful memory allocation, which is the most important property in
> this path, I think.
>
> However, I'm not sure if the reclaim could lead to locking issues or
> something like that with the more complex use-cases with MMU notifiers
> and stuff like that. Christian, do you have any experience or
> information that would be good to share in this regard?

Yeah, very good question.

__GFP_NORETRY isn't sufficient as far as I know. Reclaim must be 
completely suppressed to be able to allocate in a GPU reset handler.

Daniel added lockdep annotation to some of the dma-fence signaling paths 
and this yielded quite a bunch of potential deadlocks.

It's not even that reclaim itself waits for dma_fences (that can happen, 
but is quite unlikely), but rather that reclaim needs locks and under 
those locks we then wait for dma_fences.

We should probably add a define somewhere which documents that 
(GFP_ATOMIC | __NO_WARN) should be used in the GPU reset handlers when 
allocating memory for coredumps.

Regards,
Christian.

>
> Regards,
> Lucas
>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++--------
>>>    2 files changed, 51 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index dbe062a087c5..e1cc83a89d46 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -1068,11 +1068,6 @@ struct amdgpu_device {
>>>    	uint32_t                        *reset_dump_reg_list;
>>>    	uint32_t			*reset_dump_reg_value;
>>>    	int                             num_regs;
>>> -#ifdef CONFIG_DEV_COREDUMP
>>> -	struct amdgpu_task_info         reset_task_info;
>>> -	bool                            reset_vram_lost;
>>> -	struct timespec64               reset_time;
>>> -#endif
>>>    
>>>    	bool                            scpm_enabled;
>>>    	uint32_t                        scpm_status;
>>> @@ -1085,6 +1080,15 @@ struct amdgpu_device {
>>>    	uint32_t			aid_mask;
>>>    };
>>>    
>>> +#ifdef CONFIG_DEV_COREDUMP
>>> +struct amdgpu_coredump_info {
>>> +	struct amdgpu_device		*adev;
>>> +	struct amdgpu_task_info         reset_task_info;
>>> +	struct timespec64               reset_time;
>>> +	bool                            reset_vram_lost;
>>> +};
>>> +#endif
>>> +
>>>    static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
>>>    {
>>>    	return container_of(ddev, struct amdgpu_device, ddev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index e25f085ee886..23b9784e9787 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>    	return 0;
>>>    }
>>>    
>>> -#ifdef CONFIG_DEV_COREDUMP
>>> +#ifndef CONFIG_DEV_COREDUMP
>>> +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
>>> +			    struct amdgpu_reset_context *reset_context)
>>> +{
>>> +}
>>> +#else
>>>    static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>>>    		size_t count, void *data, size_t datalen)
>>>    {
>>>    	struct drm_printer p;
>>> -	struct amdgpu_device *adev = data;
>>> +	struct amdgpu_coredump_info *coredump = data;
>>>    	struct drm_print_iterator iter;
>>>    	int i;
>>>    
>>> @@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>>>    	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
>>>    	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
>>>    	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
>>> -	drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset_time.tv_nsec);
>>> -	if (adev->reset_task_info.pid)
>>> +	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
>>> +	if (coredump->reset_task_info.pid)
>>>    		drm_printf(&p, "process_name: %s PID: %d\n",
>>> -			   adev->reset_task_info.process_name,
>>> -			   adev->reset_task_info.pid);
>>> +			   coredump->reset_task_info.process_name,
>>> +			   coredump->reset_task_info.pid);
>>>    
>>> -	if (adev->reset_vram_lost)
>>> +	if (coredump->reset_vram_lost)
>>>    		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
>>> -	if (adev->num_regs) {
>>> +	if (coredump->adev->num_regs) {
>>>    		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
>>>    
>>> -		for (i = 0; i < adev->num_regs; i++)
>>> +		for (i = 0; i < coredump->adev->num_regs; i++)
>>>    			drm_printf(&p, "0x%08x: 0x%08x\n",
>>> -				   adev->reset_dump_reg_list[i],
>>> -				   adev->reset_dump_reg_value[i]);
>>> +				   coredump->adev->reset_dump_reg_list[i],
>>> +				   coredump->adev->reset_dump_reg_value[i]);
>>>    	}
>>>    
>>>    	return count - iter.remain;
>>> @@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>>>    
>>>    static void amdgpu_devcoredump_free(void *data)
>>>    {
>>> +	kfree(data);
>>>    }
>>>    
>>> -static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev)
>>> +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
>>> +			    struct amdgpu_reset_context *reset_context)
>>>    {
>>> +	struct amdgpu_coredump_info *coredump;
>>>    	struct drm_device *dev = adev_to_drm(adev);
>>>    
>>> -	ktime_get_ts64(&adev->reset_time);
>>> -	dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL,
>>> +	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
>>> +
>>> +	if (!coredump) {
>>> +		DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__);
>>> +		return;
>>> +	}
>>> +
>>> +	memset(coredump, 0, sizeof(*coredump));
>>> +
>>> +	coredump->reset_vram_lost = vram_lost;
>>> +
>>> +	if (reset_context->job && reset_context->job->vm)
>>> +		coredump->reset_task_info = reset_context->job->vm->task_info;
>>> +
>>> +	coredump->adev = adev;
>>> +
>>> +	ktime_get_ts64(&coredump->reset_time);
>>> +
>>> +	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
>>>    		      amdgpu_devcoredump_read, amdgpu_devcoredump_free);
>>>    }
>>>    #endif
>>> @@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>    					goto out;
>>>    
>>>    				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
>>> -#ifdef CONFIG_DEV_COREDUMP
>>> -				tmp_adev->reset_vram_lost = vram_lost;
>>> -				memset(&tmp_adev->reset_task_info, 0,
>>> -						sizeof(tmp_adev->reset_task_info));
>>> -				if (reset_context->job && reset_context->job->vm)
>>> -					tmp_adev->reset_task_info =
>>> -						reset_context->job->vm->task_info;
>>> -				amdgpu_reset_capture_coredumpm(tmp_adev);
>>> -#endif
>>> +
>>> +				amdgpu_coredump(tmp_adev, vram_lost, reset_context);
>>> +
>>>    				if (vram_lost) {
>>>    					DRM_INFO("VRAM is lost due to GPU reset!\n");
>>>    					amdgpu_inc_vram_lost(tmp_adev);


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically
@ 2023-07-12 10:39         ` Christian König
  0 siblings, 0 replies; 37+ messages in thread
From: Christian König @ 2023-07-12 10:39 UTC (permalink / raw)
  To: Lucas Stach, André Almeida, dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, 'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	alexander.deucher

Am 12.07.23 um 10:59 schrieb Lucas Stach:
> Am Mittwoch, dem 12.07.2023 um 10:37 +0200 schrieb Christian König:
>> Am 11.07.23 um 23:34 schrieb André Almeida:
>>> Instead of storing coredump information inside amdgpu_device struct,
>>> move if to a proper separated struct and allocate it dynamically. This
>>> will make it easier to further expand the logged information.
>> Verry big NAK to this. The problem is that memory allocation isn't
>> allowed during a GPU reset.
>>
>> What you could do is to allocate the memory with GFP_ATOMIC or similar,
>> but for a large structure that might not be possible.
>>
> I'm still not fully clear on what the rules are here. In etnaviv we do
> devcoredump allocation in the GPU reset path with __GFP_NOWARN |
> __GFP_NORETRY, which means the allocation will kick memory reclaim if
> necessary, but will just give up if no memory could be made available
> easily. This satisfies the forward progress guarantee in the absence of
> successful memory allocation, which is the most important property in
> this path, I think.
>
> However, I'm not sure if the reclaim could lead to locking issues or
> something like that with the more complex use-cases with MMU notifiers
> and stuff like that. Christian, do you have any experience or
> information that would be good to share in this regard?

Yeah, very good question.

__GFP_NORETRY isn't sufficient as far as I know. Reclaim must be 
completely suppressed to be able to allocate in a GPU reset handler.

Daniel added lockdep annotation to some of the dma-fence signaling paths 
and this yielded quite a bunch of potential deadlocks.

It's not even that reclaim itself waits for dma_fences (that can happen, 
but is quite unlikely), but rather that reclaim needs locks and under 
those locks we then wait for dma_fences.

We should probably add a define somewhere which documents that 
(GFP_ATOMIC | __NO_WARN) should be used in the GPU reset handlers when 
allocating memory for coredumps.

Regards,
Christian.

>
> Regards,
> Lucas
>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++--------
>>>    2 files changed, 51 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> index dbe062a087c5..e1cc83a89d46 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>> @@ -1068,11 +1068,6 @@ struct amdgpu_device {
>>>    	uint32_t                        *reset_dump_reg_list;
>>>    	uint32_t			*reset_dump_reg_value;
>>>    	int                             num_regs;
>>> -#ifdef CONFIG_DEV_COREDUMP
>>> -	struct amdgpu_task_info         reset_task_info;
>>> -	bool                            reset_vram_lost;
>>> -	struct timespec64               reset_time;
>>> -#endif
>>>    
>>>    	bool                            scpm_enabled;
>>>    	uint32_t                        scpm_status;
>>> @@ -1085,6 +1080,15 @@ struct amdgpu_device {
>>>    	uint32_t			aid_mask;
>>>    };
>>>    
>>> +#ifdef CONFIG_DEV_COREDUMP
>>> +struct amdgpu_coredump_info {
>>> +	struct amdgpu_device		*adev;
>>> +	struct amdgpu_task_info         reset_task_info;
>>> +	struct timespec64               reset_time;
>>> +	bool                            reset_vram_lost;
>>> +};
>>> +#endif
>>> +
>>>    static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
>>>    {
>>>    	return container_of(ddev, struct amdgpu_device, ddev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index e25f085ee886..23b9784e9787 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>    	return 0;
>>>    }
>>>    
>>> -#ifdef CONFIG_DEV_COREDUMP
>>> +#ifndef CONFIG_DEV_COREDUMP
>>> +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
>>> +			    struct amdgpu_reset_context *reset_context)
>>> +{
>>> +}
>>> +#else
>>>    static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>>>    		size_t count, void *data, size_t datalen)
>>>    {
>>>    	struct drm_printer p;
>>> -	struct amdgpu_device *adev = data;
>>> +	struct amdgpu_coredump_info *coredump = data;
>>>    	struct drm_print_iterator iter;
>>>    	int i;
>>>    
>>> @@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>>>    	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
>>>    	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
>>>    	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
>>> -	drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset_time.tv_nsec);
>>> -	if (adev->reset_task_info.pid)
>>> +	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
>>> +	if (coredump->reset_task_info.pid)
>>>    		drm_printf(&p, "process_name: %s PID: %d\n",
>>> -			   adev->reset_task_info.process_name,
>>> -			   adev->reset_task_info.pid);
>>> +			   coredump->reset_task_info.process_name,
>>> +			   coredump->reset_task_info.pid);
>>>    
>>> -	if (adev->reset_vram_lost)
>>> +	if (coredump->reset_vram_lost)
>>>    		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
>>> -	if (adev->num_regs) {
>>> +	if (coredump->adev->num_regs) {
>>>    		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
>>>    
>>> -		for (i = 0; i < adev->num_regs; i++)
>>> +		for (i = 0; i < coredump->adev->num_regs; i++)
>>>    			drm_printf(&p, "0x%08x: 0x%08x\n",
>>> -				   adev->reset_dump_reg_list[i],
>>> -				   adev->reset_dump_reg_value[i]);
>>> +				   coredump->adev->reset_dump_reg_list[i],
>>> +				   coredump->adev->reset_dump_reg_value[i]);
>>>    	}
>>>    
>>>    	return count - iter.remain;
>>> @@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>>>    
>>>    static void amdgpu_devcoredump_free(void *data)
>>>    {
>>> +	kfree(data);
>>>    }
>>>    
>>> -static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev)
>>> +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
>>> +			    struct amdgpu_reset_context *reset_context)
>>>    {
>>> +	struct amdgpu_coredump_info *coredump;
>>>    	struct drm_device *dev = adev_to_drm(adev);
>>>    
>>> -	ktime_get_ts64(&adev->reset_time);
>>> -	dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL,
>>> +	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
>>> +
>>> +	if (!coredump) {
>>> +		DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__);
>>> +		return;
>>> +	}
>>> +
>>> +	memset(coredump, 0, sizeof(*coredump));
>>> +
>>> +	coredump->reset_vram_lost = vram_lost;
>>> +
>>> +	if (reset_context->job && reset_context->job->vm)
>>> +		coredump->reset_task_info = reset_context->job->vm->task_info;
>>> +
>>> +	coredump->adev = adev;
>>> +
>>> +	ktime_get_ts64(&coredump->reset_time);
>>> +
>>> +	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
>>>    		      amdgpu_devcoredump_read, amdgpu_devcoredump_free);
>>>    }
>>>    #endif
>>> @@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>    					goto out;
>>>    
>>>    				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
>>> -#ifdef CONFIG_DEV_COREDUMP
>>> -				tmp_adev->reset_vram_lost = vram_lost;
>>> -				memset(&tmp_adev->reset_task_info, 0,
>>> -						sizeof(tmp_adev->reset_task_info));
>>> -				if (reset_context->job && reset_context->job->vm)
>>> -					tmp_adev->reset_task_info =
>>> -						reset_context->job->vm->task_info;
>>> -				amdgpu_reset_capture_coredumpm(tmp_adev);
>>> -#endif
>>> +
>>> +				amdgpu_coredump(tmp_adev, vram_lost, reset_context);
>>> +
>>>    				if (vram_lost) {
>>>    					DRM_INFO("VRAM is lost due to GPU reset!\n");
>>>    					amdgpu_inc_vram_lost(tmp_adev);


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically
  2023-07-12 10:39         ` Christian König
@ 2023-07-12 10:46           ` Lucas Stach
  -1 siblings, 0 replies; 37+ messages in thread
From: Lucas Stach @ 2023-07-12 10:46 UTC (permalink / raw)
  To: Christian König, André Almeida, dri-devel, amd-gfx,
	linux-kernel
  Cc: pierre-eric.pelloux-prayer, 'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	alexander.deucher

Am Mittwoch, dem 12.07.2023 um 12:39 +0200 schrieb Christian König:
> Am 12.07.23 um 10:59 schrieb Lucas Stach:
> > Am Mittwoch, dem 12.07.2023 um 10:37 +0200 schrieb Christian König:
> > > Am 11.07.23 um 23:34 schrieb André Almeida:
> > > > Instead of storing coredump information inside amdgpu_device struct,
> > > > move if to a proper separated struct and allocate it dynamically. This
> > > > will make it easier to further expand the logged information.
> > > Verry big NAK to this. The problem is that memory allocation isn't
> > > allowed during a GPU reset.
> > > 
> > > What you could do is to allocate the memory with GFP_ATOMIC or similar,
> > > but for a large structure that might not be possible.
> > > 
> > I'm still not fully clear on what the rules are here. In etnaviv we do
> > devcoredump allocation in the GPU reset path with __GFP_NOWARN |
> > __GFP_NORETRY, which means the allocation will kick memory reclaim if
> > necessary, but will just give up if no memory could be made available
> > easily. This satisfies the forward progress guarantee in the absence of
> > successful memory allocation, which is the most important property in
> > this path, I think.
> > 
> > However, I'm not sure if the reclaim could lead to locking issues or
> > something like that with the more complex use-cases with MMU notifiers
> > and stuff like that. Christian, do you have any experience or
> > information that would be good to share in this regard?
> 
> Yeah, very good question.
> 
> __GFP_NORETRY isn't sufficient as far as I know. Reclaim must be 
> completely suppressed to be able to allocate in a GPU reset handler.
> 
> Daniel added lockdep annotation to some of the dma-fence signaling paths 
> and this yielded quite a bunch of potential deadlocks.
> 
> It's not even that reclaim itself waits for dma_fences (that can happen, 
> but is quite unlikely), but rather that reclaim needs locks and under 
> those locks we then wait for dma_fences.
> 
> We should probably add a define somewhere which documents that 
> (GFP_ATOMIC | __NO_WARN) should be used in the GPU reset handlers when 
> allocating memory for coredumps.
> 
> Regards,
> Christian.
> 
> > 
> > Regards,
> > Lucas
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > > ---
> > > >    drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
> > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++--------
> > > >    2 files changed, 51 insertions(+), 28 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > index dbe062a087c5..e1cc83a89d46 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > @@ -1068,11 +1068,6 @@ struct amdgpu_device {
> > > >    	uint32_t                        *reset_dump_reg_list;
> > > >    	uint32_t			*reset_dump_reg_value;
> > > >    	int                             num_regs;
> > > > -#ifdef CONFIG_DEV_COREDUMP
> > > > -	struct amdgpu_task_info         reset_task_info;
> > > > -	bool                            reset_vram_lost;
> > > > -	struct timespec64               reset_time;
> > > > -#endif
> > > >    
> > > >    	bool                            scpm_enabled;
> > > >    	uint32_t                        scpm_status;
> > > > @@ -1085,6 +1080,15 @@ struct amdgpu_device {
> > > >    	uint32_t			aid_mask;
> > > >    };
> > > >    
> > > > +#ifdef CONFIG_DEV_COREDUMP
> > > > +struct amdgpu_coredump_info {
> > > > +	struct amdgpu_device		*adev;
> > > > +	struct amdgpu_task_info         reset_task_info;
> > > > +	struct timespec64               reset_time;
> > > > +	bool                            reset_vram_lost;
> > > > +};
> > > > +#endif
> > > > +
> > > >    static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> > > >    {
> > > >    	return container_of(ddev, struct amdgpu_device, ddev);
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > index e25f085ee886..23b9784e9787 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > @@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
> > > >    	return 0;
> > > >    }
> > > >    
> > > > -#ifdef CONFIG_DEV_COREDUMP
> > > > +#ifndef CONFIG_DEV_COREDUMP
> > > > +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> > > > +			    struct amdgpu_reset_context *reset_context)
> > > > +{
> > > > +}
> > > > +#else
> > > >    static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> > > >    		size_t count, void *data, size_t datalen)
> > > >    {
> > > >    	struct drm_printer p;
> > > > -	struct amdgpu_device *adev = data;
> > > > +	struct amdgpu_coredump_info *coredump = data;
> > > >    	struct drm_print_iterator iter;
> > > >    	int i;
> > > >    
> > > > @@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> > > >    	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
> > > >    	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
> > > >    	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
> > > > -	drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset_time.tv_nsec);
> > > > -	if (adev->reset_task_info.pid)
> > > > +	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
> > > > +	if (coredump->reset_task_info.pid)
> > > >    		drm_printf(&p, "process_name: %s PID: %d\n",
> > > > -			   adev->reset_task_info.process_name,
> > > > -			   adev->reset_task_info.pid);
> > > > +			   coredump->reset_task_info.process_name,
> > > > +			   coredump->reset_task_info.pid);
> > > >    
> > > > -	if (adev->reset_vram_lost)
> > > > +	if (coredump->reset_vram_lost)
> > > >    		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
> > > > -	if (adev->num_regs) {
> > > > +	if (coredump->adev->num_regs) {
> > > >    		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
> > > >    
> > > > -		for (i = 0; i < adev->num_regs; i++)
> > > > +		for (i = 0; i < coredump->adev->num_regs; i++)
> > > >    			drm_printf(&p, "0x%08x: 0x%08x\n",
> > > > -				   adev->reset_dump_reg_list[i],
> > > > -				   adev->reset_dump_reg_value[i]);
> > > > +				   coredump->adev->reset_dump_reg_list[i],
> > > > +				   coredump->adev->reset_dump_reg_value[i]);
> > > >    	}
> > > >    
> > > >    	return count - iter.remain;
> > > > @@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> > > >    
> > > >    static void amdgpu_devcoredump_free(void *data)
> > > >    {
> > > > +	kfree(data);
> > > >    }
> > > >    
> > > > -static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev)
> > > > +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> > > > +			    struct amdgpu_reset_context *reset_context)
> > > >    {
> > > > +	struct amdgpu_coredump_info *coredump;
> > > >    	struct drm_device *dev = adev_to_drm(adev);
> > > >    
> > > > -	ktime_get_ts64(&adev->reset_time);
> > > > -	dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL,
> > > > +	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
> > > > +
> > > > +	if (!coredump) {
> > > > +		DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__);
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	memset(coredump, 0, sizeof(*coredump));
> > > > +
> > > > +	coredump->reset_vram_lost = vram_lost;
> > > > +
> > > > +	if (reset_context->job && reset_context->job->vm)
> > > > +		coredump->reset_task_info = reset_context->job->vm->task_info;
> > > > +
> > > > +	coredump->adev = adev;
> > > > +
> > > > +	ktime_get_ts64(&coredump->reset_time);
> > > > +
> > > > +	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
> > > >    		      amdgpu_devcoredump_read, amdgpu_devcoredump_free);
> > > >    }
> > > >    #endif
> > > > @@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
> > > >    					goto out;
> > > >    
> > > >    				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
> > > > -#ifdef CONFIG_DEV_COREDUMP
> > > > -				tmp_adev->reset_vram_lost = vram_lost;
> > > > -				memset(&tmp_adev->reset_task_info, 0,
> > > > -						sizeof(tmp_adev->reset_task_info));
> > > > -				if (reset_context->job && reset_context->job->vm)
> > > > -					tmp_adev->reset_task_info =
> > > > -						reset_context->job->vm->task_info;
> > > > -				amdgpu_reset_capture_coredumpm(tmp_adev);
> > > > -#endif
> > > > +
> > > > +				amdgpu_coredump(tmp_adev, vram_lost, reset_context);
> > > > +
> > > >    				if (vram_lost) {
> > > >    					DRM_INFO("VRAM is lost due to GPU reset!\n");
> > > >    					amdgpu_inc_vram_lost(tmp_adev);
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically
@ 2023-07-12 10:46           ` Lucas Stach
  0 siblings, 0 replies; 37+ messages in thread
From: Lucas Stach @ 2023-07-12 10:46 UTC (permalink / raw)
  To: Christian König, André Almeida, dri-devel, amd-gfx,
	linux-kernel
  Cc: pierre-eric.pelloux-prayer, 'Marek Olšák',
	michel.daenzer, Timur Kristóf, Samuel Pitoiset, kernel-dev,
	alexander.deucher

Am Mittwoch, dem 12.07.2023 um 12:39 +0200 schrieb Christian König:
> Am 12.07.23 um 10:59 schrieb Lucas Stach:
> > Am Mittwoch, dem 12.07.2023 um 10:37 +0200 schrieb Christian König:
> > > Am 11.07.23 um 23:34 schrieb André Almeida:
> > > > Instead of storing coredump information inside amdgpu_device struct,
> > > > move if to a proper separated struct and allocate it dynamically. This
> > > > will make it easier to further expand the logged information.
> > > Verry big NAK to this. The problem is that memory allocation isn't
> > > allowed during a GPU reset.
> > > 
> > > What you could do is to allocate the memory with GFP_ATOMIC or similar,
> > > but for a large structure that might not be possible.
> > > 
> > I'm still not fully clear on what the rules are here. In etnaviv we do
> > devcoredump allocation in the GPU reset path with __GFP_NOWARN |
> > __GFP_NORETRY, which means the allocation will kick memory reclaim if
> > necessary, but will just give up if no memory could be made available
> > easily. This satisfies the forward progress guarantee in the absence of
> > successful memory allocation, which is the most important property in
> > this path, I think.
> > 
> > However, I'm not sure if the reclaim could lead to locking issues or
> > something like that with the more complex use-cases with MMU notifiers
> > and stuff like that. Christian, do you have any experience or
> > information that would be good to share in this regard?
> 
> Yeah, very good question.
> 
> __GFP_NORETRY isn't sufficient as far as I know. Reclaim must be 
> completely suppressed to be able to allocate in a GPU reset handler.
> 
> Daniel added lockdep annotation to some of the dma-fence signaling paths 
> and this yielded quite a bunch of potential deadlocks.
> 
> It's not even that reclaim itself waits for dma_fences (that can happen, 
> but is quite unlikely), but rather that reclaim needs locks and under 
> those locks we then wait for dma_fences.
> 
> We should probably add a define somewhere which documents that 
> (GFP_ATOMIC | __NO_WARN) should be used in the GPU reset handlers when 
> allocating memory for coredumps.
> 
> Regards,
> Christian.
> 
> > 
> > Regards,
> > Lucas
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > > ---
> > > >    drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
> > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++--------
> > > >    2 files changed, 51 insertions(+), 28 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > index dbe062a087c5..e1cc83a89d46 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > @@ -1068,11 +1068,6 @@ struct amdgpu_device {
> > > >    	uint32_t                        *reset_dump_reg_list;
> > > >    	uint32_t			*reset_dump_reg_value;
> > > >    	int                             num_regs;
> > > > -#ifdef CONFIG_DEV_COREDUMP
> > > > -	struct amdgpu_task_info         reset_task_info;
> > > > -	bool                            reset_vram_lost;
> > > > -	struct timespec64               reset_time;
> > > > -#endif
> > > >    
> > > >    	bool                            scpm_enabled;
> > > >    	uint32_t                        scpm_status;
> > > > @@ -1085,6 +1080,15 @@ struct amdgpu_device {
> > > >    	uint32_t			aid_mask;
> > > >    };
> > > >    
> > > > +#ifdef CONFIG_DEV_COREDUMP
> > > > +struct amdgpu_coredump_info {
> > > > +	struct amdgpu_device		*adev;
> > > > +	struct amdgpu_task_info         reset_task_info;
> > > > +	struct timespec64               reset_time;
> > > > +	bool                            reset_vram_lost;
> > > > +};
> > > > +#endif
> > > > +
> > > >    static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> > > >    {
> > > >    	return container_of(ddev, struct amdgpu_device, ddev);
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > index e25f085ee886..23b9784e9787 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > @@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
> > > >    	return 0;
> > > >    }
> > > >    
> > > > -#ifdef CONFIG_DEV_COREDUMP
> > > > +#ifndef CONFIG_DEV_COREDUMP
> > > > +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> > > > +			    struct amdgpu_reset_context *reset_context)
> > > > +{
> > > > +}
> > > > +#else
> > > >    static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> > > >    		size_t count, void *data, size_t datalen)
> > > >    {
> > > >    	struct drm_printer p;
> > > > -	struct amdgpu_device *adev = data;
> > > > +	struct amdgpu_coredump_info *coredump = data;
> > > >    	struct drm_print_iterator iter;
> > > >    	int i;
> > > >    
> > > > @@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> > > >    	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
> > > >    	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
> > > >    	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
> > > > -	drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset_time.tv_nsec);
> > > > -	if (adev->reset_task_info.pid)
> > > > +	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
> > > > +	if (coredump->reset_task_info.pid)
> > > >    		drm_printf(&p, "process_name: %s PID: %d\n",
> > > > -			   adev->reset_task_info.process_name,
> > > > -			   adev->reset_task_info.pid);
> > > > +			   coredump->reset_task_info.process_name,
> > > > +			   coredump->reset_task_info.pid);
> > > >    
> > > > -	if (adev->reset_vram_lost)
> > > > +	if (coredump->reset_vram_lost)
> > > >    		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
> > > > -	if (adev->num_regs) {
> > > > +	if (coredump->adev->num_regs) {
> > > >    		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
> > > >    
> > > > -		for (i = 0; i < adev->num_regs; i++)
> > > > +		for (i = 0; i < coredump->adev->num_regs; i++)
> > > >    			drm_printf(&p, "0x%08x: 0x%08x\n",
> > > > -				   adev->reset_dump_reg_list[i],
> > > > -				   adev->reset_dump_reg_value[i]);
> > > > +				   coredump->adev->reset_dump_reg_list[i],
> > > > +				   coredump->adev->reset_dump_reg_value[i]);
> > > >    	}
> > > >    
> > > >    	return count - iter.remain;
> > > > @@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> > > >    
> > > >    static void amdgpu_devcoredump_free(void *data)
> > > >    {
> > > > +	kfree(data);
> > > >    }
> > > >    
> > > > -static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev)
> > > > +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> > > > +			    struct amdgpu_reset_context *reset_context)
> > > >    {
> > > > +	struct amdgpu_coredump_info *coredump;
> > > >    	struct drm_device *dev = adev_to_drm(adev);
> > > >    
> > > > -	ktime_get_ts64(&adev->reset_time);
> > > > -	dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL,
> > > > +	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
> > > > +
> > > > +	if (!coredump) {
> > > > +		DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__);
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	memset(coredump, 0, sizeof(*coredump));
> > > > +
> > > > +	coredump->reset_vram_lost = vram_lost;
> > > > +
> > > > +	if (reset_context->job && reset_context->job->vm)
> > > > +		coredump->reset_task_info = reset_context->job->vm->task_info;
> > > > +
> > > > +	coredump->adev = adev;
> > > > +
> > > > +	ktime_get_ts64(&coredump->reset_time);
> > > > +
> > > > +	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
> > > >    		      amdgpu_devcoredump_read, amdgpu_devcoredump_free);
> > > >    }
> > > >    #endif
> > > > @@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
> > > >    					goto out;
> > > >    
> > > >    				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
> > > > -#ifdef CONFIG_DEV_COREDUMP
> > > > -				tmp_adev->reset_vram_lost = vram_lost;
> > > > -				memset(&tmp_adev->reset_task_info, 0,
> > > > -						sizeof(tmp_adev->reset_task_info));
> > > > -				if (reset_context->job && reset_context->job->vm)
> > > > -					tmp_adev->reset_task_info =
> > > > -						reset_context->job->vm->task_info;
> > > > -				amdgpu_reset_capture_coredumpm(tmp_adev);
> > > > -#endif
> > > > +
> > > > +				amdgpu_coredump(tmp_adev, vram_lost, reset_context);
> > > > +
> > > >    				if (vram_lost) {
> > > >    					DRM_INFO("VRAM is lost due to GPU reset!\n");
> > > >    					amdgpu_inc_vram_lost(tmp_adev);
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically
  2023-07-12 10:39         ` Christian König
@ 2023-07-12 10:56           ` Lucas Stach
  -1 siblings, 0 replies; 37+ messages in thread
From: Lucas Stach @ 2023-07-12 10:56 UTC (permalink / raw)
  To: Christian König, André Almeida, dri-devel, amd-gfx,
	linux-kernel
  Cc: pierre-eric.pelloux-prayer, 'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	alexander.deucher

Sorry, accidentally hit sent on the previous mail.

Am Mittwoch, dem 12.07.2023 um 12:39 +0200 schrieb Christian König:
> Am 12.07.23 um 10:59 schrieb Lucas Stach:
> > Am Mittwoch, dem 12.07.2023 um 10:37 +0200 schrieb Christian König:
> > > Am 11.07.23 um 23:34 schrieb André Almeida:
> > > > Instead of storing coredump information inside amdgpu_device struct,
> > > > move if to a proper separated struct and allocate it dynamically. This
> > > > will make it easier to further expand the logged information.
> > > Verry big NAK to this. The problem is that memory allocation isn't
> > > allowed during a GPU reset.
> > > 
> > > What you could do is to allocate the memory with GFP_ATOMIC or similar,
> > > but for a large structure that might not be possible.
> > > 
> > I'm still not fully clear on what the rules are here. In etnaviv we do
> > devcoredump allocation in the GPU reset path with __GFP_NOWARN |
> > __GFP_NORETRY, which means the allocation will kick memory reclaim if
> > necessary, but will just give up if no memory could be made available
> > easily. This satisfies the forward progress guarantee in the absence of
> > successful memory allocation, which is the most important property in
> > this path, I think.
> > 
> > However, I'm not sure if the reclaim could lead to locking issues or
> > something like that with the more complex use-cases with MMU notifiers
> > and stuff like that. Christian, do you have any experience or
> > information that would be good to share in this regard?
> 
> Yeah, very good question.
> 
> __GFP_NORETRY isn't sufficient as far as I know. Reclaim must be 
> completely suppressed to be able to allocate in a GPU reset handler.
> 
> Daniel added lockdep annotation to some of the dma-fence signaling paths 
> and this yielded quite a bunch of potential deadlocks.
> 
> It's not even that reclaim itself waits for dma_fences (that can happen, 
> but is quite unlikely), but rather that reclaim needs locks and under 
> those locks we then wait for dma_fences.
> 
> We should probably add a define somewhere which documents that 
> (GFP_ATOMIC | __NO_WARN) should be used in the GPU reset handlers when 
> allocating memory for coredumps.
> 
Hm, if the problem is the direct reclaim path where we might recurse on
a lock through those indirect dependencies then we should document this
somewhere. kswapd reclaim should be fine as far as I can see, as we'll
guarantee progress without waiting for the background reclaim.

I don't think it's appropriate to dip into the atomic allocation
reserves for a best-effort thing like writing the devcoredump, so I
think this should be GFP_NOWAIT, which will also avoid the direct
reclaim path.

Regards,
Lucas

> Regards,
> Christian.
> 
> > 
> > Regards,
> > Lucas
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > > ---
> > > >    drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
> > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++--------
> > > >    2 files changed, 51 insertions(+), 28 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > index dbe062a087c5..e1cc83a89d46 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > @@ -1068,11 +1068,6 @@ struct amdgpu_device {
> > > >    	uint32_t                        *reset_dump_reg_list;
> > > >    	uint32_t			*reset_dump_reg_value;
> > > >    	int                             num_regs;
> > > > -#ifdef CONFIG_DEV_COREDUMP
> > > > -	struct amdgpu_task_info         reset_task_info;
> > > > -	bool                            reset_vram_lost;
> > > > -	struct timespec64               reset_time;
> > > > -#endif
> > > >    
> > > >    	bool                            scpm_enabled;
> > > >    	uint32_t                        scpm_status;
> > > > @@ -1085,6 +1080,15 @@ struct amdgpu_device {
> > > >    	uint32_t			aid_mask;
> > > >    };
> > > >    
> > > > +#ifdef CONFIG_DEV_COREDUMP
> > > > +struct amdgpu_coredump_info {
> > > > +	struct amdgpu_device		*adev;
> > > > +	struct amdgpu_task_info         reset_task_info;
> > > > +	struct timespec64               reset_time;
> > > > +	bool                            reset_vram_lost;
> > > > +};
> > > > +#endif
> > > > +
> > > >    static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> > > >    {
> > > >    	return container_of(ddev, struct amdgpu_device, ddev);
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > index e25f085ee886..23b9784e9787 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > @@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
> > > >    	return 0;
> > > >    }
> > > >    
> > > > -#ifdef CONFIG_DEV_COREDUMP
> > > > +#ifndef CONFIG_DEV_COREDUMP
> > > > +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> > > > +			    struct amdgpu_reset_context *reset_context)
> > > > +{
> > > > +}
> > > > +#else
> > > >    static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> > > >    		size_t count, void *data, size_t datalen)
> > > >    {
> > > >    	struct drm_printer p;
> > > > -	struct amdgpu_device *adev = data;
> > > > +	struct amdgpu_coredump_info *coredump = data;
> > > >    	struct drm_print_iterator iter;
> > > >    	int i;
> > > >    
> > > > @@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> > > >    	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
> > > >    	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
> > > >    	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
> > > > -	drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset_time.tv_nsec);
> > > > -	if (adev->reset_task_info.pid)
> > > > +	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
> > > > +	if (coredump->reset_task_info.pid)
> > > >    		drm_printf(&p, "process_name: %s PID: %d\n",
> > > > -			   adev->reset_task_info.process_name,
> > > > -			   adev->reset_task_info.pid);
> > > > +			   coredump->reset_task_info.process_name,
> > > > +			   coredump->reset_task_info.pid);
> > > >    
> > > > -	if (adev->reset_vram_lost)
> > > > +	if (coredump->reset_vram_lost)
> > > >    		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
> > > > -	if (adev->num_regs) {
> > > > +	if (coredump->adev->num_regs) {
> > > >    		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
> > > >    
> > > > -		for (i = 0; i < adev->num_regs; i++)
> > > > +		for (i = 0; i < coredump->adev->num_regs; i++)
> > > >    			drm_printf(&p, "0x%08x: 0x%08x\n",
> > > > -				   adev->reset_dump_reg_list[i],
> > > > -				   adev->reset_dump_reg_value[i]);
> > > > +				   coredump->adev->reset_dump_reg_list[i],
> > > > +				   coredump->adev->reset_dump_reg_value[i]);
> > > >    	}
> > > >    
> > > >    	return count - iter.remain;
> > > > @@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> > > >    
> > > >    static void amdgpu_devcoredump_free(void *data)
> > > >    {
> > > > +	kfree(data);
> > > >    }
> > > >    
> > > > -static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev)
> > > > +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> > > > +			    struct amdgpu_reset_context *reset_context)
> > > >    {
> > > > +	struct amdgpu_coredump_info *coredump;
> > > >    	struct drm_device *dev = adev_to_drm(adev);
> > > >    
> > > > -	ktime_get_ts64(&adev->reset_time);
> > > > -	dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL,
> > > > +	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
> > > > +
> > > > +	if (!coredump) {
> > > > +		DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__);
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	memset(coredump, 0, sizeof(*coredump));
> > > > +
> > > > +	coredump->reset_vram_lost = vram_lost;
> > > > +
> > > > +	if (reset_context->job && reset_context->job->vm)
> > > > +		coredump->reset_task_info = reset_context->job->vm->task_info;
> > > > +
> > > > +	coredump->adev = adev;
> > > > +
> > > > +	ktime_get_ts64(&coredump->reset_time);
> > > > +
> > > > +	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
> > > >    		      amdgpu_devcoredump_read, amdgpu_devcoredump_free);
> > > >    }
> > > >    #endif
> > > > @@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
> > > >    					goto out;
> > > >    
> > > >    				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
> > > > -#ifdef CONFIG_DEV_COREDUMP
> > > > -				tmp_adev->reset_vram_lost = vram_lost;
> > > > -				memset(&tmp_adev->reset_task_info, 0,
> > > > -						sizeof(tmp_adev->reset_task_info));
> > > > -				if (reset_context->job && reset_context->job->vm)
> > > > -					tmp_adev->reset_task_info =
> > > > -						reset_context->job->vm->task_info;
> > > > -				amdgpu_reset_capture_coredumpm(tmp_adev);
> > > > -#endif
> > > > +
> > > > +				amdgpu_coredump(tmp_adev, vram_lost, reset_context);
> > > > +
> > > >    				if (vram_lost) {
> > > >    					DRM_INFO("VRAM is lost due to GPU reset!\n");
> > > >    					amdgpu_inc_vram_lost(tmp_adev);
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically
@ 2023-07-12 10:56           ` Lucas Stach
  0 siblings, 0 replies; 37+ messages in thread
From: Lucas Stach @ 2023-07-12 10:56 UTC (permalink / raw)
  To: Christian König, André Almeida, dri-devel, amd-gfx,
	linux-kernel
  Cc: pierre-eric.pelloux-prayer, 'Marek Olšák',
	michel.daenzer, Timur Kristóf, Samuel Pitoiset, kernel-dev,
	alexander.deucher

Sorry, accidentally hit sent on the previous mail.

Am Mittwoch, dem 12.07.2023 um 12:39 +0200 schrieb Christian König:
> Am 12.07.23 um 10:59 schrieb Lucas Stach:
> > Am Mittwoch, dem 12.07.2023 um 10:37 +0200 schrieb Christian König:
> > > Am 11.07.23 um 23:34 schrieb André Almeida:
> > > > Instead of storing coredump information inside amdgpu_device struct,
> > > > move if to a proper separated struct and allocate it dynamically. This
> > > > will make it easier to further expand the logged information.
> > > Verry big NAK to this. The problem is that memory allocation isn't
> > > allowed during a GPU reset.
> > > 
> > > What you could do is to allocate the memory with GFP_ATOMIC or similar,
> > > but for a large structure that might not be possible.
> > > 
> > I'm still not fully clear on what the rules are here. In etnaviv we do
> > devcoredump allocation in the GPU reset path with __GFP_NOWARN |
> > __GFP_NORETRY, which means the allocation will kick memory reclaim if
> > necessary, but will just give up if no memory could be made available
> > easily. This satisfies the forward progress guarantee in the absence of
> > successful memory allocation, which is the most important property in
> > this path, I think.
> > 
> > However, I'm not sure if the reclaim could lead to locking issues or
> > something like that with the more complex use-cases with MMU notifiers
> > and stuff like that. Christian, do you have any experience or
> > information that would be good to share in this regard?
> 
> Yeah, very good question.
> 
> __GFP_NORETRY isn't sufficient as far as I know. Reclaim must be 
> completely suppressed to be able to allocate in a GPU reset handler.
> 
> Daniel added lockdep annotation to some of the dma-fence signaling paths 
> and this yielded quite a bunch of potential deadlocks.
> 
> It's not even that reclaim itself waits for dma_fences (that can happen, 
> but is quite unlikely), but rather that reclaim needs locks and under 
> those locks we then wait for dma_fences.
> 
> We should probably add a define somewhere which documents that 
> (GFP_ATOMIC | __NO_WARN) should be used in the GPU reset handlers when 
> allocating memory for coredumps.
> 
Hm, if the problem is the direct reclaim path where we might recurse on
a lock through those indirect dependencies then we should document this
somewhere. kswapd reclaim should be fine as far as I can see, as we'll
guarantee progress without waiting for the background reclaim.

I don't think it's appropriate to dip into the atomic allocation
reserves for a best-effort thing like writing the devcoredump, so I
think this should be GFP_NOWAIT, which will also avoid the direct
reclaim path.

Regards,
Lucas

> Regards,
> Christian.
> 
> > 
> > Regards,
> > Lucas
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > > ---
> > > >    drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
> > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++--------
> > > >    2 files changed, 51 insertions(+), 28 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > index dbe062a087c5..e1cc83a89d46 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> > > > @@ -1068,11 +1068,6 @@ struct amdgpu_device {
> > > >    	uint32_t                        *reset_dump_reg_list;
> > > >    	uint32_t			*reset_dump_reg_value;
> > > >    	int                             num_regs;
> > > > -#ifdef CONFIG_DEV_COREDUMP
> > > > -	struct amdgpu_task_info         reset_task_info;
> > > > -	bool                            reset_vram_lost;
> > > > -	struct timespec64               reset_time;
> > > > -#endif
> > > >    
> > > >    	bool                            scpm_enabled;
> > > >    	uint32_t                        scpm_status;
> > > > @@ -1085,6 +1080,15 @@ struct amdgpu_device {
> > > >    	uint32_t			aid_mask;
> > > >    };
> > > >    
> > > > +#ifdef CONFIG_DEV_COREDUMP
> > > > +struct amdgpu_coredump_info {
> > > > +	struct amdgpu_device		*adev;
> > > > +	struct amdgpu_task_info         reset_task_info;
> > > > +	struct timespec64               reset_time;
> > > > +	bool                            reset_vram_lost;
> > > > +};
> > > > +#endif
> > > > +
> > > >    static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> > > >    {
> > > >    	return container_of(ddev, struct amdgpu_device, ddev);
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > index e25f085ee886..23b9784e9787 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > @@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
> > > >    	return 0;
> > > >    }
> > > >    
> > > > -#ifdef CONFIG_DEV_COREDUMP
> > > > +#ifndef CONFIG_DEV_COREDUMP
> > > > +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> > > > +			    struct amdgpu_reset_context *reset_context)
> > > > +{
> > > > +}
> > > > +#else
> > > >    static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> > > >    		size_t count, void *data, size_t datalen)
> > > >    {
> > > >    	struct drm_printer p;
> > > > -	struct amdgpu_device *adev = data;
> > > > +	struct amdgpu_coredump_info *coredump = data;
> > > >    	struct drm_print_iterator iter;
> > > >    	int i;
> > > >    
> > > > @@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> > > >    	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
> > > >    	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
> > > >    	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
> > > > -	drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset_time.tv_nsec);
> > > > -	if (adev->reset_task_info.pid)
> > > > +	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
> > > > +	if (coredump->reset_task_info.pid)
> > > >    		drm_printf(&p, "process_name: %s PID: %d\n",
> > > > -			   adev->reset_task_info.process_name,
> > > > -			   adev->reset_task_info.pid);
> > > > +			   coredump->reset_task_info.process_name,
> > > > +			   coredump->reset_task_info.pid);
> > > >    
> > > > -	if (adev->reset_vram_lost)
> > > > +	if (coredump->reset_vram_lost)
> > > >    		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
> > > > -	if (adev->num_regs) {
> > > > +	if (coredump->adev->num_regs) {
> > > >    		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
> > > >    
> > > > -		for (i = 0; i < adev->num_regs; i++)
> > > > +		for (i = 0; i < coredump->adev->num_regs; i++)
> > > >    			drm_printf(&p, "0x%08x: 0x%08x\n",
> > > > -				   adev->reset_dump_reg_list[i],
> > > > -				   adev->reset_dump_reg_value[i]);
> > > > +				   coredump->adev->reset_dump_reg_list[i],
> > > > +				   coredump->adev->reset_dump_reg_value[i]);
> > > >    	}
> > > >    
> > > >    	return count - iter.remain;
> > > > @@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
> > > >    
> > > >    static void amdgpu_devcoredump_free(void *data)
> > > >    {
> > > > +	kfree(data);
> > > >    }
> > > >    
> > > > -static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev)
> > > > +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
> > > > +			    struct amdgpu_reset_context *reset_context)
> > > >    {
> > > > +	struct amdgpu_coredump_info *coredump;
> > > >    	struct drm_device *dev = adev_to_drm(adev);
> > > >    
> > > > -	ktime_get_ts64(&adev->reset_time);
> > > > -	dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL,
> > > > +	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
> > > > +
> > > > +	if (!coredump) {
> > > > +		DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__);
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	memset(coredump, 0, sizeof(*coredump));
> > > > +
> > > > +	coredump->reset_vram_lost = vram_lost;
> > > > +
> > > > +	if (reset_context->job && reset_context->job->vm)
> > > > +		coredump->reset_task_info = reset_context->job->vm->task_info;
> > > > +
> > > > +	coredump->adev = adev;
> > > > +
> > > > +	ktime_get_ts64(&coredump->reset_time);
> > > > +
> > > > +	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
> > > >    		      amdgpu_devcoredump_read, amdgpu_devcoredump_free);
> > > >    }
> > > >    #endif
> > > > @@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
> > > >    					goto out;
> > > >    
> > > >    				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
> > > > -#ifdef CONFIG_DEV_COREDUMP
> > > > -				tmp_adev->reset_vram_lost = vram_lost;
> > > > -				memset(&tmp_adev->reset_task_info, 0,
> > > > -						sizeof(tmp_adev->reset_task_info));
> > > > -				if (reset_context->job && reset_context->job->vm)
> > > > -					tmp_adev->reset_task_info =
> > > > -						reset_context->job->vm->task_info;
> > > > -				amdgpu_reset_capture_coredumpm(tmp_adev);
> > > > -#endif
> > > > +
> > > > +				amdgpu_coredump(tmp_adev, vram_lost, reset_context);
> > > > +
> > > >    				if (vram_lost) {
> > > >    					DRM_INFO("VRAM is lost due to GPU reset!\n");
> > > >    					amdgpu_inc_vram_lost(tmp_adev);
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically
  2023-07-12 10:56           ` Lucas Stach
@ 2023-07-12 11:36             ` Christian König
  -1 siblings, 0 replies; 37+ messages in thread
From: Christian König @ 2023-07-12 11:36 UTC (permalink / raw)
  To: Lucas Stach, André Almeida, dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, 'Marek Olšák',
	Timur Kristóf, michel.daenzer, Samuel Pitoiset, kernel-dev,
	alexander.deucher

Am 12.07.23 um 12:56 schrieb Lucas Stach:
> Sorry, accidentally hit sent on the previous mail.
>
> Am Mittwoch, dem 12.07.2023 um 12:39 +0200 schrieb Christian König:
>> Am 12.07.23 um 10:59 schrieb Lucas Stach:
>>> Am Mittwoch, dem 12.07.2023 um 10:37 +0200 schrieb Christian König:
>>>> Am 11.07.23 um 23:34 schrieb André Almeida:
>>>>> Instead of storing coredump information inside amdgpu_device struct,
>>>>> move if to a proper separated struct and allocate it dynamically. This
>>>>> will make it easier to further expand the logged information.
>>>> Verry big NAK to this. The problem is that memory allocation isn't
>>>> allowed during a GPU reset.
>>>>
>>>> What you could do is to allocate the memory with GFP_ATOMIC or similar,
>>>> but for a large structure that might not be possible.
>>>>
>>> I'm still not fully clear on what the rules are here. In etnaviv we do
>>> devcoredump allocation in the GPU reset path with __GFP_NOWARN |
>>> __GFP_NORETRY, which means the allocation will kick memory reclaim if
>>> necessary, but will just give up if no memory could be made available
>>> easily. This satisfies the forward progress guarantee in the absence of
>>> successful memory allocation, which is the most important property in
>>> this path, I think.
>>>
>>> However, I'm not sure if the reclaim could lead to locking issues or
>>> something like that with the more complex use-cases with MMU notifiers
>>> and stuff like that. Christian, do you have any experience or
>>> information that would be good to share in this regard?
>> Yeah, very good question.
>>
>> __GFP_NORETRY isn't sufficient as far as I know. Reclaim must be
>> completely suppressed to be able to allocate in a GPU reset handler.
>>
>> Daniel added lockdep annotation to some of the dma-fence signaling paths
>> and this yielded quite a bunch of potential deadlocks.
>>
>> It's not even that reclaim itself waits for dma_fences (that can happen,
>> but is quite unlikely), but rather that reclaim needs locks and under
>> those locks we then wait for dma_fences.
>>
>> We should probably add a define somewhere which documents that
>> (GFP_ATOMIC | __NO_WARN) should be used in the GPU reset handlers when
>> allocating memory for coredumps.
>>
> Hm, if the problem is the direct reclaim path where we might recurse on
> a lock through those indirect dependencies then we should document this
> somewhere. kswapd reclaim should be fine as far as I can see, as we'll
> guarantee progress without waiting for the background reclaim.
>
> I don't think it's appropriate to dip into the atomic allocation
> reserves for a best-effort thing like writing the devcoredump,

Yeah, that was also my concern the last time we discussed this.

> so I think this should be GFP_NOWAIT, which will also avoid the direct
> reclaim path.

Yeah, good idea. I wasn't aware that this existed.

Regards,
Christian.

>
> Regards,
> Lucas
>
>> Regards,
>> Christian.
>>
>>> Regards,
>>> Lucas
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++--------
>>>>>     2 files changed, 51 insertions(+), 28 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> index dbe062a087c5..e1cc83a89d46 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> @@ -1068,11 +1068,6 @@ struct amdgpu_device {
>>>>>     	uint32_t                        *reset_dump_reg_list;
>>>>>     	uint32_t			*reset_dump_reg_value;
>>>>>     	int                             num_regs;
>>>>> -#ifdef CONFIG_DEV_COREDUMP
>>>>> -	struct amdgpu_task_info         reset_task_info;
>>>>> -	bool                            reset_vram_lost;
>>>>> -	struct timespec64               reset_time;
>>>>> -#endif
>>>>>     
>>>>>     	bool                            scpm_enabled;
>>>>>     	uint32_t                        scpm_status;
>>>>> @@ -1085,6 +1080,15 @@ struct amdgpu_device {
>>>>>     	uint32_t			aid_mask;
>>>>>     };
>>>>>     
>>>>> +#ifdef CONFIG_DEV_COREDUMP
>>>>> +struct amdgpu_coredump_info {
>>>>> +	struct amdgpu_device		*adev;
>>>>> +	struct amdgpu_task_info         reset_task_info;
>>>>> +	struct timespec64               reset_time;
>>>>> +	bool                            reset_vram_lost;
>>>>> +};
>>>>> +#endif
>>>>> +
>>>>>     static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
>>>>>     {
>>>>>     	return container_of(ddev, struct amdgpu_device, ddev);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index e25f085ee886..23b9784e9787 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>>     	return 0;
>>>>>     }
>>>>>     
>>>>> -#ifdef CONFIG_DEV_COREDUMP
>>>>> +#ifndef CONFIG_DEV_COREDUMP
>>>>> +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
>>>>> +			    struct amdgpu_reset_context *reset_context)
>>>>> +{
>>>>> +}
>>>>> +#else
>>>>>     static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>>>>>     		size_t count, void *data, size_t datalen)
>>>>>     {
>>>>>     	struct drm_printer p;
>>>>> -	struct amdgpu_device *adev = data;
>>>>> +	struct amdgpu_coredump_info *coredump = data;
>>>>>     	struct drm_print_iterator iter;
>>>>>     	int i;
>>>>>     
>>>>> @@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>>>>>     	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
>>>>>     	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
>>>>>     	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
>>>>> -	drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset_time.tv_nsec);
>>>>> -	if (adev->reset_task_info.pid)
>>>>> +	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
>>>>> +	if (coredump->reset_task_info.pid)
>>>>>     		drm_printf(&p, "process_name: %s PID: %d\n",
>>>>> -			   adev->reset_task_info.process_name,
>>>>> -			   adev->reset_task_info.pid);
>>>>> +			   coredump->reset_task_info.process_name,
>>>>> +			   coredump->reset_task_info.pid);
>>>>>     
>>>>> -	if (adev->reset_vram_lost)
>>>>> +	if (coredump->reset_vram_lost)
>>>>>     		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
>>>>> -	if (adev->num_regs) {
>>>>> +	if (coredump->adev->num_regs) {
>>>>>     		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
>>>>>     
>>>>> -		for (i = 0; i < adev->num_regs; i++)
>>>>> +		for (i = 0; i < coredump->adev->num_regs; i++)
>>>>>     			drm_printf(&p, "0x%08x: 0x%08x\n",
>>>>> -				   adev->reset_dump_reg_list[i],
>>>>> -				   adev->reset_dump_reg_value[i]);
>>>>> +				   coredump->adev->reset_dump_reg_list[i],
>>>>> +				   coredump->adev->reset_dump_reg_value[i]);
>>>>>     	}
>>>>>     
>>>>>     	return count - iter.remain;
>>>>> @@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>>>>>     
>>>>>     static void amdgpu_devcoredump_free(void *data)
>>>>>     {
>>>>> +	kfree(data);
>>>>>     }
>>>>>     
>>>>> -static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev)
>>>>> +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
>>>>> +			    struct amdgpu_reset_context *reset_context)
>>>>>     {
>>>>> +	struct amdgpu_coredump_info *coredump;
>>>>>     	struct drm_device *dev = adev_to_drm(adev);
>>>>>     
>>>>> -	ktime_get_ts64(&adev->reset_time);
>>>>> -	dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL,
>>>>> +	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
>>>>> +
>>>>> +	if (!coredump) {
>>>>> +		DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__);
>>>>> +		return;
>>>>> +	}
>>>>> +
>>>>> +	memset(coredump, 0, sizeof(*coredump));
>>>>> +
>>>>> +	coredump->reset_vram_lost = vram_lost;
>>>>> +
>>>>> +	if (reset_context->job && reset_context->job->vm)
>>>>> +		coredump->reset_task_info = reset_context->job->vm->task_info;
>>>>> +
>>>>> +	coredump->adev = adev;
>>>>> +
>>>>> +	ktime_get_ts64(&coredump->reset_time);
>>>>> +
>>>>> +	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
>>>>>     		      amdgpu_devcoredump_read, amdgpu_devcoredump_free);
>>>>>     }
>>>>>     #endif
>>>>> @@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>>>     					goto out;
>>>>>     
>>>>>     				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
>>>>> -#ifdef CONFIG_DEV_COREDUMP
>>>>> -				tmp_adev->reset_vram_lost = vram_lost;
>>>>> -				memset(&tmp_adev->reset_task_info, 0,
>>>>> -						sizeof(tmp_adev->reset_task_info));
>>>>> -				if (reset_context->job && reset_context->job->vm)
>>>>> -					tmp_adev->reset_task_info =
>>>>> -						reset_context->job->vm->task_info;
>>>>> -				amdgpu_reset_capture_coredumpm(tmp_adev);
>>>>> -#endif
>>>>> +
>>>>> +				amdgpu_coredump(tmp_adev, vram_lost, reset_context);
>>>>> +
>>>>>     				if (vram_lost) {
>>>>>     					DRM_INFO("VRAM is lost due to GPU reset!\n");
>>>>>     					amdgpu_inc_vram_lost(tmp_adev);


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically
@ 2023-07-12 11:36             ` Christian König
  0 siblings, 0 replies; 37+ messages in thread
From: Christian König @ 2023-07-12 11:36 UTC (permalink / raw)
  To: Lucas Stach, André Almeida, dri-devel, amd-gfx, linux-kernel
  Cc: pierre-eric.pelloux-prayer, 'Marek Olšák',
	michel.daenzer, Timur Kristóf, Samuel Pitoiset, kernel-dev,
	alexander.deucher

Am 12.07.23 um 12:56 schrieb Lucas Stach:
> Sorry, accidentally hit sent on the previous mail.
>
> Am Mittwoch, dem 12.07.2023 um 12:39 +0200 schrieb Christian König:
>> Am 12.07.23 um 10:59 schrieb Lucas Stach:
>>> Am Mittwoch, dem 12.07.2023 um 10:37 +0200 schrieb Christian König:
>>>> Am 11.07.23 um 23:34 schrieb André Almeida:
>>>>> Instead of storing coredump information inside amdgpu_device struct,
>>>>> move if to a proper separated struct and allocate it dynamically. This
>>>>> will make it easier to further expand the logged information.
>>>> Verry big NAK to this. The problem is that memory allocation isn't
>>>> allowed during a GPU reset.
>>>>
>>>> What you could do is to allocate the memory with GFP_ATOMIC or similar,
>>>> but for a large structure that might not be possible.
>>>>
>>> I'm still not fully clear on what the rules are here. In etnaviv we do
>>> devcoredump allocation in the GPU reset path with __GFP_NOWARN |
>>> __GFP_NORETRY, which means the allocation will kick memory reclaim if
>>> necessary, but will just give up if no memory could be made available
>>> easily. This satisfies the forward progress guarantee in the absence of
>>> successful memory allocation, which is the most important property in
>>> this path, I think.
>>>
>>> However, I'm not sure if the reclaim could lead to locking issues or
>>> something like that with the more complex use-cases with MMU notifiers
>>> and stuff like that. Christian, do you have any experience or
>>> information that would be good to share in this regard?
>> Yeah, very good question.
>>
>> __GFP_NORETRY isn't sufficient as far as I know. Reclaim must be
>> completely suppressed to be able to allocate in a GPU reset handler.
>>
>> Daniel added lockdep annotation to some of the dma-fence signaling paths
>> and this yielded quite a bunch of potential deadlocks.
>>
>> It's not even that reclaim itself waits for dma_fences (that can happen,
>> but is quite unlikely), but rather that reclaim needs locks and under
>> those locks we then wait for dma_fences.
>>
>> We should probably add a define somewhere which documents that
>> (GFP_ATOMIC | __NO_WARN) should be used in the GPU reset handlers when
>> allocating memory for coredumps.
>>
> Hm, if the problem is the direct reclaim path where we might recurse on
> a lock through those indirect dependencies then we should document this
> somewhere. kswapd reclaim should be fine as far as I can see, as we'll
> guarantee progress without waiting for the background reclaim.
>
> I don't think it's appropriate to dip into the atomic allocation
> reserves for a best-effort thing like writing the devcoredump,

Yeah, that was also my concern the last time we discussed this.

> so I think this should be GFP_NOWAIT, which will also avoid the direct
> reclaim path.

Yeah, good idea. I wasn't aware that this existed.

Regards,
Christian.

>
> Regards,
> Lucas
>
>> Regards,
>> Christian.
>>
>>> Regards,
>>> Lucas
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 14 +++--
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 65 ++++++++++++++--------
>>>>>     2 files changed, 51 insertions(+), 28 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> index dbe062a087c5..e1cc83a89d46 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>>>>> @@ -1068,11 +1068,6 @@ struct amdgpu_device {
>>>>>     	uint32_t                        *reset_dump_reg_list;
>>>>>     	uint32_t			*reset_dump_reg_value;
>>>>>     	int                             num_regs;
>>>>> -#ifdef CONFIG_DEV_COREDUMP
>>>>> -	struct amdgpu_task_info         reset_task_info;
>>>>> -	bool                            reset_vram_lost;
>>>>> -	struct timespec64               reset_time;
>>>>> -#endif
>>>>>     
>>>>>     	bool                            scpm_enabled;
>>>>>     	uint32_t                        scpm_status;
>>>>> @@ -1085,6 +1080,15 @@ struct amdgpu_device {
>>>>>     	uint32_t			aid_mask;
>>>>>     };
>>>>>     
>>>>> +#ifdef CONFIG_DEV_COREDUMP
>>>>> +struct amdgpu_coredump_info {
>>>>> +	struct amdgpu_device		*adev;
>>>>> +	struct amdgpu_task_info         reset_task_info;
>>>>> +	struct timespec64               reset_time;
>>>>> +	bool                            reset_vram_lost;
>>>>> +};
>>>>> +#endif
>>>>> +
>>>>>     static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
>>>>>     {
>>>>>     	return container_of(ddev, struct amdgpu_device, ddev);
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index e25f085ee886..23b9784e9787 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -4963,12 +4963,17 @@ static int amdgpu_reset_reg_dumps(struct amdgpu_device *adev)
>>>>>     	return 0;
>>>>>     }
>>>>>     
>>>>> -#ifdef CONFIG_DEV_COREDUMP
>>>>> +#ifndef CONFIG_DEV_COREDUMP
>>>>> +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
>>>>> +			    struct amdgpu_reset_context *reset_context)
>>>>> +{
>>>>> +}
>>>>> +#else
>>>>>     static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>>>>>     		size_t count, void *data, size_t datalen)
>>>>>     {
>>>>>     	struct drm_printer p;
>>>>> -	struct amdgpu_device *adev = data;
>>>>> +	struct amdgpu_coredump_info *coredump = data;
>>>>>     	struct drm_print_iterator iter;
>>>>>     	int i;
>>>>>     
>>>>> @@ -4982,21 +4987,21 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>>>>>     	drm_printf(&p, "**** AMDGPU Device Coredump ****\n");
>>>>>     	drm_printf(&p, "kernel: " UTS_RELEASE "\n");
>>>>>     	drm_printf(&p, "module: " KBUILD_MODNAME "\n");
>>>>> -	drm_printf(&p, "time: %lld.%09ld\n", adev->reset_time.tv_sec, adev->reset_time.tv_nsec);
>>>>> -	if (adev->reset_task_info.pid)
>>>>> +	drm_printf(&p, "time: %lld.%09ld\n", coredump->reset_time.tv_sec, coredump->reset_time.tv_nsec);
>>>>> +	if (coredump->reset_task_info.pid)
>>>>>     		drm_printf(&p, "process_name: %s PID: %d\n",
>>>>> -			   adev->reset_task_info.process_name,
>>>>> -			   adev->reset_task_info.pid);
>>>>> +			   coredump->reset_task_info.process_name,
>>>>> +			   coredump->reset_task_info.pid);
>>>>>     
>>>>> -	if (adev->reset_vram_lost)
>>>>> +	if (coredump->reset_vram_lost)
>>>>>     		drm_printf(&p, "VRAM is lost due to GPU reset!\n");
>>>>> -	if (adev->num_regs) {
>>>>> +	if (coredump->adev->num_regs) {
>>>>>     		drm_printf(&p, "AMDGPU register dumps:\nOffset:     Value:\n");
>>>>>     
>>>>> -		for (i = 0; i < adev->num_regs; i++)
>>>>> +		for (i = 0; i < coredump->adev->num_regs; i++)
>>>>>     			drm_printf(&p, "0x%08x: 0x%08x\n",
>>>>> -				   adev->reset_dump_reg_list[i],
>>>>> -				   adev->reset_dump_reg_value[i]);
>>>>> +				   coredump->adev->reset_dump_reg_list[i],
>>>>> +				   coredump->adev->reset_dump_reg_value[i]);
>>>>>     	}
>>>>>     
>>>>>     	return count - iter.remain;
>>>>> @@ -5004,14 +5009,34 @@ static ssize_t amdgpu_devcoredump_read(char *buffer, loff_t offset,
>>>>>     
>>>>>     static void amdgpu_devcoredump_free(void *data)
>>>>>     {
>>>>> +	kfree(data);
>>>>>     }
>>>>>     
>>>>> -static void amdgpu_reset_capture_coredumpm(struct amdgpu_device *adev)
>>>>> +static void amdgpu_coredump(struct amdgpu_device *adev, bool vram_lost,
>>>>> +			    struct amdgpu_reset_context *reset_context)
>>>>>     {
>>>>> +	struct amdgpu_coredump_info *coredump;
>>>>>     	struct drm_device *dev = adev_to_drm(adev);
>>>>>     
>>>>> -	ktime_get_ts64(&adev->reset_time);
>>>>> -	dev_coredumpm(dev->dev, THIS_MODULE, adev, 0, GFP_KERNEL,
>>>>> +	coredump = kmalloc(sizeof(*coredump), GFP_KERNEL);
>>>>> +
>>>>> +	if (!coredump) {
>>>>> +		DRM_ERROR("%s: failed to allocate memory for coredump\n", __func__);
>>>>> +		return;
>>>>> +	}
>>>>> +
>>>>> +	memset(coredump, 0, sizeof(*coredump));
>>>>> +
>>>>> +	coredump->reset_vram_lost = vram_lost;
>>>>> +
>>>>> +	if (reset_context->job && reset_context->job->vm)
>>>>> +		coredump->reset_task_info = reset_context->job->vm->task_info;
>>>>> +
>>>>> +	coredump->adev = adev;
>>>>> +
>>>>> +	ktime_get_ts64(&coredump->reset_time);
>>>>> +
>>>>> +	dev_coredumpm(dev->dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
>>>>>     		      amdgpu_devcoredump_read, amdgpu_devcoredump_free);
>>>>>     }
>>>>>     #endif
>>>>> @@ -5119,15 +5144,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>>>>     					goto out;
>>>>>     
>>>>>     				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
>>>>> -#ifdef CONFIG_DEV_COREDUMP
>>>>> -				tmp_adev->reset_vram_lost = vram_lost;
>>>>> -				memset(&tmp_adev->reset_task_info, 0,
>>>>> -						sizeof(tmp_adev->reset_task_info));
>>>>> -				if (reset_context->job && reset_context->job->vm)
>>>>> -					tmp_adev->reset_task_info =
>>>>> -						reset_context->job->vm->task_info;
>>>>> -				amdgpu_reset_capture_coredumpm(tmp_adev);
>>>>> -#endif
>>>>> +
>>>>> +				amdgpu_coredump(tmp_adev, vram_lost, reset_context);
>>>>> +
>>>>>     				if (vram_lost) {
>>>>>     					DRM_INFO("VRAM is lost due to GPU reset!\n");
>>>>>     					amdgpu_inc_vram_lost(tmp_adev);


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2023-07-12 11:37 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-11 21:34 [PATCH 0/6] drm/amdgpu: Add new reset option and rework coredump André Almeida
2023-07-11 21:34 ` André Almeida
2023-07-11 21:34 ` André Almeida
2023-07-11 21:34 ` [PATCH 1/6] drm/amdgpu: Create a module param to disable soft recovery André Almeida
2023-07-11 21:34   ` André Almeida
2023-07-11 21:34   ` André Almeida
2023-07-11 21:34 ` [PATCH 2/6] drm/amdgpu: Mark contexts guilty for causing soft recoveries André Almeida
2023-07-11 21:34   ` André Almeida
2023-07-11 21:34   ` André Almeida
2023-07-12  8:43   ` Christian König
2023-07-12  8:43     ` Christian König
2023-07-12  8:43     ` Christian König
2023-07-11 21:34 ` [PATCH 3/6] drm/amdgpu: Rework coredump to use memory dynamically André Almeida
2023-07-11 21:34   ` André Almeida
2023-07-11 21:34   ` André Almeida
2023-07-12  8:37   ` Christian König
2023-07-12  8:37     ` Christian König
2023-07-12  8:37     ` Christian König
2023-07-12  8:59     ` Lucas Stach
2023-07-12  8:59       ` Lucas Stach
2023-07-12 10:39       ` Christian König
2023-07-12 10:39         ` Christian König
2023-07-12 10:46         ` Lucas Stach
2023-07-12 10:46           ` Lucas Stach
2023-07-12 10:56         ` Lucas Stach
2023-07-12 10:56           ` Lucas Stach
2023-07-12 11:36           ` Christian König
2023-07-12 11:36             ` Christian König
2023-07-11 21:34 ` [PATCH 4/6] drm/amdgpu: Limit info in coredump for kernel threads André Almeida
2023-07-11 21:34   ` André Almeida
2023-07-11 21:34   ` André Almeida
2023-07-11 21:35 ` [PATCH 5/6] drm/amdgpu: Log IBs and ring name at coredump André Almeida
2023-07-11 21:35   ` André Almeida
2023-07-11 21:35   ` André Almeida
2023-07-11 21:35 ` [PATCH 6/6] drm/amdgpu: Create version number for coredumps André Almeida
2023-07-11 21:35   ` André Almeida
2023-07-11 21:35   ` André Almeida

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.