All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/7] Fix multiple GPU resets in XGMI hive.
@ 2022-05-25 19:04 Andrey Grodzovsky
  2022-05-25 19:04 ` [PATCH v3 1/7] Revert "workqueue: remove unused cancel_work()" Andrey Grodzovsky
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Andrey Grodzovsky @ 2022-05-25 19:04 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig

Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.

Fix: Rework GPU reset to actively stop any pending reset
works while another in progress. 

v2: Switch from generic list as was in v1[1] to eplicit 
stopping of each reset request from each reset source
per each request submitter. 

v3: Switch back to work_struct from delayed_work (Christian)

[1] - https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/

Andrey Grodzovsky (7):
  Revert "workqueue: remove unused cancel_work()"
  drm/amdgpu: Cache result of last reset at reset domain level.
  drm/admgpu: Serialize RAS recovery work directly into reset domain
    queue.
  drm/amdgpu: Add work_struct for GPU reset from debugfs
  drm/amdgpu: Add work_struct for GPU reset from kfd.
  drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
    amdgpu_device_gpu_recover
  drm/amdgpu: Stop any pending reset if another in progress.

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 +++++++++++-----------
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 19 ++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    |  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  1 +
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  2 +-
 include/linux/workqueue.h                  |  1 +
 kernel/workqueue.c                         |  9 ++++
 14 files changed, 84 insertions(+), 41 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v3 1/7] Revert "workqueue: remove unused cancel_work()"
  2022-05-25 19:04 [PATCH v3 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
@ 2022-05-25 19:04 ` Andrey Grodzovsky
  2022-05-25 19:04 ` [PATCH v3 2/7] drm/amdgpu: Cache result of last reset at reset domain level Andrey Grodzovsky
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Andrey Grodzovsky @ 2022-05-25 19:04 UTC (permalink / raw)
  To: amd-gfx
  Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Lai Jiangshan,
	Christian König

This reverts commit 6417250d3f894e66a68ba1cd93676143f2376a6f.

amdpgu need this function in order to prematurly stop pending
reset works when another reset work already in progress.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Lai Jiangshan<jiangshanlai@gmail.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 include/linux/workqueue.h | 1 +
 kernel/workqueue.c        | 9 +++++++++
 2 files changed, 10 insertions(+)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 7fee9b6cfede..9e41e1226193 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -453,6 +453,7 @@ extern int schedule_on_each_cpu(work_func_t func);
 int execute_in_process_context(work_func_t fn, struct execute_work *);
 
 extern bool flush_work(struct work_struct *work);
+extern bool cancel_work(struct work_struct *work);
 extern bool cancel_work_sync(struct work_struct *work);
 
 extern bool flush_delayed_work(struct delayed_work *dwork);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 613917bbc4e7..f94b596ebffd 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3267,6 +3267,15 @@ static bool __cancel_work(struct work_struct *work, bool is_dwork)
 	return ret;
 }
 
+/*
+ * See cancel_delayed_work()
+ */
+bool cancel_work(struct work_struct *work)
+{
+	return __cancel_work(work, false);
+}
+EXPORT_SYMBOL(cancel_work);
+
 /**
  * cancel_delayed_work - cancel a delayed work
  * @dwork: delayed_work to cancel
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 2/7] drm/amdgpu: Cache result of last reset at reset domain level.
  2022-05-25 19:04 [PATCH v3 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
  2022-05-25 19:04 ` [PATCH v3 1/7] Revert "workqueue: remove unused cancel_work()" Andrey Grodzovsky
@ 2022-05-25 19:04 ` Andrey Grodzovsky
  2022-05-30  7:47   ` Christian König
  2022-05-25 19:04 ` [PATCH v3 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue Andrey Grodzovsky
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Andrey Grodzovsky @ 2022-05-25 19:04 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig

Will be read by executors of async reset like debugfs.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 4daa0e893965..bfdd8883089a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5307,6 +5307,8 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 
 	if (r)
 		dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
+
+	atomic_set(&adev->reset_domain->reset_res, r);
 	return r;
 }
 
@@ -5321,7 +5323,7 @@ static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
 {
 	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
 
-	recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
+	amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
 }
 /*
  * Serialize gpu recover into reset domain single threaded wq
@@ -5338,7 +5340,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 	flush_work(&work.base);
 
-	return work.ret;
+	return atomic_read(&adev->reset_domain->reset_res);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index c80af0889773..32c86a0b145c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -132,6 +132,7 @@ struct amdgpu_reset_domain *amdgpu_reset_create_reset_domain(enum amdgpu_reset_d
 	}
 
 	atomic_set(&reset_domain->in_gpu_reset, 0);
+	atomic_set(&reset_domain->reset_res, 0);
 	init_rwsem(&reset_domain->sem);
 
 	return reset_domain;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
index 1949dbe28a86..9e55a5d7a825 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -82,6 +82,7 @@ struct amdgpu_reset_domain {
 	enum amdgpu_reset_domain_type type;
 	struct rw_semaphore sem;
 	atomic_t in_gpu_reset;
+	atomic_t reset_res;
 };
 
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue.
  2022-05-25 19:04 [PATCH v3 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
  2022-05-25 19:04 ` [PATCH v3 1/7] Revert "workqueue: remove unused cancel_work()" Andrey Grodzovsky
  2022-05-25 19:04 ` [PATCH v3 2/7] drm/amdgpu: Cache result of last reset at reset domain level Andrey Grodzovsky
@ 2022-05-25 19:04 ` Andrey Grodzovsky
  2022-05-30  7:49   ` Christian König
  2022-05-25 19:04 ` [PATCH v3 4/7] drm/amdgpu: Add work_struct for GPU reset from debugfs Andrey Grodzovsky
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Andrey Grodzovsky @ 2022-05-25 19:04 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig

Save the extra usless work schedule.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 31207f7eec02..a439c04223b5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -35,6 +35,8 @@
 #include "amdgpu_xgmi.h"
 #include "ivsrcid/nbio/irqsrcs_nbif_7_4.h"
 #include "atom.h"
+#include "amdgpu_reset.h"
+
 #ifdef CONFIG_X86_MCE_AMD
 #include <asm/mce.h>
 
@@ -1920,7 +1922,7 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
 	}
 
 	if (amdgpu_device_should_recover_gpu(ras->adev))
-		amdgpu_device_gpu_recover(ras->adev, NULL);
+		amdgpu_device_gpu_recover_imp(ras->adev, NULL);
 	atomic_set(&ras->in_recovery, 0);
 }
 
@@ -2928,7 +2930,7 @@ int amdgpu_ras_reset_gpu(struct amdgpu_device *adev)
 	struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
 
 	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
-		schedule_work(&ras->recovery_work);
+		amdgpu_reset_domain_schedule(ras->adev->reset_domain, &ras->recovery_work);
 	return 0;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 4/7] drm/amdgpu: Add work_struct for GPU reset from debugfs
  2022-05-25 19:04 [PATCH v3 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
                   ` (2 preceding siblings ...)
  2022-05-25 19:04 ` [PATCH v3 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue Andrey Grodzovsky
@ 2022-05-25 19:04 ` Andrey Grodzovsky
  2022-05-30  7:52   ` Christian König
  2022-05-25 19:04 ` [PATCH v3 5/7] drm/amdgpu: Add work_struct for GPU reset from kfd Andrey Grodzovsky
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Andrey Grodzovsky @ 2022-05-25 19:04 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig

We need to have a work_struct to cancel this reset if another
already in progress.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h       |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 +++++++++++++++++--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 76df583663c7..8165ee5b0457 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1048,6 +1048,8 @@ struct amdgpu_device {
 
 	bool                            scpm_enabled;
 	uint32_t                        scpm_status;
+
+	struct work_struct		reset_work;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index d16c8c1f72db..b0498ffcf7c3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -39,6 +39,7 @@
 #include <drm/drm_drv.h>
 #include "amdgpu.h"
 #include "amdgpu_trace.h"
+#include "amdgpu_reset.h"
 
 /*
  * Fences
@@ -798,7 +799,10 @@ static int gpu_recover_get(void *data, u64 *val)
 		return 0;
 	}
 
-	*val = amdgpu_device_gpu_recover(adev, NULL);
+	if (amdgpu_reset_domain_schedule(adev->reset_domain, &adev->reset_work))
+		flush_work(&adev->reset_work);
+
+	*val = atomic_read(&adev->reset_domain->reset_res);
 
 	pm_runtime_mark_last_busy(dev->dev);
 	pm_runtime_put_autosuspend(dev->dev);
@@ -810,6 +814,14 @@ DEFINE_SHOW_ATTRIBUTE(amdgpu_debugfs_fence_info);
 DEFINE_DEBUGFS_ATTRIBUTE(amdgpu_debugfs_gpu_recover_fops, gpu_recover_get, NULL,
 			 "%lld\n");
 
+static void amdgpu_debugfs_reset_work(struct work_struct *work)
+{
+	struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
+						  reset_work);
+
+	amdgpu_device_gpu_recover_imp(adev, NULL);
+}
+
 #endif
 
 void amdgpu_debugfs_fence_init(struct amdgpu_device *adev)
@@ -821,9 +833,12 @@ void amdgpu_debugfs_fence_init(struct amdgpu_device *adev)
 	debugfs_create_file("amdgpu_fence_info", 0444, root, adev,
 			    &amdgpu_debugfs_fence_info_fops);
 
-	if (!amdgpu_sriov_vf(adev))
+	if (!amdgpu_sriov_vf(adev)) {
+
+		INIT_WORK(&adev->reset_work, amdgpu_debugfs_reset_work);
 		debugfs_create_file("amdgpu_gpu_recover", 0444, root, adev,
 				    &amdgpu_debugfs_gpu_recover_fops);
+	}
 #endif
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 5/7] drm/amdgpu: Add work_struct for GPU reset from kfd.
  2022-05-25 19:04 [PATCH v3 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
                   ` (3 preceding siblings ...)
  2022-05-25 19:04 ` [PATCH v3 4/7] drm/amdgpu: Add work_struct for GPU reset from debugfs Andrey Grodzovsky
@ 2022-05-25 19:04 ` Andrey Grodzovsky
  2022-05-30  7:54   ` Christian König
  2022-05-31 15:31   ` Felix Kuehling
  2022-05-25 19:04 ` [PATCH v3 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover Andrey Grodzovsky
  2022-05-25 19:04 ` [PATCH v3 7/7] drm/amdgpu: Stop any pending reset if another in progress Andrey Grodzovsky
  6 siblings, 2 replies; 19+ messages in thread
From: Andrey Grodzovsky @ 2022-05-25 19:04 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig

We need to have a work_struct to cancel this reset if another
already in progress.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 ++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 31 ----------------------
 3 files changed, 15 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 1f8161cd507f..a23abc0e86e7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -33,6 +33,7 @@
 #include <uapi/linux/kfd_ioctl.h>
 #include "amdgpu_ras.h"
 #include "amdgpu_umc.h"
+#include "amdgpu_reset.h"
 
 /* Total memory size in system memory and all GPU VRAM. Used to
  * estimate worst case amount of memory to reserve for page tables
@@ -122,6 +123,15 @@ static void amdgpu_doorbell_get_kfd_info(struct amdgpu_device *adev,
 	}
 }
 
+
+static void amdgpu_amdkfd_reset_work(struct work_struct *work)
+{
+	struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
+						  kfd.reset_work);
+
+	amdgpu_device_gpu_recover_imp(adev, NULL);
+}
+
 void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 {
 	int i;
@@ -180,6 +190,8 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 
 		adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev,
 						adev_to_drm(adev), &gpu_resources);
+
+		INIT_WORK(&adev->kfd.reset_work, amdgpu_amdkfd_reset_work);
 	}
 }
 
@@ -247,7 +259,8 @@ int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev)
 void amdgpu_amdkfd_gpu_reset(struct amdgpu_device *adev)
 {
 	if (amdgpu_device_should_recover_gpu(adev))
-		amdgpu_device_gpu_recover(adev, NULL);
+		amdgpu_reset_domain_schedule(adev->reset_domain,
+					     &adev->kfd.reset_work);
 }
 
 int amdgpu_amdkfd_alloc_gtt_mem(struct amdgpu_device *adev, size_t size,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index f8b9f27adcf5..e0709af5a326 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -96,6 +96,7 @@ struct amdgpu_kfd_dev {
 	struct kfd_dev *dev;
 	uint64_t vram_used;
 	bool init_complete;
+	struct work_struct reset_work;
 };
 
 enum kgd_engine_type {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index bfdd8883089a..e3e2a5d17cc2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5312,37 +5312,6 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 	return r;
 }
 
-struct amdgpu_recover_work_struct {
-	struct work_struct base;
-	struct amdgpu_device *adev;
-	struct amdgpu_job *job;
-	int ret;
-};
-
-static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
-{
-	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
-
-	amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
-}
-/*
- * Serialize gpu recover into reset domain single threaded wq
- */
-int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
-				    struct amdgpu_job *job)
-{
-	struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
-
-	INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
-
-	if (!amdgpu_reset_domain_schedule(adev->reset_domain, &work.base))
-		return -EAGAIN;
-
-	flush_work(&work.base);
-
-	return atomic_read(&adev->reset_domain->reset_res);
-}
-
 /**
  * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
  *
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover
  2022-05-25 19:04 [PATCH v3 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
                   ` (4 preceding siblings ...)
  2022-05-25 19:04 ` [PATCH v3 5/7] drm/amdgpu: Add work_struct for GPU reset from kfd Andrey Grodzovsky
@ 2022-05-25 19:04 ` Andrey Grodzovsky
  2022-05-30  7:55   ` Christian König
  2022-05-25 19:04 ` [PATCH v3 7/7] drm/amdgpu: Stop any pending reset if another in progress Andrey Grodzovsky
  6 siblings, 1 reply; 19+ messages in thread
From: Andrey Grodzovsky @ 2022-05-25 19:04 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig

We removed the wrapper that was queueing the recover function
into reset domain queue who was using this name.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      | 2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      | 2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      | 2 +-
 9 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 8165ee5b0457..664ed0a6deab 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1244,7 +1244,7 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev);
 bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
 int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			      struct amdgpu_job* job);
-int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
+int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			      struct amdgpu_job *job);
 void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
 int amdgpu_device_pci_reset(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index a23abc0e86e7..513c57f839d8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -129,7 +129,7 @@ static void amdgpu_amdkfd_reset_work(struct work_struct *work)
 	struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
 						  kfd.reset_work);
 
-	amdgpu_device_gpu_recover_imp(adev, NULL);
+	amdgpu_device_gpu_recover(adev, NULL);
 }
 
 void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e3e2a5d17cc2..424571e46cf5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5065,7 +5065,7 @@ static void amdgpu_device_recheck_guilty_jobs(
  * Returns 0 for success or an error on failure.
  */
 
-int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
+int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			      struct amdgpu_job *job)
 {
 	struct list_head device_list, *device_list_handle =  NULL;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index b0498ffcf7c3..957437a5558c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -819,7 +819,7 @@ static void amdgpu_debugfs_reset_work(struct work_struct *work)
 	struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
 						  reset_work);
 
-	amdgpu_device_gpu_recover_imp(adev, NULL);
+	amdgpu_device_gpu_recover(adev, NULL);
 }
 
 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index dfe7f2b8f0aa..10aa073600d4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -64,7 +64,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 		  ti.process_name, ti.tgid, ti.task_name, ti.pid);
 
 	if (amdgpu_device_should_recover_gpu(ring->adev)) {
-		r = amdgpu_device_gpu_recover_imp(ring->adev, job);
+		r = amdgpu_device_gpu_recover(ring->adev, job);
 		if (r)
 			DRM_ERROR("GPU Recovery Failed: %d\n", r);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index a439c04223b5..bc0049308207 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1922,7 +1922,7 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
 	}
 
 	if (amdgpu_device_should_recover_gpu(ras->adev))
-		amdgpu_device_gpu_recover_imp(ras->adev, NULL);
+		amdgpu_device_gpu_recover(ras->adev, NULL);
 	atomic_set(&ras->in_recovery, 0);
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index b81acf59870c..7ec5b5cf4bb9 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -284,7 +284,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	if (amdgpu_device_should_recover_gpu(adev)
 		&& (!amdgpu_device_has_job_running(adev) ||
 		adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT))
-		amdgpu_device_gpu_recover_imp(adev, NULL);
+		amdgpu_device_gpu_recover(adev, NULL);
 }
 
 static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index 22c10b97ea81..e18b75c8fde6 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -311,7 +311,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 		adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
 		adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
 		adev->video_timeout == MAX_SCHEDULE_TIMEOUT))
-		amdgpu_device_gpu_recover_imp(adev, NULL);
+		amdgpu_device_gpu_recover(adev, NULL);
 }
 
 static int xgpu_nv_set_mailbox_rcv_irq(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index 7b63d30b9b79..c5016a926331 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -523,7 +523,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
 
 	/* Trigger recovery due to world switch failure */
 	if (amdgpu_device_should_recover_gpu(adev))
-		amdgpu_device_gpu_recover_imp(adev, NULL);
+		amdgpu_device_gpu_recover(adev, NULL);
 }
 
 static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 7/7] drm/amdgpu: Stop any pending reset if another in progress.
  2022-05-25 19:04 [PATCH v3 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
                   ` (5 preceding siblings ...)
  2022-05-25 19:04 ` [PATCH v3 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover Andrey Grodzovsky
@ 2022-05-25 19:04 ` Andrey Grodzovsky
  2022-05-30  7:56   ` Christian König
  2022-05-31 15:31   ` Felix Kuehling
  6 siblings, 2 replies; 19+ messages in thread
From: Andrey Grodzovsky @ 2022-05-25 19:04 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig

We skip rest requests if another one is already in progress.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 27 ++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 424571e46cf5..e1f7ee604ea4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5054,6 +5054,27 @@ static void amdgpu_device_recheck_guilty_jobs(
 	}
 }
 
+static inline void amdggpu_device_stop_pedning_resets(struct amdgpu_device* adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+
+#if defined(CONFIG_DEBUG_FS)
+	if (!amdgpu_sriov_vf(adev))
+		cancel_work(&adev->reset_work);
+#endif
+
+	if (adev->kfd.dev)
+		cancel_work(&adev->kfd.reset_work);
+
+	if (amdgpu_sriov_vf(adev))
+		cancel_work(&adev->virt.flr_work);
+
+	if (con && adev->ras_enabled)
+		cancel_work(&con->recovery_work);
+
+}
+
+
 /**
  * amdgpu_device_gpu_recover - reset the asic and recover scheduler
  *
@@ -5209,6 +5230,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 				  r, adev_to_drm(tmp_adev)->unique);
 			tmp_adev->asic_reset_res = r;
 		}
+
+		/*
+		 * Drop all pending non scheduler resets. Scheduler resets
+		 * were already dropped during drm_sched_stop
+		 */
+		amdggpu_device_stop_pedning_resets(tmp_adev);
 	}
 
 	tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 2/7] drm/amdgpu: Cache result of last reset at reset domain level.
  2022-05-25 19:04 ` [PATCH v3 2/7] drm/amdgpu: Cache result of last reset at reset domain level Andrey Grodzovsky
@ 2022-05-30  7:47   ` Christian König
  0 siblings, 0 replies; 19+ messages in thread
From: Christian König @ 2022-05-30  7:47 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, lijo.lazar



Am 25.05.22 um 21:04 schrieb Andrey Grodzovsky:
> Will be read by executors of async reset like debugfs.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  | 1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 1 +
>   3 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 4daa0e893965..bfdd8883089a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5307,6 +5307,8 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>   
>   	if (r)
>   		dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
> +
> +	atomic_set(&adev->reset_domain->reset_res, r);
>   	return r;
>   }
>   
> @@ -5321,7 +5323,7 @@ static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
>   {
>   	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
>   
> -	recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
> +	amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>   }
>   /*
>    * Serialize gpu recover into reset domain single threaded wq
> @@ -5338,7 +5340,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   
>   	flush_work(&work.base);
>   
> -	return work.ret;
> +	return atomic_read(&adev->reset_domain->reset_res);
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> index c80af0889773..32c86a0b145c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> @@ -132,6 +132,7 @@ struct amdgpu_reset_domain *amdgpu_reset_create_reset_domain(enum amdgpu_reset_d
>   	}
>   
>   	atomic_set(&reset_domain->in_gpu_reset, 0);
> +	atomic_set(&reset_domain->reset_res, 0);
>   	init_rwsem(&reset_domain->sem);
>   
>   	return reset_domain;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> index 1949dbe28a86..9e55a5d7a825 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> @@ -82,6 +82,7 @@ struct amdgpu_reset_domain {
>   	enum amdgpu_reset_domain_type type;
>   	struct rw_semaphore sem;
>   	atomic_t in_gpu_reset;
> +	atomic_t reset_res;

Maybe we should both atomics into "active" and "result" since they are 
already part of the reset domain.

But only a nit pick, feel free to add Reviewed-by: Christian König 
<christian.koenig@amd.com> either way.

Regards,
Christian.

>   };
>   
>   


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue.
  2022-05-25 19:04 ` [PATCH v3 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue Andrey Grodzovsky
@ 2022-05-30  7:49   ` Christian König
  2022-05-31  3:02     ` Luben Tuikov
  0 siblings, 1 reply; 19+ messages in thread
From: Christian König @ 2022-05-30  7:49 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, Tuikov, Luben, lijo.lazar

Am 25.05.22 um 21:04 schrieb Andrey Grodzovsky:
> Save the extra usless work schedule.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Acked-by: Christian König <christian.koenig@amd.com>

Maybe Luben want to take a look as well, he has done some RAS review in 
the past.

Thanks,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 31207f7eec02..a439c04223b5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -35,6 +35,8 @@
>   #include "amdgpu_xgmi.h"
>   #include "ivsrcid/nbio/irqsrcs_nbif_7_4.h"
>   #include "atom.h"
> +#include "amdgpu_reset.h"
> +
>   #ifdef CONFIG_X86_MCE_AMD
>   #include <asm/mce.h>
>   
> @@ -1920,7 +1922,7 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
>   	}
>   
>   	if (amdgpu_device_should_recover_gpu(ras->adev))
> -		amdgpu_device_gpu_recover(ras->adev, NULL);
> +		amdgpu_device_gpu_recover_imp(ras->adev, NULL);
>   	atomic_set(&ras->in_recovery, 0);
>   }
>   
> @@ -2928,7 +2930,7 @@ int amdgpu_ras_reset_gpu(struct amdgpu_device *adev)
>   	struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
>   
>   	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
> -		schedule_work(&ras->recovery_work);
> +		amdgpu_reset_domain_schedule(ras->adev->reset_domain, &ras->recovery_work);
>   	return 0;
>   }
>   


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/7] drm/amdgpu: Add work_struct for GPU reset from debugfs
  2022-05-25 19:04 ` [PATCH v3 4/7] drm/amdgpu: Add work_struct for GPU reset from debugfs Andrey Grodzovsky
@ 2022-05-30  7:52   ` Christian König
  2022-05-30 15:46     ` Andrey Grodzovsky
  0 siblings, 1 reply; 19+ messages in thread
From: Christian König @ 2022-05-30  7:52 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, lijo.lazar



Am 25.05.22 um 21:04 schrieb Andrey Grodzovsky:
> We need to have a work_struct to cancel this reset if another
> already in progress.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h       |  2 ++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 +++++++++++++++++--
>   2 files changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 76df583663c7..8165ee5b0457 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1048,6 +1048,8 @@ struct amdgpu_device {
>   
>   	bool                            scpm_enabled;
>   	uint32_t                        scpm_status;
> +
> +	struct work_struct		reset_work;
>   };
>   
>   static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index d16c8c1f72db..b0498ffcf7c3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -39,6 +39,7 @@
>   #include <drm/drm_drv.h>
>   #include "amdgpu.h"
>   #include "amdgpu_trace.h"
> +#include "amdgpu_reset.h"
>   
>   /*
>    * Fences
> @@ -798,7 +799,10 @@ static int gpu_recover_get(void *data, u64 *val)
>   		return 0;
>   	}
>   
> -	*val = amdgpu_device_gpu_recover(adev, NULL);
> +	if (amdgpu_reset_domain_schedule(adev->reset_domain, &adev->reset_work))
> +		flush_work(&adev->reset_work);
> +
> +	*val = atomic_read(&adev->reset_domain->reset_res);
>   
>   	pm_runtime_mark_last_busy(dev->dev);
>   	pm_runtime_put_autosuspend(dev->dev);
> @@ -810,6 +814,14 @@ DEFINE_SHOW_ATTRIBUTE(amdgpu_debugfs_fence_info);
>   DEFINE_DEBUGFS_ATTRIBUTE(amdgpu_debugfs_gpu_recover_fops, gpu_recover_get, NULL,
>   			 "%lld\n");
>   
> +static void amdgpu_debugfs_reset_work(struct work_struct *work)
> +{
> +	struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
> +						  reset_work);
> +
> +	amdgpu_device_gpu_recover_imp(adev, NULL);
> +}
> +
>   #endif
>   
>   void amdgpu_debugfs_fence_init(struct amdgpu_device *adev)
> @@ -821,9 +833,12 @@ void amdgpu_debugfs_fence_init(struct amdgpu_device *adev)
>   	debugfs_create_file("amdgpu_fence_info", 0444, root, adev,
>   			    &amdgpu_debugfs_fence_info_fops);
>   
> -	if (!amdgpu_sriov_vf(adev))
> +	if (!amdgpu_sriov_vf(adev)) {

I think we should drop the check for amdgpu_sriov_vf() here. It's a 
valid requirement to be able to trigger a GPU reset for a VF as well.

But not topic of this patch, feel free to add an Reviewed-by: Christian 
König <christian.koenig@amd.com>.

Regards,
Christian.

> +
> +		INIT_WORK(&adev->reset_work, amdgpu_debugfs_reset_work);
>   		debugfs_create_file("amdgpu_gpu_recover", 0444, root, adev,
>   				    &amdgpu_debugfs_gpu_recover_fops);
> +	}
>   #endif
>   }
>   


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 5/7] drm/amdgpu: Add work_struct for GPU reset from kfd.
  2022-05-25 19:04 ` [PATCH v3 5/7] drm/amdgpu: Add work_struct for GPU reset from kfd Andrey Grodzovsky
@ 2022-05-30  7:54   ` Christian König
  2022-05-31 15:31   ` Felix Kuehling
  1 sibling, 0 replies; 19+ messages in thread
From: Christian König @ 2022-05-30  7:54 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, lijo.lazar

Am 25.05.22 um 21:04 schrieb Andrey Grodzovsky:
> We need to have a work_struct to cancel this reset if another
> already in progress.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 ++++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 31 ----------------------
>   3 files changed, 15 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 1f8161cd507f..a23abc0e86e7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -33,6 +33,7 @@
>   #include <uapi/linux/kfd_ioctl.h>
>   #include "amdgpu_ras.h"
>   #include "amdgpu_umc.h"
> +#include "amdgpu_reset.h"
>   
>   /* Total memory size in system memory and all GPU VRAM. Used to
>    * estimate worst case amount of memory to reserve for page tables
> @@ -122,6 +123,15 @@ static void amdgpu_doorbell_get_kfd_info(struct amdgpu_device *adev,
>   	}
>   }
>   
> +
> +static void amdgpu_amdkfd_reset_work(struct work_struct *work)
> +{
> +	struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
> +						  kfd.reset_work);
> +
> +	amdgpu_device_gpu_recover_imp(adev, NULL);
> +}
> +
>   void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
>   {
>   	int i;
> @@ -180,6 +190,8 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
>   
>   		adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev,
>   						adev_to_drm(adev), &gpu_resources);
> +
> +		INIT_WORK(&adev->kfd.reset_work, amdgpu_amdkfd_reset_work);
>   	}
>   }
>   
> @@ -247,7 +259,8 @@ int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev)
>   void amdgpu_amdkfd_gpu_reset(struct amdgpu_device *adev)
>   {
>   	if (amdgpu_device_should_recover_gpu(adev))
> -		amdgpu_device_gpu_recover(adev, NULL);
> +		amdgpu_reset_domain_schedule(adev->reset_domain,
> +					     &adev->kfd.reset_work);
>   }
>   
>   int amdgpu_amdkfd_alloc_gtt_mem(struct amdgpu_device *adev, size_t size,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index f8b9f27adcf5..e0709af5a326 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -96,6 +96,7 @@ struct amdgpu_kfd_dev {
>   	struct kfd_dev *dev;
>   	uint64_t vram_used;
>   	bool init_complete;
> +	struct work_struct reset_work;
>   };
>   
>   enum kgd_engine_type {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index bfdd8883089a..e3e2a5d17cc2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5312,37 +5312,6 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>   	return r;
>   }
>   
> -struct amdgpu_recover_work_struct {
> -	struct work_struct base;
> -	struct amdgpu_device *adev;
> -	struct amdgpu_job *job;
> -	int ret;
> -};
> -
> -static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
> -{
> -	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
> -
> -	amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
> -}
> -/*
> - * Serialize gpu recover into reset domain single threaded wq
> - */
> -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> -				    struct amdgpu_job *job)
> -{
> -	struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
> -
> -	INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
> -
> -	if (!amdgpu_reset_domain_schedule(adev->reset_domain, &work.base))
> -		return -EAGAIN;
> -
> -	flush_work(&work.base);
> -
> -	return atomic_read(&adev->reset_domain->reset_res);
> -}
> -
>   /**
>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>    *


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover
  2022-05-25 19:04 ` [PATCH v3 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover Andrey Grodzovsky
@ 2022-05-30  7:55   ` Christian König
  0 siblings, 0 replies; 19+ messages in thread
From: Christian König @ 2022-05-30  7:55 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, lijo.lazar

Am 25.05.22 um 21:04 schrieb Andrey Grodzovsky:
> We removed the wrapper that was queueing the recover function
> into reset domain queue who was using this name.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 2 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      | 2 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      | 2 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      | 2 +-
>   9 files changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 8165ee5b0457..664ed0a6deab 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1244,7 +1244,7 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev);
>   bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
>   int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   			      struct amdgpu_job* job);
> -int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   			      struct amdgpu_job *job);
>   void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
>   int amdgpu_device_pci_reset(struct amdgpu_device *adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index a23abc0e86e7..513c57f839d8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -129,7 +129,7 @@ static void amdgpu_amdkfd_reset_work(struct work_struct *work)
>   	struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
>   						  kfd.reset_work);
>   
> -	amdgpu_device_gpu_recover_imp(adev, NULL);
> +	amdgpu_device_gpu_recover(adev, NULL);
>   }
>   
>   void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e3e2a5d17cc2..424571e46cf5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5065,7 +5065,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>    * Returns 0 for success or an error on failure.
>    */
>   
> -int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
> +int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   			      struct amdgpu_job *job)
>   {
>   	struct list_head device_list, *device_list_handle =  NULL;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index b0498ffcf7c3..957437a5558c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -819,7 +819,7 @@ static void amdgpu_debugfs_reset_work(struct work_struct *work)
>   	struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
>   						  reset_work);
>   
> -	amdgpu_device_gpu_recover_imp(adev, NULL);
> +	amdgpu_device_gpu_recover(adev, NULL);
>   }
>   
>   #endif
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index dfe7f2b8f0aa..10aa073600d4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -64,7 +64,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   		  ti.process_name, ti.tgid, ti.task_name, ti.pid);
>   
>   	if (amdgpu_device_should_recover_gpu(ring->adev)) {
> -		r = amdgpu_device_gpu_recover_imp(ring->adev, job);
> +		r = amdgpu_device_gpu_recover(ring->adev, job);
>   		if (r)
>   			DRM_ERROR("GPU Recovery Failed: %d\n", r);
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index a439c04223b5..bc0049308207 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -1922,7 +1922,7 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
>   	}
>   
>   	if (amdgpu_device_should_recover_gpu(ras->adev))
> -		amdgpu_device_gpu_recover_imp(ras->adev, NULL);
> +		amdgpu_device_gpu_recover(ras->adev, NULL);
>   	atomic_set(&ras->in_recovery, 0);
>   }
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index b81acf59870c..7ec5b5cf4bb9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -284,7 +284,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>   	if (amdgpu_device_should_recover_gpu(adev)
>   		&& (!amdgpu_device_has_job_running(adev) ||
>   		adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT))
> -		amdgpu_device_gpu_recover_imp(adev, NULL);
> +		amdgpu_device_gpu_recover(adev, NULL);
>   }
>   
>   static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> index 22c10b97ea81..e18b75c8fde6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> @@ -311,7 +311,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
>   		adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
>   		adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
>   		adev->video_timeout == MAX_SCHEDULE_TIMEOUT))
> -		amdgpu_device_gpu_recover_imp(adev, NULL);
> +		amdgpu_device_gpu_recover(adev, NULL);
>   }
>   
>   static int xgpu_nv_set_mailbox_rcv_irq(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> index 7b63d30b9b79..c5016a926331 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> @@ -523,7 +523,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
>   
>   	/* Trigger recovery due to world switch failure */
>   	if (amdgpu_device_should_recover_gpu(adev))
> -		amdgpu_device_gpu_recover_imp(adev, NULL);
> +		amdgpu_device_gpu_recover(adev, NULL);
>   }
>   
>   static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 7/7] drm/amdgpu: Stop any pending reset if another in progress.
  2022-05-25 19:04 ` [PATCH v3 7/7] drm/amdgpu: Stop any pending reset if another in progress Andrey Grodzovsky
@ 2022-05-30  7:56   ` Christian König
  2022-05-31 15:31   ` Felix Kuehling
  1 sibling, 0 replies; 19+ messages in thread
From: Christian König @ 2022-05-30  7:56 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, lijo.lazar

Am 25.05.22 um 21:04 schrieb Andrey Grodzovsky:
> We skip rest requests if another one is already in progress.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 27 ++++++++++++++++++++++
>   1 file changed, 27 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 424571e46cf5..e1f7ee604ea4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5054,6 +5054,27 @@ static void amdgpu_device_recheck_guilty_jobs(
>   	}
>   }
>   
> +static inline void amdggpu_device_stop_pedning_resets(struct amdgpu_device* adev)
> +{
> +	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +
> +#if defined(CONFIG_DEBUG_FS)
> +	if (!amdgpu_sriov_vf(adev))
> +		cancel_work(&adev->reset_work);
> +#endif
> +
> +	if (adev->kfd.dev)
> +		cancel_work(&adev->kfd.reset_work);
> +
> +	if (amdgpu_sriov_vf(adev))
> +		cancel_work(&adev->virt.flr_work);
> +
> +	if (con && adev->ras_enabled)
> +		cancel_work(&con->recovery_work);
> +
> +}
> +
> +
>   /**
>    * amdgpu_device_gpu_recover - reset the asic and recover scheduler
>    *
> @@ -5209,6 +5230,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   				  r, adev_to_drm(tmp_adev)->unique);
>   			tmp_adev->asic_reset_res = r;
>   		}
> +
> +		/*
> +		 * Drop all pending non scheduler resets. Scheduler resets
> +		 * were already dropped during drm_sched_stop
> +		 */
> +		amdggpu_device_stop_pedning_resets(tmp_adev);
>   	}
>   
>   	tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/7] drm/amdgpu: Add work_struct for GPU reset from debugfs
  2022-05-30  7:52   ` Christian König
@ 2022-05-30 15:46     ` Andrey Grodzovsky
  0 siblings, 0 replies; 19+ messages in thread
From: Andrey Grodzovsky @ 2022-05-30 15:46 UTC (permalink / raw)
  To: Christian König, amd-gfx, Liu, Monk; +Cc: Zoy.Bai, lijo.lazar

+ Monk

On 2022-05-30 03:52, Christian König wrote:
> 
> 
> Am 25.05.22 um 21:04 schrieb Andrey Grodzovsky:
>> We need to have a work_struct to cancel this reset if another
>> already in progress.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h       |  2 ++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 +++++++++++++++++--
>>   2 files changed, 19 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 76df583663c7..8165ee5b0457 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -1048,6 +1048,8 @@ struct amdgpu_device {
>>       bool                            scpm_enabled;
>>       uint32_t                        scpm_status;
>> +
>> +    struct work_struct        reset_work;
>>   };
>>   static inline struct amdgpu_device *drm_to_adev(struct drm_device 
>> *ddev)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index d16c8c1f72db..b0498ffcf7c3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -39,6 +39,7 @@
>>   #include <drm/drm_drv.h>
>>   #include "amdgpu.h"
>>   #include "amdgpu_trace.h"
>> +#include "amdgpu_reset.h"
>>   /*
>>    * Fences
>> @@ -798,7 +799,10 @@ static int gpu_recover_get(void *data, u64 *val)
>>           return 0;
>>       }
>> -    *val = amdgpu_device_gpu_recover(adev, NULL);
>> +    if (amdgpu_reset_domain_schedule(adev->reset_domain, 
>> &adev->reset_work))
>> +        flush_work(&adev->reset_work);
>> +
>> +    *val = atomic_read(&adev->reset_domain->reset_res);
>>       pm_runtime_mark_last_busy(dev->dev);
>>       pm_runtime_put_autosuspend(dev->dev);
>> @@ -810,6 +814,14 @@ DEFINE_SHOW_ATTRIBUTE(amdgpu_debugfs_fence_info);
>>   DEFINE_DEBUGFS_ATTRIBUTE(amdgpu_debugfs_gpu_recover_fops, 
>> gpu_recover_get, NULL,
>>                "%lld\n");
>> +static void amdgpu_debugfs_reset_work(struct work_struct *work)
>> +{
>> +    struct amdgpu_device *adev = container_of(work, struct 
>> amdgpu_device,
>> +                          reset_work);
>> +
>> +    amdgpu_device_gpu_recover_imp(adev, NULL);
>> +}
>> +
>>   #endif
>>   void amdgpu_debugfs_fence_init(struct amdgpu_device *adev)
>> @@ -821,9 +833,12 @@ void amdgpu_debugfs_fence_init(struct 
>> amdgpu_device *adev)
>>       debugfs_create_file("amdgpu_fence_info", 0444, root, adev,
>>                   &amdgpu_debugfs_fence_info_fops);
>> -    if (!amdgpu_sriov_vf(adev))
>> +    if (!amdgpu_sriov_vf(adev)) {
> 
> I think we should drop the check for amdgpu_sriov_vf() here. It's a 
> valid requirement to be able to trigger a GPU reset for a VF as well.
> 
> But not topic of this patch, feel free to add an Reviewed-by: Christian 
> König <christian.koenig@amd.com>.
> 
> Regards,
> Christian.

Monk - any idea why we prevent from creation of debugfs GPU reset for VF ?

Andrey

> 
>> +
>> +        INIT_WORK(&adev->reset_work, amdgpu_debugfs_reset_work);
>>           debugfs_create_file("amdgpu_gpu_recover", 0444, root, adev,
>>                       &amdgpu_debugfs_gpu_recover_fops);
>> +    }
>>   #endif
>>   }
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue.
  2022-05-30  7:49   ` Christian König
@ 2022-05-31  3:02     ` Luben Tuikov
  0 siblings, 0 replies; 19+ messages in thread
From: Luben Tuikov @ 2022-05-31  3:02 UTC (permalink / raw)
  To: Christian König, Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, lijo.lazar

On 2022-05-30 03:49, Christian König wrote:
> Am 25.05.22 um 21:04 schrieb Andrey Grodzovsky:
>> Save the extra usless work schedule.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> 
> Acked-by: Christian König <christian.koenig@amd.com>
> 
> Maybe Luben want to take a look as well, he has done some RAS review in 
> the past.

Looks good to me. I understand this is a change to sync up with the other changes
Andrey has been doing to GPU recovery.

We'll need to test this "in the wild", and thoroughly.

Acked-by: Luben Tuikov <luben.tuikov@amd.com>

Regards,
Luben

> 
> Thanks,
> Christian.
> 
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> index 31207f7eec02..a439c04223b5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
>> @@ -35,6 +35,8 @@
>>   #include "amdgpu_xgmi.h"
>>   #include "ivsrcid/nbio/irqsrcs_nbif_7_4.h"
>>   #include "atom.h"
>> +#include "amdgpu_reset.h"
>> +
>>   #ifdef CONFIG_X86_MCE_AMD
>>   #include <asm/mce.h>
>>   
>> @@ -1920,7 +1922,7 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
>>   	}
>>   
>>   	if (amdgpu_device_should_recover_gpu(ras->adev))
>> -		amdgpu_device_gpu_recover(ras->adev, NULL);
>> +		amdgpu_device_gpu_recover_imp(ras->adev, NULL);
>>   	atomic_set(&ras->in_recovery, 0);
>>   }
>>   
>> @@ -2928,7 +2930,7 @@ int amdgpu_ras_reset_gpu(struct amdgpu_device *adev)
>>   	struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
>>   
>>   	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
>> -		schedule_work(&ras->recovery_work);
>> +		amdgpu_reset_domain_schedule(ras->adev->reset_domain, &ras->recovery_work);
>>   	return 0;
>>   }
>>   
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 7/7] drm/amdgpu: Stop any pending reset if another in progress.
  2022-05-25 19:04 ` [PATCH v3 7/7] drm/amdgpu: Stop any pending reset if another in progress Andrey Grodzovsky
  2022-05-30  7:56   ` Christian König
@ 2022-05-31 15:31   ` Felix Kuehling
  2022-05-31 15:35     ` Felix Kuehling
  1 sibling, 1 reply; 19+ messages in thread
From: Felix Kuehling @ 2022-05-31 15:31 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, lijo.lazar, Christian.Koenig

Am 2022-05-25 um 15:04 schrieb Andrey Grodzovsky:
> We skip rest requests if another one is already in progress.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 27 ++++++++++++++++++++++
>   1 file changed, 27 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 424571e46cf5..e1f7ee604ea4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5054,6 +5054,27 @@ static void amdgpu_device_recheck_guilty_jobs(
>   	}
>   }
>   
> +static inline void amdggpu_device_stop_pedning_resets(struct amdgpu_device* adev)

Typo: pedning -> pending


> +{
> +	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +
> +#if defined(CONFIG_DEBUG_FS)
> +	if (!amdgpu_sriov_vf(adev))
> +		cancel_work(&adev->reset_work);
> +#endif
> +
> +	if (adev->kfd.dev)
> +		cancel_work(&adev->kfd.reset_work);

Do you also need to cancel resets from other GPUs in the same hive?

Regards,
   Felix


> +
> +	if (amdgpu_sriov_vf(adev))
> +		cancel_work(&adev->virt.flr_work);
> +
> +	if (con && adev->ras_enabled)
> +		cancel_work(&con->recovery_work);
> +
> +}
> +
> +
>   /**
>    * amdgpu_device_gpu_recover - reset the asic and recover scheduler
>    *
> @@ -5209,6 +5230,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   				  r, adev_to_drm(tmp_adev)->unique);
>   			tmp_adev->asic_reset_res = r;
>   		}
> +
> +		/*
> +		 * Drop all pending non scheduler resets. Scheduler resets
> +		 * were already dropped during drm_sched_stop
> +		 */
> +		amdggpu_device_stop_pedning_resets(tmp_adev);
>   	}
>   
>   	tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 5/7] drm/amdgpu: Add work_struct for GPU reset from kfd.
  2022-05-25 19:04 ` [PATCH v3 5/7] drm/amdgpu: Add work_struct for GPU reset from kfd Andrey Grodzovsky
  2022-05-30  7:54   ` Christian König
@ 2022-05-31 15:31   ` Felix Kuehling
  1 sibling, 0 replies; 19+ messages in thread
From: Felix Kuehling @ 2022-05-31 15:31 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, lijo.lazar, Christian.Koenig


Am 2022-05-25 um 15:04 schrieb Andrey Grodzovsky:
> We need to have a work_struct to cancel this reset if another
> already in progress.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 ++++++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 31 ----------------------
>   3 files changed, 15 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 1f8161cd507f..a23abc0e86e7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -33,6 +33,7 @@
>   #include <uapi/linux/kfd_ioctl.h>
>   #include "amdgpu_ras.h"
>   #include "amdgpu_umc.h"
> +#include "amdgpu_reset.h"
>   
>   /* Total memory size in system memory and all GPU VRAM. Used to
>    * estimate worst case amount of memory to reserve for page tables
> @@ -122,6 +123,15 @@ static void amdgpu_doorbell_get_kfd_info(struct amdgpu_device *adev,
>   	}
>   }
>   
> +
> +static void amdgpu_amdkfd_reset_work(struct work_struct *work)
> +{
> +	struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
> +						  kfd.reset_work);
> +
> +	amdgpu_device_gpu_recover_imp(adev, NULL);
> +}
> +
>   void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
>   {
>   	int i;
> @@ -180,6 +190,8 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
>   
>   		adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev,
>   						adev_to_drm(adev), &gpu_resources);
> +
> +		INIT_WORK(&adev->kfd.reset_work, amdgpu_amdkfd_reset_work);
>   	}
>   }
>   
> @@ -247,7 +259,8 @@ int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev)
>   void amdgpu_amdkfd_gpu_reset(struct amdgpu_device *adev)
>   {
>   	if (amdgpu_device_should_recover_gpu(adev))
> -		amdgpu_device_gpu_recover(adev, NULL);
> +		amdgpu_reset_domain_schedule(adev->reset_domain,
> +					     &adev->kfd.reset_work);
>   }
>   
>   int amdgpu_amdkfd_alloc_gtt_mem(struct amdgpu_device *adev, size_t size,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> index f8b9f27adcf5..e0709af5a326 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
> @@ -96,6 +96,7 @@ struct amdgpu_kfd_dev {
>   	struct kfd_dev *dev;
>   	uint64_t vram_used;
>   	bool init_complete;
> +	struct work_struct reset_work;
>   };
>   
>   enum kgd_engine_type {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index bfdd8883089a..e3e2a5d17cc2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5312,37 +5312,6 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>   	return r;
>   }
>   
> -struct amdgpu_recover_work_struct {
> -	struct work_struct base;
> -	struct amdgpu_device *adev;
> -	struct amdgpu_job *job;
> -	int ret;
> -};
> -
> -static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
> -{
> -	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
> -
> -	amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
> -}
> -/*
> - * Serialize gpu recover into reset domain single threaded wq
> - */
> -int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> -				    struct amdgpu_job *job)
> -{
> -	struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
> -
> -	INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
> -
> -	if (!amdgpu_reset_domain_schedule(adev->reset_domain, &work.base))
> -		return -EAGAIN;
> -
> -	flush_work(&work.base);
> -
> -	return atomic_read(&adev->reset_domain->reset_res);
> -}
> -
>   /**
>    * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
>    *

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 7/7] drm/amdgpu: Stop any pending reset if another in progress.
  2022-05-31 15:31   ` Felix Kuehling
@ 2022-05-31 15:35     ` Felix Kuehling
  0 siblings, 0 replies; 19+ messages in thread
From: Felix Kuehling @ 2022-05-31 15:35 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, lijo.lazar, Christian.Koenig

Am 2022-05-31 um 11:31 schrieb Felix Kuehling:
> Am 2022-05-25 um 15:04 schrieb Andrey Grodzovsky:
>> We skip rest requests if another one is already in progress.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 27 ++++++++++++++++++++++
>>   1 file changed, 27 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 424571e46cf5..e1f7ee604ea4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -5054,6 +5054,27 @@ static void amdgpu_device_recheck_guilty_jobs(
>>       }
>>   }
>>   +static inline void amdggpu_device_stop_pedning_resets(struct 
>> amdgpu_device* adev)
>
> Typo: pedning -> pending
>
>
>> +{
>> +    struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
>> +
>> +#if defined(CONFIG_DEBUG_FS)
>> +    if (!amdgpu_sriov_vf(adev))
>> +        cancel_work(&adev->reset_work);
>> +#endif
>> +
>> +    if (adev->kfd.dev)
>> +        cancel_work(&adev->kfd.reset_work);
>
> Do you also need to cancel resets from other GPUs in the same hive?

Never mind. I see this is called in a loop over the GPUs in 
amdgpu_device_gpu_recover.

Other than the typo, this patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


>
> Regards,
>   Felix
>
>
>> +
>> +    if (amdgpu_sriov_vf(adev))
>> +        cancel_work(&adev->virt.flr_work);
>> +
>> +    if (con && adev->ras_enabled)
>> +        cancel_work(&con->recovery_work);
>> +
>> +}
>> +
>> +
>>   /**
>>    * amdgpu_device_gpu_recover - reset the asic and recover scheduler
>>    *
>> @@ -5209,6 +5230,12 @@ int amdgpu_device_gpu_recover(struct 
>> amdgpu_device *adev,
>>                     r, adev_to_drm(tmp_adev)->unique);
>>               tmp_adev->asic_reset_res = r;
>>           }
>> +
>> +        /*
>> +         * Drop all pending non scheduler resets. Scheduler resets
>> +         * were already dropped during drm_sched_stop
>> +         */
>> +        amdggpu_device_stop_pedning_resets(tmp_adev);
>>       }
>>         tmp_vram_lost_counter = 
>> atomic_read(&((adev)->vram_lost_counter));

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-05-31 15:35 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-25 19:04 [PATCH v3 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
2022-05-25 19:04 ` [PATCH v3 1/7] Revert "workqueue: remove unused cancel_work()" Andrey Grodzovsky
2022-05-25 19:04 ` [PATCH v3 2/7] drm/amdgpu: Cache result of last reset at reset domain level Andrey Grodzovsky
2022-05-30  7:47   ` Christian König
2022-05-25 19:04 ` [PATCH v3 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue Andrey Grodzovsky
2022-05-30  7:49   ` Christian König
2022-05-31  3:02     ` Luben Tuikov
2022-05-25 19:04 ` [PATCH v3 4/7] drm/amdgpu: Add work_struct for GPU reset from debugfs Andrey Grodzovsky
2022-05-30  7:52   ` Christian König
2022-05-30 15:46     ` Andrey Grodzovsky
2022-05-25 19:04 ` [PATCH v3 5/7] drm/amdgpu: Add work_struct for GPU reset from kfd Andrey Grodzovsky
2022-05-30  7:54   ` Christian König
2022-05-31 15:31   ` Felix Kuehling
2022-05-25 19:04 ` [PATCH v3 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover Andrey Grodzovsky
2022-05-30  7:55   ` Christian König
2022-05-25 19:04 ` [PATCH v3 7/7] drm/amdgpu: Stop any pending reset if another in progress Andrey Grodzovsky
2022-05-30  7:56   ` Christian König
2022-05-31 15:31   ` Felix Kuehling
2022-05-31 15:35     ` Felix Kuehling

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.