All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.
@ 2022-05-17 19:20 Andrey Grodzovsky
  2022-05-17 19:20 ` [PATCH v2 1/7] drm/amdgpu: Cache result of last reset at reset domain level Andrey Grodzovsky
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-17 19:20 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig

Problem:
During hive reset caused by command timing out on a ring
extra resets are generated by triggered by KFD which is
unable to accesses registers on the resetting ASIC.

Fix: Rework GPU reset to actively stop any pending reset
works while another in progress. 

v2: Switch from generic list as was in v1[1] to eplicit 
stopping of each reset request from each reset source
per each request submitter. 

[1] - https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/

Andrey Grodzovsky (7):
  drm/amdgpu: Cache result of last reset at reset domain level.
  drm/amdgpu: Switch to delayed work from work_struct.
  drm/admgpu: Serialize RAS recovery work directly into reset domain
    queue.
  drm/amdgpu: Add delayed work for GPU reset from debugfs
  drm/amdgpu: Add delayed work for GPU reset from kfd.
  drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
    amdgpu_device_gpu_recover
  drm/amdgpu: Stop any pending reset if another in progress.

 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 +++++++++++-----------
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 19 ++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  6 +--
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  6 +--
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 +--
 14 files changed, 87 insertions(+), 54 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/7] drm/amdgpu: Cache result of last reset at reset domain level.
  2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
@ 2022-05-17 19:20 ` Andrey Grodzovsky
  2022-05-18  6:02   ` Christian König
  2022-05-17 19:20 ` [PATCH v2 2/7] drm/amdgpu: Switch to delayed work from work_struct Andrey Grodzovsky
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-17 19:20 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig

Will be read by executors of async reset like debugfs.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b583026dc893..3948e7db6ad7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5303,6 +5303,8 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 
 	if (r)
 		dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
+
+	atomic_set(&adev->reset_domain->reset_res, r);
 	return r;
 }
 
@@ -5317,7 +5319,7 @@ static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
 {
 	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
 
-	recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
+	amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
 }
 /*
  * Serialize gpu recover into reset domain single threaded wq
@@ -5334,7 +5336,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 	flush_work(&work.base);
 
-	return work.ret;
+	return atomic_read(&adev->reset_domain->reset_res);
 }
 
 /**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index c80af0889773..32c86a0b145c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -132,6 +132,7 @@ struct amdgpu_reset_domain *amdgpu_reset_create_reset_domain(enum amdgpu_reset_d
 	}
 
 	atomic_set(&reset_domain->in_gpu_reset, 0);
+	atomic_set(&reset_domain->reset_res, 0);
 	init_rwsem(&reset_domain->sem);
 
 	return reset_domain;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
index 1949dbe28a86..9e55a5d7a825 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -82,6 +82,7 @@ struct amdgpu_reset_domain {
 	enum amdgpu_reset_domain_type type;
 	struct rw_semaphore sem;
 	atomic_t in_gpu_reset;
+	atomic_t reset_res;
 };
 
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/7] drm/amdgpu: Switch to delayed work from work_struct.
  2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
  2022-05-17 19:20 ` [PATCH v2 1/7] drm/amdgpu: Cache result of last reset at reset domain level Andrey Grodzovsky
@ 2022-05-17 19:20 ` Andrey Grodzovsky
  2022-05-18  6:03   ` Christian König
  2022-05-17 19:20 ` [PATCH v2 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue Andrey Grodzovsky
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-17 19:20 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig

We need to be able to non blocking cancel pending reset works
from within GPU reset. Currently kernel API allows this only
for delayed_work and not for work_struct. Switch to delayed
work and queue it with delay 0 which is equal to queueing work
struct.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 4 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   | 2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      | 4 ++--
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      | 4 ++--
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      | 4 ++--
 6 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3948e7db6ad7..ea41edf52a6f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5309,7 +5309,7 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 }
 
 struct amdgpu_recover_work_struct {
-	struct work_struct base;
+	struct delayed_work base;
 	struct amdgpu_device *adev;
 	struct amdgpu_job *job;
 	int ret;
@@ -5317,7 +5317,7 @@ struct amdgpu_recover_work_struct {
 
 static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
 {
-	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
+	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base.work);
 
 	amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
 }
@@ -5329,12 +5329,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 {
 	struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
 
-	INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
+	INIT_DELAYED_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
 
 	if (!amdgpu_reset_domain_schedule(adev->reset_domain, &work.base))
 		return -EAGAIN;
 
-	flush_work(&work.base);
+	flush_delayed_work(&work.base);
 
 	return atomic_read(&adev->reset_domain->reset_res);
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
index 9e55a5d7a825..fee9376d495a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -114,9 +114,9 @@ static inline void amdgpu_reset_put_reset_domain(struct amdgpu_reset_domain *dom
 }
 
 static inline bool amdgpu_reset_domain_schedule(struct amdgpu_reset_domain *domain,
-						struct work_struct *work)
+						struct delayed_work *work)
 {
-	return queue_work(domain->wq, work);
+	return queue_delayed_work(domain->wq, work, 0);
 }
 
 void amdgpu_device_lock_reset_domain(struct amdgpu_reset_domain *reset_domain);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
index 239f232f9c02..b53b7a794459 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
@@ -230,7 +230,7 @@ struct amdgpu_virt {
 	uint32_t			reg_val_offs;
 	struct amdgpu_irq_src		ack_irq;
 	struct amdgpu_irq_src		rcv_irq;
-	struct work_struct		flr_work;
+	struct delayed_work		flr_work;
 	struct amdgpu_mm_table		mm_table;
 	const struct amdgpu_virt_ops	*ops;
 	struct amdgpu_vf_error_buffer	vf_errors;
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index b81acf59870c..aa5f6d6ea1e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -251,7 +251,7 @@ static int xgpu_ai_set_mailbox_ack_irq(struct amdgpu_device *adev,
 
 static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 {
-	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
+	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work.work);
 	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
 	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
 
@@ -380,7 +380,7 @@ int xgpu_ai_mailbox_get_irq(struct amdgpu_device *adev)
 		return r;
 	}
 
-	INIT_WORK(&adev->virt.flr_work, xgpu_ai_mailbox_flr_work);
+	INIT_DELAYED_WORK(&adev->virt.flr_work, xgpu_ai_mailbox_flr_work);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index 22c10b97ea81..dd9f6b6f62f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -275,7 +275,7 @@ static int xgpu_nv_set_mailbox_ack_irq(struct amdgpu_device *adev,
 
 static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 {
-	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
+	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work.work);
 	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
 	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
 
@@ -407,7 +407,7 @@ int xgpu_nv_mailbox_get_irq(struct amdgpu_device *adev)
 		return r;
 	}
 
-	INIT_WORK(&adev->virt.flr_work, xgpu_nv_mailbox_flr_work);
+	INIT_DELAYED_WORK(&adev->virt.flr_work, xgpu_nv_mailbox_flr_work);
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index 7b63d30b9b79..b11831da1b13 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -512,7 +512,7 @@ static int xgpu_vi_set_mailbox_ack_irq(struct amdgpu_device *adev,
 
 static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
 {
-	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
+	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work.work);
 	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
 
 	/* wait until RCV_MSG become 3 */
@@ -610,7 +610,7 @@ int xgpu_vi_mailbox_get_irq(struct amdgpu_device *adev)
 		return r;
 	}
 
-	INIT_WORK(&adev->virt.flr_work, xgpu_vi_mailbox_flr_work);
+	INIT_DELAYED_WORK(&adev->virt.flr_work, xgpu_vi_mailbox_flr_work);
 
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue.
  2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
  2022-05-17 19:20 ` [PATCH v2 1/7] drm/amdgpu: Cache result of last reset at reset domain level Andrey Grodzovsky
  2022-05-17 19:20 ` [PATCH v2 2/7] drm/amdgpu: Switch to delayed work from work_struct Andrey Grodzovsky
@ 2022-05-17 19:20 ` Andrey Grodzovsky
  2022-05-17 19:20 ` [PATCH v2 4/7] drm/amdgpu: Add delayed work for GPU reset from debugfs Andrey Grodzovsky
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-17 19:20 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig

Save the extra usless work schedule. Also swith to delayed work.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 12 +++++++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h |  2 +-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index a653cf3b3d13..7e8c7bcc7303 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -35,6 +35,8 @@
 #include "amdgpu_xgmi.h"
 #include "ivsrcid/nbio/irqsrcs_nbif_7_4.h"
 #include "atom.h"
+#include "amdgpu_reset.h"
+
 #ifdef CONFIG_X86_MCE_AMD
 #include <asm/mce.h>
 
@@ -1889,7 +1891,7 @@ static int amdgpu_ras_badpages_read(struct amdgpu_device *adev,
 static void amdgpu_ras_do_recovery(struct work_struct *work)
 {
 	struct amdgpu_ras *ras =
-		container_of(work, struct amdgpu_ras, recovery_work);
+		container_of(work, struct amdgpu_ras, recovery_work.work);
 	struct amdgpu_device *remote_adev = NULL;
 	struct amdgpu_device *adev = ras->adev;
 	struct list_head device_list, *device_list_handle =  NULL;
@@ -1916,7 +1918,7 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
 	}
 
 	if (amdgpu_device_should_recover_gpu(ras->adev))
-		amdgpu_device_gpu_recover(ras->adev, NULL);
+		amdgpu_device_gpu_recover_imp(ras->adev, NULL);
 	atomic_set(&ras->in_recovery, 0);
 }
 
@@ -2148,7 +2150,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
 	}
 
 	mutex_init(&con->recovery_lock);
-	INIT_WORK(&con->recovery_work, amdgpu_ras_do_recovery);
+	INIT_DELAYED_WORK(&con->recovery_work, amdgpu_ras_do_recovery);
 	atomic_set(&con->in_recovery, 0);
 	con->eeprom_control.bad_channel_bitmap = 0;
 
@@ -2217,7 +2219,7 @@ static int amdgpu_ras_recovery_fini(struct amdgpu_device *adev)
 	if (!data)
 		return 0;
 
-	cancel_work_sync(&con->recovery_work);
+	cancel_delayed_work_sync(&con->recovery_work);
 
 	mutex_lock(&con->recovery_lock);
 	con->eh_data = NULL;
@@ -2910,7 +2912,7 @@ int amdgpu_ras_reset_gpu(struct amdgpu_device *adev)
 	struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
 
 	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
-		schedule_work(&ras->recovery_work);
+		amdgpu_reset_domain_schedule(ras->adev->reset_domain, &ras->recovery_work);
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index b9a6fac2b8b2..f7e21c2abc61 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -347,7 +347,7 @@ struct amdgpu_ras {
 	struct ras_manager *objs;
 
 	/* gpu recovery */
-	struct work_struct recovery_work;
+	struct delayed_work recovery_work;
 	atomic_t in_recovery;
 	struct amdgpu_device *adev;
 	/* error handler data */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 4/7] drm/amdgpu: Add delayed work for GPU reset from debugfs
  2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
                   ` (2 preceding siblings ...)
  2022-05-17 19:20 ` [PATCH v2 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue Andrey Grodzovsky
@ 2022-05-17 19:20 ` Andrey Grodzovsky
  2022-05-17 19:21 ` [PATCH v2 5/7] drm/amdgpu: Add delayed work for GPU reset from kfd Andrey Grodzovsky
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-17 19:20 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig

We need to have a delayed work to cancel this reset if another
already in progress.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h       |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 +++++++++++++++++--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 3c20c2eadf4e..4ef17c6d1a50 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1048,6 +1048,8 @@ struct amdgpu_device {
 
 	bool                            scpm_enabled;
 	uint32_t                        scpm_status;
+
+	struct delayed_work		reset_work;
 };
 
 static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index d16c8c1f72db..f980f1501c48 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -39,6 +39,7 @@
 #include <drm/drm_drv.h>
 #include "amdgpu.h"
 #include "amdgpu_trace.h"
+#include "amdgpu_reset.h"
 
 /*
  * Fences
@@ -798,7 +799,10 @@ static int gpu_recover_get(void *data, u64 *val)
 		return 0;
 	}
 
-	*val = amdgpu_device_gpu_recover(adev, NULL);
+	if (amdgpu_reset_domain_schedule(adev->reset_domain, &adev->reset_work))
+		flush_delayed_work(&adev->reset_work);
+
+	*val = atomic_read(&adev->reset_domain->reset_res);
 
 	pm_runtime_mark_last_busy(dev->dev);
 	pm_runtime_put_autosuspend(dev->dev);
@@ -810,6 +814,14 @@ DEFINE_SHOW_ATTRIBUTE(amdgpu_debugfs_fence_info);
 DEFINE_DEBUGFS_ATTRIBUTE(amdgpu_debugfs_gpu_recover_fops, gpu_recover_get, NULL,
 			 "%lld\n");
 
+static void amdgpu_debugfs_reset_work(struct work_struct *work)
+{
+	struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
+						  reset_work.work);
+
+	amdgpu_device_gpu_recover_imp(adev, NULL);
+}
+
 #endif
 
 void amdgpu_debugfs_fence_init(struct amdgpu_device *adev)
@@ -821,9 +833,12 @@ void amdgpu_debugfs_fence_init(struct amdgpu_device *adev)
 	debugfs_create_file("amdgpu_fence_info", 0444, root, adev,
 			    &amdgpu_debugfs_fence_info_fops);
 
-	if (!amdgpu_sriov_vf(adev))
+	if (!amdgpu_sriov_vf(adev)) {
+
+		INIT_DELAYED_WORK(&adev->reset_work, amdgpu_debugfs_reset_work);
 		debugfs_create_file("amdgpu_gpu_recover", 0444, root, adev,
 				    &amdgpu_debugfs_gpu_recover_fops);
+	}
 #endif
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 5/7] drm/amdgpu: Add delayed work for GPU reset from kfd.
  2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
                   ` (3 preceding siblings ...)
  2022-05-17 19:20 ` [PATCH v2 4/7] drm/amdgpu: Add delayed work for GPU reset from debugfs Andrey Grodzovsky
@ 2022-05-17 19:21 ` Andrey Grodzovsky
  2022-05-17 19:21 ` [PATCH v2 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover Andrey Grodzovsky
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-17 19:21 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig

We need to have a delayed work to cancel this reset if another
already in progress.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 ++++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 31 ----------------------
 3 files changed, 15 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 1f8161cd507f..4cc846341394 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -33,6 +33,7 @@
 #include <uapi/linux/kfd_ioctl.h>
 #include "amdgpu_ras.h"
 #include "amdgpu_umc.h"
+#include "amdgpu_reset.h"
 
 /* Total memory size in system memory and all GPU VRAM. Used to
  * estimate worst case amount of memory to reserve for page tables
@@ -122,6 +123,15 @@ static void amdgpu_doorbell_get_kfd_info(struct amdgpu_device *adev,
 	}
 }
 
+
+static void amdgpu_amdkfd_reset_work(struct work_struct *work)
+{
+	struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
+						  kfd.reset_work.work);
+
+	amdgpu_device_gpu_recover_imp(adev, NULL);
+}
+
 void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 {
 	int i;
@@ -180,6 +190,8 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
 
 		adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev,
 						adev_to_drm(adev), &gpu_resources);
+
+		INIT_DELAYED_WORK(&adev->kfd.reset_work, amdgpu_amdkfd_reset_work);
 	}
 }
 
@@ -247,7 +259,8 @@ int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev)
 void amdgpu_amdkfd_gpu_reset(struct amdgpu_device *adev)
 {
 	if (amdgpu_device_should_recover_gpu(adev))
-		amdgpu_device_gpu_recover(adev, NULL);
+		amdgpu_reset_domain_schedule(adev->reset_domain,
+					     &adev->kfd.reset_work);
 }
 
 int amdgpu_amdkfd_alloc_gtt_mem(struct amdgpu_device *adev, size_t size,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index f8b9f27adcf5..5e04dba8c7f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -96,6 +96,7 @@ struct amdgpu_kfd_dev {
 	struct kfd_dev *dev;
 	uint64_t vram_used;
 	bool init_complete;
+	struct delayed_work reset_work;
 };
 
 enum kgd_engine_type {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ea41edf52a6f..ae4c37c89ac7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5308,37 +5308,6 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 	return r;
 }
 
-struct amdgpu_recover_work_struct {
-	struct delayed_work base;
-	struct amdgpu_device *adev;
-	struct amdgpu_job *job;
-	int ret;
-};
-
-static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
-{
-	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base.work);
-
-	amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
-}
-/*
- * Serialize gpu recover into reset domain single threaded wq
- */
-int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
-				    struct amdgpu_job *job)
-{
-	struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
-
-	INIT_DELAYED_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
-
-	if (!amdgpu_reset_domain_schedule(adev->reset_domain, &work.base))
-		return -EAGAIN;
-
-	flush_delayed_work(&work.base);
-
-	return atomic_read(&adev->reset_domain->reset_res);
-}
-
 /**
  * amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
  *
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover
  2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
                   ` (4 preceding siblings ...)
  2022-05-17 19:21 ` [PATCH v2 5/7] drm/amdgpu: Add delayed work for GPU reset from kfd Andrey Grodzovsky
@ 2022-05-17 19:21 ` Andrey Grodzovsky
  2022-05-17 19:21 ` [PATCH v2 7/7] drm/amdgpu: Stop any pending reset if another in progress Andrey Grodzovsky
  2022-05-18  6:07 ` [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Christian König
  7 siblings, 0 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-17 19:21 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig

We removed the wrapper that was queueing the recover function
into reset domain queue who was using this name.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      | 2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      | 2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      | 2 +-
 9 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 4ef17c6d1a50..ee668f253c7a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1244,7 +1244,7 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev);
 bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
 int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			      struct amdgpu_job* job);
-int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
+int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			      struct amdgpu_job *job);
 void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
 int amdgpu_device_pci_reset(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 4cc846341394..434053a9e027 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -129,7 +129,7 @@ static void amdgpu_amdkfd_reset_work(struct work_struct *work)
 	struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
 						  kfd.reset_work.work);
 
-	amdgpu_device_gpu_recover_imp(adev, NULL);
+	amdgpu_device_gpu_recover(adev, NULL);
 }
 
 void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ae4c37c89ac7..65f738fd4761 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5065,7 +5065,7 @@ static void amdgpu_device_recheck_guilty_jobs(
  * Returns 0 for success or an error on failure.
  */
 
-int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
+int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			      struct amdgpu_job *job)
 {
 	struct list_head device_list, *device_list_handle =  NULL;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index f980f1501c48..7954ebf16885 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -819,7 +819,7 @@ static void amdgpu_debugfs_reset_work(struct work_struct *work)
 	struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
 						  reset_work.work);
 
-	amdgpu_device_gpu_recover_imp(adev, NULL);
+	amdgpu_device_gpu_recover(adev, NULL);
 }
 
 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index dfe7f2b8f0aa..10aa073600d4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -64,7 +64,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 		  ti.process_name, ti.tgid, ti.task_name, ti.pid);
 
 	if (amdgpu_device_should_recover_gpu(ring->adev)) {
-		r = amdgpu_device_gpu_recover_imp(ring->adev, job);
+		r = amdgpu_device_gpu_recover(ring->adev, job);
 		if (r)
 			DRM_ERROR("GPU Recovery Failed: %d\n", r);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 7e8c7bcc7303..221d24feb8c9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1918,7 +1918,7 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
 	}
 
 	if (amdgpu_device_should_recover_gpu(ras->adev))
-		amdgpu_device_gpu_recover_imp(ras->adev, NULL);
+		amdgpu_device_gpu_recover(ras->adev, NULL);
 	atomic_set(&ras->in_recovery, 0);
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index aa5f6d6ea1e3..3b7d9f171793 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -284,7 +284,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	if (amdgpu_device_should_recover_gpu(adev)
 		&& (!amdgpu_device_has_job_running(adev) ||
 		adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT))
-		amdgpu_device_gpu_recover_imp(adev, NULL);
+		amdgpu_device_gpu_recover(adev, NULL);
 }
 
 static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index dd9f6b6f62f5..82126b48477d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -311,7 +311,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 		adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
 		adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
 		adev->video_timeout == MAX_SCHEDULE_TIMEOUT))
-		amdgpu_device_gpu_recover_imp(adev, NULL);
+		amdgpu_device_gpu_recover(adev, NULL);
 }
 
 static int xgpu_nv_set_mailbox_rcv_irq(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index b11831da1b13..d4f721cd34d1 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -523,7 +523,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
 
 	/* Trigger recovery due to world switch failure */
 	if (amdgpu_device_should_recover_gpu(adev))
-		amdgpu_device_gpu_recover_imp(adev, NULL);
+		amdgpu_device_gpu_recover(adev, NULL);
 }
 
 static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 7/7] drm/amdgpu: Stop any pending reset if another in progress.
  2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
                   ` (5 preceding siblings ...)
  2022-05-17 19:21 ` [PATCH v2 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover Andrey Grodzovsky
@ 2022-05-17 19:21 ` Andrey Grodzovsky
  2022-05-17 20:56   ` Felix Kuehling
  2022-05-18  6:07 ` [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Christian König
  7 siblings, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-17 19:21 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig

We skip rest requests if another one is already in progress.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 27 ++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 65f738fd4761..43af5ea3eee5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5054,6 +5054,27 @@ static void amdgpu_device_recheck_guilty_jobs(
 	}
 }
 
+static inline void amdggpu_device_stop_pedning_resets(struct amdgpu_device* adev)
+{
+	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+
+#if defined(CONFIG_DEBUG_FS)
+	if (!amdgpu_sriov_vf(adev))
+		cancel_delayed_work(&adev->reset_work);
+#endif
+
+	if (adev->kfd.dev)
+		cancel_delayed_work(&adev->kfd.reset_work);
+
+	if (amdgpu_sriov_vf(adev))
+		cancel_delayed_work(&adev->virt.flr_work);
+
+	if (con && adev->ras_enabled)
+		cancel_delayed_work(&con->recovery_work);
+
+}
+
+
 /**
  * amdgpu_device_gpu_recover - reset the asic and recover scheduler
  *
@@ -5209,6 +5230,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 				  r, adev_to_drm(tmp_adev)->unique);
 			tmp_adev->asic_reset_res = r;
 		}
+
+		/*
+		 * Drop all pending non scheduler resets. Scheduler resets
+		 * were already dropped during drm_sched_stop
+		 */
+		amdggpu_device_stop_pedning_resets(tmp_adev);
 	}
 
 	tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 7/7] drm/amdgpu: Stop any pending reset if another in progress.
  2022-05-17 19:21 ` [PATCH v2 7/7] drm/amdgpu: Stop any pending reset if another in progress Andrey Grodzovsky
@ 2022-05-17 20:56   ` Felix Kuehling
  0 siblings, 0 replies; 15+ messages in thread
From: Felix Kuehling @ 2022-05-17 20:56 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, lijo.lazar, Christian.Koenig

Am 2022-05-17 um 15:21 schrieb Andrey Grodzovsky:
> We skip rest requests if another one is already in progress.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 27 ++++++++++++++++++++++
>   1 file changed, 27 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 65f738fd4761..43af5ea3eee5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5054,6 +5054,27 @@ static void amdgpu_device_recheck_guilty_jobs(
>   	}
>   }
>   
> +static inline void amdggpu_device_stop_pedning_resets(struct amdgpu_device* adev)

Typo: pedning -> pending

Other than that the series is

Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>


> +{
> +	struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +
> +#if defined(CONFIG_DEBUG_FS)
> +	if (!amdgpu_sriov_vf(adev))
> +		cancel_delayed_work(&adev->reset_work);
> +#endif
> +
> +	if (adev->kfd.dev)
> +		cancel_delayed_work(&adev->kfd.reset_work);
> +
> +	if (amdgpu_sriov_vf(adev))
> +		cancel_delayed_work(&adev->virt.flr_work);
> +
> +	if (con && adev->ras_enabled)
> +		cancel_delayed_work(&con->recovery_work);
> +
> +}
> +
> +
>   /**
>    * amdgpu_device_gpu_recover - reset the asic and recover scheduler
>    *
> @@ -5209,6 +5230,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   				  r, adev_to_drm(tmp_adev)->unique);
>   			tmp_adev->asic_reset_res = r;
>   		}
> +
> +		/*
> +		 * Drop all pending non scheduler resets. Scheduler resets
> +		 * were already dropped during drm_sched_stop
> +		 */
> +		amdggpu_device_stop_pedning_resets(tmp_adev);
>   	}
>   
>   	tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/7] drm/amdgpu: Cache result of last reset at reset domain level.
  2022-05-17 19:20 ` [PATCH v2 1/7] drm/amdgpu: Cache result of last reset at reset domain level Andrey Grodzovsky
@ 2022-05-18  6:02   ` Christian König
  0 siblings, 0 replies; 15+ messages in thread
From: Christian König @ 2022-05-18  6:02 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, lijo.lazar

Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
> Will be read by executors of async reset like debugfs.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  | 1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 1 +
>   3 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b583026dc893..3948e7db6ad7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5303,6 +5303,8 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>   
>   	if (r)
>   		dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
> +
> +	atomic_set(&adev->reset_domain->reset_res, r);
>   	return r;
>   }
>   
> @@ -5317,7 +5319,7 @@ static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
>   {
>   	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
>   
> -	recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
> +	amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>   }
>   /*
>    * Serialize gpu recover into reset domain single threaded wq
> @@ -5334,7 +5336,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   
>   	flush_work(&work.base);
>   
> -	return work.ret;
> +	return atomic_read(&adev->reset_domain->reset_res);
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> index c80af0889773..32c86a0b145c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> @@ -132,6 +132,7 @@ struct amdgpu_reset_domain *amdgpu_reset_create_reset_domain(enum amdgpu_reset_d
>   	}
>   
>   	atomic_set(&reset_domain->in_gpu_reset, 0);
> +	atomic_set(&reset_domain->reset_res, 0);
>   	init_rwsem(&reset_domain->sem);
>   
>   	return reset_domain;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> index 1949dbe28a86..9e55a5d7a825 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> @@ -82,6 +82,7 @@ struct amdgpu_reset_domain {
>   	enum amdgpu_reset_domain_type type;
>   	struct rw_semaphore sem;
>   	atomic_t in_gpu_reset;
> +	atomic_t reset_res;

Maybe better name that last_result or something like that.

Christian.

>   };
>   
>   


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 2/7] drm/amdgpu: Switch to delayed work from work_struct.
  2022-05-17 19:20 ` [PATCH v2 2/7] drm/amdgpu: Switch to delayed work from work_struct Andrey Grodzovsky
@ 2022-05-18  6:03   ` Christian König
  0 siblings, 0 replies; 15+ messages in thread
From: Christian König @ 2022-05-18  6:03 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, lijo.lazar

Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
> We need to be able to non blocking cancel pending reset works
> from within GPU reset. Currently kernel API allows this only
> for delayed_work and not for work_struct. Switch to delayed
> work and queue it with delay 0 which is equal to queueing work
> struct.

Ok well that needs further explanation. Why should this be used exactly?

Christian.

>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 4 ++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   | 2 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      | 4 ++--
>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      | 4 ++--
>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      | 4 ++--
>   6 files changed, 13 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 3948e7db6ad7..ea41edf52a6f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5309,7 +5309,7 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>   }
>   
>   struct amdgpu_recover_work_struct {
> -	struct work_struct base;
> +	struct delayed_work base;
>   	struct amdgpu_device *adev;
>   	struct amdgpu_job *job;
>   	int ret;
> @@ -5317,7 +5317,7 @@ struct amdgpu_recover_work_struct {
>   
>   static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
>   {
> -	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
> +	struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base.work);
>   
>   	amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
>   }
> @@ -5329,12 +5329,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   {
>   	struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
>   
> -	INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
> +	INIT_DELAYED_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
>   
>   	if (!amdgpu_reset_domain_schedule(adev->reset_domain, &work.base))
>   		return -EAGAIN;
>   
> -	flush_work(&work.base);
> +	flush_delayed_work(&work.base);
>   
>   	return atomic_read(&adev->reset_domain->reset_res);
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> index 9e55a5d7a825..fee9376d495a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> @@ -114,9 +114,9 @@ static inline void amdgpu_reset_put_reset_domain(struct amdgpu_reset_domain *dom
>   }
>   
>   static inline bool amdgpu_reset_domain_schedule(struct amdgpu_reset_domain *domain,
> -						struct work_struct *work)
> +						struct delayed_work *work)
>   {
> -	return queue_work(domain->wq, work);
> +	return queue_delayed_work(domain->wq, work, 0);
>   }
>   
>   void amdgpu_device_lock_reset_domain(struct amdgpu_reset_domain *reset_domain);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> index 239f232f9c02..b53b7a794459 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> @@ -230,7 +230,7 @@ struct amdgpu_virt {
>   	uint32_t			reg_val_offs;
>   	struct amdgpu_irq_src		ack_irq;
>   	struct amdgpu_irq_src		rcv_irq;
> -	struct work_struct		flr_work;
> +	struct delayed_work		flr_work;
>   	struct amdgpu_mm_table		mm_table;
>   	const struct amdgpu_virt_ops	*ops;
>   	struct amdgpu_vf_error_buffer	vf_errors;
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index b81acf59870c..aa5f6d6ea1e3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -251,7 +251,7 @@ static int xgpu_ai_set_mailbox_ack_irq(struct amdgpu_device *adev,
>   
>   static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>   {
> -	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
> +	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work.work);
>   	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
>   	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>   
> @@ -380,7 +380,7 @@ int xgpu_ai_mailbox_get_irq(struct amdgpu_device *adev)
>   		return r;
>   	}
>   
> -	INIT_WORK(&adev->virt.flr_work, xgpu_ai_mailbox_flr_work);
> +	INIT_DELAYED_WORK(&adev->virt.flr_work, xgpu_ai_mailbox_flr_work);
>   
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> index 22c10b97ea81..dd9f6b6f62f5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> @@ -275,7 +275,7 @@ static int xgpu_nv_set_mailbox_ack_irq(struct amdgpu_device *adev,
>   
>   static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
>   {
> -	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
> +	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work.work);
>   	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
>   	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>   
> @@ -407,7 +407,7 @@ int xgpu_nv_mailbox_get_irq(struct amdgpu_device *adev)
>   		return r;
>   	}
>   
> -	INIT_WORK(&adev->virt.flr_work, xgpu_nv_mailbox_flr_work);
> +	INIT_DELAYED_WORK(&adev->virt.flr_work, xgpu_nv_mailbox_flr_work);
>   
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> index 7b63d30b9b79..b11831da1b13 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> @@ -512,7 +512,7 @@ static int xgpu_vi_set_mailbox_ack_irq(struct amdgpu_device *adev,
>   
>   static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
>   {
> -	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
> +	struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work.work);
>   	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
>   
>   	/* wait until RCV_MSG become 3 */
> @@ -610,7 +610,7 @@ int xgpu_vi_mailbox_get_irq(struct amdgpu_device *adev)
>   		return r;
>   	}
>   
> -	INIT_WORK(&adev->virt.flr_work, xgpu_vi_mailbox_flr_work);
> +	INIT_DELAYED_WORK(&adev->virt.flr_work, xgpu_vi_mailbox_flr_work);
>   
>   	return 0;
>   }


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.
  2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
                   ` (6 preceding siblings ...)
  2022-05-17 19:21 ` [PATCH v2 7/7] drm/amdgpu: Stop any pending reset if another in progress Andrey Grodzovsky
@ 2022-05-18  6:07 ` Christian König
  2022-05-18 14:24   ` Andrey Grodzovsky
  7 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2022-05-18  6:07 UTC (permalink / raw)
  To: Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, lijo.lazar

Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
> Problem:
> During hive reset caused by command timing out on a ring
> extra resets are generated by triggered by KFD which is
> unable to accesses registers on the resetting ASIC.
>
> Fix: Rework GPU reset to actively stop any pending reset
> works while another in progress.
>
> v2: Switch from generic list as was in v1[1] to eplicit
> stopping of each reset request from each reset source
> per each request submitter.

Looks mostly good to me.

Apart from the naming nit pick on patch #1 the only thing I couldn't of 
hand figure out is why you are using a delayed work everywhere instead 
of a just a work item.

That needs a bit further explanation what's happening here.

Christian.

>
> [1] - https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/
>
> Andrey Grodzovsky (7):
>    drm/amdgpu: Cache result of last reset at reset domain level.
>    drm/amdgpu: Switch to delayed work from work_struct.
>    drm/admgpu: Serialize RAS recovery work directly into reset domain
>      queue.
>    drm/amdgpu: Add delayed work for GPU reset from debugfs
>    drm/amdgpu: Add delayed work for GPU reset from kfd.
>    drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
>      amdgpu_device_gpu_recover
>    drm/amdgpu: Stop any pending reset if another in progress.
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  4 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 +++++++++++-----------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 19 ++++++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 ++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  5 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  6 +--
>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  6 +--
>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 +--
>   14 files changed, 87 insertions(+), 54 deletions(-)
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.
  2022-05-18  6:07 ` [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Christian König
@ 2022-05-18 14:24   ` Andrey Grodzovsky
  2022-05-19  7:58     ` Christian König
  0 siblings, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-18 14:24 UTC (permalink / raw)
  To: Christian König, amd-gfx; +Cc: Zoy.Bai, lijo.lazar

[-- Attachment #1: Type: text/plain, Size: 3140 bytes --]


On 2022-05-18 02:07, Christian König wrote:
> Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
>> Problem:
>> During hive reset caused by command timing out on a ring
>> extra resets are generated by triggered by KFD which is
>> unable to accesses registers on the resetting ASIC.
>>
>> Fix: Rework GPU reset to actively stop any pending reset
>> works while another in progress.
>>
>> v2: Switch from generic list as was in v1[1] to eplicit
>> stopping of each reset request from each reset source
>> per each request submitter.
>
> Looks mostly good to me.
>
> Apart from the naming nit pick on patch #1 the only thing I couldn't 
> of hand figure out is why you are using a delayed work everywhere 
> instead of a just a work item.
>
> That needs a bit further explanation what's happening here.
>
> Christian.


Check APIs for cancelling work vs. delayed work -

For work_struct the only public API is this - 
https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214 
- blocking cancel.

For delayed_work we have both blocking and non blocking public APIs -

https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295

https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295

I prefer not to go now into convincing core kernel people of exposing 
another interface for our own sake - from my past experience API changes 
in core code has slim chances and a lot of time spent on back and forth 
arguments.

"If the mountain will not come to Muhammad, then Muhammad must go to the 
mountain" ;)*
*

Andrey

>
>>
>> [1] - 
>> https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/
>>
>> Andrey Grodzovsky (7):
>>    drm/amdgpu: Cache result of last reset at reset domain level.
>>    drm/amdgpu: Switch to delayed work from work_struct.
>>    drm/admgpu: Serialize RAS recovery work directly into reset domain
>>      queue.
>>    drm/amdgpu: Add delayed work for GPU reset from debugfs
>>    drm/amdgpu: Add delayed work for GPU reset from kfd.
>>    drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
>>      amdgpu_device_gpu_recover
>>    drm/amdgpu: Stop any pending reset if another in progress.
>>
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  4 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 +++++++++++-----------
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 19 ++++++-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 ++--
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  5 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 +-
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  6 +--
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  6 +--
>>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 +--
>>   14 files changed, 87 insertions(+), 54 deletions(-)
>>
>

[-- Attachment #2: Type: text/html, Size: 6524 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.
  2022-05-18 14:24   ` Andrey Grodzovsky
@ 2022-05-19  7:58     ` Christian König
  2022-05-19 13:41       ` Andrey Grodzovsky
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2022-05-19  7:58 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, amd-gfx; +Cc: Zoy.Bai, lijo.lazar

[-- Attachment #1: Type: text/plain, Size: 3934 bytes --]



Am 18.05.22 um 16:24 schrieb Andrey Grodzovsky:
>
>
> On 2022-05-18 02:07, Christian König wrote:
>> Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
>>> Problem:
>>> During hive reset caused by command timing out on a ring
>>> extra resets are generated by triggered by KFD which is
>>> unable to accesses registers on the resetting ASIC.
>>>
>>> Fix: Rework GPU reset to actively stop any pending reset
>>> works while another in progress.
>>>
>>> v2: Switch from generic list as was in v1[1] to eplicit
>>> stopping of each reset request from each reset source
>>> per each request submitter.
>>
>> Looks mostly good to me.
>>
>> Apart from the naming nit pick on patch #1 the only thing I couldn't 
>> of hand figure out is why you are using a delayed work everywhere 
>> instead of a just a work item.
>>
>> That needs a bit further explanation what's happening here.
>>
>> Christian.
>
>
> Check APIs for cancelling work vs. delayed work -
>
> For work_struct the only public API is this - 
> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214 
> - blocking cancel.
>
> For delayed_work we have both blocking and non blocking public APIs -
>
> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
>
> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
>
> I prefer not to go now into convincing core kernel people of exposing 
> another interface for our own sake - from my past experience API 
> changes in core code has slim chances and a lot of time spent on back 
> and forth arguments.
>
> "If the mountain will not come to Muhammad, then Muhammad must go to 
> the mountain" ;)*
> *
>

Ah, good point. The cancel_work() function was removed a few years ago:

commit 6417250d3f894e66a68ba1cd93676143f2376a6f
Author: Stephen Hemminger <stephen@networkplumber.org>
Date:   Tue Mar 6 19:34:42 2018 -0800

     workqueue: remove unused cancel_work()

     Found this by accident.
     There are no usages of bare cancel_work() in current kernel source.

     Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
     Signed-off-by: Tejun Heo <tj@kernel.org>


Maybe just revert that patch, export the function and use it. I think 
there is plenty of justification for this.

Thanks,
Christian.

> **
>
> Andrey
>
>>
>>>
>>> [1] - 
>>> https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/
>>>
>>> Andrey Grodzovsky (7):
>>>    drm/amdgpu: Cache result of last reset at reset domain level.
>>>    drm/amdgpu: Switch to delayed work from work_struct.
>>>    drm/admgpu: Serialize RAS recovery work directly into reset domain
>>>      queue.
>>>    drm/amdgpu: Add delayed work for GPU reset from debugfs
>>>    drm/amdgpu: Add delayed work for GPU reset from kfd.
>>>    drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
>>>      amdgpu_device_gpu_recover
>>>    drm/amdgpu: Stop any pending reset if another in progress.
>>>
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  4 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 
>>> +++++++++++-----------
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 19 ++++++-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 ++--
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  2 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  1 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  5 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 +-
>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  6 +--
>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  6 +--
>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 +--
>>>   14 files changed, 87 insertions(+), 54 deletions(-)
>>>
>>

[-- Attachment #2: Type: text/html, Size: 8327 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.
  2022-05-19  7:58     ` Christian König
@ 2022-05-19 13:41       ` Andrey Grodzovsky
  0 siblings, 0 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-19 13:41 UTC (permalink / raw)
  To: Christian König, Christian König, amd-gfx; +Cc: Zoy.Bai, lijo.lazar

[-- Attachment #1: Type: text/plain, Size: 4170 bytes --]


On 2022-05-19 03:58, Christian König wrote:
>
>
> Am 18.05.22 um 16:24 schrieb Andrey Grodzovsky:
>>
>>
>> On 2022-05-18 02:07, Christian König wrote:
>>> Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
>>>> Problem:
>>>> During hive reset caused by command timing out on a ring
>>>> extra resets are generated by triggered by KFD which is
>>>> unable to accesses registers on the resetting ASIC.
>>>>
>>>> Fix: Rework GPU reset to actively stop any pending reset
>>>> works while another in progress.
>>>>
>>>> v2: Switch from generic list as was in v1[1] to eplicit
>>>> stopping of each reset request from each reset source
>>>> per each request submitter.
>>>
>>> Looks mostly good to me.
>>>
>>> Apart from the naming nit pick on patch #1 the only thing I couldn't 
>>> of hand figure out is why you are using a delayed work everywhere 
>>> instead of a just a work item.
>>>
>>> That needs a bit further explanation what's happening here.
>>>
>>> Christian.
>>
>>
>> Check APIs for cancelling work vs. delayed work -
>>
>> For work_struct the only public API is this - 
>> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214 
>> - blocking cancel.
>>
>> For delayed_work we have both blocking and non blocking public APIs -
>>
>> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
>>
>> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
>>
>> I prefer not to go now into convincing core kernel people of exposing 
>> another interface for our own sake - from my past experience API 
>> changes in core code has slim chances and a lot of time spent on back 
>> and forth arguments.
>>
>> "If the mountain will not come to Muhammad, then Muhammad must go to 
>> the mountain" ;)*
>> *
>>
>
> Ah, good point. The cancel_work() function was removed a few years ago:
>
> commit 6417250d3f894e66a68ba1cd93676143f2376a6f
> Author: Stephen Hemminger <stephen@networkplumber.org>
> Date:   Tue Mar 6 19:34:42 2018 -0800
>
>     workqueue: remove unused cancel_work()
>
>     Found this by accident.
>     There are no usages of bare cancel_work() in current kernel source.
>
>     Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
>     Signed-off-by: Tejun Heo <tj@kernel.org>
>
>
> Maybe just revert that patch, export the function and use it. I think 
> there is plenty of justification for this.
>
> Thanks,
> Christian.


Ok - i will send them a patch - let's see what they say.

Andrey


>
>> **
>>
>> Andrey
>>
>>>
>>>>
>>>> [1] - 
>>>> https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/
>>>>
>>>> Andrey Grodzovsky (7):
>>>>    drm/amdgpu: Cache result of last reset at reset domain level.
>>>>    drm/amdgpu: Switch to delayed work from work_struct.
>>>>    drm/admgpu: Serialize RAS recovery work directly into reset domain
>>>>      queue.
>>>>    drm/amdgpu: Add delayed work for GPU reset from debugfs
>>>>    drm/amdgpu: Add delayed work for GPU reset from kfd.
>>>>    drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
>>>>      amdgpu_device_gpu_recover
>>>>    drm/amdgpu: Stop any pending reset if another in progress.
>>>>
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  4 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h |  1 +
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 
>>>> +++++++++++-----------
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 19 ++++++-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  2 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 ++--
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  2 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c  |  1 +
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  5 +-
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |  2 +-
>>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  6 +--
>>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  6 +--
>>>>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  6 +--
>>>>   14 files changed, 87 insertions(+), 54 deletions(-)
>>>>
>>>
>

[-- Attachment #2: Type: text/html, Size: 12100 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-05-19 13:41 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
2022-05-17 19:20 ` [PATCH v2 1/7] drm/amdgpu: Cache result of last reset at reset domain level Andrey Grodzovsky
2022-05-18  6:02   ` Christian König
2022-05-17 19:20 ` [PATCH v2 2/7] drm/amdgpu: Switch to delayed work from work_struct Andrey Grodzovsky
2022-05-18  6:03   ` Christian König
2022-05-17 19:20 ` [PATCH v2 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue Andrey Grodzovsky
2022-05-17 19:20 ` [PATCH v2 4/7] drm/amdgpu: Add delayed work for GPU reset from debugfs Andrey Grodzovsky
2022-05-17 19:21 ` [PATCH v2 5/7] drm/amdgpu: Add delayed work for GPU reset from kfd Andrey Grodzovsky
2022-05-17 19:21 ` [PATCH v2 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover Andrey Grodzovsky
2022-05-17 19:21 ` [PATCH v2 7/7] drm/amdgpu: Stop any pending reset if another in progress Andrey Grodzovsky
2022-05-17 20:56   ` Felix Kuehling
2022-05-18  6:07 ` [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Christian König
2022-05-18 14:24   ` Andrey Grodzovsky
2022-05-19  7:58     ` Christian König
2022-05-19 13:41       ` Andrey Grodzovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.