* [PATCH v2 1/7] drm/amdgpu: Cache result of last reset at reset domain level.
2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
@ 2022-05-17 19:20 ` Andrey Grodzovsky
2022-05-18 6:02 ` Christian König
2022-05-17 19:20 ` [PATCH v2 2/7] drm/amdgpu: Switch to delayed work from work_struct Andrey Grodzovsky
` (6 subsequent siblings)
7 siblings, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-17 19:20 UTC (permalink / raw)
To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig
Will be read by executors of async reset like debugfs.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++--
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 +
3 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b583026dc893..3948e7db6ad7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5303,6 +5303,8 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
if (r)
dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
+
+ atomic_set(&adev->reset_domain->reset_res, r);
return r;
}
@@ -5317,7 +5319,7 @@ static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
{
struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
- recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
+ amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
}
/*
* Serialize gpu recover into reset domain single threaded wq
@@ -5334,7 +5336,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
flush_work(&work.base);
- return work.ret;
+ return atomic_read(&adev->reset_domain->reset_res);
}
/**
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
index c80af0889773..32c86a0b145c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
@@ -132,6 +132,7 @@ struct amdgpu_reset_domain *amdgpu_reset_create_reset_domain(enum amdgpu_reset_d
}
atomic_set(&reset_domain->in_gpu_reset, 0);
+ atomic_set(&reset_domain->reset_res, 0);
init_rwsem(&reset_domain->sem);
return reset_domain;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
index 1949dbe28a86..9e55a5d7a825 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -82,6 +82,7 @@ struct amdgpu_reset_domain {
enum amdgpu_reset_domain_type type;
struct rw_semaphore sem;
atomic_t in_gpu_reset;
+ atomic_t reset_res;
};
--
2.25.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v2 1/7] drm/amdgpu: Cache result of last reset at reset domain level.
2022-05-17 19:20 ` [PATCH v2 1/7] drm/amdgpu: Cache result of last reset at reset domain level Andrey Grodzovsky
@ 2022-05-18 6:02 ` Christian König
0 siblings, 0 replies; 15+ messages in thread
From: Christian König @ 2022-05-18 6:02 UTC (permalink / raw)
To: Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, lijo.lazar
Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
> Will be read by executors of async reset like debugfs.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++--
> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 +
> 3 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b583026dc893..3948e7db6ad7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5303,6 +5303,8 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>
> if (r)
> dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
> +
> + atomic_set(&adev->reset_domain->reset_res, r);
> return r;
> }
>
> @@ -5317,7 +5319,7 @@ static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
> {
> struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
>
> - recover_work->ret = amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
> + amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
> }
> /*
> * Serialize gpu recover into reset domain single threaded wq
> @@ -5334,7 +5336,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>
> flush_work(&work.base);
>
> - return work.ret;
> + return atomic_read(&adev->reset_domain->reset_res);
> }
>
> /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> index c80af0889773..32c86a0b145c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
> @@ -132,6 +132,7 @@ struct amdgpu_reset_domain *amdgpu_reset_create_reset_domain(enum amdgpu_reset_d
> }
>
> atomic_set(&reset_domain->in_gpu_reset, 0);
> + atomic_set(&reset_domain->reset_res, 0);
> init_rwsem(&reset_domain->sem);
>
> return reset_domain;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> index 1949dbe28a86..9e55a5d7a825 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> @@ -82,6 +82,7 @@ struct amdgpu_reset_domain {
> enum amdgpu_reset_domain_type type;
> struct rw_semaphore sem;
> atomic_t in_gpu_reset;
> + atomic_t reset_res;
Maybe better name that last_result or something like that.
Christian.
> };
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v2 2/7] drm/amdgpu: Switch to delayed work from work_struct.
2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
2022-05-17 19:20 ` [PATCH v2 1/7] drm/amdgpu: Cache result of last reset at reset domain level Andrey Grodzovsky
@ 2022-05-17 19:20 ` Andrey Grodzovsky
2022-05-18 6:03 ` Christian König
2022-05-17 19:20 ` [PATCH v2 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue Andrey Grodzovsky
` (5 subsequent siblings)
7 siblings, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-17 19:20 UTC (permalink / raw)
To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig
We need to be able to non blocking cancel pending reset works
from within GPU reset. Currently kernel API allows this only
for delayed_work and not for work_struct. Switch to delayed
work and queue it with delay 0 which is equal to queueing work
struct.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++----
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 4 ++--
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 2 +-
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 4 ++--
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 4 ++--
drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 4 ++--
6 files changed, 13 insertions(+), 13 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3948e7db6ad7..ea41edf52a6f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5309,7 +5309,7 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
}
struct amdgpu_recover_work_struct {
- struct work_struct base;
+ struct delayed_work base;
struct amdgpu_device *adev;
struct amdgpu_job *job;
int ret;
@@ -5317,7 +5317,7 @@ struct amdgpu_recover_work_struct {
static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
{
- struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
+ struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base.work);
amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
}
@@ -5329,12 +5329,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
{
struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
- INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
+ INIT_DELAYED_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
if (!amdgpu_reset_domain_schedule(adev->reset_domain, &work.base))
return -EAGAIN;
- flush_work(&work.base);
+ flush_delayed_work(&work.base);
return atomic_read(&adev->reset_domain->reset_res);
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
index 9e55a5d7a825..fee9376d495a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -114,9 +114,9 @@ static inline void amdgpu_reset_put_reset_domain(struct amdgpu_reset_domain *dom
}
static inline bool amdgpu_reset_domain_schedule(struct amdgpu_reset_domain *domain,
- struct work_struct *work)
+ struct delayed_work *work)
{
- return queue_work(domain->wq, work);
+ return queue_delayed_work(domain->wq, work, 0);
}
void amdgpu_device_lock_reset_domain(struct amdgpu_reset_domain *reset_domain);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
index 239f232f9c02..b53b7a794459 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
@@ -230,7 +230,7 @@ struct amdgpu_virt {
uint32_t reg_val_offs;
struct amdgpu_irq_src ack_irq;
struct amdgpu_irq_src rcv_irq;
- struct work_struct flr_work;
+ struct delayed_work flr_work;
struct amdgpu_mm_table mm_table;
const struct amdgpu_virt_ops *ops;
struct amdgpu_vf_error_buffer vf_errors;
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index b81acf59870c..aa5f6d6ea1e3 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -251,7 +251,7 @@ static int xgpu_ai_set_mailbox_ack_irq(struct amdgpu_device *adev,
static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
{
- struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
+ struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work.work);
struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
@@ -380,7 +380,7 @@ int xgpu_ai_mailbox_get_irq(struct amdgpu_device *adev)
return r;
}
- INIT_WORK(&adev->virt.flr_work, xgpu_ai_mailbox_flr_work);
+ INIT_DELAYED_WORK(&adev->virt.flr_work, xgpu_ai_mailbox_flr_work);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index 22c10b97ea81..dd9f6b6f62f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -275,7 +275,7 @@ static int xgpu_nv_set_mailbox_ack_irq(struct amdgpu_device *adev,
static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
{
- struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
+ struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work.work);
struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
@@ -407,7 +407,7 @@ int xgpu_nv_mailbox_get_irq(struct amdgpu_device *adev)
return r;
}
- INIT_WORK(&adev->virt.flr_work, xgpu_nv_mailbox_flr_work);
+ INIT_DELAYED_WORK(&adev->virt.flr_work, xgpu_nv_mailbox_flr_work);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index 7b63d30b9b79..b11831da1b13 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -512,7 +512,7 @@ static int xgpu_vi_set_mailbox_ack_irq(struct amdgpu_device *adev,
static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
{
- struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
+ struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work.work);
struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
/* wait until RCV_MSG become 3 */
@@ -610,7 +610,7 @@ int xgpu_vi_mailbox_get_irq(struct amdgpu_device *adev)
return r;
}
- INIT_WORK(&adev->virt.flr_work, xgpu_vi_mailbox_flr_work);
+ INIT_DELAYED_WORK(&adev->virt.flr_work, xgpu_vi_mailbox_flr_work);
return 0;
}
--
2.25.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v2 2/7] drm/amdgpu: Switch to delayed work from work_struct.
2022-05-17 19:20 ` [PATCH v2 2/7] drm/amdgpu: Switch to delayed work from work_struct Andrey Grodzovsky
@ 2022-05-18 6:03 ` Christian König
0 siblings, 0 replies; 15+ messages in thread
From: Christian König @ 2022-05-18 6:03 UTC (permalink / raw)
To: Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, lijo.lazar
Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
> We need to be able to non blocking cancel pending reset works
> from within GPU reset. Currently kernel API allows this only
> for delayed_work and not for work_struct. Switch to delayed
> work and queue it with delay 0 which is equal to queueing work
> struct.
Ok well that needs further explanation. Why should this be used exactly?
Christian.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++----
> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 4 ++--
> drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 2 +-
> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 4 ++--
> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 4 ++--
> drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 4 ++--
> 6 files changed, 13 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 3948e7db6ad7..ea41edf52a6f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5309,7 +5309,7 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
> }
>
> struct amdgpu_recover_work_struct {
> - struct work_struct base;
> + struct delayed_work base;
> struct amdgpu_device *adev;
> struct amdgpu_job *job;
> int ret;
> @@ -5317,7 +5317,7 @@ struct amdgpu_recover_work_struct {
>
> static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
> {
> - struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base);
> + struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base.work);
>
> amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
> }
> @@ -5329,12 +5329,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> {
> struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
>
> - INIT_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
> + INIT_DELAYED_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
>
> if (!amdgpu_reset_domain_schedule(adev->reset_domain, &work.base))
> return -EAGAIN;
>
> - flush_work(&work.base);
> + flush_delayed_work(&work.base);
>
> return atomic_read(&adev->reset_domain->reset_res);
> }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> index 9e55a5d7a825..fee9376d495a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> @@ -114,9 +114,9 @@ static inline void amdgpu_reset_put_reset_domain(struct amdgpu_reset_domain *dom
> }
>
> static inline bool amdgpu_reset_domain_schedule(struct amdgpu_reset_domain *domain,
> - struct work_struct *work)
> + struct delayed_work *work)
> {
> - return queue_work(domain->wq, work);
> + return queue_delayed_work(domain->wq, work, 0);
> }
>
> void amdgpu_device_lock_reset_domain(struct amdgpu_reset_domain *reset_domain);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> index 239f232f9c02..b53b7a794459 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> @@ -230,7 +230,7 @@ struct amdgpu_virt {
> uint32_t reg_val_offs;
> struct amdgpu_irq_src ack_irq;
> struct amdgpu_irq_src rcv_irq;
> - struct work_struct flr_work;
> + struct delayed_work flr_work;
> struct amdgpu_mm_table mm_table;
> const struct amdgpu_virt_ops *ops;
> struct amdgpu_vf_error_buffer vf_errors;
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index b81acf59870c..aa5f6d6ea1e3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -251,7 +251,7 @@ static int xgpu_ai_set_mailbox_ack_irq(struct amdgpu_device *adev,
>
> static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
> {
> - struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
> + struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work.work);
> struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
> int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
>
> @@ -380,7 +380,7 @@ int xgpu_ai_mailbox_get_irq(struct amdgpu_device *adev)
> return r;
> }
>
> - INIT_WORK(&adev->virt.flr_work, xgpu_ai_mailbox_flr_work);
> + INIT_DELAYED_WORK(&adev->virt.flr_work, xgpu_ai_mailbox_flr_work);
>
> return 0;
> }
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> index 22c10b97ea81..dd9f6b6f62f5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> @@ -275,7 +275,7 @@ static int xgpu_nv_set_mailbox_ack_irq(struct amdgpu_device *adev,
>
> static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
> {
> - struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
> + struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work.work);
> struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
> int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
>
> @@ -407,7 +407,7 @@ int xgpu_nv_mailbox_get_irq(struct amdgpu_device *adev)
> return r;
> }
>
> - INIT_WORK(&adev->virt.flr_work, xgpu_nv_mailbox_flr_work);
> + INIT_DELAYED_WORK(&adev->virt.flr_work, xgpu_nv_mailbox_flr_work);
>
> return 0;
> }
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> index 7b63d30b9b79..b11831da1b13 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> @@ -512,7 +512,7 @@ static int xgpu_vi_set_mailbox_ack_irq(struct amdgpu_device *adev,
>
> static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
> {
> - struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work);
> + struct amdgpu_virt *virt = container_of(work, struct amdgpu_virt, flr_work.work);
> struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
>
> /* wait until RCV_MSG become 3 */
> @@ -610,7 +610,7 @@ int xgpu_vi_mailbox_get_irq(struct amdgpu_device *adev)
> return r;
> }
>
> - INIT_WORK(&adev->virt.flr_work, xgpu_vi_mailbox_flr_work);
> + INIT_DELAYED_WORK(&adev->virt.flr_work, xgpu_vi_mailbox_flr_work);
>
> return 0;
> }
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v2 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue.
2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
2022-05-17 19:20 ` [PATCH v2 1/7] drm/amdgpu: Cache result of last reset at reset domain level Andrey Grodzovsky
2022-05-17 19:20 ` [PATCH v2 2/7] drm/amdgpu: Switch to delayed work from work_struct Andrey Grodzovsky
@ 2022-05-17 19:20 ` Andrey Grodzovsky
2022-05-17 19:20 ` [PATCH v2 4/7] drm/amdgpu: Add delayed work for GPU reset from debugfs Andrey Grodzovsky
` (4 subsequent siblings)
7 siblings, 0 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-17 19:20 UTC (permalink / raw)
To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig
Save the extra usless work schedule. Also swith to delayed work.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 12 +++++++-----
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 +-
2 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index a653cf3b3d13..7e8c7bcc7303 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -35,6 +35,8 @@
#include "amdgpu_xgmi.h"
#include "ivsrcid/nbio/irqsrcs_nbif_7_4.h"
#include "atom.h"
+#include "amdgpu_reset.h"
+
#ifdef CONFIG_X86_MCE_AMD
#include <asm/mce.h>
@@ -1889,7 +1891,7 @@ static int amdgpu_ras_badpages_read(struct amdgpu_device *adev,
static void amdgpu_ras_do_recovery(struct work_struct *work)
{
struct amdgpu_ras *ras =
- container_of(work, struct amdgpu_ras, recovery_work);
+ container_of(work, struct amdgpu_ras, recovery_work.work);
struct amdgpu_device *remote_adev = NULL;
struct amdgpu_device *adev = ras->adev;
struct list_head device_list, *device_list_handle = NULL;
@@ -1916,7 +1918,7 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
}
if (amdgpu_device_should_recover_gpu(ras->adev))
- amdgpu_device_gpu_recover(ras->adev, NULL);
+ amdgpu_device_gpu_recover_imp(ras->adev, NULL);
atomic_set(&ras->in_recovery, 0);
}
@@ -2148,7 +2150,7 @@ int amdgpu_ras_recovery_init(struct amdgpu_device *adev)
}
mutex_init(&con->recovery_lock);
- INIT_WORK(&con->recovery_work, amdgpu_ras_do_recovery);
+ INIT_DELAYED_WORK(&con->recovery_work, amdgpu_ras_do_recovery);
atomic_set(&con->in_recovery, 0);
con->eeprom_control.bad_channel_bitmap = 0;
@@ -2217,7 +2219,7 @@ static int amdgpu_ras_recovery_fini(struct amdgpu_device *adev)
if (!data)
return 0;
- cancel_work_sync(&con->recovery_work);
+ cancel_delayed_work_sync(&con->recovery_work);
mutex_lock(&con->recovery_lock);
con->eh_data = NULL;
@@ -2910,7 +2912,7 @@ int amdgpu_ras_reset_gpu(struct amdgpu_device *adev)
struct amdgpu_ras *ras = amdgpu_ras_get_context(adev);
if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
- schedule_work(&ras->recovery_work);
+ amdgpu_reset_domain_schedule(ras->adev->reset_domain, &ras->recovery_work);
return 0;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index b9a6fac2b8b2..f7e21c2abc61 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -347,7 +347,7 @@ struct amdgpu_ras {
struct ras_manager *objs;
/* gpu recovery */
- struct work_struct recovery_work;
+ struct delayed_work recovery_work;
atomic_t in_recovery;
struct amdgpu_device *adev;
/* error handler data */
--
2.25.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 4/7] drm/amdgpu: Add delayed work for GPU reset from debugfs
2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
` (2 preceding siblings ...)
2022-05-17 19:20 ` [PATCH v2 3/7] drm/admgpu: Serialize RAS recovery work directly into reset domain queue Andrey Grodzovsky
@ 2022-05-17 19:20 ` Andrey Grodzovsky
2022-05-17 19:21 ` [PATCH v2 5/7] drm/amdgpu: Add delayed work for GPU reset from kfd Andrey Grodzovsky
` (3 subsequent siblings)
7 siblings, 0 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-17 19:20 UTC (permalink / raw)
To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig
We need to have a delayed work to cancel this reset if another
already in progress.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 +++++++++++++++++--
2 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 3c20c2eadf4e..4ef17c6d1a50 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1048,6 +1048,8 @@ struct amdgpu_device {
bool scpm_enabled;
uint32_t scpm_status;
+
+ struct delayed_work reset_work;
};
static inline struct amdgpu_device *drm_to_adev(struct drm_device *ddev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index d16c8c1f72db..f980f1501c48 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -39,6 +39,7 @@
#include <drm/drm_drv.h>
#include "amdgpu.h"
#include "amdgpu_trace.h"
+#include "amdgpu_reset.h"
/*
* Fences
@@ -798,7 +799,10 @@ static int gpu_recover_get(void *data, u64 *val)
return 0;
}
- *val = amdgpu_device_gpu_recover(adev, NULL);
+ if (amdgpu_reset_domain_schedule(adev->reset_domain, &adev->reset_work))
+ flush_delayed_work(&adev->reset_work);
+
+ *val = atomic_read(&adev->reset_domain->reset_res);
pm_runtime_mark_last_busy(dev->dev);
pm_runtime_put_autosuspend(dev->dev);
@@ -810,6 +814,14 @@ DEFINE_SHOW_ATTRIBUTE(amdgpu_debugfs_fence_info);
DEFINE_DEBUGFS_ATTRIBUTE(amdgpu_debugfs_gpu_recover_fops, gpu_recover_get, NULL,
"%lld\n");
+static void amdgpu_debugfs_reset_work(struct work_struct *work)
+{
+ struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
+ reset_work.work);
+
+ amdgpu_device_gpu_recover_imp(adev, NULL);
+}
+
#endif
void amdgpu_debugfs_fence_init(struct amdgpu_device *adev)
@@ -821,9 +833,12 @@ void amdgpu_debugfs_fence_init(struct amdgpu_device *adev)
debugfs_create_file("amdgpu_fence_info", 0444, root, adev,
&amdgpu_debugfs_fence_info_fops);
- if (!amdgpu_sriov_vf(adev))
+ if (!amdgpu_sriov_vf(adev)) {
+
+ INIT_DELAYED_WORK(&adev->reset_work, amdgpu_debugfs_reset_work);
debugfs_create_file("amdgpu_gpu_recover", 0444, root, adev,
&amdgpu_debugfs_gpu_recover_fops);
+ }
#endif
}
--
2.25.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 5/7] drm/amdgpu: Add delayed work for GPU reset from kfd.
2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
` (3 preceding siblings ...)
2022-05-17 19:20 ` [PATCH v2 4/7] drm/amdgpu: Add delayed work for GPU reset from debugfs Andrey Grodzovsky
@ 2022-05-17 19:21 ` Andrey Grodzovsky
2022-05-17 19:21 ` [PATCH v2 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover Andrey Grodzovsky
` (2 subsequent siblings)
7 siblings, 0 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-17 19:21 UTC (permalink / raw)
To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig
We need to have a delayed work to cancel this reset if another
already in progress.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 ++++++++++-
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 1 +
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 31 ----------------------
3 files changed, 15 insertions(+), 32 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 1f8161cd507f..4cc846341394 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -33,6 +33,7 @@
#include <uapi/linux/kfd_ioctl.h>
#include "amdgpu_ras.h"
#include "amdgpu_umc.h"
+#include "amdgpu_reset.h"
/* Total memory size in system memory and all GPU VRAM. Used to
* estimate worst case amount of memory to reserve for page tables
@@ -122,6 +123,15 @@ static void amdgpu_doorbell_get_kfd_info(struct amdgpu_device *adev,
}
}
+
+static void amdgpu_amdkfd_reset_work(struct work_struct *work)
+{
+ struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
+ kfd.reset_work.work);
+
+ amdgpu_device_gpu_recover_imp(adev, NULL);
+}
+
void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
{
int i;
@@ -180,6 +190,8 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
adev->kfd.init_complete = kgd2kfd_device_init(adev->kfd.dev,
adev_to_drm(adev), &gpu_resources);
+
+ INIT_DELAYED_WORK(&adev->kfd.reset_work, amdgpu_amdkfd_reset_work);
}
}
@@ -247,7 +259,8 @@ int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev)
void amdgpu_amdkfd_gpu_reset(struct amdgpu_device *adev)
{
if (amdgpu_device_should_recover_gpu(adev))
- amdgpu_device_gpu_recover(adev, NULL);
+ amdgpu_reset_domain_schedule(adev->reset_domain,
+ &adev->kfd.reset_work);
}
int amdgpu_amdkfd_alloc_gtt_mem(struct amdgpu_device *adev, size_t size,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
index f8b9f27adcf5..5e04dba8c7f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h
@@ -96,6 +96,7 @@ struct amdgpu_kfd_dev {
struct kfd_dev *dev;
uint64_t vram_used;
bool init_complete;
+ struct delayed_work reset_work;
};
enum kgd_engine_type {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ea41edf52a6f..ae4c37c89ac7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5308,37 +5308,6 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
return r;
}
-struct amdgpu_recover_work_struct {
- struct delayed_work base;
- struct amdgpu_device *adev;
- struct amdgpu_job *job;
- int ret;
-};
-
-static void amdgpu_device_queue_gpu_recover_work(struct work_struct *work)
-{
- struct amdgpu_recover_work_struct *recover_work = container_of(work, struct amdgpu_recover_work_struct, base.work);
-
- amdgpu_device_gpu_recover_imp(recover_work->adev, recover_work->job);
-}
-/*
- * Serialize gpu recover into reset domain single threaded wq
- */
-int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
- struct amdgpu_job *job)
-{
- struct amdgpu_recover_work_struct work = {.adev = adev, .job = job};
-
- INIT_DELAYED_WORK(&work.base, amdgpu_device_queue_gpu_recover_work);
-
- if (!amdgpu_reset_domain_schedule(adev->reset_domain, &work.base))
- return -EAGAIN;
-
- flush_delayed_work(&work.base);
-
- return atomic_read(&adev->reset_domain->reset_res);
-}
-
/**
* amdgpu_device_get_pcie_info - fence pcie info about the PCIE slot
*
--
2.25.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover
2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
` (4 preceding siblings ...)
2022-05-17 19:21 ` [PATCH v2 5/7] drm/amdgpu: Add delayed work for GPU reset from kfd Andrey Grodzovsky
@ 2022-05-17 19:21 ` Andrey Grodzovsky
2022-05-17 19:21 ` [PATCH v2 7/7] drm/amdgpu: Stop any pending reset if another in progress Andrey Grodzovsky
2022-05-18 6:07 ` [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Christian König
7 siblings, 0 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-17 19:21 UTC (permalink / raw)
To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig
We removed the wrapper that was queueing the recover function
into reset domain queue who was using this name.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 2 +-
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 2 +-
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 2 +-
drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 2 +-
9 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 4ef17c6d1a50..ee668f253c7a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1244,7 +1244,7 @@ bool amdgpu_device_has_job_running(struct amdgpu_device *adev);
bool amdgpu_device_should_recover_gpu(struct amdgpu_device *adev);
int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
struct amdgpu_job* job);
-int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
+int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
struct amdgpu_job *job);
void amdgpu_device_pci_config_reset(struct amdgpu_device *adev);
int amdgpu_device_pci_reset(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 4cc846341394..434053a9e027 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -129,7 +129,7 @@ static void amdgpu_amdkfd_reset_work(struct work_struct *work)
struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
kfd.reset_work.work);
- amdgpu_device_gpu_recover_imp(adev, NULL);
+ amdgpu_device_gpu_recover(adev, NULL);
}
void amdgpu_amdkfd_device_init(struct amdgpu_device *adev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ae4c37c89ac7..65f738fd4761 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5065,7 +5065,7 @@ static void amdgpu_device_recheck_guilty_jobs(
* Returns 0 for success or an error on failure.
*/
-int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
+int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
struct amdgpu_job *job)
{
struct list_head device_list, *device_list_handle = NULL;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index f980f1501c48..7954ebf16885 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -819,7 +819,7 @@ static void amdgpu_debugfs_reset_work(struct work_struct *work)
struct amdgpu_device *adev = container_of(work, struct amdgpu_device,
reset_work.work);
- amdgpu_device_gpu_recover_imp(adev, NULL);
+ amdgpu_device_gpu_recover(adev, NULL);
}
#endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index dfe7f2b8f0aa..10aa073600d4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -64,7 +64,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
ti.process_name, ti.tgid, ti.task_name, ti.pid);
if (amdgpu_device_should_recover_gpu(ring->adev)) {
- r = amdgpu_device_gpu_recover_imp(ring->adev, job);
+ r = amdgpu_device_gpu_recover(ring->adev, job);
if (r)
DRM_ERROR("GPU Recovery Failed: %d\n", r);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 7e8c7bcc7303..221d24feb8c9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -1918,7 +1918,7 @@ static void amdgpu_ras_do_recovery(struct work_struct *work)
}
if (amdgpu_device_should_recover_gpu(ras->adev))
- amdgpu_device_gpu_recover_imp(ras->adev, NULL);
+ amdgpu_device_gpu_recover(ras->adev, NULL);
atomic_set(&ras->in_recovery, 0);
}
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index aa5f6d6ea1e3..3b7d9f171793 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -284,7 +284,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
if (amdgpu_device_should_recover_gpu(adev)
&& (!amdgpu_device_has_job_running(adev) ||
adev->sdma_timeout == MAX_SCHEDULE_TIMEOUT))
- amdgpu_device_gpu_recover_imp(adev, NULL);
+ amdgpu_device_gpu_recover(adev, NULL);
}
static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index dd9f6b6f62f5..82126b48477d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -311,7 +311,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
adev->gfx_timeout == MAX_SCHEDULE_TIMEOUT ||
adev->compute_timeout == MAX_SCHEDULE_TIMEOUT ||
adev->video_timeout == MAX_SCHEDULE_TIMEOUT))
- amdgpu_device_gpu_recover_imp(adev, NULL);
+ amdgpu_device_gpu_recover(adev, NULL);
}
static int xgpu_nv_set_mailbox_rcv_irq(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index b11831da1b13..d4f721cd34d1 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -523,7 +523,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
/* Trigger recovery due to world switch failure */
if (amdgpu_device_should_recover_gpu(adev))
- amdgpu_device_gpu_recover_imp(adev, NULL);
+ amdgpu_device_gpu_recover(adev, NULL);
}
static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,
--
2.25.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH v2 7/7] drm/amdgpu: Stop any pending reset if another in progress.
2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
` (5 preceding siblings ...)
2022-05-17 19:21 ` [PATCH v2 6/7] drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover Andrey Grodzovsky
@ 2022-05-17 19:21 ` Andrey Grodzovsky
2022-05-17 20:56 ` Felix Kuehling
2022-05-18 6:07 ` [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Christian König
7 siblings, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-17 19:21 UTC (permalink / raw)
To: amd-gfx; +Cc: Zoy.Bai, Andrey Grodzovsky, lijo.lazar, Christian.Koenig
We skip rest requests if another one is already in progress.
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 27 ++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 65f738fd4761..43af5ea3eee5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5054,6 +5054,27 @@ static void amdgpu_device_recheck_guilty_jobs(
}
}
+static inline void amdggpu_device_stop_pedning_resets(struct amdgpu_device* adev)
+{
+ struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
+
+#if defined(CONFIG_DEBUG_FS)
+ if (!amdgpu_sriov_vf(adev))
+ cancel_delayed_work(&adev->reset_work);
+#endif
+
+ if (adev->kfd.dev)
+ cancel_delayed_work(&adev->kfd.reset_work);
+
+ if (amdgpu_sriov_vf(adev))
+ cancel_delayed_work(&adev->virt.flr_work);
+
+ if (con && adev->ras_enabled)
+ cancel_delayed_work(&con->recovery_work);
+
+}
+
+
/**
* amdgpu_device_gpu_recover - reset the asic and recover scheduler
*
@@ -5209,6 +5230,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
r, adev_to_drm(tmp_adev)->unique);
tmp_adev->asic_reset_res = r;
}
+
+ /*
+ * Drop all pending non scheduler resets. Scheduler resets
+ * were already dropped during drm_sched_stop
+ */
+ amdggpu_device_stop_pedning_resets(tmp_adev);
}
tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
--
2.25.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH v2 7/7] drm/amdgpu: Stop any pending reset if another in progress.
2022-05-17 19:21 ` [PATCH v2 7/7] drm/amdgpu: Stop any pending reset if another in progress Andrey Grodzovsky
@ 2022-05-17 20:56 ` Felix Kuehling
0 siblings, 0 replies; 15+ messages in thread
From: Felix Kuehling @ 2022-05-17 20:56 UTC (permalink / raw)
To: Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, lijo.lazar, Christian.Koenig
Am 2022-05-17 um 15:21 schrieb Andrey Grodzovsky:
> We skip rest requests if another one is already in progress.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 27 ++++++++++++++++++++++
> 1 file changed, 27 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 65f738fd4761..43af5ea3eee5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5054,6 +5054,27 @@ static void amdgpu_device_recheck_guilty_jobs(
> }
> }
>
> +static inline void amdggpu_device_stop_pedning_resets(struct amdgpu_device* adev)
Typo: pedning -> pending
Other than that the series is
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
> +{
> + struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> +
> +#if defined(CONFIG_DEBUG_FS)
> + if (!amdgpu_sriov_vf(adev))
> + cancel_delayed_work(&adev->reset_work);
> +#endif
> +
> + if (adev->kfd.dev)
> + cancel_delayed_work(&adev->kfd.reset_work);
> +
> + if (amdgpu_sriov_vf(adev))
> + cancel_delayed_work(&adev->virt.flr_work);
> +
> + if (con && adev->ras_enabled)
> + cancel_delayed_work(&con->recovery_work);
> +
> +}
> +
> +
> /**
> * amdgpu_device_gpu_recover - reset the asic and recover scheduler
> *
> @@ -5209,6 +5230,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> r, adev_to_drm(tmp_adev)->unique);
> tmp_adev->asic_reset_res = r;
> }
> +
> + /*
> + * Drop all pending non scheduler resets. Scheduler resets
> + * were already dropped during drm_sched_stop
> + */
> + amdggpu_device_stop_pedning_resets(tmp_adev);
> }
>
> tmp_vram_lost_counter = atomic_read(&((adev)->vram_lost_counter));
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.
2022-05-17 19:20 [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Andrey Grodzovsky
` (6 preceding siblings ...)
2022-05-17 19:21 ` [PATCH v2 7/7] drm/amdgpu: Stop any pending reset if another in progress Andrey Grodzovsky
@ 2022-05-18 6:07 ` Christian König
2022-05-18 14:24 ` Andrey Grodzovsky
7 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2022-05-18 6:07 UTC (permalink / raw)
To: Andrey Grodzovsky, amd-gfx; +Cc: Zoy.Bai, lijo.lazar
Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
> Problem:
> During hive reset caused by command timing out on a ring
> extra resets are generated by triggered by KFD which is
> unable to accesses registers on the resetting ASIC.
>
> Fix: Rework GPU reset to actively stop any pending reset
> works while another in progress.
>
> v2: Switch from generic list as was in v1[1] to eplicit
> stopping of each reset request from each reset source
> per each request submitter.
Looks mostly good to me.
Apart from the naming nit pick on patch #1 the only thing I couldn't of
hand figure out is why you are using a delayed work everywhere instead
of a just a work item.
That needs a bit further explanation what's happening here.
Christian.
>
> [1] - https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/
>
> Andrey Grodzovsky (7):
> drm/amdgpu: Cache result of last reset at reset domain level.
> drm/amdgpu: Switch to delayed work from work_struct.
> drm/admgpu: Serialize RAS recovery work directly into reset domain
> queue.
> drm/amdgpu: Add delayed work for GPU reset from debugfs
> drm/amdgpu: Add delayed work for GPU reset from kfd.
> drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
> amdgpu_device_gpu_recover
> drm/amdgpu: Stop any pending reset if another in progress.
>
> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 1 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 +++++++++++-----------
> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 ++++++-
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 ++--
> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 5 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 2 +-
> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 6 +--
> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 6 +--
> drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 6 +--
> 14 files changed, 87 insertions(+), 54 deletions(-)
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.
2022-05-18 6:07 ` [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive Christian König
@ 2022-05-18 14:24 ` Andrey Grodzovsky
2022-05-19 7:58 ` Christian König
0 siblings, 1 reply; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-18 14:24 UTC (permalink / raw)
To: Christian König, amd-gfx; +Cc: Zoy.Bai, lijo.lazar
[-- Attachment #1: Type: text/plain, Size: 3140 bytes --]
On 2022-05-18 02:07, Christian König wrote:
> Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
>> Problem:
>> During hive reset caused by command timing out on a ring
>> extra resets are generated by triggered by KFD which is
>> unable to accesses registers on the resetting ASIC.
>>
>> Fix: Rework GPU reset to actively stop any pending reset
>> works while another in progress.
>>
>> v2: Switch from generic list as was in v1[1] to eplicit
>> stopping of each reset request from each reset source
>> per each request submitter.
>
> Looks mostly good to me.
>
> Apart from the naming nit pick on patch #1 the only thing I couldn't
> of hand figure out is why you are using a delayed work everywhere
> instead of a just a work item.
>
> That needs a bit further explanation what's happening here.
>
> Christian.
Check APIs for cancelling work vs. delayed work -
For work_struct the only public API is this -
https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214
- blocking cancel.
For delayed_work we have both blocking and non blocking public APIs -
https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
I prefer not to go now into convincing core kernel people of exposing
another interface for our own sake - from my past experience API changes
in core code has slim chances and a lot of time spent on back and forth
arguments.
"If the mountain will not come to Muhammad, then Muhammad must go to the
mountain" ;)*
*
Andrey
>
>>
>> [1] -
>> https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/
>>
>> Andrey Grodzovsky (7):
>> drm/amdgpu: Cache result of last reset at reset domain level.
>> drm/amdgpu: Switch to delayed work from work_struct.
>> drm/admgpu: Serialize RAS recovery work directly into reset domain
>> queue.
>> drm/amdgpu: Add delayed work for GPU reset from debugfs
>> drm/amdgpu: Add delayed work for GPU reset from kfd.
>> drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
>> amdgpu_device_gpu_recover
>> drm/amdgpu: Stop any pending reset if another in progress.
>>
>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 +-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 1 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 +++++++++++-----------
>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 ++++++-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 ++--
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 +-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 5 +-
>> drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 2 +-
>> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 6 +--
>> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 6 +--
>> drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 6 +--
>> 14 files changed, 87 insertions(+), 54 deletions(-)
>>
>
[-- Attachment #2: Type: text/html, Size: 6524 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.
2022-05-18 14:24 ` Andrey Grodzovsky
@ 2022-05-19 7:58 ` Christian König
2022-05-19 13:41 ` Andrey Grodzovsky
0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2022-05-19 7:58 UTC (permalink / raw)
To: Andrey Grodzovsky, Christian König, amd-gfx; +Cc: Zoy.Bai, lijo.lazar
[-- Attachment #1: Type: text/plain, Size: 3934 bytes --]
Am 18.05.22 um 16:24 schrieb Andrey Grodzovsky:
>
>
> On 2022-05-18 02:07, Christian König wrote:
>> Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
>>> Problem:
>>> During hive reset caused by command timing out on a ring
>>> extra resets are generated by triggered by KFD which is
>>> unable to accesses registers on the resetting ASIC.
>>>
>>> Fix: Rework GPU reset to actively stop any pending reset
>>> works while another in progress.
>>>
>>> v2: Switch from generic list as was in v1[1] to eplicit
>>> stopping of each reset request from each reset source
>>> per each request submitter.
>>
>> Looks mostly good to me.
>>
>> Apart from the naming nit pick on patch #1 the only thing I couldn't
>> of hand figure out is why you are using a delayed work everywhere
>> instead of a just a work item.
>>
>> That needs a bit further explanation what's happening here.
>>
>> Christian.
>
>
> Check APIs for cancelling work vs. delayed work -
>
> For work_struct the only public API is this -
> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214
> - blocking cancel.
>
> For delayed_work we have both blocking and non blocking public APIs -
>
> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
>
> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
>
> I prefer not to go now into convincing core kernel people of exposing
> another interface for our own sake - from my past experience API
> changes in core code has slim chances and a lot of time spent on back
> and forth arguments.
>
> "If the mountain will not come to Muhammad, then Muhammad must go to
> the mountain" ;)*
> *
>
Ah, good point. The cancel_work() function was removed a few years ago:
commit 6417250d3f894e66a68ba1cd93676143f2376a6f
Author: Stephen Hemminger <stephen@networkplumber.org>
Date: Tue Mar 6 19:34:42 2018 -0800
workqueue: remove unused cancel_work()
Found this by accident.
There are no usages of bare cancel_work() in current kernel source.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Maybe just revert that patch, export the function and use it. I think
there is plenty of justification for this.
Thanks,
Christian.
> **
>
> Andrey
>
>>
>>>
>>> [1] -
>>> https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/
>>>
>>> Andrey Grodzovsky (7):
>>> drm/amdgpu: Cache result of last reset at reset domain level.
>>> drm/amdgpu: Switch to delayed work from work_struct.
>>> drm/admgpu: Serialize RAS recovery work directly into reset domain
>>> queue.
>>> drm/amdgpu: Add delayed work for GPU reset from debugfs
>>> drm/amdgpu: Add delayed work for GPU reset from kfd.
>>> drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
>>> amdgpu_device_gpu_recover
>>> drm/amdgpu: Stop any pending reset if another in progress.
>>>
>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 +-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 1 +
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62
>>> +++++++++++-----------
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 ++++++-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 ++--
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 +-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1 +
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 5 +-
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 2 +-
>>> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 6 +--
>>> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 6 +--
>>> drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 6 +--
>>> 14 files changed, 87 insertions(+), 54 deletions(-)
>>>
>>
[-- Attachment #2: Type: text/html, Size: 8327 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2 0/7] Fix multiple GPU resets in XGMI hive.
2022-05-19 7:58 ` Christian König
@ 2022-05-19 13:41 ` Andrey Grodzovsky
0 siblings, 0 replies; 15+ messages in thread
From: Andrey Grodzovsky @ 2022-05-19 13:41 UTC (permalink / raw)
To: Christian König, Christian König, amd-gfx; +Cc: Zoy.Bai, lijo.lazar
[-- Attachment #1: Type: text/plain, Size: 4170 bytes --]
On 2022-05-19 03:58, Christian König wrote:
>
>
> Am 18.05.22 um 16:24 schrieb Andrey Grodzovsky:
>>
>>
>> On 2022-05-18 02:07, Christian König wrote:
>>> Am 17.05.22 um 21:20 schrieb Andrey Grodzovsky:
>>>> Problem:
>>>> During hive reset caused by command timing out on a ring
>>>> extra resets are generated by triggered by KFD which is
>>>> unable to accesses registers on the resetting ASIC.
>>>>
>>>> Fix: Rework GPU reset to actively stop any pending reset
>>>> works while another in progress.
>>>>
>>>> v2: Switch from generic list as was in v1[1] to eplicit
>>>> stopping of each reset request from each reset source
>>>> per each request submitter.
>>>
>>> Looks mostly good to me.
>>>
>>> Apart from the naming nit pick on patch #1 the only thing I couldn't
>>> of hand figure out is why you are using a delayed work everywhere
>>> instead of a just a work item.
>>>
>>> That needs a bit further explanation what's happening here.
>>>
>>> Christian.
>>
>>
>> Check APIs for cancelling work vs. delayed work -
>>
>> For work_struct the only public API is this -
>> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3214
>> - blocking cancel.
>>
>> For delayed_work we have both blocking and non blocking public APIs -
>>
>> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
>>
>> https://elixir.bootlin.com/linux/latest/source/kernel/workqueue.c#L3295
>>
>> I prefer not to go now into convincing core kernel people of exposing
>> another interface for our own sake - from my past experience API
>> changes in core code has slim chances and a lot of time spent on back
>> and forth arguments.
>>
>> "If the mountain will not come to Muhammad, then Muhammad must go to
>> the mountain" ;)*
>> *
>>
>
> Ah, good point. The cancel_work() function was removed a few years ago:
>
> commit 6417250d3f894e66a68ba1cd93676143f2376a6f
> Author: Stephen Hemminger <stephen@networkplumber.org>
> Date: Tue Mar 6 19:34:42 2018 -0800
>
> workqueue: remove unused cancel_work()
>
> Found this by accident.
> There are no usages of bare cancel_work() in current kernel source.
>
> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>
>
>
> Maybe just revert that patch, export the function and use it. I think
> there is plenty of justification for this.
>
> Thanks,
> Christian.
Ok - i will send them a patch - let's see what they say.
Andrey
>
>> **
>>
>> Andrey
>>
>>>
>>>>
>>>> [1] -
>>>> https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@amd.com/
>>>>
>>>> Andrey Grodzovsky (7):
>>>> drm/amdgpu: Cache result of last reset at reset domain level.
>>>> drm/amdgpu: Switch to delayed work from work_struct.
>>>> drm/admgpu: Serialize RAS recovery work directly into reset domain
>>>> queue.
>>>> drm/amdgpu: Add delayed work for GPU reset from debugfs
>>>> drm/amdgpu: Add delayed work for GPU reset from kfd.
>>>> drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to
>>>> amdgpu_device_gpu_recover
>>>> drm/amdgpu: Stop any pending reset if another in progress.
>>>>
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 +-
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++-
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 1 +
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62
>>>> +++++++++++-----------
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 ++++++-
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +-
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 ++--
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 +-
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1 +
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 5 +-
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 2 +-
>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 6 +--
>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 6 +--
>>>> drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 6 +--
>>>> 14 files changed, 87 insertions(+), 54 deletions(-)
>>>>
>>>
>
[-- Attachment #2: Type: text/html, Size: 12100 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread