* [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row
@ 2024-04-26 3:57 Yunxiang Li
2024-04-26 3:57 ` [PATCH v3 2/4] drm/amdgpu: Add reset_context flag for host FLR Yunxiang Li
` (4 more replies)
0 siblings, 5 replies; 18+ messages in thread
From: Yunxiang Li @ 2024-04-26 3:57 UTC (permalink / raw)
To: amd-gfx
Cc: Alexander.Deucher, christian.koenig, lijo.lazar, felix.kuehling,
emily.deng, Yunxiang Li
Some times a hang GPU causes multiple reset sources to schedule resets.
The second source will be able to trigger an unnecessary reset if they
schedule after we call amdgpu_device_stop_pending_resets.
Move amdgpu_device_stop_pending_resets to after the reset is done. Since
at this point the GPU is supposedly in a good state, any reset scheduled
after this point would be a legitimate reset.
Remove unnecessary and incorrect checks for amdgpu_in_reset that was
kinda serving this purpose.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
v2: instead of adding amdgpu_in_reset check, move when we cancel pending
resets
v3: no changes from v2, collect all the patches in one series for easier review
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 ++++++++++---------
drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 2 +-
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 2 +-
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 2 +-
drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 2 +-
5 files changed, 14 insertions(+), 13 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 861ccff78af9..8befd10bf007 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5070,8 +5070,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
retry:
amdgpu_amdkfd_pre_reset(adev);
- amdgpu_device_stop_pending_resets(adev);
-
if (from_hypervisor)
r = amdgpu_virt_request_full_gpu(adev, true);
else
@@ -5823,13 +5821,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
r, adev_to_drm(tmp_adev)->unique);
tmp_adev->asic_reset_res = r;
}
-
- if (!amdgpu_sriov_vf(tmp_adev))
- /*
- * Drop all pending non scheduler resets. Scheduler resets
- * were already dropped during drm_sched_stop
- */
- amdgpu_device_stop_pending_resets(tmp_adev);
}
/* Actual ASIC resets if needed.*/
@@ -5851,6 +5842,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
goto retry;
}
+ list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
+ /*
+ * Drop any pending non scheduler resets queued before reset is done.
+ * Any reset scheduled after this point would be valid. Scheduler resets
+ * were already dropped during drm_sched_stop and no new ones can come
+ * in before drm_sched_start.
+ */
+ amdgpu_device_stop_pending_resets(tmp_adev);
+ }
+
skip_hw_reset:
/* Post ASIC reset for all devs .*/
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 54ab51a4ada7..c2385178d6b3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -597,7 +597,7 @@ static void amdgpu_virt_update_vf2pf_work_item(struct work_struct *work)
if (ret) {
adev->virt.vf2pf_update_retry_cnt++;
if ((adev->virt.vf2pf_update_retry_cnt >= AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT) &&
- amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev)) {
+ amdgpu_sriov_runtime(adev)) {
amdgpu_ras_set_fed(adev, true);
if (amdgpu_reset_domain_schedule(adev->reset_domain,
&adev->virt.flr_work))
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 0c7275bca8f7..c5ba9c4757a8 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -319,7 +319,7 @@ static int xgpu_ai_mailbox_rcv_irq(struct amdgpu_device *adev,
switch (event) {
case IDH_FLR_NOTIFICATION:
- if (amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev))
+ if (amdgpu_sriov_runtime(adev))
WARN_ONCE(!amdgpu_reset_domain_schedule(adev->reset_domain,
&adev->virt.flr_work),
"Failed to queue work! at %s",
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index aba00d961627..fa9d1b02f391 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -358,7 +358,7 @@ static int xgpu_nv_mailbox_rcv_irq(struct amdgpu_device *adev,
switch (event) {
case IDH_FLR_NOTIFICATION:
- if (amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev))
+ if (amdgpu_sriov_runtime(adev))
WARN_ONCE(!amdgpu_reset_domain_schedule(adev->reset_domain,
&adev->virt.flr_work),
"Failed to queue work! at %s",
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index 59f53c743362..14a065516ae4 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -560,7 +560,7 @@ static int xgpu_vi_mailbox_rcv_irq(struct amdgpu_device *adev,
r = xgpu_vi_mailbox_rcv_msg(adev, IDH_FLR_NOTIFICATION);
/* only handle FLR_NOTIFY now */
- if (!r && !amdgpu_in_reset(adev))
+ if (!r)
WARN_ONCE(!amdgpu_reset_domain_schedule(adev->reset_domain,
&adev->virt.flr_work),
"Failed to queue work! at %s",
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v3 2/4] drm/amdgpu: Add reset_context flag for host FLR
2024-04-26 3:57 [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Yunxiang Li
@ 2024-04-26 3:57 ` Yunxiang Li
2024-04-26 18:27 ` [PATCH v4 " Yunxiang Li
2024-04-26 3:57 ` [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic Yunxiang Li
` (3 subsequent siblings)
4 siblings, 1 reply; 18+ messages in thread
From: Yunxiang Li @ 2024-04-26 3:57 UTC (permalink / raw)
To: amd-gfx
Cc: Alexander.Deucher, christian.koenig, lijo.lazar, felix.kuehling,
emily.deng, Yunxiang Li
There are other reset sources that pass NULL as the job pointer, such as
amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if
the FLR comes from the host does not work.
Add a flag in reset_context to explicitly mark host triggered reset, and
set this flag when we receive host reset notification.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
v2: fix typo
v3: pass reset_context directly
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++----
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 +
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 1 +
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 1 +
drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 1 +
5 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8befd10bf007..1fd9637daafc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5055,13 +5055,13 @@ static int amdgpu_device_recover_vram(struct amdgpu_device *adev)
* amdgpu_device_reset_sriov - reset ASIC for SR-IOV vf
*
* @adev: amdgpu_device pointer
- * @from_hypervisor: request from hypervisor
+ * @reset_context: amdgpu reset context pointer
*
* do VF FLR and reinitialize Asic
* return 0 means succeeded otherwise failed
*/
static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
- bool from_hypervisor)
+ struct amdgpu_reset_context *reset_context)
{
int r;
struct amdgpu_hive_info *hive = NULL;
@@ -5070,7 +5070,7 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
retry:
amdgpu_amdkfd_pre_reset(adev);
- if (from_hypervisor)
+ if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags))
r = amdgpu_virt_request_full_gpu(adev, true);
else
r = amdgpu_virt_reset_gpu(adev);
@@ -5826,7 +5826,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
/* Actual ASIC resets if needed.*/
/* Host driver will handle XGMI hive reset for SRIOV */
if (amdgpu_sriov_vf(adev)) {
- r = amdgpu_device_reset_sriov(adev, job ? false : true);
+ r = amdgpu_device_reset_sriov(adev, reset_context);
if (r)
adev->asic_reset_res = r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
index b11d190ece53..5a9cc043b858 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -33,6 +33,7 @@ enum AMDGPU_RESET_FLAGS {
AMDGPU_NEED_FULL_RESET = 0,
AMDGPU_SKIP_HW_RESET = 1,
AMDGPU_SKIP_COREDUMP = 2,
+ AMDGPU_HOST_FLR = 3,
};
struct amdgpu_reset_context {
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index c5ba9c4757a8..f4c47492e0cd 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -292,6 +292,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
reset_context.method = AMD_RESET_METHOD_NONE;
reset_context.reset_req_dev = adev;
clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+ set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
amdgpu_device_gpu_recover(adev, NULL, &reset_context);
}
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index fa9d1b02f391..14cc7910e5cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -328,6 +328,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
reset_context.method = AMD_RESET_METHOD_NONE;
reset_context.reset_req_dev = adev;
clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+ set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
amdgpu_device_gpu_recover(adev, NULL, &reset_context);
}
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index 14a065516ae4..78cd07744ebe 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -529,6 +529,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
reset_context.method = AMD_RESET_METHOD_NONE;
reset_context.reset_req_dev = adev;
clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+ set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
amdgpu_device_gpu_recover(adev, NULL, &reset_context);
}
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic
2024-04-26 3:57 [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Yunxiang Li
2024-04-26 3:57 ` [PATCH v3 2/4] drm/amdgpu: Add reset_context flag for host FLR Yunxiang Li
@ 2024-04-26 3:57 ` Yunxiang Li
2024-04-26 7:40 ` Christian König
` (2 more replies)
2024-04-26 3:57 ` [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function Yunxiang Li
` (2 subsequent siblings)
4 siblings, 3 replies; 18+ messages in thread
From: Yunxiang Li @ 2024-04-26 3:57 UTC (permalink / raw)
To: amd-gfx
Cc: Alexander.Deucher, christian.koenig, lijo.lazar, felix.kuehling,
emily.deng, Yunxiang Li
The retry loop for SRIOV reset have refcount and memory leak issue.
Depending on which function call fails it can potentially call
amdgpu_amdkfd_pre/post_reset different number of times and causes
kfd_locked count to be wrong. This will block all future attempts at
opening /dev/kfd. The retry loop also leakes resources by calling
amdgpu_virt_init_data_exchange multiple times without calling the
corresponding fini function.
Align with the bare-metal reset path which doesn't have these issues.
This means taking the amdgpu_amdkfd_pre/post_reset functions out of the
reset loop and calling amdgpu_device_pre_asic_reset each retry which
properly free the resources from previous try by calling
amdgpu_virt_fini_data_exchange.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 50 ++++++++++------------
1 file changed, 22 insertions(+), 28 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1fd9637daafc..3c4755f3c116 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5063,19 +5063,14 @@ static int amdgpu_device_recover_vram(struct amdgpu_device *adev)
static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
struct amdgpu_reset_context *reset_context)
{
- int r;
+ int r = 0;
struct amdgpu_hive_info *hive = NULL;
- int retry_limit = 0;
-
-retry:
- amdgpu_amdkfd_pre_reset(adev);
if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags))
r = amdgpu_virt_request_full_gpu(adev, true);
else
r = amdgpu_virt_reset_gpu(adev);
- if (r)
- return r;
+
amdgpu_ras_set_fed(adev, false);
amdgpu_irq_gpu_reset_resume_helper(adev);
@@ -5085,7 +5080,7 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
/* Resume IP prior to SMC */
r = amdgpu_device_ip_reinit_early_sriov(adev);
if (r)
- goto error;
+ return r;
amdgpu_virt_init_data_exchange(adev);
@@ -5096,38 +5091,35 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
/* now we are okay to resume SMC/CP/SDMA */
r = amdgpu_device_ip_reinit_late_sriov(adev);
if (r)
- goto error;
+ return r;
hive = amdgpu_get_xgmi_hive(adev);
/* Update PSP FW topology after reset */
if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
r = amdgpu_xgmi_update_topology(hive, adev);
-
if (hive)
amdgpu_put_xgmi_hive(hive);
+ if (r)
+ return r;
- if (!r) {
- r = amdgpu_ib_ring_tests(adev);
-
- amdgpu_amdkfd_post_reset(adev);
- }
+ r = amdgpu_ib_ring_tests(adev);
+ if (r)
+ return r;
-error:
- if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
+ if (adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
amdgpu_inc_vram_lost(adev);
r = amdgpu_device_recover_vram(adev);
}
- amdgpu_virt_release_full_gpu(adev, true);
+ if (r)
+ return r;
- if (AMDGPU_RETRY_SRIOV_RESET(r)) {
- if (retry_limit < AMDGPU_MAX_RETRY_LIMIT) {
- retry_limit++;
- goto retry;
- } else
- DRM_ERROR("GPU reset retry is beyond the retry limit\n");
- }
+ /* need to be called during full access so we can't do it later like
+ * bare-metal does.
+ */
+ amdgpu_amdkfd_post_reset(adev);
+ amdgpu_virt_release_full_gpu(adev, true);
- return r;
+ return 0;
}
/**
@@ -5686,6 +5678,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
int i, r = 0;
bool need_emergency_restart = false;
bool audio_suspended = false;
+ int retry_limit = AMDGPU_MAX_RETRY_LIMIT;
/*
* Special case: RAS triggered and full reset isn't supported
@@ -5767,8 +5760,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
- if (!amdgpu_sriov_vf(tmp_adev))
- amdgpu_amdkfd_pre_reset(tmp_adev);
+ amdgpu_amdkfd_pre_reset(tmp_adev);
/*
* Mark these ASICs to be reseted as untracked first
@@ -5827,6 +5819,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
/* Host driver will handle XGMI hive reset for SRIOV */
if (amdgpu_sriov_vf(adev)) {
r = amdgpu_device_reset_sriov(adev, reset_context);
+ if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0)
+ goto retry;
if (r)
adev->asic_reset_res = r;
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function
2024-04-26 3:57 [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Yunxiang Li
2024-04-26 3:57 ` [PATCH v3 2/4] drm/amdgpu: Add reset_context flag for host FLR Yunxiang Li
2024-04-26 3:57 ` [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic Yunxiang Li
@ 2024-04-26 3:57 ` Yunxiang Li
2024-04-28 7:17 ` Deng, Emily
2024-04-30 18:32 ` Luo, Zhigang
2024-04-30 19:05 ` [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Li, Yunxiang (Teddy)
2024-04-30 19:09 ` Lazar, Lijo
4 siblings, 2 replies; 18+ messages in thread
From: Yunxiang Li @ 2024-04-26 3:57 UTC (permalink / raw)
To: amd-gfx
Cc: Alexander.Deucher, christian.koenig, lijo.lazar, felix.kuehling,
emily.deng, Yunxiang Li
This is part of the reset, move it into the reset function.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c4755f3c116..8f2c1f71ed9a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5119,6 +5119,11 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
amdgpu_amdkfd_post_reset(adev);
amdgpu_virt_release_full_gpu(adev, true);
+ /* Aldebaran and gfx_11_0_3 support ras in SRIOV, so need resume ras during reset */
+ if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 2) ||
+ amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) ||
+ amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(11, 0, 3))
+ amdgpu_ras_resume(adev);
return 0;
}
@@ -5823,13 +5828,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
goto retry;
if (r)
adev->asic_reset_res = r;
-
- /* Aldebaran and gfx_11_0_3 support ras in SRIOV, so need resume ras during reset */
- if (amdgpu_ip_version(adev, GC_HWIP, 0) ==
- IP_VERSION(9, 4, 2) ||
- amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) ||
- amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(11, 0, 3))
- amdgpu_ras_resume(adev);
} else {
r = amdgpu_do_asic_reset(device_list_handle, reset_context);
if (r && r == -EAGAIN)
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic
2024-04-26 3:57 ` [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic Yunxiang Li
@ 2024-04-26 7:40 ` Christian König
2024-04-26 8:42 ` Deng, Emily
2024-04-26 18:29 ` [PATCH v2 " Yunxiang Li
2 siblings, 0 replies; 18+ messages in thread
From: Christian König @ 2024-04-26 7:40 UTC (permalink / raw)
To: Yunxiang Li, amd-gfx
Cc: Alexander.Deucher, lijo.lazar, felix.kuehling, emily.deng
Am 26.04.24 um 05:57 schrieb Yunxiang Li:
> The retry loop for SRIOV reset have refcount and memory leak issue.
> Depending on which function call fails it can potentially call
> amdgpu_amdkfd_pre/post_reset different number of times and causes
> kfd_locked count to be wrong. This will block all future attempts at
> opening /dev/kfd. The retry loop also leakes resources by calling
> amdgpu_virt_init_data_exchange multiple times without calling the
> corresponding fini function.
>
> Align with the bare-metal reset path which doesn't have these issues.
> This means taking the amdgpu_amdkfd_pre/post_reset functions out of the
> reset loop and calling amdgpu_device_pre_asic_reset each retry which
> properly free the resources from previous try by calling
> amdgpu_virt_fini_data_exchange.
>
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 50 ++++++++++------------
> 1 file changed, 22 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 1fd9637daafc..3c4755f3c116 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5063,19 +5063,14 @@ static int amdgpu_device_recover_vram(struct amdgpu_device *adev)
> static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
> struct amdgpu_reset_context *reset_context)
> {
> - int r;
> + int r = 0;
Please don't initialize local variables if it isn't necessary.
As far as I can see that is initialized below and will generate a
warning from automated checkers.
Regards,
Christian.
> struct amdgpu_hive_info *hive = NULL;
> - int retry_limit = 0;
> -
> -retry:
> - amdgpu_amdkfd_pre_reset(adev);
>
> if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags))
> r = amdgpu_virt_request_full_gpu(adev, true);
> else
> r = amdgpu_virt_reset_gpu(adev);
> - if (r)
> - return r;
> +
> amdgpu_ras_set_fed(adev, false);
> amdgpu_irq_gpu_reset_resume_helper(adev);
>
> @@ -5085,7 +5080,7 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
> /* Resume IP prior to SMC */
> r = amdgpu_device_ip_reinit_early_sriov(adev);
> if (r)
> - goto error;
> + return r;
>
> amdgpu_virt_init_data_exchange(adev);
>
> @@ -5096,38 +5091,35 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
> /* now we are okay to resume SMC/CP/SDMA */
> r = amdgpu_device_ip_reinit_late_sriov(adev);
> if (r)
> - goto error;
> + return r;
>
> hive = amdgpu_get_xgmi_hive(adev);
> /* Update PSP FW topology after reset */
> if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
> r = amdgpu_xgmi_update_topology(hive, adev);
> -
> if (hive)
> amdgpu_put_xgmi_hive(hive);
> + if (r)
> + return r;
>
> - if (!r) {
> - r = amdgpu_ib_ring_tests(adev);
> -
> - amdgpu_amdkfd_post_reset(adev);
> - }
> + r = amdgpu_ib_ring_tests(adev);
> + if (r)
> + return r;
>
> -error:
> - if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
> + if (adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
> amdgpu_inc_vram_lost(adev);
> r = amdgpu_device_recover_vram(adev);
> }
> - amdgpu_virt_release_full_gpu(adev, true);
> + if (r)
> + return r;
>
> - if (AMDGPU_RETRY_SRIOV_RESET(r)) {
> - if (retry_limit < AMDGPU_MAX_RETRY_LIMIT) {
> - retry_limit++;
> - goto retry;
> - } else
> - DRM_ERROR("GPU reset retry is beyond the retry limit\n");
> - }
> + /* need to be called during full access so we can't do it later like
> + * bare-metal does.
> + */
> + amdgpu_amdkfd_post_reset(adev);
> + amdgpu_virt_release_full_gpu(adev, true);
>
> - return r;
> + return 0;
> }
>
> /**
> @@ -5686,6 +5678,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> int i, r = 0;
> bool need_emergency_restart = false;
> bool audio_suspended = false;
> + int retry_limit = AMDGPU_MAX_RETRY_LIMIT;
>
> /*
> * Special case: RAS triggered and full reset isn't supported
> @@ -5767,8 +5760,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>
> cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
>
> - if (!amdgpu_sriov_vf(tmp_adev))
> - amdgpu_amdkfd_pre_reset(tmp_adev);
> + amdgpu_amdkfd_pre_reset(tmp_adev);
>
> /*
> * Mark these ASICs to be reseted as untracked first
> @@ -5827,6 +5819,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> /* Host driver will handle XGMI hive reset for SRIOV */
> if (amdgpu_sriov_vf(adev)) {
> r = amdgpu_device_reset_sriov(adev, reset_context);
> + if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0)
> + goto retry;
> if (r)
> adev->asic_reset_res = r;
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic
2024-04-26 3:57 ` [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic Yunxiang Li
2024-04-26 7:40 ` Christian König
@ 2024-04-26 8:42 ` Deng, Emily
2024-04-26 18:23 ` Li, Yunxiang (Teddy)
2024-04-26 18:29 ` [PATCH v2 " Yunxiang Li
2 siblings, 1 reply; 18+ messages in thread
From: Deng, Emily @ 2024-04-26 8:42 UTC (permalink / raw)
To: Li, Yunxiang (Teddy), amd-gfx
Cc: Deucher, Alexander, Koenig, Christian, Lazar, Lijo, Kuehling, Felix
[AMD Official Use Only - General]
>-----Original Message-----
>From: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>Sent: Friday, April 26, 2024 11:58 AM
>To: amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
><Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Kuehling,
>Felix <Felix.Kuehling@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Li,
>Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>Subject: [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic
>
>The retry loop for SRIOV reset have refcount and memory leak issue.
>Depending on which function call fails it can potentially call
>amdgpu_amdkfd_pre/post_reset different number of times and causes
>kfd_locked count to be wrong. This will block all future attempts at opening
>/dev/kfd. The retry loop also leakes resources by calling
>amdgpu_virt_init_data_exchange multiple times without calling the
>corresponding fini function.
>
>Align with the bare-metal reset path which doesn't have these issues.
>This means taking the amdgpu_amdkfd_pre/post_reset functions out of the
>reset loop and calling amdgpu_device_pre_asic_reset each retry which
>properly free the resources from previous try by calling
>amdgpu_virt_fini_data_exchange.
>
>Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
>---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 50 ++++++++++------------
> 1 file changed, 22 insertions(+), 28 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>index 1fd9637daafc..3c4755f3c116 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>@@ -5063,19 +5063,14 @@ static int amdgpu_device_recover_vram(struct
>amdgpu_device *adev) static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,
> struct amdgpu_reset_context
>*reset_context) {
>- int r;
>+ int r = 0;
> struct amdgpu_hive_info *hive = NULL;
>- int retry_limit = 0;
>-
>-retry:
>- amdgpu_amdkfd_pre_reset(adev);
>
> if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags))
> r = amdgpu_virt_request_full_gpu(adev, true);
> else
> r = amdgpu_virt_reset_gpu(adev);
>- if (r)
>- return r;
Why remove this?
Emily Deng
Best Wishes
>+
> amdgpu_ras_set_fed(adev, false);
> amdgpu_irq_gpu_reset_resume_helper(adev);
>
>@@ -5085,7 +5080,7 @@ static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,
> /* Resume IP prior to SMC */
> r = amdgpu_device_ip_reinit_early_sriov(adev);
> if (r)
>- goto error;
>+ return r;
Need to call amdgpu_virt_release_full_gpu(adev, true) before retry, and the same as below.
Emily Deng
Best Wishes
> amdgpu_virt_init_data_exchange(adev);
>
>@@ -5096,38 +5091,35 @@ static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,
> /* now we are okay to resume SMC/CP/SDMA */
> r = amdgpu_device_ip_reinit_late_sriov(adev);
> if (r)
>- goto error;
>+ return r;
>
> hive = amdgpu_get_xgmi_hive(adev);
> /* Update PSP FW topology after reset */
> if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
> r = amdgpu_xgmi_update_topology(hive, adev);
>-
> if (hive)
> amdgpu_put_xgmi_hive(hive);
>+ if (r)
>+ return r;
>
>- if (!r) {
>- r = amdgpu_ib_ring_tests(adev);
>-
>- amdgpu_amdkfd_post_reset(adev);
>- }
>+ r = amdgpu_ib_ring_tests(adev);
>+ if (r)
>+ return r;
>
>-error:
>- if (!r && adev->virt.gim_feature &
>AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
>+ if (adev->virt.gim_feature &
>AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
> amdgpu_inc_vram_lost(adev);
> r = amdgpu_device_recover_vram(adev);
> }
>- amdgpu_virt_release_full_gpu(adev, true);
>+ if (r)
>+ return r;
>
>- if (AMDGPU_RETRY_SRIOV_RESET(r)) {
>- if (retry_limit < AMDGPU_MAX_RETRY_LIMIT) {
>- retry_limit++;
>- goto retry;
>- } else
>- DRM_ERROR("GPU reset retry is beyond the retry
>limit\n");
>- }
>+ /* need to be called during full access so we can't do it later like
>+ * bare-metal does.
>+ */
>+ amdgpu_amdkfd_post_reset(adev);
>+ amdgpu_virt_release_full_gpu(adev, true);
>
>- return r;
>+ return 0;
> }
>
> /**
>@@ -5686,6 +5678,7 @@ int amdgpu_device_gpu_recover(struct
>amdgpu_device *adev,
> int i, r = 0;
> bool need_emergency_restart = false;
> bool audio_suspended = false;
>+ int retry_limit = AMDGPU_MAX_RETRY_LIMIT;
>
> /*
> * Special case: RAS triggered and full reset isn't supported @@ -
>5767,8 +5760,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device
>*adev,
>
> cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
>
>- if (!amdgpu_sriov_vf(tmp_adev))
>- amdgpu_amdkfd_pre_reset(tmp_adev);
>+ amdgpu_amdkfd_pre_reset(tmp_adev);
>
> /*
> * Mark these ASICs to be reseted as untracked first @@ -
>5827,6 +5819,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device
>*adev,
> /* Host driver will handle XGMI hive reset for SRIOV */
> if (amdgpu_sriov_vf(adev)) {
> r = amdgpu_device_reset_sriov(adev, reset_context);
>+ if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0)
>+ goto retry;
> if (r)
> adev->asic_reset_res = r;
>
>--
>2.34.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic
2024-04-26 8:42 ` Deng, Emily
@ 2024-04-26 18:23 ` Li, Yunxiang (Teddy)
0 siblings, 0 replies; 18+ messages in thread
From: Li, Yunxiang (Teddy) @ 2024-04-26 18:23 UTC (permalink / raw)
To: Deng, Emily, amd-gfx
Cc: Deucher, Alexander, Koenig, Christian, Lazar, Lijo, Kuehling, Felix
[Public]
> Why remove this?
Oops it's a copy-paste error from the previous revision
> Need to call amdgpu_virt_release_full_gpu(adev, true) before retry, and the same as below.
I thought we talked about if we call amdgpu_virt_{reset,request_full}_gpu again we don't need to release full gpu, I must have mis-understood. I'll put it back. Yeah checking the code it seem we still need to release.
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v4 2/4] drm/amdgpu: Add reset_context flag for host FLR
2024-04-26 3:57 ` [PATCH v3 2/4] drm/amdgpu: Add reset_context flag for host FLR Yunxiang Li
@ 2024-04-26 18:27 ` Yunxiang Li
2024-04-28 7:16 ` Deng, Emily
2024-04-30 18:28 ` Luo, Zhigang
0 siblings, 2 replies; 18+ messages in thread
From: Yunxiang Li @ 2024-04-26 18:27 UTC (permalink / raw)
To: amd-gfx
Cc: Alexander.Deucher, christian.koenig, lijo.lazar, felix.kuehling,
emily.deng, Yunxiang Li
There are other reset sources that pass NULL as the job pointer, such as
amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if
the FLR comes from the host does not work.
Add a flag in reset_context to explicitly mark host triggered reset, and
set this flag when we receive host reset notification.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
v2: fix typo
v3: pass reset_context directly
v4: clear the flag in case we retry
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 ++++++++-----
drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 +
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 1 +
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 1 +
drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 1 +
5 files changed, 12 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8befd10bf007..33c889c027a5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5055,13 +5055,13 @@ static int amdgpu_device_recover_vram(struct amdgpu_device *adev)
* amdgpu_device_reset_sriov - reset ASIC for SR-IOV vf
*
* @adev: amdgpu_device pointer
- * @from_hypervisor: request from hypervisor
+ * @reset_context: amdgpu reset context pointer
*
* do VF FLR and reinitialize Asic
* return 0 means succeeded otherwise failed
*/
static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
- bool from_hypervisor)
+ struct amdgpu_reset_context *reset_context)
{
int r;
struct amdgpu_hive_info *hive = NULL;
@@ -5070,12 +5070,15 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
retry:
amdgpu_amdkfd_pre_reset(adev);
- if (from_hypervisor)
+ if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
+ clear_bit(AMDGPU_HOST_FLR, &reset_context->flags);
r = amdgpu_virt_request_full_gpu(adev, true);
- else
+ } else {
r = amdgpu_virt_reset_gpu(adev);
+ }
if (r)
return r;
+
amdgpu_ras_set_fed(adev, false);
amdgpu_irq_gpu_reset_resume_helper(adev);
@@ -5826,7 +5829,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
/* Actual ASIC resets if needed.*/
/* Host driver will handle XGMI hive reset for SRIOV */
if (amdgpu_sriov_vf(adev)) {
- r = amdgpu_device_reset_sriov(adev, job ? false : true);
+ r = amdgpu_device_reset_sriov(adev, reset_context);
if (r)
adev->asic_reset_res = r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
index b11d190ece53..5a9cc043b858 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -33,6 +33,7 @@ enum AMDGPU_RESET_FLAGS {
AMDGPU_NEED_FULL_RESET = 0,
AMDGPU_SKIP_HW_RESET = 1,
AMDGPU_SKIP_COREDUMP = 2,
+ AMDGPU_HOST_FLR = 3,
};
struct amdgpu_reset_context {
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index c5ba9c4757a8..f4c47492e0cd 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -292,6 +292,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
reset_context.method = AMD_RESET_METHOD_NONE;
reset_context.reset_req_dev = adev;
clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+ set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
amdgpu_device_gpu_recover(adev, NULL, &reset_context);
}
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index fa9d1b02f391..14cc7910e5cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -328,6 +328,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
reset_context.method = AMD_RESET_METHOD_NONE;
reset_context.reset_req_dev = adev;
clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+ set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
amdgpu_device_gpu_recover(adev, NULL, &reset_context);
}
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index 14a065516ae4..78cd07744ebe 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -529,6 +529,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
reset_context.method = AMD_RESET_METHOD_NONE;
reset_context.reset_req_dev = adev;
clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+ set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
amdgpu_device_gpu_recover(adev, NULL, &reset_context);
}
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v2 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic
2024-04-26 3:57 ` [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic Yunxiang Li
2024-04-26 7:40 ` Christian König
2024-04-26 8:42 ` Deng, Emily
@ 2024-04-26 18:29 ` Yunxiang Li
2024-04-28 7:16 ` Deng, Emily
2024-04-30 18:25 ` Luo, Zhigang
2 siblings, 2 replies; 18+ messages in thread
From: Yunxiang Li @ 2024-04-26 18:29 UTC (permalink / raw)
To: amd-gfx
Cc: Alexander.Deucher, christian.koenig, lijo.lazar, felix.kuehling,
emily.deng, Yunxiang Li
The retry loop for SRIOV reset have refcount and memory leak issue.
Depending on which function call fails it can potentially call
amdgpu_amdkfd_pre/post_reset different number of times and causes
kfd_locked count to be wrong. This will block all future attempts at
opening /dev/kfd. The retry loop also leakes resources by calling
amdgpu_virt_init_data_exchange multiple times without calling the
corresponding fini function.
Align with the bare-metal reset path which doesn't have these issues.
This means taking the amdgpu_amdkfd_pre/post_reset functions out of the
reset loop and calling amdgpu_device_pre_asic_reset each retry which
properly free the resources from previous try by calling
amdgpu_virt_fini_data_exchange.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
v2: put back release full access and the missed return
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 47 ++++++++++------------
1 file changed, 22 insertions(+), 25 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 33c889c027a5..b23645f23a2e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5065,10 +5065,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
{
int r;
struct amdgpu_hive_info *hive = NULL;
- int retry_limit = 0;
-
-retry:
- amdgpu_amdkfd_pre_reset(adev);
if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
clear_bit(AMDGPU_HOST_FLR, &reset_context->flags);
@@ -5088,7 +5084,7 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
/* Resume IP prior to SMC */
r = amdgpu_device_ip_reinit_early_sriov(adev);
if (r)
- goto error;
+ return r;
amdgpu_virt_init_data_exchange(adev);
@@ -5099,38 +5095,35 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
/* now we are okay to resume SMC/CP/SDMA */
r = amdgpu_device_ip_reinit_late_sriov(adev);
if (r)
- goto error;
+ return r;
hive = amdgpu_get_xgmi_hive(adev);
/* Update PSP FW topology after reset */
if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
r = amdgpu_xgmi_update_topology(hive, adev);
-
if (hive)
amdgpu_put_xgmi_hive(hive);
+ if (r)
+ return r;
- if (!r) {
- r = amdgpu_ib_ring_tests(adev);
-
- amdgpu_amdkfd_post_reset(adev);
- }
+ r = amdgpu_ib_ring_tests(adev);
+ if (r)
+ return r;
-error:
- if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
+ if (adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
amdgpu_inc_vram_lost(adev);
r = amdgpu_device_recover_vram(adev);
}
- amdgpu_virt_release_full_gpu(adev, true);
+ if (r)
+ return r;
- if (AMDGPU_RETRY_SRIOV_RESET(r)) {
- if (retry_limit < AMDGPU_MAX_RETRY_LIMIT) {
- retry_limit++;
- goto retry;
- } else
- DRM_ERROR("GPU reset retry is beyond the retry limit\n");
- }
+ /* need to be called during full access so we can't do it later like
+ * bare-metal does.
+ */
+ amdgpu_amdkfd_post_reset(adev);
+ amdgpu_virt_release_full_gpu(adev, true);
- return r;
+ return 0;
}
/**
@@ -5689,6 +5682,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
int i, r = 0;
bool need_emergency_restart = false;
bool audio_suspended = false;
+ int retry_limit = AMDGPU_MAX_RETRY_LIMIT;
/*
* Special case: RAS triggered and full reset isn't supported
@@ -5770,8 +5764,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
- if (!amdgpu_sriov_vf(tmp_adev))
- amdgpu_amdkfd_pre_reset(tmp_adev);
+ amdgpu_amdkfd_pre_reset(tmp_adev);
/*
* Mark these ASICs to be reseted as untracked first
@@ -5830,6 +5823,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
/* Host driver will handle XGMI hive reset for SRIOV */
if (amdgpu_sriov_vf(adev)) {
r = amdgpu_device_reset_sriov(adev, reset_context);
+ if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0) {
+ amdgpu_virt_release_full_gpu(adev, true);
+ goto retry;
+ }
if (r)
adev->asic_reset_res = r;
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* RE: [PATCH v2 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic
2024-04-26 18:29 ` [PATCH v2 " Yunxiang Li
@ 2024-04-28 7:16 ` Deng, Emily
2024-04-30 18:25 ` Luo, Zhigang
1 sibling, 0 replies; 18+ messages in thread
From: Deng, Emily @ 2024-04-28 7:16 UTC (permalink / raw)
To: Li, Yunxiang (Teddy), amd-gfx
Cc: Deucher, Alexander, Koenig, Christian, Lazar, Lijo, Kuehling, Felix
[AMD Official Use Only - General]
Reviewed-by: Emily Deng <Emily.Deng@amd.com>
Emily Deng
Best Wishes
>-----Original Message-----
>From: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>Sent: Saturday, April 27, 2024 2:29 AM
>To: amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
><Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Kuehling,
>Felix <Felix.Kuehling@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Li,
>Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>Subject: [PATCH v2 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry
>logic
>
>The retry loop for SRIOV reset have refcount and memory leak issue.
>Depending on which function call fails it can potentially call
>amdgpu_amdkfd_pre/post_reset different number of times and causes
>kfd_locked count to be wrong. This will block all future attempts at opening
>/dev/kfd. The retry loop also leakes resources by calling
>amdgpu_virt_init_data_exchange multiple times without calling the
>corresponding fini function.
>
>Align with the bare-metal reset path which doesn't have these issues.
>This means taking the amdgpu_amdkfd_pre/post_reset functions out of the
>reset loop and calling amdgpu_device_pre_asic_reset each retry which
>properly free the resources from previous try by calling
>amdgpu_virt_fini_data_exchange.
>
>Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
>---
>v2: put back release full access and the missed return
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 47 ++++++++++------------
> 1 file changed, 22 insertions(+), 25 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>index 33c889c027a5..b23645f23a2e 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>@@ -5065,10 +5065,6 @@ static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev, {
> int r;
> struct amdgpu_hive_info *hive = NULL;
>- int retry_limit = 0;
>-
>-retry:
>- amdgpu_amdkfd_pre_reset(adev);
>
> if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
> clear_bit(AMDGPU_HOST_FLR, &reset_context->flags); @@ -
>5088,7 +5084,7 @@ static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,
> /* Resume IP prior to SMC */
> r = amdgpu_device_ip_reinit_early_sriov(adev);
> if (r)
>- goto error;
>+ return r;
>
> amdgpu_virt_init_data_exchange(adev);
>
>@@ -5099,38 +5095,35 @@ static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,
> /* now we are okay to resume SMC/CP/SDMA */
> r = amdgpu_device_ip_reinit_late_sriov(adev);
> if (r)
>- goto error;
>+ return r;
>
> hive = amdgpu_get_xgmi_hive(adev);
> /* Update PSP FW topology after reset */
> if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
> r = amdgpu_xgmi_update_topology(hive, adev);
>-
> if (hive)
> amdgpu_put_xgmi_hive(hive);
>+ if (r)
>+ return r;
>
>- if (!r) {
>- r = amdgpu_ib_ring_tests(adev);
>-
>- amdgpu_amdkfd_post_reset(adev);
>- }
>+ r = amdgpu_ib_ring_tests(adev);
>+ if (r)
>+ return r;
>
>-error:
>- if (!r && adev->virt.gim_feature &
>AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
>+ if (adev->virt.gim_feature &
>AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
> amdgpu_inc_vram_lost(adev);
> r = amdgpu_device_recover_vram(adev);
> }
>- amdgpu_virt_release_full_gpu(adev, true);
>+ if (r)
>+ return r;
>
>- if (AMDGPU_RETRY_SRIOV_RESET(r)) {
>- if (retry_limit < AMDGPU_MAX_RETRY_LIMIT) {
>- retry_limit++;
>- goto retry;
>- } else
>- DRM_ERROR("GPU reset retry is beyond the retry
>limit\n");
>- }
>+ /* need to be called during full access so we can't do it later like
>+ * bare-metal does.
>+ */
>+ amdgpu_amdkfd_post_reset(adev);
>+ amdgpu_virt_release_full_gpu(adev, true);
>
>- return r;
>+ return 0;
> }
>
> /**
>@@ -5689,6 +5682,7 @@ int amdgpu_device_gpu_recover(struct
>amdgpu_device *adev,
> int i, r = 0;
> bool need_emergency_restart = false;
> bool audio_suspended = false;
>+ int retry_limit = AMDGPU_MAX_RETRY_LIMIT;
>
> /*
> * Special case: RAS triggered and full reset isn't supported @@ -
>5770,8 +5764,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device
>*adev,
>
> cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
>
>- if (!amdgpu_sriov_vf(tmp_adev))
>- amdgpu_amdkfd_pre_reset(tmp_adev);
>+ amdgpu_amdkfd_pre_reset(tmp_adev);
>
> /*
> * Mark these ASICs to be reseted as untracked first @@ -
>5830,6 +5823,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device
>*adev,
> /* Host driver will handle XGMI hive reset for SRIOV */
> if (amdgpu_sriov_vf(adev)) {
> r = amdgpu_device_reset_sriov(adev, reset_context);
>+ if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0) {
>+ amdgpu_virt_release_full_gpu(adev, true);
>+ goto retry;
>+ }
> if (r)
> adev->asic_reset_res = r;
>
>--
>2.34.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH v4 2/4] drm/amdgpu: Add reset_context flag for host FLR
2024-04-26 18:27 ` [PATCH v4 " Yunxiang Li
@ 2024-04-28 7:16 ` Deng, Emily
2024-04-30 18:28 ` Luo, Zhigang
1 sibling, 0 replies; 18+ messages in thread
From: Deng, Emily @ 2024-04-28 7:16 UTC (permalink / raw)
To: Li, Yunxiang (Teddy), amd-gfx
Cc: Deucher, Alexander, Koenig, Christian, Lazar, Lijo, Kuehling, Felix
[AMD Official Use Only - General]
Reviewed-by: Emily Deng <Emily.Deng@amd.com>
Emily Deng
Best Wishes
>-----Original Message-----
>From: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>Sent: Saturday, April 27, 2024 2:27 AM
>To: amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
><Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Kuehling,
>Felix <Felix.Kuehling@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Li,
>Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>Subject: [PATCH v4 2/4] drm/amdgpu: Add reset_context flag for host FLR
>
>There are other reset sources that pass NULL as the job pointer, such as
>amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if the
>FLR comes from the host does not work.
>
>Add a flag in reset_context to explicitly mark host triggered reset, and set
>this flag when we receive host reset notification.
>
>Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
>---
>v2: fix typo
>v3: pass reset_context directly
>v4: clear the flag in case we retry
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 ++++++++-----
>drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 +
> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 1 +
> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 1 +
> drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 1 +
> 5 files changed, 12 insertions(+), 5 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>index 8befd10bf007..33c889c027a5 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>@@ -5055,13 +5055,13 @@ static int amdgpu_device_recover_vram(struct
>amdgpu_device *adev)
> * amdgpu_device_reset_sriov - reset ASIC for SR-IOV vf
> *
> * @adev: amdgpu_device pointer
>- * @from_hypervisor: request from hypervisor
>+ * @reset_context: amdgpu reset context pointer
> *
> * do VF FLR and reinitialize Asic
> * return 0 means succeeded otherwise failed
> */
> static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>- bool from_hypervisor)
>+ struct amdgpu_reset_context
>*reset_context)
> {
> int r;
> struct amdgpu_hive_info *hive = NULL;
>@@ -5070,12 +5070,15 @@ static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,
> retry:
> amdgpu_amdkfd_pre_reset(adev);
>
>- if (from_hypervisor)
>+ if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
>+ clear_bit(AMDGPU_HOST_FLR, &reset_context->flags);
> r = amdgpu_virt_request_full_gpu(adev, true);
>- else
>+ } else {
> r = amdgpu_virt_reset_gpu(adev);
>+ }
> if (r)
> return r;
>+
> amdgpu_ras_set_fed(adev, false);
> amdgpu_irq_gpu_reset_resume_helper(adev);
>
>@@ -5826,7 +5829,7 @@ int amdgpu_device_gpu_recover(struct
>amdgpu_device *adev,
> /* Actual ASIC resets if needed.*/
> /* Host driver will handle XGMI hive reset for SRIOV */
> if (amdgpu_sriov_vf(adev)) {
>- r = amdgpu_device_reset_sriov(adev, job ? false : true);
>+ r = amdgpu_device_reset_sriov(adev, reset_context);
> if (r)
> adev->asic_reset_res = r;
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>index b11d190ece53..5a9cc043b858 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>@@ -33,6 +33,7 @@ enum AMDGPU_RESET_FLAGS {
> AMDGPU_NEED_FULL_RESET = 0,
> AMDGPU_SKIP_HW_RESET = 1,
> AMDGPU_SKIP_COREDUMP = 2,
>+ AMDGPU_HOST_FLR = 3,
> };
>
> struct amdgpu_reset_context {
>diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>index c5ba9c4757a8..f4c47492e0cd 100644
>--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>@@ -292,6 +292,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>work_struct *work)
> reset_context.method = AMD_RESET_METHOD_NONE;
> reset_context.reset_req_dev = adev;
> clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
>+ set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
>
> amdgpu_device_gpu_recover(adev, NULL, &reset_context);
> }
>diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>index fa9d1b02f391..14cc7910e5cf 100644
>--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>@@ -328,6 +328,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>work_struct *work)
> reset_context.method = AMD_RESET_METHOD_NONE;
> reset_context.reset_req_dev = adev;
> clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
>+ set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
>
> amdgpu_device_gpu_recover(adev, NULL, &reset_context);
> }
>diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>index 14a065516ae4..78cd07744ebe 100644
>--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>@@ -529,6 +529,7 @@ static void xgpu_vi_mailbox_flr_work(struct
>work_struct *work)
> reset_context.method = AMD_RESET_METHOD_NONE;
> reset_context.reset_req_dev = adev;
> clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
>+ set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
>
> amdgpu_device_gpu_recover(adev, NULL, &reset_context);
> }
>--
>2.34.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function
2024-04-26 3:57 ` [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function Yunxiang Li
@ 2024-04-28 7:17 ` Deng, Emily
2024-04-30 18:32 ` Luo, Zhigang
1 sibling, 0 replies; 18+ messages in thread
From: Deng, Emily @ 2024-04-28 7:17 UTC (permalink / raw)
To: Li, Yunxiang (Teddy), amd-gfx
Cc: Deucher, Alexander, Koenig, Christian, Lazar, Lijo, Kuehling, Felix
[AMD Official Use Only - General]
Reviewed-by: Emily Deng <Emily.Deng@amd.com>
Emily Deng
Best Wishes
>-----Original Message-----
>From: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>Sent: Friday, April 26, 2024 11:58 AM
>To: amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
><Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Kuehling,
>Felix <Felix.Kuehling@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Li,
>Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>Subject: [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function
>
>This is part of the reset, move it into the reset function.
>
>Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
>---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++++-------
> 1 file changed, 5 insertions(+), 7 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>index 3c4755f3c116..8f2c1f71ed9a 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>@@ -5119,6 +5119,11 @@ static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,
> amdgpu_amdkfd_post_reset(adev);
> amdgpu_virt_release_full_gpu(adev, true);
>
>+ /* Aldebaran and gfx_11_0_3 support ras in SRIOV, so need resume
>ras during reset */
>+ if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 2) ||
>+ amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) ||
>+ amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(11, 0, 3))
>+ amdgpu_ras_resume(adev);
> return 0;
> }
>
>@@ -5823,13 +5828,6 @@ int amdgpu_device_gpu_recover(struct
>amdgpu_device *adev,
> goto retry;
> if (r)
> adev->asic_reset_res = r;
>-
>- /* Aldebaran and gfx_11_0_3 support ras in SRIOV, so need
>resume ras during reset */
>- if (amdgpu_ip_version(adev, GC_HWIP, 0) ==
>- IP_VERSION(9, 4, 2) ||
>- amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4,
>3) ||
>- amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(11,
>0, 3))
>- amdgpu_ras_resume(adev);
> } else {
> r = amdgpu_do_asic_reset(device_list_handle,
>reset_context);
> if (r && r == -EAGAIN)
>--
>2.34.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH v2 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic
2024-04-26 18:29 ` [PATCH v2 " Yunxiang Li
2024-04-28 7:16 ` Deng, Emily
@ 2024-04-30 18:25 ` Luo, Zhigang
1 sibling, 0 replies; 18+ messages in thread
From: Luo, Zhigang @ 2024-04-30 18:25 UTC (permalink / raw)
To: Li, Yunxiang (Teddy), amd-gfx
Cc: Deucher, Alexander, Koenig, Christian, Lazar, Lijo, Kuehling,
Felix, Deng, Emily, Li, Yunxiang (Teddy)
[AMD Official Use Only - General]
Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>
-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Yunxiang Li
Sent: Friday, April 26, 2024 2:29 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>
Subject: [PATCH v2 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic
The retry loop for SRIOV reset have refcount and memory leak issue.
Depending on which function call fails it can potentially call amdgpu_amdkfd_pre/post_reset different number of times and causes kfd_locked count to be wrong. This will block all future attempts at opening /dev/kfd. The retry loop also leakes resources by calling amdgpu_virt_init_data_exchange multiple times without calling the corresponding fini function.
Align with the bare-metal reset path which doesn't have these issues.
This means taking the amdgpu_amdkfd_pre/post_reset functions out of the reset loop and calling amdgpu_device_pre_asic_reset each retry which properly free the resources from previous try by calling amdgpu_virt_fini_data_exchange.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
v2: put back release full access and the missed return
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 47 ++++++++++------------
1 file changed, 22 insertions(+), 25 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 33c889c027a5..b23645f23a2e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5065,10 +5065,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev, {
int r;
struct amdgpu_hive_info *hive = NULL;
- int retry_limit = 0;
-
-retry:
- amdgpu_amdkfd_pre_reset(adev);
if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
clear_bit(AMDGPU_HOST_FLR, &reset_context->flags); @@ -5088,7 +5084,7 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
/* Resume IP prior to SMC */
r = amdgpu_device_ip_reinit_early_sriov(adev);
if (r)
- goto error;
+ return r;
amdgpu_virt_init_data_exchange(adev);
@@ -5099,38 +5095,35 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
/* now we are okay to resume SMC/CP/SDMA */
r = amdgpu_device_ip_reinit_late_sriov(adev);
if (r)
- goto error;
+ return r;
hive = amdgpu_get_xgmi_hive(adev);
/* Update PSP FW topology after reset */
if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
r = amdgpu_xgmi_update_topology(hive, adev);
-
if (hive)
amdgpu_put_xgmi_hive(hive);
+ if (r)
+ return r;
- if (!r) {
- r = amdgpu_ib_ring_tests(adev);
-
- amdgpu_amdkfd_post_reset(adev);
- }
+ r = amdgpu_ib_ring_tests(adev);
+ if (r)
+ return r;
-error:
- if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
+ if (adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
amdgpu_inc_vram_lost(adev);
r = amdgpu_device_recover_vram(adev);
}
- amdgpu_virt_release_full_gpu(adev, true);
+ if (r)
+ return r;
- if (AMDGPU_RETRY_SRIOV_RESET(r)) {
- if (retry_limit < AMDGPU_MAX_RETRY_LIMIT) {
- retry_limit++;
- goto retry;
- } else
- DRM_ERROR("GPU reset retry is beyond the retry limit\n");
- }
+ /* need to be called during full access so we can't do it later like
+ * bare-metal does.
+ */
+ amdgpu_amdkfd_post_reset(adev);
+ amdgpu_virt_release_full_gpu(adev, true);
- return r;
+ return 0;
}
/**
@@ -5689,6 +5682,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
int i, r = 0;
bool need_emergency_restart = false;
bool audio_suspended = false;
+ int retry_limit = AMDGPU_MAX_RETRY_LIMIT;
/*
* Special case: RAS triggered and full reset isn't supported @@ -5770,8 +5764,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
- if (!amdgpu_sriov_vf(tmp_adev))
- amdgpu_amdkfd_pre_reset(tmp_adev);
+ amdgpu_amdkfd_pre_reset(tmp_adev);
/*
* Mark these ASICs to be reseted as untracked first @@ -5830,6 +5823,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
/* Host driver will handle XGMI hive reset for SRIOV */
if (amdgpu_sriov_vf(adev)) {
r = amdgpu_device_reset_sriov(adev, reset_context);
+ if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0) {
+ amdgpu_virt_release_full_gpu(adev, true);
+ goto retry;
+ }
if (r)
adev->asic_reset_res = r;
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* RE: [PATCH v4 2/4] drm/amdgpu: Add reset_context flag for host FLR
2024-04-26 18:27 ` [PATCH v4 " Yunxiang Li
2024-04-28 7:16 ` Deng, Emily
@ 2024-04-30 18:28 ` Luo, Zhigang
1 sibling, 0 replies; 18+ messages in thread
From: Luo, Zhigang @ 2024-04-30 18:28 UTC (permalink / raw)
To: Li, Yunxiang (Teddy), amd-gfx
Cc: Deucher, Alexander, Koenig, Christian, Lazar, Lijo, Kuehling,
Felix, Deng, Emily, Li, Yunxiang (Teddy)
[AMD Official Use Only - General]
Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>
-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Yunxiang Li
Sent: Friday, April 26, 2024 2:27 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>
Subject: [PATCH v4 2/4] drm/amdgpu: Add reset_context flag for host FLR
There are other reset sources that pass NULL as the job pointer, such as amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if the FLR comes from the host does not work.
Add a flag in reset_context to explicitly mark host triggered reset, and set this flag when we receive host reset notification.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
v2: fix typo
v3: pass reset_context directly
v4: clear the flag in case we retry
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 ++++++++----- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 +
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 1 +
drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 1 +
drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 1 +
5 files changed, 12 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8befd10bf007..33c889c027a5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5055,13 +5055,13 @@ static int amdgpu_device_recover_vram(struct amdgpu_device *adev)
* amdgpu_device_reset_sriov - reset ASIC for SR-IOV vf
*
* @adev: amdgpu_device pointer
- * @from_hypervisor: request from hypervisor
+ * @reset_context: amdgpu reset context pointer
*
* do VF FLR and reinitialize Asic
* return 0 means succeeded otherwise failed
*/
static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
- bool from_hypervisor)
+ struct amdgpu_reset_context *reset_context)
{
int r;
struct amdgpu_hive_info *hive = NULL;
@@ -5070,12 +5070,15 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
retry:
amdgpu_amdkfd_pre_reset(adev);
- if (from_hypervisor)
+ if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
+ clear_bit(AMDGPU_HOST_FLR, &reset_context->flags);
r = amdgpu_virt_request_full_gpu(adev, true);
- else
+ } else {
r = amdgpu_virt_reset_gpu(adev);
+ }
if (r)
return r;
+
amdgpu_ras_set_fed(adev, false);
amdgpu_irq_gpu_reset_resume_helper(adev);
@@ -5826,7 +5829,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
/* Actual ASIC resets if needed.*/
/* Host driver will handle XGMI hive reset for SRIOV */
if (amdgpu_sriov_vf(adev)) {
- r = amdgpu_device_reset_sriov(adev, job ? false : true);
+ r = amdgpu_device_reset_sriov(adev, reset_context);
if (r)
adev->asic_reset_res = r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
index b11d190ece53..5a9cc043b858 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -33,6 +33,7 @@ enum AMDGPU_RESET_FLAGS {
AMDGPU_NEED_FULL_RESET = 0,
AMDGPU_SKIP_HW_RESET = 1,
AMDGPU_SKIP_COREDUMP = 2,
+ AMDGPU_HOST_FLR = 3,
};
struct amdgpu_reset_context {
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index c5ba9c4757a8..f4c47492e0cd 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -292,6 +292,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
reset_context.method = AMD_RESET_METHOD_NONE;
reset_context.reset_req_dev = adev;
clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+ set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
amdgpu_device_gpu_recover(adev, NULL, &reset_context);
}
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index fa9d1b02f391..14cc7910e5cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -328,6 +328,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
reset_context.method = AMD_RESET_METHOD_NONE;
reset_context.reset_req_dev = adev;
clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+ set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
amdgpu_device_gpu_recover(adev, NULL, &reset_context);
}
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index 14a065516ae4..78cd07744ebe 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -529,6 +529,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
reset_context.method = AMD_RESET_METHOD_NONE;
reset_context.reset_req_dev = adev;
clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+ set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
amdgpu_device_gpu_recover(adev, NULL, &reset_context);
}
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* RE: [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function
2024-04-26 3:57 ` [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function Yunxiang Li
2024-04-28 7:17 ` Deng, Emily
@ 2024-04-30 18:32 ` Luo, Zhigang
1 sibling, 0 replies; 18+ messages in thread
From: Luo, Zhigang @ 2024-04-30 18:32 UTC (permalink / raw)
To: Li, Yunxiang (Teddy), amd-gfx
Cc: Deucher, Alexander, Koenig, Christian, Lazar, Lijo, Kuehling,
Felix, Deng, Emily, Li, Yunxiang (Teddy)
[AMD Official Use Only - General]
Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>
-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Yunxiang Li
Sent: Thursday, April 25, 2024 11:58 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>
Subject: [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function
This is part of the reset, move it into the reset function.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c4755f3c116..8f2c1f71ed9a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5119,6 +5119,11 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
amdgpu_amdkfd_post_reset(adev);
amdgpu_virt_release_full_gpu(adev, true);
+ /* Aldebaran and gfx_11_0_3 support ras in SRIOV, so need resume ras during reset */
+ if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 2) ||
+ amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) ||
+ amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(11, 0, 3))
+ amdgpu_ras_resume(adev);
return 0;
}
@@ -5823,13 +5828,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
goto retry;
if (r)
adev->asic_reset_res = r;
-
- /* Aldebaran and gfx_11_0_3 support ras in SRIOV, so need resume ras during reset */
- if (amdgpu_ip_version(adev, GC_HWIP, 0) ==
- IP_VERSION(9, 4, 2) ||
- amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) ||
- amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(11, 0, 3))
- amdgpu_ras_resume(adev);
} else {
r = amdgpu_do_asic_reset(device_list_handle, reset_context);
if (r && r == -EAGAIN)
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* RE: [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row
2024-04-26 3:57 [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Yunxiang Li
` (2 preceding siblings ...)
2024-04-26 3:57 ` [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function Yunxiang Li
@ 2024-04-30 19:05 ` Li, Yunxiang (Teddy)
2024-05-02 7:15 ` Christian König
2024-04-30 19:09 ` Lazar, Lijo
4 siblings, 1 reply; 18+ messages in thread
From: Li, Yunxiang (Teddy) @ 2024-04-30 19:05 UTC (permalink / raw)
To: amd-gfx, Koenig, Christian
Cc: Deucher, Alexander, Lazar, Lijo, Kuehling, Felix, Deng, Emily
[Public]
Hi Christ,
I got R-b from the SRIOV team for the rest of the patches, can you help review this last one? I think the concerns from the previous thread are all addressed https://patchwork.freedesktop.org/patch/590678/?series=132727
Regards,
Teddy
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row
2024-04-26 3:57 [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Yunxiang Li
` (3 preceding siblings ...)
2024-04-30 19:05 ` [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Li, Yunxiang (Teddy)
@ 2024-04-30 19:09 ` Lazar, Lijo
4 siblings, 0 replies; 18+ messages in thread
From: Lazar, Lijo @ 2024-04-30 19:09 UTC (permalink / raw)
To: Yunxiang Li, amd-gfx
Cc: Alexander.Deucher, christian.koenig, felix.kuehling, emily.deng
On 4/26/2024 9:27 AM, Yunxiang Li wrote:
> Some times a hang GPU causes multiple reset sources to schedule resets.
> The second source will be able to trigger an unnecessary reset if they
> schedule after we call amdgpu_device_stop_pending_resets.
>
> Move amdgpu_device_stop_pending_resets to after the reset is done. Since
> at this point the GPU is supposedly in a good state, any reset scheduled
> after this point would be a legitimate reset.
>
> Remove unnecessary and incorrect checks for amdgpu_in_reset that was
> kinda serving this purpose.
>
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Thanks,
Lijo
> ---
> v2: instead of adding amdgpu_in_reset check, move when we cancel pending
> resets
> v3: no changes from v2, collect all the patches in one series for easier review
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 ++++++++++---------
> drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 2 +-
> drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 2 +-
> 5 files changed, 14 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 861ccff78af9..8befd10bf007 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5070,8 +5070,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
> retry:
> amdgpu_amdkfd_pre_reset(adev);
>
> - amdgpu_device_stop_pending_resets(adev);
> -
> if (from_hypervisor)
> r = amdgpu_virt_request_full_gpu(adev, true);
> else
> @@ -5823,13 +5821,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> r, adev_to_drm(tmp_adev)->unique);
> tmp_adev->asic_reset_res = r;
> }
> -
> - if (!amdgpu_sriov_vf(tmp_adev))
> - /*
> - * Drop all pending non scheduler resets. Scheduler resets
> - * were already dropped during drm_sched_stop
> - */
> - amdgpu_device_stop_pending_resets(tmp_adev);
> }
>
> /* Actual ASIC resets if needed.*/
> @@ -5851,6 +5842,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> goto retry;
> }
>
> + list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
> + /*
> + * Drop any pending non scheduler resets queued before reset is done.
> + * Any reset scheduled after this point would be valid. Scheduler resets
> + * were already dropped during drm_sched_stop and no new ones can come
> + * in before drm_sched_start.
> + */
> + amdgpu_device_stop_pending_resets(tmp_adev);
> + }
> +
> skip_hw_reset:
>
> /* Post ASIC reset for all devs .*/
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index 54ab51a4ada7..c2385178d6b3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -597,7 +597,7 @@ static void amdgpu_virt_update_vf2pf_work_item(struct work_struct *work)
> if (ret) {
> adev->virt.vf2pf_update_retry_cnt++;
> if ((adev->virt.vf2pf_update_retry_cnt >= AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT) &&
> - amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev)) {
> + amdgpu_sriov_runtime(adev)) {
> amdgpu_ras_set_fed(adev, true);
> if (amdgpu_reset_domain_schedule(adev->reset_domain,
> &adev->virt.flr_work))
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index 0c7275bca8f7..c5ba9c4757a8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -319,7 +319,7 @@ static int xgpu_ai_mailbox_rcv_irq(struct amdgpu_device *adev,
>
> switch (event) {
> case IDH_FLR_NOTIFICATION:
> - if (amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev))
> + if (amdgpu_sriov_runtime(adev))
> WARN_ONCE(!amdgpu_reset_domain_schedule(adev->reset_domain,
> &adev->virt.flr_work),
> "Failed to queue work! at %s",
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> index aba00d961627..fa9d1b02f391 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> @@ -358,7 +358,7 @@ static int xgpu_nv_mailbox_rcv_irq(struct amdgpu_device *adev,
>
> switch (event) {
> case IDH_FLR_NOTIFICATION:
> - if (amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev))
> + if (amdgpu_sriov_runtime(adev))
> WARN_ONCE(!amdgpu_reset_domain_schedule(adev->reset_domain,
> &adev->virt.flr_work),
> "Failed to queue work! at %s",
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> index 59f53c743362..14a065516ae4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> @@ -560,7 +560,7 @@ static int xgpu_vi_mailbox_rcv_irq(struct amdgpu_device *adev,
> r = xgpu_vi_mailbox_rcv_msg(adev, IDH_FLR_NOTIFICATION);
>
> /* only handle FLR_NOTIFY now */
> - if (!r && !amdgpu_in_reset(adev))
> + if (!r)
> WARN_ONCE(!amdgpu_reset_domain_schedule(adev->reset_domain,
> &adev->virt.flr_work),
> "Failed to queue work! at %s",
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row
2024-04-30 19:05 ` [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Li, Yunxiang (Teddy)
@ 2024-05-02 7:15 ` Christian König
0 siblings, 0 replies; 18+ messages in thread
From: Christian König @ 2024-05-02 7:15 UTC (permalink / raw)
To: Li, Yunxiang (Teddy), amd-gfx
Cc: Deucher, Alexander, Lazar, Lijo, Kuehling, Felix, Deng, Emily
Am 30.04.24 um 21:05 schrieb Li, Yunxiang (Teddy):
> [Public]
>
> Hi Christ,
>
> I got R-b from the SRIOV team for the rest of the patches, can you help review this last one? I think the concerns from the previous thread are all addressed https://patchwork.freedesktop.org/patch/590678/?series=132727
I don't think I can help here since I'm not familiar with the RAS code
either.
But I've seen that you already got an rb from Hawking's team, that
should be sufficient I think.
Regards,
Christian.
>
> Regards,
> Teddy
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2024-05-02 7:16 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-26 3:57 [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Yunxiang Li
2024-04-26 3:57 ` [PATCH v3 2/4] drm/amdgpu: Add reset_context flag for host FLR Yunxiang Li
2024-04-26 18:27 ` [PATCH v4 " Yunxiang Li
2024-04-28 7:16 ` Deng, Emily
2024-04-30 18:28 ` Luo, Zhigang
2024-04-26 3:57 ` [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic Yunxiang Li
2024-04-26 7:40 ` Christian König
2024-04-26 8:42 ` Deng, Emily
2024-04-26 18:23 ` Li, Yunxiang (Teddy)
2024-04-26 18:29 ` [PATCH v2 " Yunxiang Li
2024-04-28 7:16 ` Deng, Emily
2024-04-30 18:25 ` Luo, Zhigang
2024-04-26 3:57 ` [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function Yunxiang Li
2024-04-28 7:17 ` Deng, Emily
2024-04-30 18:32 ` Luo, Zhigang
2024-04-30 19:05 ` [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Li, Yunxiang (Teddy)
2024-05-02 7:15 ` Christian König
2024-04-30 19:09 ` Lazar, Lijo
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.