All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row
@ 2024-04-26  3:57 Yunxiang Li
  2024-04-26  3:57 ` [PATCH v3 2/4] drm/amdgpu: Add reset_context flag for host FLR Yunxiang Li
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Yunxiang Li @ 2024-04-26  3:57 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, lijo.lazar, felix.kuehling,
	emily.deng, Yunxiang Li

Some times a hang GPU causes multiple reset sources to schedule resets.
The second source will be able to trigger an unnecessary reset if they
schedule after we call amdgpu_device_stop_pending_resets.

Move amdgpu_device_stop_pending_resets to after the reset is done. Since
at this point the GPU is supposedly in a good state, any reset scheduled
after this point would be a legitimate reset.

Remove unnecessary and incorrect checks for amdgpu_in_reset that was
kinda serving this purpose.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
v2: instead of adding amdgpu_in_reset check, move when we cancel pending
resets
v3: no changes from v2, collect all the patches in one series for easier review

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 ++++++++++---------
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   |  2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  2 +-
 5 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 861ccff78af9..8befd10bf007 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5070,8 +5070,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 retry:
 	amdgpu_amdkfd_pre_reset(adev);
 
-	amdgpu_device_stop_pending_resets(adev);
-
 	if (from_hypervisor)
 		r = amdgpu_virt_request_full_gpu(adev, true);
 	else
@@ -5823,13 +5821,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 				  r, adev_to_drm(tmp_adev)->unique);
 			tmp_adev->asic_reset_res = r;
 		}
-
-		if (!amdgpu_sriov_vf(tmp_adev))
-			/*
-			* Drop all pending non scheduler resets. Scheduler resets
-			* were already dropped during drm_sched_stop
-			*/
-			amdgpu_device_stop_pending_resets(tmp_adev);
 	}
 
 	/* Actual ASIC resets if needed.*/
@@ -5851,6 +5842,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			goto retry;
 	}
 
+	list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
+		/*
+		 * Drop any pending non scheduler resets queued before reset is done.
+		 * Any reset scheduled after this point would be valid. Scheduler resets
+		 * were already dropped during drm_sched_stop and no new ones can come
+		 * in before drm_sched_start.
+		 */
+		amdgpu_device_stop_pending_resets(tmp_adev);
+	}
+
 skip_hw_reset:
 
 	/* Post ASIC reset for all devs .*/
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index 54ab51a4ada7..c2385178d6b3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -597,7 +597,7 @@ static void amdgpu_virt_update_vf2pf_work_item(struct work_struct *work)
 	if (ret) {
 		adev->virt.vf2pf_update_retry_cnt++;
 		if ((adev->virt.vf2pf_update_retry_cnt >= AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT) &&
-		    amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev)) {
+		    amdgpu_sriov_runtime(adev)) {
 			amdgpu_ras_set_fed(adev, true);
 			if (amdgpu_reset_domain_schedule(adev->reset_domain,
 							  &adev->virt.flr_work))
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 0c7275bca8f7..c5ba9c4757a8 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -319,7 +319,7 @@ static int xgpu_ai_mailbox_rcv_irq(struct amdgpu_device *adev,
 
 	switch (event) {
 		case IDH_FLR_NOTIFICATION:
-		if (amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev))
+		if (amdgpu_sriov_runtime(adev))
 			WARN_ONCE(!amdgpu_reset_domain_schedule(adev->reset_domain,
 								&adev->virt.flr_work),
 				  "Failed to queue work! at %s",
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index aba00d961627..fa9d1b02f391 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -358,7 +358,7 @@ static int xgpu_nv_mailbox_rcv_irq(struct amdgpu_device *adev,
 
 	switch (event) {
 	case IDH_FLR_NOTIFICATION:
-		if (amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev))
+		if (amdgpu_sriov_runtime(adev))
 			WARN_ONCE(!amdgpu_reset_domain_schedule(adev->reset_domain,
 				   &adev->virt.flr_work),
 				  "Failed to queue work! at %s",
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index 59f53c743362..14a065516ae4 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -560,7 +560,7 @@ static int xgpu_vi_mailbox_rcv_irq(struct amdgpu_device *adev,
 		r = xgpu_vi_mailbox_rcv_msg(adev, IDH_FLR_NOTIFICATION);
 
 		/* only handle FLR_NOTIFY now */
-		if (!r && !amdgpu_in_reset(adev))
+		if (!r)
 			WARN_ONCE(!amdgpu_reset_domain_schedule(adev->reset_domain,
 								&adev->virt.flr_work),
 				  "Failed to queue work! at %s",
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 2/4] drm/amdgpu: Add reset_context flag for host FLR
  2024-04-26  3:57 [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Yunxiang Li
@ 2024-04-26  3:57 ` Yunxiang Li
  2024-04-26 18:27   ` [PATCH v4 " Yunxiang Li
  2024-04-26  3:57 ` [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic Yunxiang Li
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Yunxiang Li @ 2024-04-26  3:57 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, lijo.lazar, felix.kuehling,
	emily.deng, Yunxiang Li

There are other reset sources that pass NULL as the job pointer, such as
amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if
the FLR comes from the host does not work.

Add a flag in reset_context to explicitly mark host triggered reset, and
set this flag when we receive host reset notification.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
v2: fix typo
v3: pass reset_context directly

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 8 ++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 1 +
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      | 1 +
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      | 1 +
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      | 1 +
 5 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8befd10bf007..1fd9637daafc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5055,13 +5055,13 @@ static int amdgpu_device_recover_vram(struct amdgpu_device *adev)
  * amdgpu_device_reset_sriov - reset ASIC for SR-IOV vf
  *
  * @adev: amdgpu_device pointer
- * @from_hypervisor: request from hypervisor
+ * @reset_context: amdgpu reset context pointer
  *
  * do VF FLR and reinitialize Asic
  * return 0 means succeeded otherwise failed
  */
 static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
-				     bool from_hypervisor)
+				     struct amdgpu_reset_context *reset_context)
 {
 	int r;
 	struct amdgpu_hive_info *hive = NULL;
@@ -5070,7 +5070,7 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 retry:
 	amdgpu_amdkfd_pre_reset(adev);
 
-	if (from_hypervisor)
+	if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags))
 		r = amdgpu_virt_request_full_gpu(adev, true);
 	else
 		r = amdgpu_virt_reset_gpu(adev);
@@ -5826,7 +5826,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	/* Actual ASIC resets if needed.*/
 	/* Host driver will handle XGMI hive reset for SRIOV */
 	if (amdgpu_sriov_vf(adev)) {
-		r = amdgpu_device_reset_sriov(adev, job ? false : true);
+		r = amdgpu_device_reset_sriov(adev, reset_context);
 		if (r)
 			adev->asic_reset_res = r;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
index b11d190ece53..5a9cc043b858 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -33,6 +33,7 @@ enum AMDGPU_RESET_FLAGS {
 	AMDGPU_NEED_FULL_RESET = 0,
 	AMDGPU_SKIP_HW_RESET = 1,
 	AMDGPU_SKIP_COREDUMP = 2,
+	AMDGPU_HOST_FLR = 3,
 };
 
 struct amdgpu_reset_context {
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index c5ba9c4757a8..f4c47492e0cd 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -292,6 +292,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 		reset_context.method = AMD_RESET_METHOD_NONE;
 		reset_context.reset_req_dev = adev;
 		clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+		set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
 
 		amdgpu_device_gpu_recover(adev, NULL, &reset_context);
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index fa9d1b02f391..14cc7910e5cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -328,6 +328,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 		reset_context.method = AMD_RESET_METHOD_NONE;
 		reset_context.reset_req_dev = adev;
 		clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+		set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
 
 		amdgpu_device_gpu_recover(adev, NULL, &reset_context);
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index 14a065516ae4..78cd07744ebe 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -529,6 +529,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
 		reset_context.method = AMD_RESET_METHOD_NONE;
 		reset_context.reset_req_dev = adev;
 		clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+		set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
 
 		amdgpu_device_gpu_recover(adev, NULL, &reset_context);
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic
  2024-04-26  3:57 [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Yunxiang Li
  2024-04-26  3:57 ` [PATCH v3 2/4] drm/amdgpu: Add reset_context flag for host FLR Yunxiang Li
@ 2024-04-26  3:57 ` Yunxiang Li
  2024-04-26  7:40   ` Christian König
                     ` (2 more replies)
  2024-04-26  3:57 ` [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function Yunxiang Li
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 18+ messages in thread
From: Yunxiang Li @ 2024-04-26  3:57 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, lijo.lazar, felix.kuehling,
	emily.deng, Yunxiang Li

The retry loop for SRIOV reset have refcount and memory leak issue.
Depending on which function call fails it can potentially call
amdgpu_amdkfd_pre/post_reset different number of times and causes
kfd_locked count to be wrong. This will block all future attempts at
opening /dev/kfd. The retry loop also leakes resources by calling
amdgpu_virt_init_data_exchange multiple times without calling the
corresponding fini function.

Align with the bare-metal reset path which doesn't have these issues.
This means taking the amdgpu_amdkfd_pre/post_reset functions out of the
reset loop and calling amdgpu_device_pre_asic_reset each retry which
properly free the resources from previous try by calling
amdgpu_virt_fini_data_exchange.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 50 ++++++++++------------
 1 file changed, 22 insertions(+), 28 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1fd9637daafc..3c4755f3c116 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5063,19 +5063,14 @@ static int amdgpu_device_recover_vram(struct amdgpu_device *adev)
 static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 				     struct amdgpu_reset_context *reset_context)
 {
-	int r;
+	int r = 0;
 	struct amdgpu_hive_info *hive = NULL;
-	int retry_limit = 0;
-
-retry:
-	amdgpu_amdkfd_pre_reset(adev);
 
 	if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags))
 		r = amdgpu_virt_request_full_gpu(adev, true);
 	else
 		r = amdgpu_virt_reset_gpu(adev);
-	if (r)
-		return r;
+
 	amdgpu_ras_set_fed(adev, false);
 	amdgpu_irq_gpu_reset_resume_helper(adev);
 
@@ -5085,7 +5080,7 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 	/* Resume IP prior to SMC */
 	r = amdgpu_device_ip_reinit_early_sriov(adev);
 	if (r)
-		goto error;
+		return r;
 
 	amdgpu_virt_init_data_exchange(adev);
 
@@ -5096,38 +5091,35 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 	/* now we are okay to resume SMC/CP/SDMA */
 	r = amdgpu_device_ip_reinit_late_sriov(adev);
 	if (r)
-		goto error;
+		return r;
 
 	hive = amdgpu_get_xgmi_hive(adev);
 	/* Update PSP FW topology after reset */
 	if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
 		r = amdgpu_xgmi_update_topology(hive, adev);
-
 	if (hive)
 		amdgpu_put_xgmi_hive(hive);
+	if (r)
+		return r;
 
-	if (!r) {
-		r = amdgpu_ib_ring_tests(adev);
-
-		amdgpu_amdkfd_post_reset(adev);
-	}
+	r = amdgpu_ib_ring_tests(adev);
+	if (r)
+		return r;
 
-error:
-	if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
+	if (adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
 		amdgpu_inc_vram_lost(adev);
 		r = amdgpu_device_recover_vram(adev);
 	}
-	amdgpu_virt_release_full_gpu(adev, true);
+	if (r)
+		return r;
 
-	if (AMDGPU_RETRY_SRIOV_RESET(r)) {
-		if (retry_limit < AMDGPU_MAX_RETRY_LIMIT) {
-			retry_limit++;
-			goto retry;
-		} else
-			DRM_ERROR("GPU reset retry is beyond the retry limit\n");
-	}
+	/* need to be called during full access so we can't do it later like
+	 * bare-metal does.
+	 */
+	amdgpu_amdkfd_post_reset(adev);
+	amdgpu_virt_release_full_gpu(adev, true);
 
-	return r;
+	return 0;
 }
 
 /**
@@ -5686,6 +5678,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	int i, r = 0;
 	bool need_emergency_restart = false;
 	bool audio_suspended = false;
+	int retry_limit = AMDGPU_MAX_RETRY_LIMIT;
 
 	/*
 	 * Special case: RAS triggered and full reset isn't supported
@@ -5767,8 +5760,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 		cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
 
-		if (!amdgpu_sriov_vf(tmp_adev))
-			amdgpu_amdkfd_pre_reset(tmp_adev);
+		amdgpu_amdkfd_pre_reset(tmp_adev);
 
 		/*
 		 * Mark these ASICs to be reseted as untracked first
@@ -5827,6 +5819,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	/* Host driver will handle XGMI hive reset for SRIOV */
 	if (amdgpu_sriov_vf(adev)) {
 		r = amdgpu_device_reset_sriov(adev, reset_context);
+		if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0)
+			goto retry;
 		if (r)
 			adev->asic_reset_res = r;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function
  2024-04-26  3:57 [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Yunxiang Li
  2024-04-26  3:57 ` [PATCH v3 2/4] drm/amdgpu: Add reset_context flag for host FLR Yunxiang Li
  2024-04-26  3:57 ` [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic Yunxiang Li
@ 2024-04-26  3:57 ` Yunxiang Li
  2024-04-28  7:17   ` Deng, Emily
  2024-04-30 18:32   ` Luo, Zhigang
  2024-04-30 19:05 ` [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Li, Yunxiang (Teddy)
  2024-04-30 19:09 ` Lazar, Lijo
  4 siblings, 2 replies; 18+ messages in thread
From: Yunxiang Li @ 2024-04-26  3:57 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, lijo.lazar, felix.kuehling,
	emily.deng, Yunxiang Li

This is part of the reset, move it into the reset function.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c4755f3c116..8f2c1f71ed9a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5119,6 +5119,11 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 	amdgpu_amdkfd_post_reset(adev);
 	amdgpu_virt_release_full_gpu(adev, true);
 
+	/* Aldebaran and gfx_11_0_3 support ras in SRIOV, so need resume ras during reset */
+	if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 2) ||
+	    amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) ||
+	    amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(11, 0, 3))
+		amdgpu_ras_resume(adev);
 	return 0;
 }
 
@@ -5823,13 +5828,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			goto retry;
 		if (r)
 			adev->asic_reset_res = r;
-
-		/* Aldebaran and gfx_11_0_3 support ras in SRIOV, so need resume ras during reset */
-		if (amdgpu_ip_version(adev, GC_HWIP, 0) ==
-			    IP_VERSION(9, 4, 2) ||
-		    amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) ||
-		    amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(11, 0, 3))
-			amdgpu_ras_resume(adev);
 	} else {
 		r = amdgpu_do_asic_reset(device_list_handle, reset_context);
 		if (r && r == -EAGAIN)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic
  2024-04-26  3:57 ` [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic Yunxiang Li
@ 2024-04-26  7:40   ` Christian König
  2024-04-26  8:42   ` Deng, Emily
  2024-04-26 18:29   ` [PATCH v2 " Yunxiang Li
  2 siblings, 0 replies; 18+ messages in thread
From: Christian König @ 2024-04-26  7:40 UTC (permalink / raw)
  To: Yunxiang Li, amd-gfx
  Cc: Alexander.Deucher, lijo.lazar, felix.kuehling, emily.deng



Am 26.04.24 um 05:57 schrieb Yunxiang Li:
> The retry loop for SRIOV reset have refcount and memory leak issue.
> Depending on which function call fails it can potentially call
> amdgpu_amdkfd_pre/post_reset different number of times and causes
> kfd_locked count to be wrong. This will block all future attempts at
> opening /dev/kfd. The retry loop also leakes resources by calling
> amdgpu_virt_init_data_exchange multiple times without calling the
> corresponding fini function.
>
> Align with the bare-metal reset path which doesn't have these issues.
> This means taking the amdgpu_amdkfd_pre/post_reset functions out of the
> reset loop and calling amdgpu_device_pre_asic_reset each retry which
> properly free the resources from previous try by calling
> amdgpu_virt_fini_data_exchange.
>
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 50 ++++++++++------------
>   1 file changed, 22 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 1fd9637daafc..3c4755f3c116 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5063,19 +5063,14 @@ static int amdgpu_device_recover_vram(struct amdgpu_device *adev)
>   static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>   				     struct amdgpu_reset_context *reset_context)
>   {
> -	int r;
> +	int r = 0;

Please don't initialize local variables if it isn't necessary.

As far as I can see that is initialized below and will generate a 
warning from automated checkers.

Regards,
Christian.

>   	struct amdgpu_hive_info *hive = NULL;
> -	int retry_limit = 0;
> -
> -retry:
> -	amdgpu_amdkfd_pre_reset(adev);
>   
>   	if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags))
>   		r = amdgpu_virt_request_full_gpu(adev, true);
>   	else
>   		r = amdgpu_virt_reset_gpu(adev);
> -	if (r)
> -		return r;
> +
>   	amdgpu_ras_set_fed(adev, false);
>   	amdgpu_irq_gpu_reset_resume_helper(adev);
>   
> @@ -5085,7 +5080,7 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>   	/* Resume IP prior to SMC */
>   	r = amdgpu_device_ip_reinit_early_sriov(adev);
>   	if (r)
> -		goto error;
> +		return r;
>   
>   	amdgpu_virt_init_data_exchange(adev);
>   
> @@ -5096,38 +5091,35 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>   	/* now we are okay to resume SMC/CP/SDMA */
>   	r = amdgpu_device_ip_reinit_late_sriov(adev);
>   	if (r)
> -		goto error;
> +		return r;
>   
>   	hive = amdgpu_get_xgmi_hive(adev);
>   	/* Update PSP FW topology after reset */
>   	if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
>   		r = amdgpu_xgmi_update_topology(hive, adev);
> -
>   	if (hive)
>   		amdgpu_put_xgmi_hive(hive);
> +	if (r)
> +		return r;
>   
> -	if (!r) {
> -		r = amdgpu_ib_ring_tests(adev);
> -
> -		amdgpu_amdkfd_post_reset(adev);
> -	}
> +	r = amdgpu_ib_ring_tests(adev);
> +	if (r)
> +		return r;
>   
> -error:
> -	if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
> +	if (adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
>   		amdgpu_inc_vram_lost(adev);
>   		r = amdgpu_device_recover_vram(adev);
>   	}
> -	amdgpu_virt_release_full_gpu(adev, true);
> +	if (r)
> +		return r;
>   
> -	if (AMDGPU_RETRY_SRIOV_RESET(r)) {
> -		if (retry_limit < AMDGPU_MAX_RETRY_LIMIT) {
> -			retry_limit++;
> -			goto retry;
> -		} else
> -			DRM_ERROR("GPU reset retry is beyond the retry limit\n");
> -	}
> +	/* need to be called during full access so we can't do it later like
> +	 * bare-metal does.
> +	 */
> +	amdgpu_amdkfd_post_reset(adev);
> +	amdgpu_virt_release_full_gpu(adev, true);
>   
> -	return r;
> +	return 0;
>   }
>   
>   /**
> @@ -5686,6 +5678,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	int i, r = 0;
>   	bool need_emergency_restart = false;
>   	bool audio_suspended = false;
> +	int retry_limit = AMDGPU_MAX_RETRY_LIMIT;
>   
>   	/*
>   	 * Special case: RAS triggered and full reset isn't supported
> @@ -5767,8 +5760,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   
>   		cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
>   
> -		if (!amdgpu_sriov_vf(tmp_adev))
> -			amdgpu_amdkfd_pre_reset(tmp_adev);
> +		amdgpu_amdkfd_pre_reset(tmp_adev);
>   
>   		/*
>   		 * Mark these ASICs to be reseted as untracked first
> @@ -5827,6 +5819,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	/* Host driver will handle XGMI hive reset for SRIOV */
>   	if (amdgpu_sriov_vf(adev)) {
>   		r = amdgpu_device_reset_sriov(adev, reset_context);
> +		if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0)
> +			goto retry;
>   		if (r)
>   			adev->asic_reset_res = r;
>   


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic
  2024-04-26  3:57 ` [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic Yunxiang Li
  2024-04-26  7:40   ` Christian König
@ 2024-04-26  8:42   ` Deng, Emily
  2024-04-26 18:23     ` Li, Yunxiang (Teddy)
  2024-04-26 18:29   ` [PATCH v2 " Yunxiang Li
  2 siblings, 1 reply; 18+ messages in thread
From: Deng, Emily @ 2024-04-26  8:42 UTC (permalink / raw)
  To: Li, Yunxiang (Teddy), amd-gfx
  Cc: Deucher, Alexander, Koenig, Christian, Lazar, Lijo, Kuehling, Felix

[AMD Official Use Only - General]

>-----Original Message-----
>From: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>Sent: Friday, April 26, 2024 11:58 AM
>To: amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
><Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Kuehling,
>Felix <Felix.Kuehling@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Li,
>Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>Subject: [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic
>
>The retry loop for SRIOV reset have refcount and memory leak issue.
>Depending on which function call fails it can potentially call
>amdgpu_amdkfd_pre/post_reset different number of times and causes
>kfd_locked count to be wrong. This will block all future attempts at opening
>/dev/kfd. The retry loop also leakes resources by calling
>amdgpu_virt_init_data_exchange multiple times without calling the
>corresponding fini function.
>
>Align with the bare-metal reset path which doesn't have these issues.
>This means taking the amdgpu_amdkfd_pre/post_reset functions out of the
>reset loop and calling amdgpu_device_pre_asic_reset each retry which
>properly free the resources from previous try by calling
>amdgpu_virt_fini_data_exchange.
>
>Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
>---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 50 ++++++++++------------
> 1 file changed, 22 insertions(+), 28 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>index 1fd9637daafc..3c4755f3c116 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>@@ -5063,19 +5063,14 @@ static int amdgpu_device_recover_vram(struct
>amdgpu_device *adev)  static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,
>                                    struct amdgpu_reset_context
>*reset_context)  {
>-      int r;
>+      int r = 0;
>       struct amdgpu_hive_info *hive = NULL;
>-      int retry_limit = 0;
>-
>-retry:
>-      amdgpu_amdkfd_pre_reset(adev);
>
>       if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags))
>               r = amdgpu_virt_request_full_gpu(adev, true);
>       else
>               r = amdgpu_virt_reset_gpu(adev);
>-      if (r)
>-              return r;
Why remove this?

Emily Deng
Best Wishes


>+
>       amdgpu_ras_set_fed(adev, false);
>       amdgpu_irq_gpu_reset_resume_helper(adev);
>
>@@ -5085,7 +5080,7 @@ static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,
>       /* Resume IP prior to SMC */
>       r = amdgpu_device_ip_reinit_early_sriov(adev);
>       if (r)
>-              goto error;
>+              return r;
Need to call amdgpu_virt_release_full_gpu(adev, true) before retry, and the same as below.

Emily Deng
Best Wishes

>       amdgpu_virt_init_data_exchange(adev);
>
>@@ -5096,38 +5091,35 @@ static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,
>       /* now we are okay to resume SMC/CP/SDMA */
>       r = amdgpu_device_ip_reinit_late_sriov(adev);
>       if (r)
>-              goto error;
>+              return r;
>
>       hive = amdgpu_get_xgmi_hive(adev);
>       /* Update PSP FW topology after reset */
>       if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
>               r = amdgpu_xgmi_update_topology(hive, adev);
>-
>       if (hive)
>               amdgpu_put_xgmi_hive(hive);
>+      if (r)
>+              return r;
>
>-      if (!r) {
>-              r = amdgpu_ib_ring_tests(adev);
>-
>-              amdgpu_amdkfd_post_reset(adev);
>-      }
>+      r = amdgpu_ib_ring_tests(adev);
>+      if (r)
>+              return r;
>
>-error:
>-      if (!r && adev->virt.gim_feature &
>AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
>+      if (adev->virt.gim_feature &
>AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
>               amdgpu_inc_vram_lost(adev);
>               r = amdgpu_device_recover_vram(adev);
>       }
>-      amdgpu_virt_release_full_gpu(adev, true);
>+      if (r)
>+              return r;
>
>-      if (AMDGPU_RETRY_SRIOV_RESET(r)) {
>-              if (retry_limit < AMDGPU_MAX_RETRY_LIMIT) {
>-                      retry_limit++;
>-                      goto retry;
>-              } else
>-                      DRM_ERROR("GPU reset retry is beyond the retry
>limit\n");
>-      }
>+      /* need to be called during full access so we can't do it later like
>+       * bare-metal does.
>+       */
>+      amdgpu_amdkfd_post_reset(adev);
>+      amdgpu_virt_release_full_gpu(adev, true);
>
>-      return r;
>+      return 0;
> }
>
> /**
>@@ -5686,6 +5678,7 @@ int amdgpu_device_gpu_recover(struct
>amdgpu_device *adev,
>       int i, r = 0;
>       bool need_emergency_restart = false;
>       bool audio_suspended = false;
>+      int retry_limit = AMDGPU_MAX_RETRY_LIMIT;
>
>       /*
>        * Special case: RAS triggered and full reset isn't supported @@ -
>5767,8 +5760,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device
>*adev,
>
>               cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
>
>-              if (!amdgpu_sriov_vf(tmp_adev))
>-                      amdgpu_amdkfd_pre_reset(tmp_adev);
>+              amdgpu_amdkfd_pre_reset(tmp_adev);
>
>               /*
>                * Mark these ASICs to be reseted as untracked first @@ -
>5827,6 +5819,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device
>*adev,
>       /* Host driver will handle XGMI hive reset for SRIOV */
>       if (amdgpu_sriov_vf(adev)) {
>               r = amdgpu_device_reset_sriov(adev, reset_context);
>+              if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0)
>+                      goto retry;
>               if (r)
>                       adev->asic_reset_res = r;
>
>--
>2.34.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic
  2024-04-26  8:42   ` Deng, Emily
@ 2024-04-26 18:23     ` Li, Yunxiang (Teddy)
  0 siblings, 0 replies; 18+ messages in thread
From: Li, Yunxiang (Teddy) @ 2024-04-26 18:23 UTC (permalink / raw)
  To: Deng, Emily, amd-gfx
  Cc: Deucher, Alexander, Koenig, Christian, Lazar, Lijo, Kuehling, Felix

[Public]

> Why remove this?
Oops it's a copy-paste error from the previous revision

> Need to call amdgpu_virt_release_full_gpu(adev, true) before retry, and the same as below.
I thought we talked about if we call amdgpu_virt_{reset,request_full}_gpu again we don't need to release full gpu, I must have mis-understood. I'll put it back. Yeah checking the code it seem we still need to release.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 2/4] drm/amdgpu: Add reset_context flag for host FLR
  2024-04-26  3:57 ` [PATCH v3 2/4] drm/amdgpu: Add reset_context flag for host FLR Yunxiang Li
@ 2024-04-26 18:27   ` Yunxiang Li
  2024-04-28  7:16     ` Deng, Emily
  2024-04-30 18:28     ` Luo, Zhigang
  0 siblings, 2 replies; 18+ messages in thread
From: Yunxiang Li @ 2024-04-26 18:27 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, lijo.lazar, felix.kuehling,
	emily.deng, Yunxiang Li

There are other reset sources that pass NULL as the job pointer, such as
amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if
the FLR comes from the host does not work.

Add a flag in reset_context to explicitly mark host triggered reset, and
set this flag when we receive host reset notification.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
v2: fix typo
v3: pass reset_context directly
v4: clear the flag in case we retry

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 ++++++++-----
 drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  1 +
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  1 +
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  1 +
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  1 +
 5 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8befd10bf007..33c889c027a5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5055,13 +5055,13 @@ static int amdgpu_device_recover_vram(struct amdgpu_device *adev)
  * amdgpu_device_reset_sriov - reset ASIC for SR-IOV vf
  *
  * @adev: amdgpu_device pointer
- * @from_hypervisor: request from hypervisor
+ * @reset_context: amdgpu reset context pointer
  *
  * do VF FLR and reinitialize Asic
  * return 0 means succeeded otherwise failed
  */
 static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
-				     bool from_hypervisor)
+				     struct amdgpu_reset_context *reset_context)
 {
 	int r;
 	struct amdgpu_hive_info *hive = NULL;
@@ -5070,12 +5070,15 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 retry:
 	amdgpu_amdkfd_pre_reset(adev);
 
-	if (from_hypervisor)
+	if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
+		clear_bit(AMDGPU_HOST_FLR, &reset_context->flags);
 		r = amdgpu_virt_request_full_gpu(adev, true);
-	else
+	} else {
 		r = amdgpu_virt_reset_gpu(adev);
+	}
 	if (r)
 		return r;
+
 	amdgpu_ras_set_fed(adev, false);
 	amdgpu_irq_gpu_reset_resume_helper(adev);
 
@@ -5826,7 +5829,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	/* Actual ASIC resets if needed.*/
 	/* Host driver will handle XGMI hive reset for SRIOV */
 	if (amdgpu_sriov_vf(adev)) {
-		r = amdgpu_device_reset_sriov(adev, job ? false : true);
+		r = amdgpu_device_reset_sriov(adev, reset_context);
 		if (r)
 			adev->asic_reset_res = r;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
index b11d190ece53..5a9cc043b858 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -33,6 +33,7 @@ enum AMDGPU_RESET_FLAGS {
 	AMDGPU_NEED_FULL_RESET = 0,
 	AMDGPU_SKIP_HW_RESET = 1,
 	AMDGPU_SKIP_COREDUMP = 2,
+	AMDGPU_HOST_FLR = 3,
 };
 
 struct amdgpu_reset_context {
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index c5ba9c4757a8..f4c47492e0cd 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -292,6 +292,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 		reset_context.method = AMD_RESET_METHOD_NONE;
 		reset_context.reset_req_dev = adev;
 		clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+		set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
 
 		amdgpu_device_gpu_recover(adev, NULL, &reset_context);
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index fa9d1b02f391..14cc7910e5cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -328,6 +328,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 		reset_context.method = AMD_RESET_METHOD_NONE;
 		reset_context.reset_req_dev = adev;
 		clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+		set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
 
 		amdgpu_device_gpu_recover(adev, NULL, &reset_context);
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index 14a065516ae4..78cd07744ebe 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -529,6 +529,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
 		reset_context.method = AMD_RESET_METHOD_NONE;
 		reset_context.reset_req_dev = adev;
 		clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+		set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
 
 		amdgpu_device_gpu_recover(adev, NULL, &reset_context);
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic
  2024-04-26  3:57 ` [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic Yunxiang Li
  2024-04-26  7:40   ` Christian König
  2024-04-26  8:42   ` Deng, Emily
@ 2024-04-26 18:29   ` Yunxiang Li
  2024-04-28  7:16     ` Deng, Emily
  2024-04-30 18:25     ` Luo, Zhigang
  2 siblings, 2 replies; 18+ messages in thread
From: Yunxiang Li @ 2024-04-26 18:29 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, christian.koenig, lijo.lazar, felix.kuehling,
	emily.deng, Yunxiang Li

The retry loop for SRIOV reset have refcount and memory leak issue.
Depending on which function call fails it can potentially call
amdgpu_amdkfd_pre/post_reset different number of times and causes
kfd_locked count to be wrong. This will block all future attempts at
opening /dev/kfd. The retry loop also leakes resources by calling
amdgpu_virt_init_data_exchange multiple times without calling the
corresponding fini function.

Align with the bare-metal reset path which doesn't have these issues.
This means taking the amdgpu_amdkfd_pre/post_reset functions out of the
reset loop and calling amdgpu_device_pre_asic_reset each retry which
properly free the resources from previous try by calling
amdgpu_virt_fini_data_exchange.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
v2: put back release full access and the missed return

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 47 ++++++++++------------
 1 file changed, 22 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 33c889c027a5..b23645f23a2e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5065,10 +5065,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 {
 	int r;
 	struct amdgpu_hive_info *hive = NULL;
-	int retry_limit = 0;
-
-retry:
-	amdgpu_amdkfd_pre_reset(adev);
 
 	if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
 		clear_bit(AMDGPU_HOST_FLR, &reset_context->flags);
@@ -5088,7 +5084,7 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 	/* Resume IP prior to SMC */
 	r = amdgpu_device_ip_reinit_early_sriov(adev);
 	if (r)
-		goto error;
+		return r;
 
 	amdgpu_virt_init_data_exchange(adev);
 
@@ -5099,38 +5095,35 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 	/* now we are okay to resume SMC/CP/SDMA */
 	r = amdgpu_device_ip_reinit_late_sriov(adev);
 	if (r)
-		goto error;
+		return r;
 
 	hive = amdgpu_get_xgmi_hive(adev);
 	/* Update PSP FW topology after reset */
 	if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
 		r = amdgpu_xgmi_update_topology(hive, adev);
-
 	if (hive)
 		amdgpu_put_xgmi_hive(hive);
+	if (r)
+		return r;
 
-	if (!r) {
-		r = amdgpu_ib_ring_tests(adev);
-
-		amdgpu_amdkfd_post_reset(adev);
-	}
+	r = amdgpu_ib_ring_tests(adev);
+	if (r)
+		return r;
 
-error:
-	if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
+	if (adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
 		amdgpu_inc_vram_lost(adev);
 		r = amdgpu_device_recover_vram(adev);
 	}
-	amdgpu_virt_release_full_gpu(adev, true);
+	if (r)
+		return r;
 
-	if (AMDGPU_RETRY_SRIOV_RESET(r)) {
-		if (retry_limit < AMDGPU_MAX_RETRY_LIMIT) {
-			retry_limit++;
-			goto retry;
-		} else
-			DRM_ERROR("GPU reset retry is beyond the retry limit\n");
-	}
+	/* need to be called during full access so we can't do it later like
+	 * bare-metal does.
+	 */
+	amdgpu_amdkfd_post_reset(adev);
+	amdgpu_virt_release_full_gpu(adev, true);
 
-	return r;
+	return 0;
 }
 
 /**
@@ -5689,6 +5682,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	int i, r = 0;
 	bool need_emergency_restart = false;
 	bool audio_suspended = false;
+	int retry_limit = AMDGPU_MAX_RETRY_LIMIT;
 
 	/*
 	 * Special case: RAS triggered and full reset isn't supported
@@ -5770,8 +5764,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 		cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
 
-		if (!amdgpu_sriov_vf(tmp_adev))
-			amdgpu_amdkfd_pre_reset(tmp_adev);
+		amdgpu_amdkfd_pre_reset(tmp_adev);
 
 		/*
 		 * Mark these ASICs to be reseted as untracked first
@@ -5830,6 +5823,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	/* Host driver will handle XGMI hive reset for SRIOV */
 	if (amdgpu_sriov_vf(adev)) {
 		r = amdgpu_device_reset_sriov(adev, reset_context);
+		if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0) {
+			amdgpu_virt_release_full_gpu(adev, true);
+			goto retry;
+		}
 		if (r)
 			adev->asic_reset_res = r;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: [PATCH v2 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic
  2024-04-26 18:29   ` [PATCH v2 " Yunxiang Li
@ 2024-04-28  7:16     ` Deng, Emily
  2024-04-30 18:25     ` Luo, Zhigang
  1 sibling, 0 replies; 18+ messages in thread
From: Deng, Emily @ 2024-04-28  7:16 UTC (permalink / raw)
  To: Li, Yunxiang (Teddy), amd-gfx
  Cc: Deucher, Alexander, Koenig, Christian, Lazar, Lijo, Kuehling, Felix

[AMD Official Use Only - General]

Reviewed-by: Emily Deng <Emily.Deng@amd.com>

Emily Deng
Best Wishes



>-----Original Message-----
>From: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>Sent: Saturday, April 27, 2024 2:29 AM
>To: amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
><Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Kuehling,
>Felix <Felix.Kuehling@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Li,
>Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>Subject: [PATCH v2 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry
>logic
>
>The retry loop for SRIOV reset have refcount and memory leak issue.
>Depending on which function call fails it can potentially call
>amdgpu_amdkfd_pre/post_reset different number of times and causes
>kfd_locked count to be wrong. This will block all future attempts at opening
>/dev/kfd. The retry loop also leakes resources by calling
>amdgpu_virt_init_data_exchange multiple times without calling the
>corresponding fini function.
>
>Align with the bare-metal reset path which doesn't have these issues.
>This means taking the amdgpu_amdkfd_pre/post_reset functions out of the
>reset loop and calling amdgpu_device_pre_asic_reset each retry which
>properly free the resources from previous try by calling
>amdgpu_virt_fini_data_exchange.
>
>Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
>---
>v2: put back release full access and the missed return
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 47 ++++++++++------------
> 1 file changed, 22 insertions(+), 25 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>index 33c889c027a5..b23645f23a2e 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>@@ -5065,10 +5065,6 @@ static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,  {
>       int r;
>       struct amdgpu_hive_info *hive = NULL;
>-      int retry_limit = 0;
>-
>-retry:
>-      amdgpu_amdkfd_pre_reset(adev);
>
>       if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
>               clear_bit(AMDGPU_HOST_FLR, &reset_context->flags); @@ -
>5088,7 +5084,7 @@ static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,
>       /* Resume IP prior to SMC */
>       r = amdgpu_device_ip_reinit_early_sriov(adev);
>       if (r)
>-              goto error;
>+              return r;
>
>       amdgpu_virt_init_data_exchange(adev);
>
>@@ -5099,38 +5095,35 @@ static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,
>       /* now we are okay to resume SMC/CP/SDMA */
>       r = amdgpu_device_ip_reinit_late_sriov(adev);
>       if (r)
>-              goto error;
>+              return r;
>
>       hive = amdgpu_get_xgmi_hive(adev);
>       /* Update PSP FW topology after reset */
>       if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
>               r = amdgpu_xgmi_update_topology(hive, adev);
>-
>       if (hive)
>               amdgpu_put_xgmi_hive(hive);
>+      if (r)
>+              return r;
>
>-      if (!r) {
>-              r = amdgpu_ib_ring_tests(adev);
>-
>-              amdgpu_amdkfd_post_reset(adev);
>-      }
>+      r = amdgpu_ib_ring_tests(adev);
>+      if (r)
>+              return r;
>
>-error:
>-      if (!r && adev->virt.gim_feature &
>AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
>+      if (adev->virt.gim_feature &
>AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
>               amdgpu_inc_vram_lost(adev);
>               r = amdgpu_device_recover_vram(adev);
>       }
>-      amdgpu_virt_release_full_gpu(adev, true);
>+      if (r)
>+              return r;
>
>-      if (AMDGPU_RETRY_SRIOV_RESET(r)) {
>-              if (retry_limit < AMDGPU_MAX_RETRY_LIMIT) {
>-                      retry_limit++;
>-                      goto retry;
>-              } else
>-                      DRM_ERROR("GPU reset retry is beyond the retry
>limit\n");
>-      }
>+      /* need to be called during full access so we can't do it later like
>+       * bare-metal does.
>+       */
>+      amdgpu_amdkfd_post_reset(adev);
>+      amdgpu_virt_release_full_gpu(adev, true);
>
>-      return r;
>+      return 0;
> }
>
> /**
>@@ -5689,6 +5682,7 @@ int amdgpu_device_gpu_recover(struct
>amdgpu_device *adev,
>       int i, r = 0;
>       bool need_emergency_restart = false;
>       bool audio_suspended = false;
>+      int retry_limit = AMDGPU_MAX_RETRY_LIMIT;
>
>       /*
>        * Special case: RAS triggered and full reset isn't supported @@ -
>5770,8 +5764,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device
>*adev,
>
>               cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
>
>-              if (!amdgpu_sriov_vf(tmp_adev))
>-                      amdgpu_amdkfd_pre_reset(tmp_adev);
>+              amdgpu_amdkfd_pre_reset(tmp_adev);
>
>               /*
>                * Mark these ASICs to be reseted as untracked first @@ -
>5830,6 +5823,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device
>*adev,
>       /* Host driver will handle XGMI hive reset for SRIOV */
>       if (amdgpu_sriov_vf(adev)) {
>               r = amdgpu_device_reset_sriov(adev, reset_context);
>+              if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0) {
>+                      amdgpu_virt_release_full_gpu(adev, true);
>+                      goto retry;
>+              }
>               if (r)
>                       adev->asic_reset_res = r;
>
>--
>2.34.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v4 2/4] drm/amdgpu: Add reset_context flag for host FLR
  2024-04-26 18:27   ` [PATCH v4 " Yunxiang Li
@ 2024-04-28  7:16     ` Deng, Emily
  2024-04-30 18:28     ` Luo, Zhigang
  1 sibling, 0 replies; 18+ messages in thread
From: Deng, Emily @ 2024-04-28  7:16 UTC (permalink / raw)
  To: Li, Yunxiang (Teddy), amd-gfx
  Cc: Deucher, Alexander, Koenig, Christian, Lazar, Lijo, Kuehling, Felix

[AMD Official Use Only - General]

Reviewed-by: Emily Deng <Emily.Deng@amd.com>

Emily Deng
Best Wishes



>-----Original Message-----
>From: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>Sent: Saturday, April 27, 2024 2:27 AM
>To: amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
><Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Kuehling,
>Felix <Felix.Kuehling@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Li,
>Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>Subject: [PATCH v4 2/4] drm/amdgpu: Add reset_context flag for host FLR
>
>There are other reset sources that pass NULL as the job pointer, such as
>amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if the
>FLR comes from the host does not work.
>
>Add a flag in reset_context to explicitly mark host triggered reset, and set
>this flag when we receive host reset notification.
>
>Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
>---
>v2: fix typo
>v3: pass reset_context directly
>v4: clear the flag in case we retry
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 ++++++++-----
>drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  1 +
> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  1 +
> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  1 +
> drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  1 +
> 5 files changed, 12 insertions(+), 5 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>index 8befd10bf007..33c889c027a5 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>@@ -5055,13 +5055,13 @@ static int amdgpu_device_recover_vram(struct
>amdgpu_device *adev)
>  * amdgpu_device_reset_sriov - reset ASIC for SR-IOV vf
>  *
>  * @adev: amdgpu_device pointer
>- * @from_hypervisor: request from hypervisor
>+ * @reset_context: amdgpu reset context pointer
>  *
>  * do VF FLR and reinitialize Asic
>  * return 0 means succeeded otherwise failed
>  */
> static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>-                                   bool from_hypervisor)
>+                                   struct amdgpu_reset_context
>*reset_context)
> {
>       int r;
>       struct amdgpu_hive_info *hive = NULL;
>@@ -5070,12 +5070,15 @@ static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,
> retry:
>       amdgpu_amdkfd_pre_reset(adev);
>
>-      if (from_hypervisor)
>+      if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
>+              clear_bit(AMDGPU_HOST_FLR, &reset_context->flags);
>               r = amdgpu_virt_request_full_gpu(adev, true);
>-      else
>+      } else {
>               r = amdgpu_virt_reset_gpu(adev);
>+      }
>       if (r)
>               return r;
>+
>       amdgpu_ras_set_fed(adev, false);
>       amdgpu_irq_gpu_reset_resume_helper(adev);
>
>@@ -5826,7 +5829,7 @@ int amdgpu_device_gpu_recover(struct
>amdgpu_device *adev,
>       /* Actual ASIC resets if needed.*/
>       /* Host driver will handle XGMI hive reset for SRIOV */
>       if (amdgpu_sriov_vf(adev)) {
>-              r = amdgpu_device_reset_sriov(adev, job ? false : true);
>+              r = amdgpu_device_reset_sriov(adev, reset_context);
>               if (r)
>                       adev->asic_reset_res = r;
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>index b11d190ece53..5a9cc043b858 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>@@ -33,6 +33,7 @@ enum AMDGPU_RESET_FLAGS {
>       AMDGPU_NEED_FULL_RESET = 0,
>       AMDGPU_SKIP_HW_RESET = 1,
>       AMDGPU_SKIP_COREDUMP = 2,
>+      AMDGPU_HOST_FLR = 3,
> };
>
> struct amdgpu_reset_context {
>diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>index c5ba9c4757a8..f4c47492e0cd 100644
>--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>@@ -292,6 +292,7 @@ static void xgpu_ai_mailbox_flr_work(struct
>work_struct *work)
>               reset_context.method = AMD_RESET_METHOD_NONE;
>               reset_context.reset_req_dev = adev;
>               clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
>+              set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
>
>               amdgpu_device_gpu_recover(adev, NULL, &reset_context);
>       }
>diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>index fa9d1b02f391..14cc7910e5cf 100644
>--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
>@@ -328,6 +328,7 @@ static void xgpu_nv_mailbox_flr_work(struct
>work_struct *work)
>               reset_context.method = AMD_RESET_METHOD_NONE;
>               reset_context.reset_req_dev = adev;
>               clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
>+              set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
>
>               amdgpu_device_gpu_recover(adev, NULL, &reset_context);
>       }
>diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>index 14a065516ae4..78cd07744ebe 100644
>--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>@@ -529,6 +529,7 @@ static void xgpu_vi_mailbox_flr_work(struct
>work_struct *work)
>               reset_context.method = AMD_RESET_METHOD_NONE;
>               reset_context.reset_req_dev = adev;
>               clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
>+              set_bit(AMDGPU_HOST_FLR, &reset_context.flags);
>
>               amdgpu_device_gpu_recover(adev, NULL, &reset_context);
>       }
>--
>2.34.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function
  2024-04-26  3:57 ` [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function Yunxiang Li
@ 2024-04-28  7:17   ` Deng, Emily
  2024-04-30 18:32   ` Luo, Zhigang
  1 sibling, 0 replies; 18+ messages in thread
From: Deng, Emily @ 2024-04-28  7:17 UTC (permalink / raw)
  To: Li, Yunxiang (Teddy), amd-gfx
  Cc: Deucher, Alexander, Koenig, Christian, Lazar, Lijo, Kuehling, Felix

[AMD Official Use Only - General]

Reviewed-by: Emily Deng <Emily.Deng@amd.com>

Emily Deng
Best Wishes



>-----Original Message-----
>From: Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>Sent: Friday, April 26, 2024 11:58 AM
>To: amd-gfx@lists.freedesktop.org
>Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian
><Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Kuehling,
>Felix <Felix.Kuehling@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Li,
>Yunxiang (Teddy) <Yunxiang.Li@amd.com>
>Subject: [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function
>
>This is part of the reset, move it into the reset function.
>
>Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
>---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++++-------
> 1 file changed, 5 insertions(+), 7 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>index 3c4755f3c116..8f2c1f71ed9a 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>@@ -5119,6 +5119,11 @@ static int amdgpu_device_reset_sriov(struct
>amdgpu_device *adev,
>       amdgpu_amdkfd_post_reset(adev);
>       amdgpu_virt_release_full_gpu(adev, true);
>
>+      /* Aldebaran and gfx_11_0_3 support ras in SRIOV, so need resume
>ras during reset */
>+      if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 2) ||
>+          amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) ||
>+          amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(11, 0, 3))
>+              amdgpu_ras_resume(adev);
>       return 0;
> }
>
>@@ -5823,13 +5828,6 @@ int amdgpu_device_gpu_recover(struct
>amdgpu_device *adev,
>                       goto retry;
>               if (r)
>                       adev->asic_reset_res = r;
>-
>-              /* Aldebaran and gfx_11_0_3 support ras in SRIOV, so need
>resume ras during reset */
>-              if (amdgpu_ip_version(adev, GC_HWIP, 0) ==
>-                          IP_VERSION(9, 4, 2) ||
>-                  amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4,
>3) ||
>-                  amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(11,
>0, 3))
>-                      amdgpu_ras_resume(adev);
>       } else {
>               r = amdgpu_do_asic_reset(device_list_handle,
>reset_context);
>               if (r && r == -EAGAIN)
>--
>2.34.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v2 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic
  2024-04-26 18:29   ` [PATCH v2 " Yunxiang Li
  2024-04-28  7:16     ` Deng, Emily
@ 2024-04-30 18:25     ` Luo, Zhigang
  1 sibling, 0 replies; 18+ messages in thread
From: Luo, Zhigang @ 2024-04-30 18:25 UTC (permalink / raw)
  To: Li, Yunxiang (Teddy), amd-gfx
  Cc: Deucher, Alexander, Koenig, Christian, Lazar, Lijo, Kuehling,
	Felix, Deng, Emily, Li, Yunxiang (Teddy)

[AMD Official Use Only - General]

Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Yunxiang Li
Sent: Friday, April 26, 2024 2:29 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>
Subject: [PATCH v2 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic

The retry loop for SRIOV reset have refcount and memory leak issue.
Depending on which function call fails it can potentially call amdgpu_amdkfd_pre/post_reset different number of times and causes kfd_locked count to be wrong. This will block all future attempts at opening /dev/kfd. The retry loop also leakes resources by calling amdgpu_virt_init_data_exchange multiple times without calling the corresponding fini function.

Align with the bare-metal reset path which doesn't have these issues.
This means taking the amdgpu_amdkfd_pre/post_reset functions out of the reset loop and calling amdgpu_device_pre_asic_reset each retry which properly free the resources from previous try by calling amdgpu_virt_fini_data_exchange.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
v2: put back release full access and the missed return

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 47 ++++++++++------------
 1 file changed, 22 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 33c889c027a5..b23645f23a2e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5065,10 +5065,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,  {
        int r;
        struct amdgpu_hive_info *hive = NULL;
-       int retry_limit = 0;
-
-retry:
-       amdgpu_amdkfd_pre_reset(adev);

        if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
                clear_bit(AMDGPU_HOST_FLR, &reset_context->flags); @@ -5088,7 +5084,7 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
        /* Resume IP prior to SMC */
        r = amdgpu_device_ip_reinit_early_sriov(adev);
        if (r)
-               goto error;
+               return r;

        amdgpu_virt_init_data_exchange(adev);

@@ -5099,38 +5095,35 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
        /* now we are okay to resume SMC/CP/SDMA */
        r = amdgpu_device_ip_reinit_late_sriov(adev);
        if (r)
-               goto error;
+               return r;

        hive = amdgpu_get_xgmi_hive(adev);
        /* Update PSP FW topology after reset */
        if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
                r = amdgpu_xgmi_update_topology(hive, adev);
-
        if (hive)
                amdgpu_put_xgmi_hive(hive);
+       if (r)
+               return r;

-       if (!r) {
-               r = amdgpu_ib_ring_tests(adev);
-
-               amdgpu_amdkfd_post_reset(adev);
-       }
+       r = amdgpu_ib_ring_tests(adev);
+       if (r)
+               return r;

-error:
-       if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
+       if (adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
                amdgpu_inc_vram_lost(adev);
                r = amdgpu_device_recover_vram(adev);
        }
-       amdgpu_virt_release_full_gpu(adev, true);
+       if (r)
+               return r;

-       if (AMDGPU_RETRY_SRIOV_RESET(r)) {
-               if (retry_limit < AMDGPU_MAX_RETRY_LIMIT) {
-                       retry_limit++;
-                       goto retry;
-               } else
-                       DRM_ERROR("GPU reset retry is beyond the retry limit\n");
-       }
+       /* need to be called during full access so we can't do it later like
+        * bare-metal does.
+        */
+       amdgpu_amdkfd_post_reset(adev);
+       amdgpu_virt_release_full_gpu(adev, true);

-       return r;
+       return 0;
 }

 /**
@@ -5689,6 +5682,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
        int i, r = 0;
        bool need_emergency_restart = false;
        bool audio_suspended = false;
+       int retry_limit = AMDGPU_MAX_RETRY_LIMIT;

        /*
         * Special case: RAS triggered and full reset isn't supported @@ -5770,8 +5764,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,

                cancel_delayed_work_sync(&tmp_adev->delayed_init_work);

-               if (!amdgpu_sriov_vf(tmp_adev))
-                       amdgpu_amdkfd_pre_reset(tmp_adev);
+               amdgpu_amdkfd_pre_reset(tmp_adev);

                /*
                 * Mark these ASICs to be reseted as untracked first @@ -5830,6 +5823,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
        /* Host driver will handle XGMI hive reset for SRIOV */
        if (amdgpu_sriov_vf(adev)) {
                r = amdgpu_device_reset_sriov(adev, reset_context);
+               if (AMDGPU_RETRY_SRIOV_RESET(r) && (retry_limit--) > 0) {
+                       amdgpu_virt_release_full_gpu(adev, true);
+                       goto retry;
+               }
                if (r)
                        adev->asic_reset_res = r;

--
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: [PATCH v4 2/4] drm/amdgpu: Add reset_context flag for host FLR
  2024-04-26 18:27   ` [PATCH v4 " Yunxiang Li
  2024-04-28  7:16     ` Deng, Emily
@ 2024-04-30 18:28     ` Luo, Zhigang
  1 sibling, 0 replies; 18+ messages in thread
From: Luo, Zhigang @ 2024-04-30 18:28 UTC (permalink / raw)
  To: Li, Yunxiang (Teddy), amd-gfx
  Cc: Deucher, Alexander, Koenig, Christian, Lazar, Lijo, Kuehling,
	Felix, Deng, Emily, Li, Yunxiang (Teddy)

[AMD Official Use Only - General]

Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Yunxiang Li
Sent: Friday, April 26, 2024 2:27 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>
Subject: [PATCH v4 2/4] drm/amdgpu: Add reset_context flag for host FLR

There are other reset sources that pass NULL as the job pointer, such as amdgpu_amdkfd_reset_work. Therefore, using the job pointer to check if the FLR comes from the host does not work.

Add a flag in reset_context to explicitly mark host triggered reset, and set this flag when we receive host reset notification.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
v2: fix typo
v3: pass reset_context directly
v4: clear the flag in case we retry

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 13 ++++++++-----  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  |  1 +
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  1 +
 drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  1 +
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  1 +
 5 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 8befd10bf007..33c889c027a5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5055,13 +5055,13 @@ static int amdgpu_device_recover_vram(struct amdgpu_device *adev)
  * amdgpu_device_reset_sriov - reset ASIC for SR-IOV vf
  *
  * @adev: amdgpu_device pointer
- * @from_hypervisor: request from hypervisor
+ * @reset_context: amdgpu reset context pointer
  *
  * do VF FLR and reinitialize Asic
  * return 0 means succeeded otherwise failed
  */
 static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
-                                    bool from_hypervisor)
+                                    struct amdgpu_reset_context *reset_context)
 {
        int r;
        struct amdgpu_hive_info *hive = NULL;
@@ -5070,12 +5070,15 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 retry:
        amdgpu_amdkfd_pre_reset(adev);

-       if (from_hypervisor)
+       if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
+               clear_bit(AMDGPU_HOST_FLR, &reset_context->flags);
                r = amdgpu_virt_request_full_gpu(adev, true);
-       else
+       } else {
                r = amdgpu_virt_reset_gpu(adev);
+       }
        if (r)
                return r;
+
        amdgpu_ras_set_fed(adev, false);
        amdgpu_irq_gpu_reset_resume_helper(adev);

@@ -5826,7 +5829,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
        /* Actual ASIC resets if needed.*/
        /* Host driver will handle XGMI hive reset for SRIOV */
        if (amdgpu_sriov_vf(adev)) {
-               r = amdgpu_device_reset_sriov(adev, job ? false : true);
+               r = amdgpu_device_reset_sriov(adev, reset_context);
                if (r)
                        adev->asic_reset_res = r;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
index b11d190ece53..5a9cc043b858 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
@@ -33,6 +33,7 @@ enum AMDGPU_RESET_FLAGS {
        AMDGPU_NEED_FULL_RESET = 0,
        AMDGPU_SKIP_HW_RESET = 1,
        AMDGPU_SKIP_COREDUMP = 2,
+       AMDGPU_HOST_FLR = 3,
 };

 struct amdgpu_reset_context {
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index c5ba9c4757a8..f4c47492e0cd 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -292,6 +292,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
                reset_context.method = AMD_RESET_METHOD_NONE;
                reset_context.reset_req_dev = adev;
                clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+               set_bit(AMDGPU_HOST_FLR, &reset_context.flags);

                amdgpu_device_gpu_recover(adev, NULL, &reset_context);
        }
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index fa9d1b02f391..14cc7910e5cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -328,6 +328,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
                reset_context.method = AMD_RESET_METHOD_NONE;
                reset_context.reset_req_dev = adev;
                clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+               set_bit(AMDGPU_HOST_FLR, &reset_context.flags);

                amdgpu_device_gpu_recover(adev, NULL, &reset_context);
        }
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index 14a065516ae4..78cd07744ebe 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -529,6 +529,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
                reset_context.method = AMD_RESET_METHOD_NONE;
                reset_context.reset_req_dev = adev;
                clear_bit(AMDGPU_NEED_FULL_RESET, &reset_context.flags);
+               set_bit(AMDGPU_HOST_FLR, &reset_context.flags);

                amdgpu_device_gpu_recover(adev, NULL, &reset_context);
        }
--
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function
  2024-04-26  3:57 ` [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function Yunxiang Li
  2024-04-28  7:17   ` Deng, Emily
@ 2024-04-30 18:32   ` Luo, Zhigang
  1 sibling, 0 replies; 18+ messages in thread
From: Luo, Zhigang @ 2024-04-30 18:32 UTC (permalink / raw)
  To: Li, Yunxiang (Teddy), amd-gfx
  Cc: Deucher, Alexander, Koenig, Christian, Lazar, Lijo, Kuehling,
	Felix, Deng, Emily, Li, Yunxiang (Teddy)

[AMD Official Use Only - General]

Reviewed-by: Zhigang Luo <zhigang.luo@amd.com>

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Yunxiang Li
Sent: Thursday, April 25, 2024 11:58 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Lazar, Lijo <Lijo.Lazar@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>; Deng, Emily <Emily.Deng@amd.com>; Li, Yunxiang (Teddy) <Yunxiang.Li@amd.com>
Subject: [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function

This is part of the reset, move it into the reset function.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c4755f3c116..8f2c1f71ed9a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5119,6 +5119,11 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
        amdgpu_amdkfd_post_reset(adev);
        amdgpu_virt_release_full_gpu(adev, true);

+       /* Aldebaran and gfx_11_0_3 support ras in SRIOV, so need resume ras during reset */
+       if (amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 2) ||
+           amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) ||
+           amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(11, 0, 3))
+               amdgpu_ras_resume(adev);
        return 0;
 }

@@ -5823,13 +5828,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
                        goto retry;
                if (r)
                        adev->asic_reset_res = r;
-
-               /* Aldebaran and gfx_11_0_3 support ras in SRIOV, so need resume ras during reset */
-               if (amdgpu_ip_version(adev, GC_HWIP, 0) ==
-                           IP_VERSION(9, 4, 2) ||
-                   amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(9, 4, 3) ||
-                   amdgpu_ip_version(adev, GC_HWIP, 0) == IP_VERSION(11, 0, 3))
-                       amdgpu_ras_resume(adev);
        } else {
                r = amdgpu_do_asic_reset(device_list_handle, reset_context);
                if (r && r == -EAGAIN)
--
2.34.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row
  2024-04-26  3:57 [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Yunxiang Li
                   ` (2 preceding siblings ...)
  2024-04-26  3:57 ` [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function Yunxiang Li
@ 2024-04-30 19:05 ` Li, Yunxiang (Teddy)
  2024-05-02  7:15   ` Christian König
  2024-04-30 19:09 ` Lazar, Lijo
  4 siblings, 1 reply; 18+ messages in thread
From: Li, Yunxiang (Teddy) @ 2024-04-30 19:05 UTC (permalink / raw)
  To: amd-gfx, Koenig,  Christian
  Cc: Deucher, Alexander, Lazar, Lijo, Kuehling, Felix, Deng, Emily

[Public]

Hi Christ,

I got R-b from the SRIOV team for the rest of the patches, can you help review this last one? I think the concerns from the previous thread are all addressed https://patchwork.freedesktop.org/patch/590678/?series=132727

Regards,
Teddy

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row
  2024-04-26  3:57 [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Yunxiang Li
                   ` (3 preceding siblings ...)
  2024-04-30 19:05 ` [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Li, Yunxiang (Teddy)
@ 2024-04-30 19:09 ` Lazar, Lijo
  4 siblings, 0 replies; 18+ messages in thread
From: Lazar, Lijo @ 2024-04-30 19:09 UTC (permalink / raw)
  To: Yunxiang Li, amd-gfx
  Cc: Alexander.Deucher, christian.koenig, felix.kuehling, emily.deng



On 4/26/2024 9:27 AM, Yunxiang Li wrote:
> Some times a hang GPU causes multiple reset sources to schedule resets.
> The second source will be able to trigger an unnecessary reset if they
> schedule after we call amdgpu_device_stop_pending_resets.
> 
> Move amdgpu_device_stop_pending_resets to after the reset is done. Since
> at this point the GPU is supposedly in a good state, any reset scheduled
> after this point would be a legitimate reset.
> 
> Remove unnecessary and incorrect checks for amdgpu_in_reset that was
> kinda serving this purpose.
> 
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>

Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>

Thanks,
Lijo
> ---
> v2: instead of adding amdgpu_in_reset check, move when we cancel pending
> resets
> v3: no changes from v2, collect all the patches in one series for easier review
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 ++++++++++---------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   |  2 +-
>  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |  2 +-
>  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c      |  2 +-
>  drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |  2 +-
>  5 files changed, 14 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 861ccff78af9..8befd10bf007 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5070,8 +5070,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>  retry:
>  	amdgpu_amdkfd_pre_reset(adev);
>  
> -	amdgpu_device_stop_pending_resets(adev);
> -
>  	if (from_hypervisor)
>  		r = amdgpu_virt_request_full_gpu(adev, true);
>  	else
> @@ -5823,13 +5821,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>  				  r, adev_to_drm(tmp_adev)->unique);
>  			tmp_adev->asic_reset_res = r;
>  		}
> -
> -		if (!amdgpu_sriov_vf(tmp_adev))
> -			/*
> -			* Drop all pending non scheduler resets. Scheduler resets
> -			* were already dropped during drm_sched_stop
> -			*/
> -			amdgpu_device_stop_pending_resets(tmp_adev);
>  	}
>  
>  	/* Actual ASIC resets if needed.*/
> @@ -5851,6 +5842,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>  			goto retry;
>  	}
>  
> +	list_for_each_entry(tmp_adev, device_list_handle, reset_list) {
> +		/*
> +		 * Drop any pending non scheduler resets queued before reset is done.
> +		 * Any reset scheduled after this point would be valid. Scheduler resets
> +		 * were already dropped during drm_sched_stop and no new ones can come
> +		 * in before drm_sched_start.
> +		 */
> +		amdgpu_device_stop_pending_resets(tmp_adev);
> +	}
> +
>  skip_hw_reset:
>  
>  	/* Post ASIC reset for all devs .*/
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index 54ab51a4ada7..c2385178d6b3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -597,7 +597,7 @@ static void amdgpu_virt_update_vf2pf_work_item(struct work_struct *work)
>  	if (ret) {
>  		adev->virt.vf2pf_update_retry_cnt++;
>  		if ((adev->virt.vf2pf_update_retry_cnt >= AMDGPU_VF2PF_UPDATE_MAX_RETRY_LIMIT) &&
> -		    amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev)) {
> +		    amdgpu_sriov_runtime(adev)) {
>  			amdgpu_ras_set_fed(adev, true);
>  			if (amdgpu_reset_domain_schedule(adev->reset_domain,
>  							  &adev->virt.flr_work))
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index 0c7275bca8f7..c5ba9c4757a8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -319,7 +319,7 @@ static int xgpu_ai_mailbox_rcv_irq(struct amdgpu_device *adev,
>  
>  	switch (event) {
>  		case IDH_FLR_NOTIFICATION:
> -		if (amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev))
> +		if (amdgpu_sriov_runtime(adev))
>  			WARN_ONCE(!amdgpu_reset_domain_schedule(adev->reset_domain,
>  								&adev->virt.flr_work),
>  				  "Failed to queue work! at %s",
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> index aba00d961627..fa9d1b02f391 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
> @@ -358,7 +358,7 @@ static int xgpu_nv_mailbox_rcv_irq(struct amdgpu_device *adev,
>  
>  	switch (event) {
>  	case IDH_FLR_NOTIFICATION:
> -		if (amdgpu_sriov_runtime(adev) && !amdgpu_in_reset(adev))
> +		if (amdgpu_sriov_runtime(adev))
>  			WARN_ONCE(!amdgpu_reset_domain_schedule(adev->reset_domain,
>  				   &adev->virt.flr_work),
>  				  "Failed to queue work! at %s",
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> index 59f53c743362..14a065516ae4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> @@ -560,7 +560,7 @@ static int xgpu_vi_mailbox_rcv_irq(struct amdgpu_device *adev,
>  		r = xgpu_vi_mailbox_rcv_msg(adev, IDH_FLR_NOTIFICATION);
>  
>  		/* only handle FLR_NOTIFY now */
> -		if (!r && !amdgpu_in_reset(adev))
> +		if (!r)
>  			WARN_ONCE(!amdgpu_reset_domain_schedule(adev->reset_domain,
>  								&adev->virt.flr_work),
>  				  "Failed to queue work! at %s",

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row
  2024-04-30 19:05 ` [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Li, Yunxiang (Teddy)
@ 2024-05-02  7:15   ` Christian König
  0 siblings, 0 replies; 18+ messages in thread
From: Christian König @ 2024-05-02  7:15 UTC (permalink / raw)
  To: Li, Yunxiang (Teddy), amd-gfx
  Cc: Deucher, Alexander, Lazar, Lijo, Kuehling, Felix, Deng, Emily

Am 30.04.24 um 21:05 schrieb Li, Yunxiang (Teddy):
> [Public]
>
> Hi Christ,
>
> I got R-b from the SRIOV team for the rest of the patches, can you help review this last one? I think the concerns from the previous thread are all addressed https://patchwork.freedesktop.org/patch/590678/?series=132727

I don't think I can help here since I'm not familiar with the RAS code 
either.

But I've seen that you already got an rb from Hawking's team, that 
should be sufficient I think.

Regards,
Christian.

>
> Regards,
> Teddy


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-05-02  7:16 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-26  3:57 [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Yunxiang Li
2024-04-26  3:57 ` [PATCH v3 2/4] drm/amdgpu: Add reset_context flag for host FLR Yunxiang Li
2024-04-26 18:27   ` [PATCH v4 " Yunxiang Li
2024-04-28  7:16     ` Deng, Emily
2024-04-30 18:28     ` Luo, Zhigang
2024-04-26  3:57 ` [PATCH 3/4] drm/amdgpu: Fix amdgpu_device_reset_sriov retry logic Yunxiang Li
2024-04-26  7:40   ` Christian König
2024-04-26  8:42   ` Deng, Emily
2024-04-26 18:23     ` Li, Yunxiang (Teddy)
2024-04-26 18:29   ` [PATCH v2 " Yunxiang Li
2024-04-28  7:16     ` Deng, Emily
2024-04-30 18:25     ` Luo, Zhigang
2024-04-26  3:57 ` [PATCH 4/4] drm/amdgpu: Move ras resume into SRIOV function Yunxiang Li
2024-04-28  7:17   ` Deng, Emily
2024-04-30 18:32   ` Luo, Zhigang
2024-04-30 19:05 ` [PATCH v3 1/4] drm/amdgpu: Fix two reset triggered in a row Li, Yunxiang (Teddy)
2024-05-02  7:15   ` Christian König
2024-04-30 19:09 ` Lazar, Lijo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.