[PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF
@ 2021-12-07 16:57 Zhigang Luo
  2021-12-07 16:57 ` [PATCH 2/4] drm/amdgpu: initialize XGMI for SRIOV VF during recover Zhigang Luo
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Zhigang Luo @ 2021-12-07 16:57 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zhigang Luo

On SRIOV, host driver can support FLR(function level reset) on individual VF
within the hive which might bring the individual device back to normal without
the necessary to execute the hive reset. If the FLR failed , host driver will
trigger the hive reset, each guest VF will get reset notification before the
real hive reset been executed. The VF device can handle the reset request
individually in it's reset work handler.

This change updated gpu recover sequence to skip reset other device in
the same hive for SRIOV VF.

Signed-off-by: Zhigang Luo <zhigang.luo@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c5afa45173c..474f8ea58aa5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4746,7 +4746,7 @@ static int amdgpu_device_lock_hive_adev(struct amdgpu_device *adev, struct amdgp
 {
 	struct amdgpu_device *tmp_adev = NULL;
 
-	if (adev->gmc.xgmi.num_physical_nodes > 1) {
+	if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) {
 		if (!hive) {
 			dev_err(adev->dev, "Hive is NULL while device has multiple xgmi nodes");
 			return -ENODEV;
@@ -4958,7 +4958,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 * We always reset all schedulers for device and all devices for XGMI
 	 * hive so that should take care of them too.
 	 */
-	hive = amdgpu_get_xgmi_hive(adev);
+	if (!amdgpu_sriov_vf(adev))
+		hive = amdgpu_get_xgmi_hive(adev);
 	if (hive) {
 		if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
 			DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
@@ -4999,7 +5000,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 * to put adev in the 1st position.
 	 */
 	INIT_LIST_HEAD(&device_list);
-	if (adev->gmc.xgmi.num_physical_nodes > 1) {
+	if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) {
 		list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
 			list_add_tail(&tmp_adev->reset_list, &device_list);
 		if (!list_is_first(&adev->reset_list, &device_list))
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/4] drm/amdgpu: initialize XGMI for SRIOV VF during recover
  2021-12-07 16:57 [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Zhigang Luo
@ 2021-12-07 16:57 ` Zhigang Luo
  2021-12-07 16:57 ` [PATCH 3/4] drm/amdgpu: recover XGMI topology for SRIOV VF after reset Zhigang Luo
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Zhigang Luo @ 2021-12-07 16:57 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zhigang Luo

For SIORV VF, XGMI was not initialized during recover. This change added
XGMI initialization for SRIOV VF during recover.

Signed-off-by: Zhigang Luo <zhigang.luo@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index b48d68d30d80..103bcadbc8b8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -2452,6 +2452,18 @@ static int psp_load_fw(struct amdgpu_device *adev)
 		return ret;
 	}
 
+	if (amdgpu_sriov_vf(adev) && amdgpu_in_reset(adev)) {
+		if (adev->gmc.xgmi.num_physical_nodes > 1) {
+			ret = psp_xgmi_initialize(psp, false, true);
+			/* Warning the XGMI seesion initialize failure
+			* Instead of stop driver initialization
+			*/
+			if (ret)
+				dev_err(psp->adev->dev,
+					"XGMI: Failed to initialize XGMI session\n");
+		}
+	}
+
 	if (psp->ta_fw) {
 		ret = psp_ras_initialize(psp);
 		if (ret)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/4] drm/amdgpu: recover XGMI topology for SRIOV VF after reset
  2021-12-07 16:57 [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Zhigang Luo
  2021-12-07 16:57 ` [PATCH 2/4] drm/amdgpu: initialize XGMI for SRIOV VF during recover Zhigang Luo
@ 2021-12-07 16:57 ` Zhigang Luo
  2021-12-07 16:57 ` [PATCH 4/4] drm/amdgpu: extended waiting SRIOV VF reset completion timeout to 10s Zhigang Luo
  2021-12-07 19:14 ` [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Liu, Shaoyun
  3 siblings, 0 replies; 7+ messages in thread
From: Zhigang Luo @ 2021-12-07 16:57 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zhigang Luo

For SRIOV VF, the XGMI topology was not recovered after reset. This
change added code to SRIOV VF reset function to update XGMI topology
for SRIOV VF after reset.

Signed-off-by: Zhigang Luo <zhigang.luo@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 474f8ea58aa5..7b07af1873bd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4284,6 +4284,7 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 				     bool from_hypervisor)
 {
 	int r;
+	struct amdgpu_hive_info *hive = NULL;
 
 	amdgpu_amdkfd_pre_reset(adev);
 
@@ -4312,9 +4313,19 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 	if (r)
 		goto error;
 
-	amdgpu_irq_gpu_reset_resume_helper(adev);
-	r = amdgpu_ib_ring_tests(adev);
-	amdgpu_amdkfd_post_reset(adev);
+	hive = amdgpu_get_xgmi_hive(adev);
+	/* Update PSP FW topology after reset */
+	if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
+		r = amdgpu_xgmi_update_topology(hive, adev);
+
+	if (hive)
+		amdgpu_put_xgmi_hive(hive);
+
+	if (!r) {
+		amdgpu_irq_gpu_reset_resume_helper(adev);
+		r = amdgpu_ib_ring_tests(adev);
+		amdgpu_amdkfd_post_reset(adev);
+	}
 
 error:
 	if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 4/4] drm/amdgpu: extended waiting SRIOV VF reset completion timeout to 10s
  2021-12-07 16:57 [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Zhigang Luo
  2021-12-07 16:57 ` [PATCH 2/4] drm/amdgpu: initialize XGMI for SRIOV VF during recover Zhigang Luo
  2021-12-07 16:57 ` [PATCH 3/4] drm/amdgpu: recover XGMI topology for SRIOV VF after reset Zhigang Luo
@ 2021-12-07 16:57 ` Zhigang Luo
  2021-12-07 19:14 ` [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Liu, Shaoyun
  3 siblings, 0 replies; 7+ messages in thread
From: Zhigang Luo @ 2021-12-07 16:57 UTC (permalink / raw)
  To: amd-gfx; +Cc: Zhigang Luo

For the ASIC has big FB, it need more time to clear FB during reset.
This change extended SRIOV VF waiting reset completion timeout from 5s
to 10s.

Signed-off-by: Zhigang Luo <zhigang.luo@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h
index bd3b23171579..f9aa4d0bb638 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h
@@ -26,7 +26,7 @@
 
 #define AI_MAILBOX_POLL_ACK_TIMEDOUT	500
 #define AI_MAILBOX_POLL_MSG_TIMEDOUT	6000
-#define AI_MAILBOX_POLL_FLR_TIMEDOUT	5000
+#define AI_MAILBOX_POLL_FLR_TIMEDOUT	10000
 #define AI_MAILBOX_POLL_MSG_REP_MAX	11
 
 enum idh_request {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* RE: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF
  2021-12-07 16:57 [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Zhigang Luo
                   ` (2 preceding siblings ...)
  2021-12-07 16:57 ` [PATCH 4/4] drm/amdgpu: extended waiting SRIOV VF reset completion timeout to 10s Zhigang Luo
@ 2021-12-07 19:14 ` Liu, Shaoyun
  2021-12-07 21:55   ` Luo, Zhigang
  3 siblings, 1 reply; 7+ messages in thread
From: Liu, Shaoyun @ 2021-12-07 19:14 UTC (permalink / raw)
  To: Luo, Zhigang, amd-gfx; +Cc: Luo, Zhigang

[AMD Official Use Only]

This   patch looks ok  to me . 
Patch 2 is  actually add the PSP xgmi init  not the whole XGMI  init, can  you change the description according  to this ? 
Patch 3,  You take the hive lock inside the reset sriov function , but the  hive lock already be took  before this function is called  in gpu_recovery function,  so is it real necessary to get hive  inside the reset sriov function , can  you try remove the code to check hive ?  Or maybe pass the  hive as a parameter into this function if the hive is needed? 
Patch 4 looks ok to me , but may need  SRDC engineer confirm it won't have  side effect on other AI  asic . 

Regards
Shaoyun.liu

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Zhigang Luo
Sent: Tuesday, December 7, 2021 11:57 AM
To: amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang <Zhigang.Luo@amd.com>
Subject: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF

On SRIOV, host driver can support FLR(function level reset) on individual VF within the hive which might bring the individual device back to normal without the necessary to execute the hive reset. If the FLR failed , host driver will trigger the hive reset, each guest VF will get reset notification before the real hive reset been executed. The VF device can handle the reset request individually in it's reset work handler.

This change updated gpu recover sequence to skip reset other device in the same hive for SRIOV VF.

Signed-off-by: Zhigang Luo <zhigang.luo@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c5afa45173c..474f8ea58aa5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4746,7 +4746,7 @@ static int amdgpu_device_lock_hive_adev(struct amdgpu_device *adev, struct amdgp  {
 	struct amdgpu_device *tmp_adev = NULL;
 
-	if (adev->gmc.xgmi.num_physical_nodes > 1) {
+	if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) 
+{
 		if (!hive) {
 			dev_err(adev->dev, "Hive is NULL while device has multiple xgmi nodes");
 			return -ENODEV;
@@ -4958,7 +4958,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 * We always reset all schedulers for device and all devices for XGMI
 	 * hive so that should take care of them too.
 	 */
-	hive = amdgpu_get_xgmi_hive(adev);
+	if (!amdgpu_sriov_vf(adev))
+		hive = amdgpu_get_xgmi_hive(adev);
 	if (hive) {
 		if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
 			DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress", @@ -4999,7 +5000,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 * to put adev in the 1st position.
 	 */
 	INIT_LIST_HEAD(&device_list);
-	if (adev->gmc.xgmi.num_physical_nodes > 1) {
+	if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) 
+{
 		list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
 			list_add_tail(&tmp_adev->reset_list, &device_list);
 		if (!list_is_first(&adev->reset_list, &device_list))
--
2.17.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* RE: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF
  2021-12-07 19:14 ` [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Liu, Shaoyun
@ 2021-12-07 21:55   ` Luo, Zhigang
  2021-12-08  2:23     ` Liu, Shaoyun
  0 siblings, 1 reply; 7+ messages in thread
From: Luo, Zhigang @ 2021-12-07 21:55 UTC (permalink / raw)
  To: Liu, Shaoyun, amd-gfx

[AMD Official Use Only]

Shaoyun, please see my comments inline.

Thanks,
Zhigang

-----Original Message-----
From: Liu, Shaoyun <Shaoyun.Liu@amd.com> 
Sent: December 7, 2021 2:15 PM
To: Luo, Zhigang <Zhigang.Luo@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang <Zhigang.Luo@amd.com>
Subject: RE: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF

[AMD Official Use Only]

This   patch looks ok  to me . 
Patch 2 is  actually add the PSP xgmi init  not the whole XGMI  init, can  you change the description according  to this ? 
[Zhigang] Ok. Will change it.
Patch 3,  You take the hive lock inside the reset sriov function , but the  hive lock already be took  before this function is called  in gpu_recovery function,  so is it real necessary to get hive  inside the reset sriov function , can  you try remove the code to check hive ?  Or maybe pass the  hive as a parameter into this function if the hive is needed? 
[Zhigang] in patch 1, we made change in gpu_recovery to skip getting xgmi hive if it's sriov vf as we don't want to reset other VF in the same hive.
Patch 4 looks ok to me , but may need  SRDC engineer confirm it won't have  side effect on other AI  asic . 

Regards
Shaoyun.liu

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Zhigang Luo
Sent: Tuesday, December 7, 2021 11:57 AM
To: amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang <Zhigang.Luo@amd.com>
Subject: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF

On SRIOV, host driver can support FLR(function level reset) on individual VF within the hive which might bring the individual device back to normal without the necessary to execute the hive reset. If the FLR failed , host driver will trigger the hive reset, each guest VF will get reset notification before the real hive reset been executed. The VF device can handle the reset request individually in it's reset work handler.

This change updated gpu recover sequence to skip reset other device in the same hive for SRIOV VF.

Signed-off-by: Zhigang Luo <zhigang.luo@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c5afa45173c..474f8ea58aa5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4746,7 +4746,7 @@ static int amdgpu_device_lock_hive_adev(struct amdgpu_device *adev, struct amdgp  {
 	struct amdgpu_device *tmp_adev = NULL;

-	if (adev->gmc.xgmi.num_physical_nodes > 1) {
+	if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) 
+{
 		if (!hive) {
 			dev_err(adev->dev, "Hive is NULL while device has multiple xgmi nodes");
 			return -ENODEV;
@@ -4958,7 +4958,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 * We always reset all schedulers for device and all devices for XGMI
 	 * hive so that should take care of them too.
 	 */
-	hive = amdgpu_get_xgmi_hive(adev);
+	if (!amdgpu_sriov_vf(adev))
+		hive = amdgpu_get_xgmi_hive(adev);
 	if (hive) {
 		if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
 			DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress", @@ -4999,7 +5000,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 * to put adev in the 1st position.
 	 */
 	INIT_LIST_HEAD(&device_list);
-	if (adev->gmc.xgmi.num_physical_nodes > 1) {
+	if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) 
+{
 		list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
 			list_add_tail(&tmp_adev->reset_list, &device_list);
 		if (!list_is_first(&adev->reset_list, &device_list))
--
2.17.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* RE: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF
  2021-12-07 21:55   ` Luo, Zhigang
@ 2021-12-08  2:23     ` Liu, Shaoyun
  0 siblings, 0 replies; 7+ messages in thread
From: Liu, Shaoyun @ 2021-12-08  2:23 UTC (permalink / raw)
  To: Luo, Zhigang, amd-gfx

[AMD Official Use Only]

Ok , sounds reasonable.  With the suggested modification 
Patch 1, 2, 3, are Reviewed by : Shaoyun.liu <Shaoyun.liu@amd.com>. Patch4 is Acked by  : Shaoyun.liu <Shaoyun.liu@amd.com>.

Regards
Shaoyun.liu

-----Original Message-----
From: Luo, Zhigang <Zhigang.Luo@amd.com> 
Sent: Tuesday, December 7, 2021 4:55 PM
To: Liu, Shaoyun <Shaoyun.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF

[AMD Official Use Only]

Shaoyun, please see my comments inline.

Thanks,
Zhigang

-----Original Message-----
From: Liu, Shaoyun <Shaoyun.Liu@amd.com>
Sent: December 7, 2021 2:15 PM
To: Luo, Zhigang <Zhigang.Luo@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang <Zhigang.Luo@amd.com>
Subject: RE: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF

[AMD Official Use Only]

This   patch looks ok  to me . 
Patch 2 is  actually add the PSP xgmi init  not the whole XGMI  init, can  you change the description according  to this ? 
[Zhigang] Ok. Will change it.
Patch 3,  You take the hive lock inside the reset sriov function , but the  hive lock already be took  before this function is called  in gpu_recovery function,  so is it real necessary to get hive  inside the reset sriov function , can  you try remove the code to check hive ?  Or maybe pass the  hive as a parameter into this function if the hive is needed? 
[Zhigang] in patch 1, we made change in gpu_recovery to skip getting xgmi hive if it's sriov vf as we don't want to reset other VF in the same hive.
Patch 4 looks ok to me , but may need  SRDC engineer confirm it won't have  side effect on other AI  asic . 

Regards
Shaoyun.liu

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Zhigang Luo
Sent: Tuesday, December 7, 2021 11:57 AM
To: amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang <Zhigang.Luo@amd.com>
Subject: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF

On SRIOV, host driver can support FLR(function level reset) on individual VF within the hive which might bring the individual device back to normal without the necessary to execute the hive reset. If the FLR failed , host driver will trigger the hive reset, each guest VF will get reset notification before the real hive reset been executed. The VF device can handle the reset request individually in it's reset work handler.

This change updated gpu recover sequence to skip reset other device in the same hive for SRIOV VF.

Signed-off-by: Zhigang Luo <zhigang.luo@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c5afa45173c..474f8ea58aa5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4746,7 +4746,7 @@ static int amdgpu_device_lock_hive_adev(struct amdgpu_device *adev, struct amdgp  {
 	struct amdgpu_device *tmp_adev = NULL;

-	if (adev->gmc.xgmi.num_physical_nodes > 1) {
+	if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) 
+{
 		if (!hive) {
 			dev_err(adev->dev, "Hive is NULL while device has multiple xgmi nodes");
 			return -ENODEV;
@@ -4958,7 +4958,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 * We always reset all schedulers for device and all devices for XGMI
 	 * hive so that should take care of them too.
 	 */
-	hive = amdgpu_get_xgmi_hive(adev);
+	if (!amdgpu_sriov_vf(adev))
+		hive = amdgpu_get_xgmi_hive(adev);
 	if (hive) {
 		if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
 			DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress", @@ -4999,7 +5000,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 * to put adev in the 1st position.
 	 */
 	INIT_LIST_HEAD(&device_list);
-	if (adev->gmc.xgmi.num_physical_nodes > 1) {
+	if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) 
+{
 		list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
 			list_add_tail(&tmp_adev->reset_list, &device_list);
 		if (!list_is_first(&adev->reset_list, &device_list))
--
2.17.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-12-08  2:23 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-07 16:57 [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Zhigang Luo
2021-12-07 16:57 ` [PATCH 2/4] drm/amdgpu: initialize XGMI for SRIOV VF during recover Zhigang Luo
2021-12-07 16:57 ` [PATCH 3/4] drm/amdgpu: recover XGMI topology for SRIOV VF after reset Zhigang Luo
2021-12-07 16:57 ` [PATCH 4/4] drm/amdgpu: extended waiting SRIOV VF reset completion timeout to 10s Zhigang Luo
2021-12-07 19:14 ` [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Liu, Shaoyun
2021-12-07 21:55   ` Luo, Zhigang
2021-12-08  2:23     ` Liu, Shaoyun

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.