* [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF
@ 2021-12-07 16:57 Zhigang Luo
2021-12-07 16:57 ` [PATCH 2/4] drm/amdgpu: initialize XGMI for SRIOV VF during recover Zhigang Luo
` (3 more replies)
0 siblings, 4 replies; 7+ messages in thread
From: Zhigang Luo @ 2021-12-07 16:57 UTC (permalink / raw)
To: amd-gfx; +Cc: Zhigang Luo
On SRIOV, host driver can support FLR(function level reset) on individual VF
within the hive which might bring the individual device back to normal without
the necessary to execute the hive reset. If the FLR failed , host driver will
trigger the hive reset, each guest VF will get reset notification before the
real hive reset been executed. The VF device can handle the reset request
individually in it's reset work handler.
This change updated gpu recover sequence to skip reset other device in
the same hive for SRIOV VF.
Signed-off-by: Zhigang Luo <zhigang.luo@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c5afa45173c..474f8ea58aa5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4746,7 +4746,7 @@ static int amdgpu_device_lock_hive_adev(struct amdgpu_device *adev, struct amdgp
{
struct amdgpu_device *tmp_adev = NULL;
- if (adev->gmc.xgmi.num_physical_nodes > 1) {
+ if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) {
if (!hive) {
dev_err(adev->dev, "Hive is NULL while device has multiple xgmi nodes");
return -ENODEV;
@@ -4958,7 +4958,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
* We always reset all schedulers for device and all devices for XGMI
* hive so that should take care of them too.
*/
- hive = amdgpu_get_xgmi_hive(adev);
+ if (!amdgpu_sriov_vf(adev))
+ hive = amdgpu_get_xgmi_hive(adev);
if (hive) {
if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
@@ -4999,7 +5000,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
* to put adev in the 1st position.
*/
INIT_LIST_HEAD(&device_list);
- if (adev->gmc.xgmi.num_physical_nodes > 1) {
+ if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1)) {
list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
list_add_tail(&tmp_adev->reset_list, &device_list);
if (!list_is_first(&adev->reset_list, &device_list))
--
2.17.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/4] drm/amdgpu: initialize XGMI for SRIOV VF during recover
2021-12-07 16:57 [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Zhigang Luo
@ 2021-12-07 16:57 ` Zhigang Luo
2021-12-07 16:57 ` [PATCH 3/4] drm/amdgpu: recover XGMI topology for SRIOV VF after reset Zhigang Luo
` (2 subsequent siblings)
3 siblings, 0 replies; 7+ messages in thread
From: Zhigang Luo @ 2021-12-07 16:57 UTC (permalink / raw)
To: amd-gfx; +Cc: Zhigang Luo
For SIORV VF, XGMI was not initialized during recover. This change added
XGMI initialization for SRIOV VF during recover.
Signed-off-by: Zhigang Luo <zhigang.luo@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index b48d68d30d80..103bcadbc8b8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -2452,6 +2452,18 @@ static int psp_load_fw(struct amdgpu_device *adev)
return ret;
}
+ if (amdgpu_sriov_vf(adev) && amdgpu_in_reset(adev)) {
+ if (adev->gmc.xgmi.num_physical_nodes > 1) {
+ ret = psp_xgmi_initialize(psp, false, true);
+ /* Warning the XGMI seesion initialize failure
+ * Instead of stop driver initialization
+ */
+ if (ret)
+ dev_err(psp->adev->dev,
+ "XGMI: Failed to initialize XGMI session\n");
+ }
+ }
+
if (psp->ta_fw) {
ret = psp_ras_initialize(psp);
if (ret)
--
2.17.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 3/4] drm/amdgpu: recover XGMI topology for SRIOV VF after reset
2021-12-07 16:57 [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Zhigang Luo
2021-12-07 16:57 ` [PATCH 2/4] drm/amdgpu: initialize XGMI for SRIOV VF during recover Zhigang Luo
@ 2021-12-07 16:57 ` Zhigang Luo
2021-12-07 16:57 ` [PATCH 4/4] drm/amdgpu: extended waiting SRIOV VF reset completion timeout to 10s Zhigang Luo
2021-12-07 19:14 ` [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Liu, Shaoyun
3 siblings, 0 replies; 7+ messages in thread
From: Zhigang Luo @ 2021-12-07 16:57 UTC (permalink / raw)
To: amd-gfx; +Cc: Zhigang Luo
For SRIOV VF, the XGMI topology was not recovered after reset. This
change added code to SRIOV VF reset function to update XGMI topology
for SRIOV VF after reset.
Signed-off-by: Zhigang Luo <zhigang.luo@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 474f8ea58aa5..7b07af1873bd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4284,6 +4284,7 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
bool from_hypervisor)
{
int r;
+ struct amdgpu_hive_info *hive = NULL;
amdgpu_amdkfd_pre_reset(adev);
@@ -4312,9 +4313,19 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
if (r)
goto error;
- amdgpu_irq_gpu_reset_resume_helper(adev);
- r = amdgpu_ib_ring_tests(adev);
- amdgpu_amdkfd_post_reset(adev);
+ hive = amdgpu_get_xgmi_hive(adev);
+ /* Update PSP FW topology after reset */
+ if (hive && adev->gmc.xgmi.num_physical_nodes > 1)
+ r = amdgpu_xgmi_update_topology(hive, adev);
+
+ if (hive)
+ amdgpu_put_xgmi_hive(hive);
+
+ if (!r) {
+ amdgpu_irq_gpu_reset_resume_helper(adev);
+ r = amdgpu_ib_ring_tests(adev);
+ amdgpu_amdkfd_post_reset(adev);
+ }
error:
if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
--
2.17.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 4/4] drm/amdgpu: extended waiting SRIOV VF reset completion timeout to 10s
2021-12-07 16:57 [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Zhigang Luo
2021-12-07 16:57 ` [PATCH 2/4] drm/amdgpu: initialize XGMI for SRIOV VF during recover Zhigang Luo
2021-12-07 16:57 ` [PATCH 3/4] drm/amdgpu: recover XGMI topology for SRIOV VF after reset Zhigang Luo
@ 2021-12-07 16:57 ` Zhigang Luo
2021-12-07 19:14 ` [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Liu, Shaoyun
3 siblings, 0 replies; 7+ messages in thread
From: Zhigang Luo @ 2021-12-07 16:57 UTC (permalink / raw)
To: amd-gfx; +Cc: Zhigang Luo
For the ASIC has big FB, it need more time to clear FB during reset.
This change extended SRIOV VF waiting reset completion timeout from 5s
to 10s.
Signed-off-by: Zhigang Luo <zhigang.luo@amd.com>
---
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h
index bd3b23171579..f9aa4d0bb638 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.h
@@ -26,7 +26,7 @@
#define AI_MAILBOX_POLL_ACK_TIMEDOUT 500
#define AI_MAILBOX_POLL_MSG_TIMEDOUT 6000
-#define AI_MAILBOX_POLL_FLR_TIMEDOUT 5000
+#define AI_MAILBOX_POLL_FLR_TIMEDOUT 10000
#define AI_MAILBOX_POLL_MSG_REP_MAX 11
enum idh_request {
--
2.17.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* RE: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF
2021-12-07 16:57 [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Zhigang Luo
` (2 preceding siblings ...)
2021-12-07 16:57 ` [PATCH 4/4] drm/amdgpu: extended waiting SRIOV VF reset completion timeout to 10s Zhigang Luo
@ 2021-12-07 19:14 ` Liu, Shaoyun
2021-12-07 21:55 ` Luo, Zhigang
3 siblings, 1 reply; 7+ messages in thread
From: Liu, Shaoyun @ 2021-12-07 19:14 UTC (permalink / raw)
To: Luo, Zhigang, amd-gfx; +Cc: Luo, Zhigang
[AMD Official Use Only]
This patch looks ok to me .
Patch 2 is actually add the PSP xgmi init not the whole XGMI init, can you change the description according to this ?
Patch 3, You take the hive lock inside the reset sriov function , but the hive lock already be took before this function is called in gpu_recovery function, so is it real necessary to get hive inside the reset sriov function , can you try remove the code to check hive ? Or maybe pass the hive as a parameter into this function if the hive is needed?
Patch 4 looks ok to me , but may need SRDC engineer confirm it won't have side effect on other AI asic .
Regards
Shaoyun.liu
-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Zhigang Luo
Sent: Tuesday, December 7, 2021 11:57 AM
To: amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang <Zhigang.Luo@amd.com>
Subject: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF
On SRIOV, host driver can support FLR(function level reset) on individual VF within the hive which might bring the individual device back to normal without the necessary to execute the hive reset. If the FLR failed , host driver will trigger the hive reset, each guest VF will get reset notification before the real hive reset been executed. The VF device can handle the reset request individually in it's reset work handler.
This change updated gpu recover sequence to skip reset other device in the same hive for SRIOV VF.
Signed-off-by: Zhigang Luo <zhigang.luo@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c5afa45173c..474f8ea58aa5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4746,7 +4746,7 @@ static int amdgpu_device_lock_hive_adev(struct amdgpu_device *adev, struct amdgp {
struct amdgpu_device *tmp_adev = NULL;
- if (adev->gmc.xgmi.num_physical_nodes > 1) {
+ if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1))
+{
if (!hive) {
dev_err(adev->dev, "Hive is NULL while device has multiple xgmi nodes");
return -ENODEV;
@@ -4958,7 +4958,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
* We always reset all schedulers for device and all devices for XGMI
* hive so that should take care of them too.
*/
- hive = amdgpu_get_xgmi_hive(adev);
+ if (!amdgpu_sriov_vf(adev))
+ hive = amdgpu_get_xgmi_hive(adev);
if (hive) {
if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress", @@ -4999,7 +5000,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
* to put adev in the 1st position.
*/
INIT_LIST_HEAD(&device_list);
- if (adev->gmc.xgmi.num_physical_nodes > 1) {
+ if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1))
+{
list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
list_add_tail(&tmp_adev->reset_list, &device_list);
if (!list_is_first(&adev->reset_list, &device_list))
--
2.17.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* RE: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF
2021-12-07 19:14 ` [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Liu, Shaoyun
@ 2021-12-07 21:55 ` Luo, Zhigang
2021-12-08 2:23 ` Liu, Shaoyun
0 siblings, 1 reply; 7+ messages in thread
From: Luo, Zhigang @ 2021-12-07 21:55 UTC (permalink / raw)
To: Liu, Shaoyun, amd-gfx
[AMD Official Use Only]
Shaoyun, please see my comments inline.
Thanks,
Zhigang
-----Original Message-----
From: Liu, Shaoyun <Shaoyun.Liu@amd.com>
Sent: December 7, 2021 2:15 PM
To: Luo, Zhigang <Zhigang.Luo@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang <Zhigang.Luo@amd.com>
Subject: RE: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF
[AMD Official Use Only]
This patch looks ok to me .
Patch 2 is actually add the PSP xgmi init not the whole XGMI init, can you change the description according to this ?
[Zhigang] Ok. Will change it.
Patch 3, You take the hive lock inside the reset sriov function , but the hive lock already be took before this function is called in gpu_recovery function, so is it real necessary to get hive inside the reset sriov function , can you try remove the code to check hive ? Or maybe pass the hive as a parameter into this function if the hive is needed?
[Zhigang] in patch 1, we made change in gpu_recovery to skip getting xgmi hive if it's sriov vf as we don't want to reset other VF in the same hive.
Patch 4 looks ok to me , but may need SRDC engineer confirm it won't have side effect on other AI asic .
Regards
Shaoyun.liu
-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Zhigang Luo
Sent: Tuesday, December 7, 2021 11:57 AM
To: amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang <Zhigang.Luo@amd.com>
Subject: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF
On SRIOV, host driver can support FLR(function level reset) on individual VF within the hive which might bring the individual device back to normal without the necessary to execute the hive reset. If the FLR failed , host driver will trigger the hive reset, each guest VF will get reset notification before the real hive reset been executed. The VF device can handle the reset request individually in it's reset work handler.
This change updated gpu recover sequence to skip reset other device in the same hive for SRIOV VF.
Signed-off-by: Zhigang Luo <zhigang.luo@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c5afa45173c..474f8ea58aa5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4746,7 +4746,7 @@ static int amdgpu_device_lock_hive_adev(struct amdgpu_device *adev, struct amdgp {
struct amdgpu_device *tmp_adev = NULL;
- if (adev->gmc.xgmi.num_physical_nodes > 1) {
+ if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1))
+{
if (!hive) {
dev_err(adev->dev, "Hive is NULL while device has multiple xgmi nodes");
return -ENODEV;
@@ -4958,7 +4958,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
* We always reset all schedulers for device and all devices for XGMI
* hive so that should take care of them too.
*/
- hive = amdgpu_get_xgmi_hive(adev);
+ if (!amdgpu_sriov_vf(adev))
+ hive = amdgpu_get_xgmi_hive(adev);
if (hive) {
if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress", @@ -4999,7 +5000,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
* to put adev in the 1st position.
*/
INIT_LIST_HEAD(&device_list);
- if (adev->gmc.xgmi.num_physical_nodes > 1) {
+ if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1))
+{
list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
list_add_tail(&tmp_adev->reset_list, &device_list);
if (!list_is_first(&adev->reset_list, &device_list))
--
2.17.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* RE: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF
2021-12-07 21:55 ` Luo, Zhigang
@ 2021-12-08 2:23 ` Liu, Shaoyun
0 siblings, 0 replies; 7+ messages in thread
From: Liu, Shaoyun @ 2021-12-08 2:23 UTC (permalink / raw)
To: Luo, Zhigang, amd-gfx
[AMD Official Use Only]
Ok , sounds reasonable. With the suggested modification
Patch 1, 2, 3, are Reviewed by : Shaoyun.liu <Shaoyun.liu@amd.com>. Patch4 is Acked by : Shaoyun.liu <Shaoyun.liu@amd.com>.
Regards
Shaoyun.liu
-----Original Message-----
From: Luo, Zhigang <Zhigang.Luo@amd.com>
Sent: Tuesday, December 7, 2021 4:55 PM
To: Liu, Shaoyun <Shaoyun.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF
[AMD Official Use Only]
Shaoyun, please see my comments inline.
Thanks,
Zhigang
-----Original Message-----
From: Liu, Shaoyun <Shaoyun.Liu@amd.com>
Sent: December 7, 2021 2:15 PM
To: Luo, Zhigang <Zhigang.Luo@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang <Zhigang.Luo@amd.com>
Subject: RE: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF
[AMD Official Use Only]
This patch looks ok to me .
Patch 2 is actually add the PSP xgmi init not the whole XGMI init, can you change the description according to this ?
[Zhigang] Ok. Will change it.
Patch 3, You take the hive lock inside the reset sriov function , but the hive lock already be took before this function is called in gpu_recovery function, so is it real necessary to get hive inside the reset sriov function , can you try remove the code to check hive ? Or maybe pass the hive as a parameter into this function if the hive is needed?
[Zhigang] in patch 1, we made change in gpu_recovery to skip getting xgmi hive if it's sriov vf as we don't want to reset other VF in the same hive.
Patch 4 looks ok to me , but may need SRDC engineer confirm it won't have side effect on other AI asic .
Regards
Shaoyun.liu
-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Zhigang Luo
Sent: Tuesday, December 7, 2021 11:57 AM
To: amd-gfx@lists.freedesktop.org
Cc: Luo, Zhigang <Zhigang.Luo@amd.com>
Subject: [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF
On SRIOV, host driver can support FLR(function level reset) on individual VF within the hive which might bring the individual device back to normal without the necessary to execute the hive reset. If the FLR failed , host driver will trigger the hive reset, each guest VF will get reset notification before the real hive reset been executed. The VF device can handle the reset request individually in it's reset work handler.
This change updated gpu recover sequence to skip reset other device in the same hive for SRIOV VF.
Signed-off-by: Zhigang Luo <zhigang.luo@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 3c5afa45173c..474f8ea58aa5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4746,7 +4746,7 @@ static int amdgpu_device_lock_hive_adev(struct amdgpu_device *adev, struct amdgp {
struct amdgpu_device *tmp_adev = NULL;
- if (adev->gmc.xgmi.num_physical_nodes > 1) {
+ if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1))
+{
if (!hive) {
dev_err(adev->dev, "Hive is NULL while device has multiple xgmi nodes");
return -ENODEV;
@@ -4958,7 +4958,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
* We always reset all schedulers for device and all devices for XGMI
* hive so that should take care of them too.
*/
- hive = amdgpu_get_xgmi_hive(adev);
+ if (!amdgpu_sriov_vf(adev))
+ hive = amdgpu_get_xgmi_hive(adev);
if (hive) {
if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress", @@ -4999,7 +5000,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
* to put adev in the 1st position.
*/
INIT_LIST_HEAD(&device_list);
- if (adev->gmc.xgmi.num_physical_nodes > 1) {
+ if (!amdgpu_sriov_vf(adev) && (adev->gmc.xgmi.num_physical_nodes > 1))
+{
list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head)
list_add_tail(&tmp_adev->reset_list, &device_list);
if (!list_is_first(&adev->reset_list, &device_list))
--
2.17.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
end of thread, other threads:[~2021-12-08 2:23 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-07 16:57 [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Zhigang Luo
2021-12-07 16:57 ` [PATCH 2/4] drm/amdgpu: initialize XGMI for SRIOV VF during recover Zhigang Luo
2021-12-07 16:57 ` [PATCH 3/4] drm/amdgpu: recover XGMI topology for SRIOV VF after reset Zhigang Luo
2021-12-07 16:57 ` [PATCH 4/4] drm/amdgpu: extended waiting SRIOV VF reset completion timeout to 10s Zhigang Luo
2021-12-07 19:14 ` [PATCH 1/4] drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF Liu, Shaoyun
2021-12-07 21:55 ` Luo, Zhigang
2021-12-08 2:23 ` Liu, Shaoyun
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.