All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Some XGMI related gpu reset fixes and cleanups
@ 2020-04-21  5:23 Evan Quan
  2020-04-21  5:23 ` [PATCH 1/4] drm/amdgpu: correct fbdev suspend on gpu reset Evan Quan
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Evan Quan @ 2020-04-21  5:23 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, Andrey.Grodzovsky, Jonathan.Kim, Evan Quan,
	hawking.zhang

Patch 1 and 2 are the necessary fixes for XGMI setup. Since these
operations are needed for other devices from the same hive. That's
missing now.
Patch 3 are 4 are basically code cosmetic.

Evan Quan (4):
  drm/amdgpu: correct fbdev suspend on gpu reset
  drm/amdgpu: correct cancel_delayed_work_sync on gpu reset
  drm/amdgpu: optimize the gpu reset for XGMI setup V2
  drm/amdgpu: code cleanup around gpu reset

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 92 ++++++++--------------
 1 file changed, 31 insertions(+), 61 deletions(-)

-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/4] drm/amdgpu: correct fbdev suspend on gpu reset
  2020-04-21  5:23 [PATCH 0/4] Some XGMI related gpu reset fixes and cleanups Evan Quan
@ 2020-04-21  5:23 ` Evan Quan
  2020-04-21  5:23 ` [PATCH 2/4] drm/amdgpu: correct cancel_delayed_work_sync " Evan Quan
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Evan Quan @ 2020-04-21  5:23 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, Andrey.Grodzovsky, Jonathan.Kim, Evan Quan,
	hawking.zhang

As for XGMI setup, it needs to be performed on
all the devices from the same hive.

Change-Id: I25e6364d31f0b34938cf424a410628aa54dd2edd
Signed-off-by: Evan Quan <evan.quan@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 079c9c5ef381..6cbe5140b873 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4254,7 +4254,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		 */
 		amdgpu_unregister_gpu_instance(tmp_adev);
 
-		amdgpu_fbdev_set_suspend(adev, 1);
+		amdgpu_fbdev_set_suspend(tmp_adev, 1);
 
 		/* disable ras on ALL IPs */
 		if (!(in_ras_intr && !use_baco) &&
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/4] drm/amdgpu: correct cancel_delayed_work_sync on gpu reset
  2020-04-21  5:23 [PATCH 0/4] Some XGMI related gpu reset fixes and cleanups Evan Quan
  2020-04-21  5:23 ` [PATCH 1/4] drm/amdgpu: correct fbdev suspend on gpu reset Evan Quan
@ 2020-04-21  5:23 ` Evan Quan
  2020-04-21  5:23 ` [PATCH 3/4] drm/amdgpu: optimize the gpu reset for XGMI setup V2 Evan Quan
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Evan Quan @ 2020-04-21  5:23 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, Andrey.Grodzovsky, Jonathan.Kim, Evan Quan,
	hawking.zhang

As for XGMI setup, it should be performed on other devices
from the hive also.

Change-Id: I08554c27216efa21c2c46c0b3379d856b5264c9e
Signed-off-by: Evan Quan <evan.quan@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 6cbe5140b873..c8fe867d6ee3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4248,6 +4248,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			                amdgpu_amdkfd_pre_reset(tmp_adev);
 		}
 
+		cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
+
 		/*
 		 * Mark these ASICs to be reseted as untracked first
 		 * And add them back after reset completed
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/4] drm/amdgpu: optimize the gpu reset for XGMI setup V2
  2020-04-21  5:23 [PATCH 0/4] Some XGMI related gpu reset fixes and cleanups Evan Quan
  2020-04-21  5:23 ` [PATCH 1/4] drm/amdgpu: correct fbdev suspend on gpu reset Evan Quan
  2020-04-21  5:23 ` [PATCH 2/4] drm/amdgpu: correct cancel_delayed_work_sync " Evan Quan
@ 2020-04-21  5:23 ` Evan Quan
  2020-04-21  5:23 ` [PATCH 4/4] drm/amdgpu: code cleanup around gpu reset Evan Quan
  2020-04-21 21:08 ` [PATCH 0/4] Some XGMI related gpu reset fixes and cleanups Andrey Grodzovsky
  4 siblings, 0 replies; 6+ messages in thread
From: Evan Quan @ 2020-04-21  5:23 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, Andrey.Grodzovsky, Jonathan.Kim, Evan Quan,
	hawking.zhang

This is basically just some code cosmetic. The current design
for XGMI setup gput reset is to operate on current device(adev)
first and then on other devices from the hive(by another 'for' loop).
But actually we can do some sort to the device list(to put current
device 1st position) and handle all the devices in a single 'for'
loop.

V2: added missing hive->hive_lock protection

Change-Id: I84dca425f1ae778c4b4b8bc3a0d2b9a3d1b50043
Signed-off-by: Evan Quan <evan.quan@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 76 +++++++---------------
 1 file changed, 25 insertions(+), 51 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c8fe867d6ee3..b415c1e5ea0d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4182,16 +4182,11 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	}
 
 	need_full_reset = job_signaled = false;
-	INIT_LIST_HEAD(&device_list);
-
-	amdgpu_ras_set_error_query_ready(adev, false);
 
 	dev_info(adev->dev, "GPU %s begin!\n",
 		(in_ras_intr && !use_baco) ? "jobs stop":"reset");
 
-	cancel_delayed_work_sync(&adev->delayed_init_work);
-
-	hive = amdgpu_get_xgmi_hive(adev, false);
+	hive = amdgpu_get_xgmi_hive(adev, true);
 
 	/*
 	 * Here we trylock to avoid chain of resets executing from
@@ -4204,35 +4199,21 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	if (hive && !mutex_trylock(&hive->reset_lock)) {
 		DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
 			  job ? job->base.id : -1, hive->hive_id);
+		mutex_unlock(&hive->hive_lock);
 		return 0;
 	}
 
-	/* Start with adev pre asic reset first for soft reset check.*/
-	if (!amdgpu_device_lock_adev(adev, !hive)) {
-		DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress",
-			  job ? job->base.id : -1);
-		return 0;
-	}
-
-	/* Block kfd: SRIOV would do it separately */
-	if (!amdgpu_sriov_vf(adev))
-                amdgpu_amdkfd_pre_reset(adev);
-
-	/* Build list of devices to reset */
-	if  (adev->gmc.xgmi.num_physical_nodes > 1) {
-		if (!hive) {
-			/*unlock kfd: SRIOV would do it separately */
-			if (!amdgpu_sriov_vf(adev))
-		                amdgpu_amdkfd_post_reset(adev);
-			amdgpu_device_unlock_adev(adev);
+	/*
+	 * Build list of devices to reset.
+	 * In case we are in XGMI hive mode, resort the device list
+	 * to put adev in the 1st position.
+	 */
+	INIT_LIST_HEAD(&device_list);
+	if (adev->gmc.xgmi.num_physical_nodes > 1) {
+		if (!hive)
 			return -ENODEV;
-		}
-
-		/*
-		 * In case we are in XGMI hive mode device reset is done for all the
-		 * nodes in the hive to retrain all XGMI links and hence the reset
-		 * sequence is executed in loop on all nodes.
-		 */
+		if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
+			list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
 		device_list_handle = &hive->device_list;
 	} else {
 		list_add_tail(&adev->gmc.xgmi.head, &device_list);
@@ -4241,15 +4222,20 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 	/* block all schedulers and reset given job's ring */
 	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
-		if (tmp_adev != adev) {
-			amdgpu_ras_set_error_query_ready(tmp_adev, false);
-			amdgpu_device_lock_adev(tmp_adev, false);
-			if (!amdgpu_sriov_vf(tmp_adev))
-			                amdgpu_amdkfd_pre_reset(tmp_adev);
+		if (!amdgpu_device_lock_adev(tmp_adev, !hive)) {
+			DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress",
+				  job ? job->base.id : -1);
+			mutex_unlock(&hive->hive_lock);
+			return 0;
 		}
 
+		amdgpu_ras_set_error_query_ready(tmp_adev, false);
+
 		cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
 
+		if (!amdgpu_sriov_vf(tmp_adev))
+			amdgpu_amdkfd_pre_reset(tmp_adev);
+
 		/*
 		 * Mark these ASICs to be reseted as untracked first
 		 * And add them back after reset completed
@@ -4295,22 +4281,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		goto skip_hw_reset;
 	}
 
-
-	/* Guilty job will be freed after this*/
-	r = amdgpu_device_pre_asic_reset(adev, job, &need_full_reset);
-	if (r) {
-		/*TODO Should we stop ?*/
-		DRM_ERROR("GPU pre asic reset failed with err, %d for drm dev, %s ",
-			  r, adev->ddev->unique);
-		adev->asic_reset_res = r;
-	}
-
 retry:	/* Rest of adevs pre asic reset from XGMI hive. */
 	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
-
-		if (tmp_adev == adev)
-			continue;
-
 		r = amdgpu_device_pre_asic_reset(tmp_adev,
 						 NULL,
 						 &need_full_reset);
@@ -4375,8 +4347,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		amdgpu_device_unlock_adev(tmp_adev);
 	}
 
-	if (hive)
+	if (hive) {
 		mutex_unlock(&hive->reset_lock);
+		mutex_unlock(&hive->hive_lock);
+	}
 
 	if (r)
 		dev_info(adev->dev, "GPU reset end with ret = %d\n", r);
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 4/4] drm/amdgpu: code cleanup around gpu reset
  2020-04-21  5:23 [PATCH 0/4] Some XGMI related gpu reset fixes and cleanups Evan Quan
                   ` (2 preceding siblings ...)
  2020-04-21  5:23 ` [PATCH 3/4] drm/amdgpu: optimize the gpu reset for XGMI setup V2 Evan Quan
@ 2020-04-21  5:23 ` Evan Quan
  2020-04-21 21:08 ` [PATCH 0/4] Some XGMI related gpu reset fixes and cleanups Andrey Grodzovsky
  4 siblings, 0 replies; 6+ messages in thread
From: Evan Quan @ 2020-04-21  5:23 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alexander.Deucher, Andrey.Grodzovsky, Jonathan.Kim, Evan Quan,
	hawking.zhang

Make code more readable.

Change-Id: I28444f285b23aac16be421e3447d0de6c3a57ee8
Signed-off-by: Evan Quan <evan.quan@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b415c1e5ea0d..349c8f85fc8c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4160,7 +4160,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			      struct amdgpu_job *job)
 {
 	struct list_head device_list, *device_list_handle =  NULL;
-	bool need_full_reset, job_signaled;
+	bool need_full_reset = false;
+	bool job_signaled = false;
 	struct amdgpu_hive_info *hive = NULL;
 	struct amdgpu_device *tmp_adev = NULL;
 	int i, r = 0;
@@ -4181,13 +4182,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		emergency_restart();
 	}
 
-	need_full_reset = job_signaled = false;
-
 	dev_info(adev->dev, "GPU %s begin!\n",
 		(in_ras_intr && !use_baco) ? "jobs stop":"reset");
 
-	hive = amdgpu_get_xgmi_hive(adev, true);
-
 	/*
 	 * Here we trylock to avoid chain of resets executing from
 	 * either trigger by jobs on different adevs in XGMI hive or jobs on
@@ -4195,7 +4192,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 * We always reset all schedulers for device and all devices for XGMI
 	 * hive so that should take care of them too.
 	 */
-
+	hive = amdgpu_get_xgmi_hive(adev, true);
 	if (hive && !mutex_trylock(&hive->reset_lock)) {
 		DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
 			  job ? job->base.id : -1, hive->hive_id);
@@ -4262,7 +4259,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		}
 	}
 
-
 	if (in_ras_intr && !use_baco)
 		goto skip_sched_resume;
 
@@ -4273,10 +4269,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	 * job->base holds a reference to parent fence
 	 */
 	if (job && job->base.s_fence->parent &&
-	    dma_fence_is_signaled(job->base.s_fence->parent))
+	    dma_fence_is_signaled(job->base.s_fence->parent)) {
 		job_signaled = true;
-
-	if (job_signaled) {
 		dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
 		goto skip_hw_reset;
 	}
-- 
2.26.2

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/4] Some XGMI related gpu reset fixes and cleanups
  2020-04-21  5:23 [PATCH 0/4] Some XGMI related gpu reset fixes and cleanups Evan Quan
                   ` (3 preceding siblings ...)
  2020-04-21  5:23 ` [PATCH 4/4] drm/amdgpu: code cleanup around gpu reset Evan Quan
@ 2020-04-21 21:08 ` Andrey Grodzovsky
  4 siblings, 0 replies; 6+ messages in thread
From: Andrey Grodzovsky @ 2020-04-21 21:08 UTC (permalink / raw)
  To: Evan Quan, amd-gfx; +Cc: Alexander.Deucher, Jonathan.Kim, hawking.zhang

Patch 1 Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Patches 2-4 Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Andrey

On 4/21/20 1:23 AM, Evan Quan wrote:
> Patch 1 and 2 are the necessary fixes for XGMI setup. Since these
> operations are needed for other devices from the same hive. That's
> missing now.
> Patch 3 are 4 are basically code cosmetic.
>
> Evan Quan (4):
>    drm/amdgpu: correct fbdev suspend on gpu reset
>    drm/amdgpu: correct cancel_delayed_work_sync on gpu reset
>    drm/amdgpu: optimize the gpu reset for XGMI setup V2
>    drm/amdgpu: code cleanup around gpu reset
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 92 ++++++++--------------
>   1 file changed, 31 insertions(+), 61 deletions(-)
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-04-21 21:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-21  5:23 [PATCH 0/4] Some XGMI related gpu reset fixes and cleanups Evan Quan
2020-04-21  5:23 ` [PATCH 1/4] drm/amdgpu: correct fbdev suspend on gpu reset Evan Quan
2020-04-21  5:23 ` [PATCH 2/4] drm/amdgpu: correct cancel_delayed_work_sync " Evan Quan
2020-04-21  5:23 ` [PATCH 3/4] drm/amdgpu: optimize the gpu reset for XGMI setup V2 Evan Quan
2020-04-21  5:23 ` [PATCH 4/4] drm/amdgpu: code cleanup around gpu reset Evan Quan
2020-04-21 21:08 ` [PATCH 0/4] Some XGMI related gpu reset fixes and cleanups Andrey Grodzovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.