All of lore.kernel.org
 help / color / mirror / Atom feed
* RE: [PATCH] drm/amdgpu: fix issue when 2 ring job timeout
       [not found] <20210106124258.31129-1-horace.chen@amd.com>
@ 2021-01-08 11:34 ` Chen, Horace
  2021-01-08 12:17   ` Christian König
  2021-01-11  4:07   ` Quan, Evan
  0 siblings, 2 replies; 3+ messages in thread
From: Chen, Horace @ 2021-01-08 11:34 UTC (permalink / raw)
  To: Chen, Horace, amd-gfx
  Cc: Xiao, Jack, Xu, Feifei, Wang,  Kevin(Yang),
	Tuikov, Luben, Deucher, Alexander, Koenig, Christian, Liu, Monk,
	Zhang, Hawking

[AMD Public Use]

Hi Christian,

Can you help review this change?

This issue happens when 2 jobs on 2 schedulers time out at the same time. Which will lead 2 threads to enter amdgpu_device_gpu_recover() at the same time. The problem is that if device is not an XGMI node, the adev->gmc.xgmi.head will be added to device_list which is a stack variable. 
So the first thread will get the device in to its device list and start to iterate, meanwhile the second thread may rob the device away from the first thread and add to its own device list. This will cause the first thread get in to a bad state in its iteration.

The solution is to lock the device earily, before we add device to the local device list.

Thanks & Regards,
Horace.

-----Original Message-----
From: Horace Chen <horace.chen@amd.com> 
Sent: Wednesday, January 6, 2021 8:43 PM
To: amd-gfx@lists.freedesktop.org
Cc: Chen, Horace <Horace.Chen@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Xiao, Jack <Jack.Xiao@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Xu, Feifei <Feifei.Xu@amd.com>; Wang, Kevin(Yang) <Kevin1.Wang@amd.com>; Xiaojie Yuan <xiaojie.yuan@amd.com>
Subject: [PATCH] drm/amdgpu: fix issue when 2 ring job timeout

Fix a racing issue when 2 rings job timeout simultaneously.

If 2 rings timed out at the same time, the
amdgpu_device_gpu_recover will be reentered. Then the
adev->gmc.xgmi.head will be grabbed by 2 local linked list,
which may cause wild pointer issue in iterating.

lock the device earily to prevent the node be added to 2
different lists.

Signed-off-by: Horace Chen <horace.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 25 ++++++++++++++++------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9a3cb98d03be..233dae27c8eb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4620,23 +4620,34 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	if (adev->gmc.xgmi.num_physical_nodes > 1) {
 		if (!hive)
 			return -ENODEV;
+
+		list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
+			if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
+				dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
+						job ? job->base.id : -1);
+				r = 0;
+				goto skip_recovery;
+			}
+		}
+
 		if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
 			list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
 		device_list_handle = &hive->device_list;
 	} else {
+		/* if current dev is already in reset, skip adding list to prevent race issue */
+		if (!amdgpu_device_lock_adev(adev, hive)) {
+			dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
+					job ? job->base.id : -1);
+			r = 0;
+			goto skip_recovery;
+		}
+
 		list_add_tail(&adev->gmc.xgmi.head, &device_list);
 		device_list_handle = &device_list;
 	}
 
 	/* block all schedulers and reset given job's ring */
 	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
-		if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
-			dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
-				  job ? job->base.id : -1);
-			r = 0;
-			goto skip_recovery;
-		}
-
 		/*
 		 * Try to put the audio codec into suspend state
 		 * before gpu reset started.
-- 
2.17.1
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] drm/amdgpu: fix issue when 2 ring job timeout
  2021-01-08 11:34 ` [PATCH] drm/amdgpu: fix issue when 2 ring job timeout Chen, Horace
@ 2021-01-08 12:17   ` Christian König
  2021-01-11  4:07   ` Quan, Evan
  1 sibling, 0 replies; 3+ messages in thread
From: Christian König @ 2021-01-08 12:17 UTC (permalink / raw)
  To: Chen, Horace, amd-gfx
  Cc: Grodzovsky, Andrey, Xiao, Jack, Xu, Feifei, Wang, Kevin(Yang),
	Tuikov, Luben, Deucher, Alexander, Liu, Monk, Zhang, Hawking

Hi Horace,

Andrey needs to take a look here since I haven't worked on that code in 
years.

In general I still suggest to unify the reset code for hive and non-hive 
code paths which would solve this quite gracefully.

Regards,
Christian.

Am 08.01.21 um 12:34 schrieb Chen, Horace:
> [AMD Public Use]
>
> Hi Christian,
>
> Can you help review this change?
>
> This issue happens when 2 jobs on 2 schedulers time out at the same time. Which will lead 2 threads to enter amdgpu_device_gpu_recover() at the same time. The problem is that if device is not an XGMI node, the adev->gmc.xgmi.head will be added to device_list which is a stack variable.
> So the first thread will get the device in to its device list and start to iterate, meanwhile the second thread may rob the device away from the first thread and add to its own device list. This will cause the first thread get in to a bad state in its iteration.
>
> The solution is to lock the device earily, before we add device to the local device list.
>
> Thanks & Regards,
> Horace.
>
> -----Original Message-----
> From: Horace Chen <horace.chen@amd.com>
> Sent: Wednesday, January 6, 2021 8:43 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Chen, Horace <Horace.Chen@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Xiao, Jack <Jack.Xiao@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Xu, Feifei <Feifei.Xu@amd.com>; Wang, Kevin(Yang) <Kevin1.Wang@amd.com>; Xiaojie Yuan <xiaojie.yuan@amd.com>
> Subject: [PATCH] drm/amdgpu: fix issue when 2 ring job timeout
>
> Fix a racing issue when 2 rings job timeout simultaneously.
>
> If 2 rings timed out at the same time, the
> amdgpu_device_gpu_recover will be reentered. Then the
> adev->gmc.xgmi.head will be grabbed by 2 local linked list,
> which may cause wild pointer issue in iterating.
>
> lock the device earily to prevent the node be added to 2
> different lists.
>
> Signed-off-by: Horace Chen <horace.chen@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 25 ++++++++++++++++------
>   1 file changed, 18 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 9a3cb98d03be..233dae27c8eb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4620,23 +4620,34 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	if (adev->gmc.xgmi.num_physical_nodes > 1) {
>   		if (!hive)
>   			return -ENODEV;
> +
> +		list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
> +			if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
> +				dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
> +						job ? job->base.id : -1);
> +				r = 0;
> +				goto skip_recovery;
> +			}
> +		}
> +
>   		if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
>   			list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
>   		device_list_handle = &hive->device_list;
>   	} else {
> +		/* if current dev is already in reset, skip adding list to prevent race issue */
> +		if (!amdgpu_device_lock_adev(adev, hive)) {
> +			dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
> +					job ? job->base.id : -1);
> +			r = 0;
> +			goto skip_recovery;
> +		}
> +
>   		list_add_tail(&adev->gmc.xgmi.head, &device_list);
>   		device_list_handle = &device_list;
>   	}
>   
>   	/* block all schedulers and reset given job's ring */
>   	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> -		if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
> -			dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
> -				  job ? job->base.id : -1);
> -			r = 0;
> -			goto skip_recovery;
> -		}
> -
>   		/*
>   		 * Try to put the audio codec into suspend state
>   		 * before gpu reset started.

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: [PATCH] drm/amdgpu: fix issue when 2 ring job timeout
  2021-01-08 11:34 ` [PATCH] drm/amdgpu: fix issue when 2 ring job timeout Chen, Horace
  2021-01-08 12:17   ` Christian König
@ 2021-01-11  4:07   ` Quan, Evan
  1 sibling, 0 replies; 3+ messages in thread
From: Quan, Evan @ 2021-01-11  4:07 UTC (permalink / raw)
  To: Chen, Horace, Chen, Horace, amd-gfx
  Cc: Xiao, Jack, Xu, Feifei, Wang,  Kevin(Yang),
	Tuikov, Luben, Deucher, Alexander, Koenig, Christian, Liu, Monk,
	Zhang, Hawking

Hi Horace,

The XGMI part should be already well protected by hive->hive_lock. So, I think you need the non-XGMI part only.
Also, it seems better to place the modifications within the hive check.
        /*
         * Here we trylock to avoid chain of resets executing from
         * either trigger by jobs on different adevs in XGMI hive or jobs on
         * different schedulers for same device while this TO handler is running.
         * We always reset all schedulers for device and all devices for XGMI
         * hive so that should take care of them too.
         */
        hive = amdgpu_get_xgmi_hive(adev);
        if (hive) {
                if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
                        DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
                                job ? job->base.id : -1, hive->hive_id);
                        amdgpu_put_xgmi_hive(hive);
                        return 0;
                }
                mutex_lock(&hive->hive_lock);
        } else {
+		/* if current dev is already in reset, skip adding list to prevent race issue */
+		if (!amdgpu_device_lock_adev(adev, hive)) {
+			dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
+					job ? job->base.id : -1);
+			r = 0;
+			goto skip_recovery;
+		}
}

BR
Evan
-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Chen, Horace
Sent: Friday, January 8, 2021 7:35 PM
To: Chen, Horace <Horace.Chen@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Xiao, Jack <Jack.Xiao@amd.com>; Xu, Feifei <Feifei.Xu@amd.com>; Wang, Kevin(Yang) <Kevin1.Wang@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
Subject: RE: [PATCH] drm/amdgpu: fix issue when 2 ring job timeout

[AMD Public Use]

Hi Christian,

Can you help review this change?

This issue happens when 2 jobs on 2 schedulers time out at the same time. Which will lead 2 threads to enter amdgpu_device_gpu_recover() at the same time. The problem is that if device is not an XGMI node, the adev->gmc.xgmi.head will be added to device_list which is a stack variable. 
So the first thread will get the device in to its device list and start to iterate, meanwhile the second thread may rob the device away from the first thread and add to its own device list. This will cause the first thread get in to a bad state in its iteration.

The solution is to lock the device earily, before we add device to the local device list.

Thanks & Regards,
Horace.

-----Original Message-----
From: Horace Chen <horace.chen@amd.com> 
Sent: Wednesday, January 6, 2021 8:43 PM
To: amd-gfx@lists.freedesktop.org
Cc: Chen, Horace <Horace.Chen@amd.com>; Tuikov, Luben <Luben.Tuikov@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Xiao, Jack <Jack.Xiao@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Xu, Feifei <Feifei.Xu@amd.com>; Wang, Kevin(Yang) <Kevin1.Wang@amd.com>; Xiaojie Yuan <xiaojie.yuan@amd.com>
Subject: [PATCH] drm/amdgpu: fix issue when 2 ring job timeout

Fix a racing issue when 2 rings job timeout simultaneously.

If 2 rings timed out at the same time, the
amdgpu_device_gpu_recover will be reentered. Then the
adev->gmc.xgmi.head will be grabbed by 2 local linked list,
which may cause wild pointer issue in iterating.

lock the device earily to prevent the node be added to 2
different lists.

Signed-off-by: Horace Chen <horace.chen@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 25 ++++++++++++++++------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9a3cb98d03be..233dae27c8eb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4620,23 +4620,34 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	if (adev->gmc.xgmi.num_physical_nodes > 1) {
 		if (!hive)
 			return -ENODEV;
+
+		list_for_each_entry(tmp_adev, &hive->device_list, gmc.xgmi.head) {
+			if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
+				dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
+						job ? job->base.id : -1);
+				r = 0;
+				goto skip_recovery;
+			}
+		}
+
 		if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
 			list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
 		device_list_handle = &hive->device_list;
 	} else {
+		/* if current dev is already in reset, skip adding list to prevent race issue */
+		if (!amdgpu_device_lock_adev(adev, hive)) {
+			dev_info(adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
+					job ? job->base.id : -1);
+			r = 0;
+			goto skip_recovery;
+		}
+
 		list_add_tail(&adev->gmc.xgmi.head, &device_list);
 		device_list_handle = &device_list;
 	}
 
 	/* block all schedulers and reset given job's ring */
 	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
-		if (!amdgpu_device_lock_adev(tmp_adev, hive)) {
-			dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress",
-				  job ? job->base.id : -1);
-			r = 0;
-			goto skip_recovery;
-		}
-
 		/*
 		 * Try to put the audio codec into suspend state
 		 * before gpu reset started.
-- 
2.17.1
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=04%7C01%7Cevan.quan%40amd.com%7Ceaa1c2da82ed436ce1d308d8b3c9657d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637457025191663857%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=XJGrL9%2B4Fs9sNpdqXdEpBqgaUfQ46BUxKvQfdcuDwVs%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-01-11  4:07 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20210106124258.31129-1-horace.chen@amd.com>
2021-01-08 11:34 ` [PATCH] drm/amdgpu: fix issue when 2 ring job timeout Chen, Horace
2021-01-08 12:17   ` Christian König
2021-01-11  4:07   ` Quan, Evan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.