From mboxrd@z Thu Jan 1 00:00:00 1970 From: =?UTF-8?Q?Christian_K=c3=b6nig?= Subject: Re: [PATCH] drm/amdgpu: Fix the dead lock issue. Date: Tue, 11 Sep 2018 08:40:29 +0200 Message-ID: References: <1536634293-26099-1-git-send-email-Emily.Deng@amd.com> <83f7a45f-1aee-a2a8-bc82-f3433157c6cb@amd.com> Reply-To: christian.koenig-5C7GfCeVMHo@public.gmane.org Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0492797015==" Return-path: In-Reply-To: <83f7a45f-1aee-a2a8-bc82-f3433157c6cb-5C7GfCeVMHo@public.gmane.org> Content-Language: en-US List-Id: Discussion list for AMD gfx List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org Sender: "amd-gfx" To: zhoucm1 , "Deng, Emily" , "Zhou, David(ChunMing)" , "amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org" This is a multi-part message in MIME format. --===============0492797015== Content-Type: multipart/alternative; boundary="------------CC180E0FFEF80A18C9FCB369" Content-Language: en-US This is a multi-part message in MIME format. --------------CC180E0FFEF80A18C9FCB369 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit That won't work correctly. The TTM BO is unreferenced in a couple of more places which we don't have control over. To make it even worse we actually can't take the reservation lock during GPU reset because the reservation object might already be destroyed when we remove the BO from the list. I will take a look at this myself today to find a solution which should work. Regards, Christian. Am 11.09.2018 um 07:41 schrieb zhoucm1: > > > On 2018年09月11日 11:37, zhoucm1 wrote: >> >> >> On 2018年09月11日 11:32, Deng, Emily wrote: >>>> -----Original Message----- >>>> From: amd-gfx On Behalf Of >>>> zhoucm1 >>>> Sent: Tuesday, September 11, 2018 11:28 AM >>>> To: Deng, Emily ; Zhou, David(ChunMing) >>>> ; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >>>> Subject: Re: [PATCH] drm/amdgpu: Fix the dead lock issue. >>>> >>>> >>>> >>>> On 2018年09月11日 11:23, Deng, Emily wrote: >>>>>> -----Original Message----- >>>>>> From: Zhou, David(ChunMing) >>>>>> Sent: Tuesday, September 11, 2018 11:03 AM >>>>>> To: Deng, Emily ; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >>>>>> Subject: Re: [PATCH] drm/amdgpu: Fix the dead lock issue. >>>>>> >>>>>> >>>>>> >>>>>> On 2018年09月11日 10:51, Emily Deng wrote: >>>>>>> It will ramdomly have the dead lock issue when test TDR: >>>>>>> 1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2. >>>>>>> amdgpu_bo_create locked the bo's resv lock 3. >>>>>>> amdgpu_bo_create_shadow is waiting for the shadow_list_lock 4. >>>>>>> amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv >>>>>>> lock. >>>>>>> >>>>>>> v2: >>>>>>>       Make a local copy of the list >>>>>>> >>>>>>> Signed-off-by: Emily Deng >>>>>>> --- >>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21 >>>>>> ++++++++++++++++++++- >>>>>>>     1 file changed, 20 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> index 2a21267..8c81404 100644 >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> @@ -3105,6 +3105,9 @@ static int >>>>>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >>>>>>>         long r = 1; >>>>>>>         int i = 0; >>>>>>>         long tmo; >>>>>>> +    struct list_head local_shadow_list; >>>>>>> + >>>>>>> +    INIT_LIST_HEAD(&local_shadow_list); >>>>>>> >>>>>>>         if (amdgpu_sriov_runtime(adev)) >>>>>>>             tmo = msecs_to_jiffies(8000); >>>>>>> @@ -3112,8 +3115,19 @@ static int >>>>>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >>>>>>>             tmo = msecs_to_jiffies(100); >>>>>>> >>>>>>>         DRM_INFO("recover vram bo from shadow start\n"); >>>>>>> + >>>>>>> +    mutex_lock(&adev->shadow_list_lock); >>>>>>> +    list_splice_init(&adev->shadow_list, &local_shadow_list); >>>>>>> +    mutex_unlock(&adev->shadow_list_lock); >>>>>>> + >>>>>>> + >>>>>>>         mutex_lock(&adev->shadow_list_lock); >>>>>> local_shadow_list is a local variable, I think it doesn't need lock >>>>>> at all, no one change it. Otherwise looks good to me. >>>>> The bo->shadow_list which now is in local_shadow_list maybe >>>>> destroy in >>>>> case that it already in amdgpu_bo_destroy, then it will change >>>> local_shadow_list, so need lock the shadow_list_lock. >>>> Ah, sorry for noise, I forget you don't reference these BOs. >>> Yes, I don't reference these Bos, as I found even reference these >>> Bos, it still couldn't avoid the case that another process is already >>> in amdgpu_bo_destroy. >> ??? that shouldn't happen, the reference is belonged to list. But >> back to here, we don't need reference them. >> And since no shadow BO is added to local after splice, we'd better to >> use list_next_entry to iterate the local shadow list instead of >> list_for_each_entry_safe. >> >> Thanks, >> David Zhou >>>> Thanks, >>>> David Zhou >>>>> Best wishes >>>>> Emily Deng >>>>>> Thanks, >>>>>> David Zhou >>>>>>> - list_for_each_entry_safe(bo, tmp, &adev->shadow_list, >>>>>>> shadow_list) { >>>>>>> +    list_for_each_entry_safe(bo, tmp, &local_shadow_list, >>>>>>> shadow_list) { > because shadow list doesn't take bo reference, we should give a > amdgpu_bo_ref(bo) with attached patch before unlock. > You can have a try. > > Thanks, > David Zhou >>>>>>> + mutex_unlock(&adev->shadow_list_lock); >>>>>>> + >>>>>>> +        if (!bo) >>>>>>> +            continue; >>>>>>> + >>>>>>>             next = NULL; >>>>>>> amdgpu_device_recover_vram_from_shadow(adev, ring, bo, >>>>>> &next); >>>>>>>             if (fence) { >>>>>>> @@ -3132,9 +3146,14 @@ static int >>>>>>> amdgpu_device_handle_vram_lost(struct amdgpu_device *adev) >>>>>>> >>>>>>>             dma_fence_put(fence); >>>>>>>             fence = next; >>>>>>> +        mutex_lock(&adev->shadow_list_lock); >>>>>>>         } >>>>>>>         mutex_unlock(&adev->shadow_list_lock); >>>>>>> >>>>>>> +    mutex_lock(&adev->shadow_list_lock); >>>>>>> +    list_splice_init(&local_shadow_list, &adev->shadow_list); >>>>>>> +    mutex_unlock(&adev->shadow_list_lock); >>>>>>> + >>>>>>>         if (fence) { >>>>>>>             r = dma_fence_wait_timeout(fence, false, tmo); >>>>>>>             if (r == 0) >>>> _______________________________________________ >>>> amd-gfx mailing list >>>> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >> >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > > > _______________________________________________ > amd-gfx mailing list > amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx --------------CC180E0FFEF80A18C9FCB369 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit
That won't work correctly. The TTM BO is unreferenced in a couple of more places which we don't have control over.

To make it even worse we actually can't take the reservation lock during GPU reset because the reservation object might already be destroyed when we remove the BO from the list.

I will take a look at this myself today to find a solution which should work.

Regards,
Christian.

Am 11.09.2018 um 07:41 schrieb zhoucm1:


On 2018年09月11日 11:37, zhoucm1 wrote:


On 2018年09月11日 11:32, Deng, Emily wrote:
-----Original Message-----
From: amd-gfx <amd-gfx-bounces-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org> On Behalf Of
zhoucm1
Sent: Tuesday, September 11, 2018 11:28 AM
To: Deng, Emily <Emily.Deng-5C7GfCeVMHo@public.gmane.org>; Zhou, David(ChunMing)
<David1.Zhou-5C7GfCeVMHo@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Subject: Re: [PATCH] drm/amdgpu: Fix the dead lock issue.



On 2018年09月11日 11:23, Deng, Emily wrote:
-----Original Message-----
From: Zhou, David(ChunMing)
Sent: Tuesday, September 11, 2018 11:03 AM
To: Deng, Emily <Emily.Deng-5C7GfCeVMHo@public.gmane.org>; amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Subject: Re: [PATCH] drm/amdgpu: Fix the dead lock issue.



On 2018年09月11日 10:51, Emily Deng wrote:
It will ramdomly have the dead lock issue when test TDR:
1. amdgpu_device_handle_vram_lost gets the lock shadow_list_lock 2.
amdgpu_bo_create locked the bo's resv lock 3.
amdgpu_bo_create_shadow is waiting for the shadow_list_lock 4.
amdgpu_device_recover_vram_from_shadow is waiting for the bo's resv
lock.

v2:
      Make a local copy of the list

Signed-off-by: Emily Deng <Emily.Deng-5C7GfCeVMHo@public.gmane.org>
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 21
++++++++++++++++++++-
    1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 2a21267..8c81404 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3105,6 +3105,9 @@ static int
amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)
        long r = 1;
        int i = 0;
        long tmo;
+    struct list_head local_shadow_list;
+
+    INIT_LIST_HEAD(&local_shadow_list);

        if (amdgpu_sriov_runtime(adev))
            tmo = msecs_to_jiffies(8000);
@@ -3112,8 +3115,19 @@ static int
amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)
            tmo = msecs_to_jiffies(100);

        DRM_INFO("recover vram bo from shadow start\n");
+
+    mutex_lock(&adev->shadow_list_lock);
+    list_splice_init(&adev->shadow_list, &local_shadow_list);
+    mutex_unlock(&adev->shadow_list_lock);
+
+
        mutex_lock(&adev->shadow_list_lock);
local_shadow_list is a local variable, I think it doesn't need lock
at all, no one change it. Otherwise looks good to me.
The bo->shadow_list which now is in local_shadow_list maybe destroy in
case that it already in amdgpu_bo_destroy, then it will change
local_shadow_list, so need lock the shadow_list_lock.
Ah, sorry for noise, I forget you don't reference these BOs.
Yes, I don't reference these Bos, as I found even reference these Bos, it still couldn't avoid the case that another process is already
in amdgpu_bo_destroy.
??? that shouldn't happen, the reference is belonged to list. But back to here, we don't need reference them.
And since no shadow BO is added to local after splice, we'd better to use list_next_entry to iterate the local shadow list instead of list_for_each_entry_safe.

Thanks,
David Zhou
Thanks,
David Zhou
Best wishes
Emily Deng
Thanks,
David Zhou
-    list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
+    list_for_each_entry_safe(bo, tmp, &local_shadow_list, shadow_list) {
because shadow list doesn't take bo reference, we should give a amdgpu_bo_ref(bo) with attached patch before unlock.
You can have a try.

Thanks,
David Zhou
+ mutex_unlock(&adev->shadow_list_lock);
+
+        if (!bo)
+            continue;
+
            next = NULL;
            amdgpu_device_recover_vram_from_shadow(adev, ring, bo,
&next);
            if (fence) {
@@ -3132,9 +3146,14 @@ static int
amdgpu_device_handle_vram_lost(struct amdgpu_device *adev)

            dma_fence_put(fence);
            fence = next;
+        mutex_lock(&adev->shadow_list_lock);
        }
        mutex_unlock(&adev->shadow_list_lock);

+    mutex_lock(&adev->shadow_list_lock);
+    list_splice_init(&local_shadow_list, &adev->shadow_list);
+    mutex_unlock(&adev->shadow_list_lock);
+
        if (fence) {
            r = dma_fence_wait_timeout(fence, false, tmo);
            if (r == 0)
_______________________________________________
amd-gfx mailing list
amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx



_______________________________________________
amd-gfx mailing list
amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

--------------CC180E0FFEF80A18C9FCB369-- --===============0492797015== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KYW1kLWdmeCBt YWlsaW5nIGxpc3QKYW1kLWdmeEBsaXN0cy5mcmVlZGVza3RvcC5vcmcKaHR0cHM6Ly9saXN0cy5m cmVlZGVza3RvcC5vcmcvbWFpbG1hbi9saXN0aW5mby9hbWQtZ2Z4Cg== --===============0492797015==--