All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: Fix a BUG_ON due to resv trylock fails
@ 2021-05-22  2:11 xinhui pan
  2021-05-22  2:57 ` Felix Kuehling
  0 siblings, 1 reply; 3+ messages in thread
From: xinhui pan @ 2021-05-22  2:11 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexander.deucher, Felix.Kuehling, xinhui pan, christian.koenig

The reservation object might be locked again by evict/swap after
individualized. The race is like below.
cpu 0                                   cpu 1
BO release				BO evict or swap
					lock lru_lock
ttm_bo_individualize_resv {resv = &_resv}
                                        ttm_bo_evict_swapout_allowable
                                                dma_resv_trylock(resv)
->release_notify() {BUG_ON(!trylock(resv))}
                                        if (!ttm_bo_get_unless_zero))
                                                dma_resv_unlock(resv)
					unlock lru_lock
To fix it simply, let's acquire lru_lock before resv trylock to avoid
the race above.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 928e8d57cd08..8f6da0034db9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -318,7 +318,9 @@ int amdgpu_amdkfd_remove_fence_on_pt_pd_bos(struct amdgpu_bo *bo)
 	ef = container_of(dma_fence_get(&info->eviction_fence->base),
 			struct amdgpu_amdkfd_fence, base);
 
+	spin_lock(&bo->tbo.bdev->lru_lock);
 	BUG_ON(!dma_resv_trylock(bo->tbo.base.resv));
+	spin_unlock(&bo->tbo.bdev->lru_lock);
 	ret = amdgpu_amdkfd_remove_eviction_fence(bo, ef);
 	dma_resv_unlock(bo->tbo.base.resv);
 
-- 
2.25.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a BUG_ON due to resv trylock fails
  2021-05-22  2:11 [PATCH] drm/amdgpu: Fix a BUG_ON due to resv trylock fails xinhui pan
@ 2021-05-22  2:57 ` Felix Kuehling
  2021-05-23 16:55   ` Christian König
  0 siblings, 1 reply; 3+ messages in thread
From: Felix Kuehling @ 2021-05-22  2:57 UTC (permalink / raw)
  To: xinhui pan, amd-gfx; +Cc: alexander.deucher, christian.koenig

When the BO gets individualized, there is an assumption that nobody is
accessing it any more. See this comment in ttm_bo_individualize_resv:

                /* This works because the BO is about to be destroyed and nobody
                 * reference it any more. The only tricky case is the trylock on
                 * the resv object while holding the lru_lock.
                 */

That is violated when the BO is still being swapped out at this stage.
You can kind of paper that over by taking the LRU lock. But there are
probably other race conditions going on when the reservation gets
swapped by "individualize" during an eviction.

I think to avoid all that TTM needs to make sure that the BO is no
longer on the LRU list when it gets individualized.

Regards,
  Felix


Am 2021-05-21 um 10:11 p.m. schrieb xinhui pan:
> The reservation object might be locked again by evict/swap after
> individualized. The race is like below.
> cpu 0                                   cpu 1
> BO release				BO evict or swap
> 					lock lru_lock
> ttm_bo_individualize_resv {resv = &_resv}
>                                         ttm_bo_evict_swapout_allowable
>                                                 dma_resv_trylock(resv)
> ->release_notify() {BUG_ON(!trylock(resv))}
>                                         if (!ttm_bo_get_unless_zero))
>                                                 dma_resv_unlock(resv)
> 					unlock lru_lock
> To fix it simply, let's acquire lru_lock before resv trylock to avoid
> the race above.
>
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 928e8d57cd08..8f6da0034db9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -318,7 +318,9 @@ int amdgpu_amdkfd_remove_fence_on_pt_pd_bos(struct amdgpu_bo *bo)
>  	ef = container_of(dma_fence_get(&info->eviction_fence->base),
>  			struct amdgpu_amdkfd_fence, base);
>  
> +	spin_lock(&bo->tbo.bdev->lru_lock);
>  	BUG_ON(!dma_resv_trylock(bo->tbo.base.resv));
> +	spin_unlock(&bo->tbo.bdev->lru_lock);
>  	ret = amdgpu_amdkfd_remove_eviction_fence(bo, ef);
>  	dma_resv_unlock(bo->tbo.base.resv);
>  
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a BUG_ON due to resv trylock fails
  2021-05-22  2:57 ` Felix Kuehling
@ 2021-05-23 16:55   ` Christian König
  0 siblings, 0 replies; 3+ messages in thread
From: Christian König @ 2021-05-23 16:55 UTC (permalink / raw)
  To: Felix Kuehling, xinhui pan, amd-gfx; +Cc: alexander.deucher, christian.koenig

Am 22.05.21 um 04:57 schrieb Felix Kuehling:
> When the BO gets individualized, there is an assumption that nobody is
> accessing it any more. See this comment in ttm_bo_individualize_resv:
>
>                  /* This works because the BO is about to be destroyed and nobody
>                   * reference it any more. The only tricky case is the trylock on
>                   * the resv object while holding the lru_lock.
>                   */
>
> That is violated when the BO is still being swapped out at this stage.
> You can kind of paper that over by taking the LRU lock. But there are
> probably other race conditions going on when the reservation gets
> swapped by "individualize" during an eviction.
>
> I think to avoid all that TTM needs to make sure that the BO is no
> longer on the LRU list when it gets individualized.

Exactly that is what we try to avoid because we want to keep the 
position on the LRU the same.

Individualizing works because the swap and evict code ignores BOs with 
zero reference count.

So yes I think Xinhui's patch should work fine.

Regards,
Christian.

>
> Regards,
>    Felix
>
>
> Am 2021-05-21 um 10:11 p.m. schrieb xinhui pan:
>> The reservation object might be locked again by evict/swap after
>> individualized. The race is like below.
>> cpu 0                                   cpu 1
>> BO release				BO evict or swap
>> 					lock lru_lock
>> ttm_bo_individualize_resv {resv = &_resv}
>>                                          ttm_bo_evict_swapout_allowable
>>                                                  dma_resv_trylock(resv)
>> ->release_notify() {BUG_ON(!trylock(resv))}
>>                                          if (!ttm_bo_get_unless_zero))
>>                                                  dma_resv_unlock(resv)
>> 					unlock lru_lock
>> To fix it simply, let's acquire lru_lock before resv trylock to avoid
>> the race above.
>>
>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 928e8d57cd08..8f6da0034db9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -318,7 +318,9 @@ int amdgpu_amdkfd_remove_fence_on_pt_pd_bos(struct amdgpu_bo *bo)
>>   	ef = container_of(dma_fence_get(&info->eviction_fence->base),
>>   			struct amdgpu_amdkfd_fence, base);
>>   
>> +	spin_lock(&bo->tbo.bdev->lru_lock);
>>   	BUG_ON(!dma_resv_trylock(bo->tbo.base.resv));
>> +	spin_unlock(&bo->tbo.bdev->lru_lock);
>>   	ret = amdgpu_amdkfd_remove_eviction_fence(bo, ef);
>>   	dma_resv_unlock(bo->tbo.base.resv);
>>   
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-05-23 16:55 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-22  2:11 [PATCH] drm/amdgpu: Fix a BUG_ON due to resv trylock fails xinhui pan
2021-05-22  2:57 ` Felix Kuehling
2021-05-23 16:55   ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.