All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: Unpin MMIO and DOORBELL BOs only after map count goes to zero
@ 2022-06-08 11:51 Ramesh Errabolu
  2022-06-08 19:39 ` Felix Kuehling
  0 siblings, 1 reply; 5+ messages in thread
From: Ramesh Errabolu @ 2022-06-08 11:51 UTC (permalink / raw)
  To: amd-gfx; +Cc: Ramesh Errabolu

In existing code MMIO and DOORBELL BOs are unpinned without ensuring the
condition that their map count has reached zero. Unpinning without checking
this constraint could lead to an error while BO is being freed. The patch
fixes this issue.

Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index a1de900ba677..e5dc94b745b1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1832,13 +1832,6 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
 
 	mutex_lock(&mem->lock);
 
-	/* Unpin MMIO/DOORBELL BO's that were pinned during allocation */
-	if (mem->alloc_flags &
-	    (KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL |
-	     KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP)) {
-		amdgpu_amdkfd_gpuvm_unpin_bo(mem->bo);
-	}
-
 	mapped_to_gpu_memory = mem->mapped_to_gpu_memory;
 	is_imported = mem->is_imported;
 	mutex_unlock(&mem->lock);
@@ -1855,7 +1848,7 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
 	/* Make sure restore workers don't access the BO any more */
 	bo_list_entry = &mem->validate_list;
 	mutex_lock(&process_info->lock);
-	list_del(&bo_list_entry->head);
+	list_del_init(&bo_list_entry->head);
 	mutex_unlock(&process_info->lock);
 
 	/* No more MMU notifiers */
@@ -1880,6 +1873,12 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
 
 	ret = unreserve_bo_and_vms(&ctx, false, false);
 
+	/* Unpin MMIO/DOORBELL BO's that were pinned during allocation */
+	if (mem->alloc_flags &
+	    (KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL |
+	     KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP))
+		amdgpu_amdkfd_gpuvm_unpin_bo(mem->bo);
+
 	/* Free the sync object */
 	amdgpu_sync_free(&mem->sync);
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/amdgpu: Unpin MMIO and DOORBELL BOs only after map count goes to zero
  2022-06-08 11:51 [PATCH] drm/amdgpu: Unpin MMIO and DOORBELL BOs only after map count goes to zero Ramesh Errabolu
@ 2022-06-08 19:39 ` Felix Kuehling
  2022-06-08 20:03   ` Errabolu, Ramesh
  0 siblings, 1 reply; 5+ messages in thread
From: Felix Kuehling @ 2022-06-08 19:39 UTC (permalink / raw)
  To: amd-gfx, Errabolu, Ramesh


On 2022-06-08 07:51, Ramesh Errabolu wrote:
> In existing code MMIO and DOORBELL BOs are unpinned without ensuring the
> condition that their map count has reached zero. Unpinning without checking
> this constraint could lead to an error while BO is being freed. The patch
> fixes this issue.
>
> Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 15 +++++++--------
>   1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index a1de900ba677..e5dc94b745b1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1832,13 +1832,6 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
>   
>   	mutex_lock(&mem->lock);
>   
> -	/* Unpin MMIO/DOORBELL BO's that were pinned during allocation */
> -	if (mem->alloc_flags &
> -	    (KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL |
> -	     KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP)) {
> -		amdgpu_amdkfd_gpuvm_unpin_bo(mem->bo);
> -	}
> -
>   	mapped_to_gpu_memory = mem->mapped_to_gpu_memory;
>   	is_imported = mem->is_imported;
>   	mutex_unlock(&mem->lock);
> @@ -1855,7 +1848,7 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
>   	/* Make sure restore workers don't access the BO any more */
>   	bo_list_entry = &mem->validate_list;
>   	mutex_lock(&process_info->lock);
> -	list_del(&bo_list_entry->head);
> +	list_del_init(&bo_list_entry->head);

Is this an unrelated fix? What is this needed for? I vaguely remember 
discussing this before, but can't remember the reason.

Regards,
   Felix


>   	mutex_unlock(&process_info->lock);
>   
>   	/* No more MMU notifiers */
> @@ -1880,6 +1873,12 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
>   
>   	ret = unreserve_bo_and_vms(&ctx, false, false);
>   
> +	/* Unpin MMIO/DOORBELL BO's that were pinned during allocation */
> +	if (mem->alloc_flags &
> +	    (KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL |
> +	     KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP))
> +		amdgpu_amdkfd_gpuvm_unpin_bo(mem->bo);
> +
>   	/* Free the sync object */
>   	amdgpu_sync_free(&mem->sync);
>   

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] drm/amdgpu: Unpin MMIO and DOORBELL BOs only after map count goes to zero
  2022-06-08 19:39 ` Felix Kuehling
@ 2022-06-08 20:03   ` Errabolu, Ramesh
  2022-06-08 20:44     ` Felix Kuehling
  0 siblings, 1 reply; 5+ messages in thread
From: Errabolu, Ramesh @ 2022-06-08 20:03 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx

[AMD Official Use Only - General]

My response is inline.

Regards,
Ramesh

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@amd.com> 
Sent: Thursday, June 9, 2022 1:10 AM
To: amd-gfx@lists.freedesktop.org; Errabolu, Ramesh <Ramesh.Errabolu@amd.com>
Subject: Re: [PATCH] drm/amdgpu: Unpin MMIO and DOORBELL BOs only after map count goes to zero


On 2022-06-08 07:51, Ramesh Errabolu wrote:
> In existing code MMIO and DOORBELL BOs are unpinned without ensuring 
> the condition that their map count has reached zero. Unpinning without 
> checking this constraint could lead to an error while BO is being 
> freed. The patch fixes this issue.
>
> Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 15 +++++++--------
>   1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index a1de900ba677..e5dc94b745b1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1832,13 +1832,6 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
>   
>   	mutex_lock(&mem->lock);
>   
> -	/* Unpin MMIO/DOORBELL BO's that were pinned during allocation */
> -	if (mem->alloc_flags &
> -	    (KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL |
> -	     KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP)) {
> -		amdgpu_amdkfd_gpuvm_unpin_bo(mem->bo);
> -	}
> -
>   	mapped_to_gpu_memory = mem->mapped_to_gpu_memory;
>   	is_imported = mem->is_imported;
>   	mutex_unlock(&mem->lock);
> @@ -1855,7 +1848,7 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
>   	/* Make sure restore workers don't access the BO any more */
>   	bo_list_entry = &mem->validate_list;
>   	mutex_lock(&process_info->lock);
> -	list_del(&bo_list_entry->head);
> +	list_del_init(&bo_list_entry->head);

Is this an unrelated fix? What is this needed for? I vaguely remember discussing this before, but can't remember the reason.

Ramesh: This fix is unrelated to P2P work. I brought this issue to attention while working on IOMMU support on DKMS branch. Basically a user could call free() before the map count goes to zero. The patch is trying fix that.

Regards,
   Felix


>   	mutex_unlock(&process_info->lock);
>   
>   	/* No more MMU notifiers */
> @@ -1880,6 +1873,12 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
>   
>   	ret = unreserve_bo_and_vms(&ctx, false, false);
>   
> +	/* Unpin MMIO/DOORBELL BO's that were pinned during allocation */
> +	if (mem->alloc_flags &
> +	    (KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL |
> +	     KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP))
> +		amdgpu_amdkfd_gpuvm_unpin_bo(mem->bo);
> +
>   	/* Free the sync object */
>   	amdgpu_sync_free(&mem->sync);
>   

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/amdgpu: Unpin MMIO and DOORBELL BOs only after map count goes to zero
  2022-06-08 20:03   ` Errabolu, Ramesh
@ 2022-06-08 20:44     ` Felix Kuehling
  2022-06-09 14:44       ` Errabolu, Ramesh
  0 siblings, 1 reply; 5+ messages in thread
From: Felix Kuehling @ 2022-06-08 20:44 UTC (permalink / raw)
  To: Errabolu, Ramesh, amd-gfx

On 2022-06-08 16:03, Errabolu, Ramesh wrote:
> [AMD Official Use Only - General]
>
> My response is inline.
>
> Regards,
> Ramesh
>
> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling@amd.com>
> Sent: Thursday, June 9, 2022 1:10 AM
> To: amd-gfx@lists.freedesktop.org; Errabolu, Ramesh <Ramesh.Errabolu@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: Unpin MMIO and DOORBELL BOs only after map count goes to zero
>
>
> On 2022-06-08 07:51, Ramesh Errabolu wrote:
>> In existing code MMIO and DOORBELL BOs are unpinned without ensuring
>> the condition that their map count has reached zero. Unpinning without
>> checking this constraint could lead to an error while BO is being
>> freed. The patch fixes this issue.
>>
>> Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 15 +++++++--------
>>    1 file changed, 7 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index a1de900ba677..e5dc94b745b1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -1832,13 +1832,6 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
>>    
>>    	mutex_lock(&mem->lock);
>>    
>> -	/* Unpin MMIO/DOORBELL BO's that were pinned during allocation */
>> -	if (mem->alloc_flags &
>> -	    (KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL |
>> -	     KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP)) {
>> -		amdgpu_amdkfd_gpuvm_unpin_bo(mem->bo);
>> -	}
>> -
>>    	mapped_to_gpu_memory = mem->mapped_to_gpu_memory;
>>    	is_imported = mem->is_imported;
>>    	mutex_unlock(&mem->lock);
>> @@ -1855,7 +1848,7 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
>>    	/* Make sure restore workers don't access the BO any more */
>>    	bo_list_entry = &mem->validate_list;
>>    	mutex_lock(&process_info->lock);
>> -	list_del(&bo_list_entry->head);
>> +	list_del_init(&bo_list_entry->head);
> Is this an unrelated fix? What is this needed for? I vaguely remember discussing this before, but can't remember the reason.
>
> Ramesh: This fix is unrelated to P2P work. I brought this issue to attention while working on IOMMU support on DKMS branch. Basically a user could call free() before the map count goes to zero. The patch is trying fix that.

I get that, but I couldn't remember why I suggested list_del_init here. 
It has nothing to do with unpinning of BOs.

Now I recall that it had something to do with restarting the ioctl after 
it was interrupted by a signal. reserve_bo_and_cond_vms can fail with 
-ERESTARTSYS. In that case the ioctl is reentered. We need to make sure 
it doesn't crash the second time around. list_del will remove 
bo_list_entry from the list but leave the pointers dangling. The second 
time around it will probably cause corruption or an oops. Using 
list_del_init avoids that by initializing the prev and next pointers to 
NULL.

See one more little fix below.


>
> Regards,
>     Felix
>
>
>>    	mutex_unlock(&process_info->lock);
>>    
>>    	/* No more MMU notifiers */
>> @@ -1880,6 +1873,12 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
>>    
>>    	ret = unreserve_bo_and_vms(&ctx, false, false);

This unreserve_bo_and_vms call cannot fail because the wait parameter is 
false. If it did fail, the error handling would be broken. I'd add a 
WARN_ONCE to make that assumption explicit, and change the return at the 
end of this function to return 0. Basically, if we got this far, we are 
not turning back, and we should return success.

You could update the commit headline to be more general. Something like: 
Fix error handling in amdgpu_amdkfd_gpuvm_free_memory_of_gpu.

Regards,
   Felix


>>    
>> +	/* Unpin MMIO/DOORBELL BO's that were pinned during allocation */
>> +	if (mem->alloc_flags &
>> +	    (KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL |
>> +	     KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP))
>> +		amdgpu_amdkfd_gpuvm_unpin_bo(mem->bo);
>> +
>>    	/* Free the sync object */
>>    	amdgpu_sync_free(&mem->sync);
>>    

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] drm/amdgpu: Unpin MMIO and DOORBELL BOs only after map count goes to zero
  2022-06-08 20:44     ` Felix Kuehling
@ 2022-06-09 14:44       ` Errabolu, Ramesh
  0 siblings, 0 replies; 5+ messages in thread
From: Errabolu, Ramesh @ 2022-06-09 14:44 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx

[AMD Official Use Only - General]

My resp in line

Regards,
Ramesh

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@amd.com> 
Sent: Thursday, June 9, 2022 2:14 AM
To: Errabolu, Ramesh <Ramesh.Errabolu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Unpin MMIO and DOORBELL BOs only after map count goes to zero

On 2022-06-08 16:03, Errabolu, Ramesh wrote:
> [AMD Official Use Only - General]
>
> My response is inline.
>
> Regards,
> Ramesh
>
> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling@amd.com>
> Sent: Thursday, June 9, 2022 1:10 AM
> To: amd-gfx@lists.freedesktop.org; Errabolu, Ramesh 
> <Ramesh.Errabolu@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: Unpin MMIO and DOORBELL BOs only 
> after map count goes to zero
>
>
> On 2022-06-08 07:51, Ramesh Errabolu wrote:
>> In existing code MMIO and DOORBELL BOs are unpinned without ensuring 
>> the condition that their map count has reached zero. Unpinning 
>> without checking this constraint could lead to an error while BO is 
>> being freed. The patch fixes this issue.
>>
>> Signed-off-by: Ramesh Errabolu <Ramesh.Errabolu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 15 +++++++--------
>>    1 file changed, 7 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index a1de900ba677..e5dc94b745b1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -1832,13 +1832,6 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
>>    
>>    	mutex_lock(&mem->lock);
>>    
>> -	/* Unpin MMIO/DOORBELL BO's that were pinned during allocation */
>> -	if (mem->alloc_flags &
>> -	    (KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL |
>> -	     KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP)) {
>> -		amdgpu_amdkfd_gpuvm_unpin_bo(mem->bo);
>> -	}
>> -
>>    	mapped_to_gpu_memory = mem->mapped_to_gpu_memory;
>>    	is_imported = mem->is_imported;
>>    	mutex_unlock(&mem->lock);
>> @@ -1855,7 +1848,7 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
>>    	/* Make sure restore workers don't access the BO any more */
>>    	bo_list_entry = &mem->validate_list;
>>    	mutex_lock(&process_info->lock);
>> -	list_del(&bo_list_entry->head);
>> +	list_del_init(&bo_list_entry->head);
> Is this an unrelated fix? What is this needed for? I vaguely remember discussing this before, but can't remember the reason.
>
> Ramesh: This fix is unrelated to P2P work. I brought this issue to attention while working on IOMMU support on DKMS branch. Basically a user could call free() before the map count goes to zero. The patch is trying fix that.

I get that, but I couldn't remember why I suggested list_del_init here. 
It has nothing to do with unpinning of BOs.

Now I recall that it had something to do with restarting the ioctl after it was interrupted by a signal. reserve_bo_and_cond_vms can fail with -ERESTARTSYS. In that case the ioctl is reentered. We need to make sure it doesn't crash the second time around. list_del will remove bo_list_entry from the list but leave the pointers dangling. The second time around it will probably cause corruption or an oops. Using list_del_init avoids that by initializing the prev and next pointers to NULL.

Ramesh: I see the same idiom in the method remove_kgd_mem_from_kfd_bo_list(). Should we be calling this method rather than re-write the same code block. Also the name remove_xyz_kfd_bo_list() is misleading. Should this name be changed.

See one more little fix below.


>
> Regards,
>     Felix
>
>
>>    	mutex_unlock(&process_info->lock);
>>    
>>    	/* No more MMU notifiers */
>> @@ -1880,6 +1873,12 @@ int amdgpu_amdkfd_gpuvm_free_memory_of_gpu(
>>    
>>    	ret = unreserve_bo_and_vms(&ctx, false, false);

This unreserve_bo_and_vms call cannot fail because the wait parameter is false. If it did fail, the error handling would be broken. I'd add a WARN_ONCE to make that assumption explicit, and change the return at the end of this function to return 0. Basically, if we got this far, we are not turning back, and we should return success.

You could update the commit headline to be more general. Something like: 
Fix error handling in amdgpu_amdkfd_gpuvm_free_memory_of_gpu.

Regards,
   Felix


>>    
>> +	/* Unpin MMIO/DOORBELL BO's that were pinned during allocation */
>> +	if (mem->alloc_flags &
>> +	    (KFD_IOC_ALLOC_MEM_FLAGS_DOORBELL |
>> +	     KFD_IOC_ALLOC_MEM_FLAGS_MMIO_REMAP))
>> +		amdgpu_amdkfd_gpuvm_unpin_bo(mem->bo);
>> +
>>    	/* Free the sync object */
>>    	amdgpu_sync_free(&mem->sync);
>>    

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-06-09 14:44 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-08 11:51 [PATCH] drm/amdgpu: Unpin MMIO and DOORBELL BOs only after map count goes to zero Ramesh Errabolu
2022-06-08 19:39 ` Felix Kuehling
2022-06-08 20:03   ` Errabolu, Ramesh
2022-06-08 20:44     ` Felix Kuehling
2022-06-09 14:44       ` Errabolu, Ramesh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.