All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: Fix a NULL pointer of fence
@ 2022-07-07  9:50 xinhui pan
  2022-07-07  9:54 ` Christian König
  2022-07-07 18:23 ` Mike Lothian
  0 siblings, 2 replies; 9+ messages in thread
From: xinhui pan @ 2022-07-07  9:50 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexander.deucher, Felix.Kuehling, xinhui pan, christian.koenig

Fence is accessed by dma_resv_add_fence() now.
Use amdgpu_amdkfd_remove_eviction_fence instead.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 0036c9e405af..1e25c400ce4f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct amdgpu_device *adev,
 
 	if (!process_info)
 		return;
-
 	/* Release eviction fence from PD */
 	amdgpu_bo_reserve(pd, false);
-	amdgpu_bo_fence(pd, NULL, false);
+	amdgpu_amdkfd_remove_eviction_fence(pd,
+					process_info->eviction_fence);
 	amdgpu_bo_unreserve(pd);
 
 	/* Update process info */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence
  2022-07-07  9:50 [PATCH] drm/amdgpu: Fix a NULL pointer of fence xinhui pan
@ 2022-07-07  9:54 ` Christian König
  2022-07-07 15:47   ` Felix Kuehling
  2022-07-07 18:23 ` Mike Lothian
  1 sibling, 1 reply; 9+ messages in thread
From: Christian König @ 2022-07-07  9:54 UTC (permalink / raw)
  To: xinhui pan, amd-gfx; +Cc: alexander.deucher, Felix.Kuehling, christian.koenig

Am 07.07.22 um 11:50 schrieb xinhui pan:
> Fence is accessed by dma_resv_add_fence() now.
> Use amdgpu_amdkfd_remove_eviction_fence instead.
>
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 0036c9e405af..1e25c400ce4f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct amdgpu_device *adev,
>   
>   	if (!process_info)
>   		return;
> -
>   	/* Release eviction fence from PD */
>   	amdgpu_bo_reserve(pd, false);
> -	amdgpu_bo_fence(pd, NULL, false);
> +	amdgpu_amdkfd_remove_eviction_fence(pd,
> +					process_info->eviction_fence);

Good catch as well, but Felix needs to take a look at this.

Regards,
Christian.

>   	amdgpu_bo_unreserve(pd);
>   
>   	/* Update process info */


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence
  2022-07-07  9:54 ` Christian König
@ 2022-07-07 15:47   ` Felix Kuehling
  2022-07-08  1:08     ` Pan, Xinhui
  0 siblings, 1 reply; 9+ messages in thread
From: Felix Kuehling @ 2022-07-07 15:47 UTC (permalink / raw)
  To: Christian König, xinhui pan, amd-gfx
  Cc: alexander.deucher, christian.koenig

Am 2022-07-07 um 05:54 schrieb Christian König:
> Am 07.07.22 um 11:50 schrieb xinhui pan:
>> Fence is accessed by dma_resv_add_fence() now.
>> Use amdgpu_amdkfd_remove_eviction_fence instead.
>>
>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 0036c9e405af..1e25c400ce4f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct 
>> amdgpu_device *adev,
>>         if (!process_info)
>>           return;
>> -
>>       /* Release eviction fence from PD */
>>       amdgpu_bo_reserve(pd, false);
>> -    amdgpu_bo_fence(pd, NULL, false);
>> +    amdgpu_amdkfd_remove_eviction_fence(pd,
>> +                    process_info->eviction_fence);
>
> Good catch as well, but Felix needs to take a look at this.

This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which 
would have removed an exclusive fence. But as far as I can tell we added 
the fence as a shared fence in init_kfd_vm and 
amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked 
as intended.

You could try if this is really needed. Just remove the eviction fence 
removal. Then enable eviction debugging with

     echo Y > /sys/module/amdgpu/parameters/debug_evictions

Run some simple tests and check the kernel log to see if process 
termination is causing any unexpected evictions.

Regards,
   Felix


>
> Regards,
> Christian.
>
>>       amdgpu_bo_unreserve(pd);
>>         /* Update process info */
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence
  2022-07-07  9:50 [PATCH] drm/amdgpu: Fix a NULL pointer of fence xinhui pan
  2022-07-07  9:54 ` Christian König
@ 2022-07-07 18:23 ` Mike Lothian
  1 sibling, 0 replies; 9+ messages in thread
From: Mike Lothian @ 2022-07-07 18:23 UTC (permalink / raw)
  To: xinhui pan; +Cc: alexander.deucher, Felix.Kuehling, christian.koenig, amd-gfx

Hi

This appears to fix https://gitlab.freedesktop.org/drm/amd/-/issues/2074

Feel free to add my tested by

Thanks

Mike

On Thu, 7 Jul 2022 at 10:51, xinhui pan <xinhui.pan@amd.com> wrote:
>
> Fence is accessed by dma_resv_add_fence() now.
> Use amdgpu_amdkfd_remove_eviction_fence instead.
>
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 0036c9e405af..1e25c400ce4f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct amdgpu_device *adev,
>
>         if (!process_info)
>                 return;
> -
>         /* Release eviction fence from PD */
>         amdgpu_bo_reserve(pd, false);
> -       amdgpu_bo_fence(pd, NULL, false);
> +       amdgpu_amdkfd_remove_eviction_fence(pd,
> +                                       process_info->eviction_fence);
>         amdgpu_bo_unreserve(pd);
>
>         /* Update process info */
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH] drm/amdgpu: Fix a NULL pointer of fence
  2022-07-07 15:47   ` Felix Kuehling
@ 2022-07-08  1:08     ` Pan, Xinhui
  2022-07-08  9:03       ` Christian König
  0 siblings, 1 reply; 9+ messages in thread
From: Pan, Xinhui @ 2022-07-08  1:08 UTC (permalink / raw)
  To: Kuehling, Felix, Christian König, amd-gfx
  Cc: Deucher, Alexander, Koenig, Christian

[AMD Official Use Only - General]

Felix,
Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉
From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken.

And I also try the debug_evictions parameter. No unexpected eviction shows anyway.
I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it.

So I would make a new patch to remove this duplicated ef removal.

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@amd.com>
Sent: Thursday, July 7, 2022 11:47 PM
To: Christian König <ckoenig.leichtzumerken@gmail.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence

Am 2022-07-07 um 05:54 schrieb Christian König:
> Am 07.07.22 um 11:50 schrieb xinhui pan:
>> Fence is accessed by dma_resv_add_fence() now.
>> Use amdgpu_amdkfd_remove_eviction_fence instead.
>>
>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 0036c9e405af..1e25c400ce4f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct
>> amdgpu_device *adev,
>>         if (!process_info)
>>           return;
>> -
>>       /* Release eviction fence from PD */
>>       amdgpu_bo_reserve(pd, false);
>> -    amdgpu_bo_fence(pd, NULL, false);
>> +    amdgpu_amdkfd_remove_eviction_fence(pd,
>> +                    process_info->eviction_fence);
>
> Good catch as well, but Felix needs to take a look at this.

This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended.

You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with

     echo Y > /sys/module/amdgpu/parameters/debug_evictions

Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions.

Regards,
   Felix


>
> Regards,
> Christian.
>
>>       amdgpu_bo_unreserve(pd);
>>         /* Update process info */
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence
  2022-07-08  1:08     ` Pan, Xinhui
@ 2022-07-08  9:03       ` Christian König
  2022-07-18 14:58         ` Mike Lothian
  0 siblings, 1 reply; 9+ messages in thread
From: Christian König @ 2022-07-08  9:03 UTC (permalink / raw)
  To: Pan, Xinhui, Kuehling, Felix, amd-gfx
  Cc: Deucher, Alexander, Koenig, Christian

Hi guys,

well the practice to remove all fences by adding a NULL exclusive fence 
was pretty much illegal in the first place because this also removes 
kernel internal fences which can lead to freeing up memory which is 
still accessed.

I've just didn't noticed that this was used by the KFD code as well 
otherwise I would have pushed to clean that up much earlier.

Regards,
Christian.

Am 08.07.22 um 03:08 schrieb Pan, Xinhui:
> [AMD Official Use Only - General]
>
> Felix,
> Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉
>  From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken.
>
> And I also try the debug_evictions parameter. No unexpected eviction shows anyway.
> I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it.
>
> So I would make a new patch to remove this duplicated ef removal.
>
> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling@amd.com>
> Sent: Thursday, July 7, 2022 11:47 PM
> To: Christian König <ckoenig.leichtzumerken@gmail.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence
>
> Am 2022-07-07 um 05:54 schrieb Christian König:
>> Am 07.07.22 um 11:50 schrieb xinhui pan:
>>> Fence is accessed by dma_resv_add_fence() now.
>>> Use amdgpu_amdkfd_remove_eviction_fence instead.
>>>
>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++--
>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> index 0036c9e405af..1e25c400ce4f 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct
>>> amdgpu_device *adev,
>>>          if (!process_info)
>>>            return;
>>> -
>>>        /* Release eviction fence from PD */
>>>        amdgpu_bo_reserve(pd, false);
>>> -    amdgpu_bo_fence(pd, NULL, false);
>>> +    amdgpu_amdkfd_remove_eviction_fence(pd,
>>> +                    process_info->eviction_fence);
>> Good catch as well, but Felix needs to take a look at this.
> This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended.
>
> You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with
>
>       echo Y > /sys/module/amdgpu/parameters/debug_evictions
>
> Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions.
>
> Regards,
>     Felix
>
>
>> Regards,
>> Christian.
>>
>>>        amdgpu_bo_unreserve(pd);
>>>          /* Update process info */


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence
  2022-07-08  9:03       ` Christian König
@ 2022-07-18 14:58         ` Mike Lothian
  2022-07-18 15:29           ` Felix Kuehling
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Lothian @ 2022-07-18 14:58 UTC (permalink / raw)
  To: Christian König
  Cc: Deucher, Alexander, Kuehling, Felix, Pan, Xinhui, Koenig,
	Christian, amd-gfx

Is this likely to land before 5.19 final? It's been nearly 2 weeks
since I said if fixed an issue I was seeing
https://gitlab.freedesktop.org/drm/amd/-/issues/2074

On Fri, 8 Jul 2022 at 10:05, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Hi guys,
>
> well the practice to remove all fences by adding a NULL exclusive fence
> was pretty much illegal in the first place because this also removes
> kernel internal fences which can lead to freeing up memory which is
> still accessed.
>
> I've just didn't noticed that this was used by the KFD code as well
> otherwise I would have pushed to clean that up much earlier.
>
> Regards,
> Christian.
>
> Am 08.07.22 um 03:08 schrieb Pan, Xinhui:
> > [AMD Official Use Only - General]
> >
> > Felix,
> > Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉
> >  From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken.
> >
> > And I also try the debug_evictions parameter. No unexpected eviction shows anyway.
> > I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it.
> >
> > So I would make a new patch to remove this duplicated ef removal.
> >
> > -----Original Message-----
> > From: Kuehling, Felix <Felix.Kuehling@amd.com>
> > Sent: Thursday, July 7, 2022 11:47 PM
> > To: Christian König <ckoenig.leichtzumerken@gmail.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
> > Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> > Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence
> >
> > Am 2022-07-07 um 05:54 schrieb Christian König:
> >> Am 07.07.22 um 11:50 schrieb xinhui pan:
> >>> Fence is accessed by dma_resv_add_fence() now.
> >>> Use amdgpu_amdkfd_remove_eviction_fence instead.
> >>>
> >>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> >>> ---
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++--
> >>>    1 file changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>> index 0036c9e405af..1e25c400ce4f 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >>> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct
> >>> amdgpu_device *adev,
> >>>          if (!process_info)
> >>>            return;
> >>> -
> >>>        /* Release eviction fence from PD */
> >>>        amdgpu_bo_reserve(pd, false);
> >>> -    amdgpu_bo_fence(pd, NULL, false);
> >>> +    amdgpu_amdkfd_remove_eviction_fence(pd,
> >>> +                    process_info->eviction_fence);
> >> Good catch as well, but Felix needs to take a look at this.
> > This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended.
> >
> > You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with
> >
> >       echo Y > /sys/module/amdgpu/parameters/debug_evictions
> >
> > Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions.
> >
> > Regards,
> >     Felix
> >
> >
> >> Regards,
> >> Christian.
> >>
> >>>        amdgpu_bo_unreserve(pd);
> >>>          /* Update process info */
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence
  2022-07-18 14:58         ` Mike Lothian
@ 2022-07-18 15:29           ` Felix Kuehling
  2022-07-18 15:46             ` Deucher, Alexander
  0 siblings, 1 reply; 9+ messages in thread
From: Felix Kuehling @ 2022-07-18 15:29 UTC (permalink / raw)
  To: Mike Lothian, Christian König
  Cc: Deucher, Alexander, Pan, Xinhui, Koenig, Christian, amd-gfx

Xinhui submitted this patch instead, which should address the same 
issue: "drm/amdgpu: Remove one duplicated ef removal"

Alex, can you pick up that patch for drm-fixes for 5.19, if it's not too 
late?

Thanks,
   Felix


On 2022-07-18 10:58, Mike Lothian wrote:
> Is this likely to land before 5.19 final? It's been nearly 2 weeks
> since I said if fixed an issue I was seeing
> https://gitlab.freedesktop.org/drm/amd/-/issues/2074
>
> On Fri, 8 Jul 2022 at 10:05, Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Hi guys,
>>
>> well the practice to remove all fences by adding a NULL exclusive fence
>> was pretty much illegal in the first place because this also removes
>> kernel internal fences which can lead to freeing up memory which is
>> still accessed.
>>
>> I've just didn't noticed that this was used by the KFD code as well
>> otherwise I would have pushed to clean that up much earlier.
>>
>> Regards,
>> Christian.
>>
>> Am 08.07.22 um 03:08 schrieb Pan, Xinhui:
>>> [AMD Official Use Only - General]
>>>
>>> Felix,
>>> Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉
>>>   From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken.
>>>
>>> And I also try the debug_evictions parameter. No unexpected eviction shows anyway.
>>> I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it.
>>>
>>> So I would make a new patch to remove this duplicated ef removal.
>>>
>>> -----Original Message-----
>>> From: Kuehling, Felix <Felix.Kuehling@amd.com>
>>> Sent: Thursday, July 7, 2022 11:47 PM
>>> To: Christian König <ckoenig.leichtzumerken@gmail.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence
>>>
>>> Am 2022-07-07 um 05:54 schrieb Christian König:
>>>> Am 07.07.22 um 11:50 schrieb xinhui pan:
>>>>> Fence is accessed by dma_resv_add_fence() now.
>>>>> Use amdgpu_amdkfd_remove_eviction_fence instead.
>>>>>
>>>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++--
>>>>>     1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> index 0036c9e405af..1e25c400ce4f 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct
>>>>> amdgpu_device *adev,
>>>>>           if (!process_info)
>>>>>             return;
>>>>> -
>>>>>         /* Release eviction fence from PD */
>>>>>         amdgpu_bo_reserve(pd, false);
>>>>> -    amdgpu_bo_fence(pd, NULL, false);
>>>>> +    amdgpu_amdkfd_remove_eviction_fence(pd,
>>>>> +                    process_info->eviction_fence);
>>>> Good catch as well, but Felix needs to take a look at this.
>>> This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended.
>>>
>>> You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with
>>>
>>>        echo Y > /sys/module/amdgpu/parameters/debug_evictions
>>>
>>> Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions.
>>>
>>> Regards,
>>>      Felix
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>         amdgpu_bo_unreserve(pd);
>>>>>           /* Update process info */

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence
  2022-07-18 15:29           ` Felix Kuehling
@ 2022-07-18 15:46             ` Deucher, Alexander
  0 siblings, 0 replies; 9+ messages in thread
From: Deucher, Alexander @ 2022-07-18 15:46 UTC (permalink / raw)
  To: Kuehling, Felix, Mike Lothian, Christian König
  Cc: Pan, Xinhui, Koenig, Christian, amd-gfx

[-- Attachment #1: Type: text/plain, Size: 4779 bytes --]

[Public]

Will do.

Alex

________________________________
From: Kuehling, Felix <Felix.Kuehling@amd.com>
Sent: Monday, July 18, 2022 11:29 AM
To: Mike Lothian <mike@fireburn.co.uk>; Christian König <ckoenig.leichtzumerken@gmail.com>
Cc: Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence

Xinhui submitted this patch instead, which should address the same
issue: "drm/amdgpu: Remove one duplicated ef removal"

Alex, can you pick up that patch for drm-fixes for 5.19, if it's not too
late?

Thanks,
   Felix


On 2022-07-18 10:58, Mike Lothian wrote:
> Is this likely to land before 5.19 final? It's been nearly 2 weeks
> since I said if fixed an issue I was seeing
> https://gitlab.freedesktop.org/drm/amd/-/issues/2074
>
> On Fri, 8 Jul 2022 at 10:05, Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
>> Hi guys,
>>
>> well the practice to remove all fences by adding a NULL exclusive fence
>> was pretty much illegal in the first place because this also removes
>> kernel internal fences which can lead to freeing up memory which is
>> still accessed.
>>
>> I've just didn't noticed that this was used by the KFD code as well
>> otherwise I would have pushed to clean that up much earlier.
>>
>> Regards,
>> Christian.
>>
>> Am 08.07.22 um 03:08 schrieb Pan, Xinhui:
>>> [AMD Official Use Only - General]
>>>
>>> Felix,
>>> Shared fences depend on exclusive fence, so add a new exclusive fence, say NULL would also remove all shared fences. That works before 5.18 . 😉
>>>   From 5.18, adding a new exclusive fence(the write usage fence) did not remove any shared fences(the read usage fence). So that is broken.
>>>
>>> And I also try the debug_evictions parameter. No unexpected eviction shows anyway.
>>> I did a quick check and found amdgpu implement BO release notify and it will remove kfd ef on pt/pd BO. So we don’t need this duplicated ef removal. The interesting thing is that is done by patch f4a3c42b5c52("drm/amdgpu: Remove kfd eviction fence before release bo (v2)") which is from me 😉 I totally forgot it.
>>>
>>> So I would make a new patch to remove this duplicated ef removal.
>>>
>>> -----Original Message-----
>>> From: Kuehling, Felix <Felix.Kuehling@amd.com>
>>> Sent: Thursday, July 7, 2022 11:47 PM
>>> To: Christian König <ckoenig.leichtzumerken@gmail.com>; Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org
>>> Cc: Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: Fix a NULL pointer of fence
>>>
>>> Am 2022-07-07 um 05:54 schrieb Christian König:
>>>> Am 07.07.22 um 11:50 schrieb xinhui pan:
>>>>> Fence is accessed by dma_resv_add_fence() now.
>>>>> Use amdgpu_amdkfd_remove_eviction_fence instead.
>>>>>
>>>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 4 ++--
>>>>>     1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> index 0036c9e405af..1e25c400ce4f 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>>>> @@ -1558,10 +1558,10 @@ void amdgpu_amdkfd_gpuvm_destroy_cb(struct
>>>>> amdgpu_device *adev,
>>>>>           if (!process_info)
>>>>>             return;
>>>>> -
>>>>>         /* Release eviction fence from PD */
>>>>>         amdgpu_bo_reserve(pd, false);
>>>>> -    amdgpu_bo_fence(pd, NULL, false);
>>>>> +    amdgpu_amdkfd_remove_eviction_fence(pd,
>>>>> +                    process_info->eviction_fence);
>>>> Good catch as well, but Felix needs to take a look at this.
>>> This is weird. We used amdgpu_bo_fence(pd, NULL, false) here, which would have removed an exclusive fence. But as far as I can tell we added the fence as a shared fence in init_kfd_vm and amdgpu_amdkfd_gpuvm_restore_process_bos. So this probably never worked as intended.
>>>
>>> You could try if this is really needed. Just remove the eviction fence removal. Then enable eviction debugging with
>>>
>>>        echo Y > /sys/module/amdgpu/parameters/debug_evictions
>>>
>>> Run some simple tests and check the kernel log to see if process termination is causing any unexpected evictions.
>>>
>>> Regards,
>>>      Felix
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>         amdgpu_bo_unreserve(pd);
>>>>>           /* Update process info */

[-- Attachment #2: Type: text/html, Size: 7926 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-07-18 15:46 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-07  9:50 [PATCH] drm/amdgpu: Fix a NULL pointer of fence xinhui pan
2022-07-07  9:54 ` Christian König
2022-07-07 15:47   ` Felix Kuehling
2022-07-08  1:08     ` Pan, Xinhui
2022-07-08  9:03       ` Christian König
2022-07-18 14:58         ` Mike Lothian
2022-07-18 15:29           ` Felix Kuehling
2022-07-18 15:46             ` Deucher, Alexander
2022-07-07 18:23 ` Mike Lothian

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.