All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdkfd: drop process ref count when xnack disable
@ 2021-09-01  2:41 Alex Sierra
  2021-09-01 11:29 ` philip yang
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Sierra @ 2021-09-01  2:41 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Sierra

During svm restore pages interrupt handler, kfd_process ref count was
never dropped when xnack was disabled. Therefore, the object was never
released.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 8f9b5b53dab5..110c46cd7fac 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -2484,7 +2484,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
 	}
 	if (!p->xnack_enabled) {
 		pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);
-		return -EFAULT;
+		r = -EFAULT;
+		goto out;
 	}
 	svms = &p->svms;
 
-- 
2.32.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable
  2021-09-01  2:41 [PATCH] drm/amdkfd: drop process ref count when xnack disable Alex Sierra
@ 2021-09-01 11:29 ` philip yang
  2021-09-01 13:45   ` Kim, Jonathan
  0 siblings, 1 reply; 7+ messages in thread
From: philip yang @ 2021-09-01 11:29 UTC (permalink / raw)
  To: Alex Sierra, amd-gfx

[-- Attachment #1: Type: text/html, Size: 1677 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] drm/amdkfd: drop process ref count when xnack disable
  2021-09-01 11:29 ` philip yang
@ 2021-09-01 13:45   ` Kim, Jonathan
  2021-09-01 16:29     ` philip yang
  0 siblings, 1 reply; 7+ messages in thread
From: Kim, Jonathan @ 2021-09-01 13:45 UTC (permalink / raw)
  To: Yang, Philip, Sierra Guiza, Alejandro (Alex), amd-gfx

[-- Attachment #1: Type: text/plain, Size: 1950 bytes --]

[AMD Official Use Only]

We were seeing process leaks on a couple of machines running certain tests that triggered vm faults on purpose.
I think svm_range_restore_pages gets called unconditionally on vm fault handling (unless the retry interrupt payload bit is supposed to be clear with xnack off)?

Either way, this patch prevents the process leaks we seeing and is also:
Reviewed-by: Jonathan Kim <jonathan.kim@amd.com>

Thanks,

Jon


From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of philip yang
Sent: Wednesday, September 1, 2021 7:30 AM
To: Sierra Guiza, Alejandro (Alex) <Alex.Sierra@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable

[CAUTION: External Email]


On 2021-08-31 10:41 p.m., Alex Sierra wrote:

During svm restore pages interrupt handler, kfd_process ref count was

never dropped when xnack was disabled. Therefore, the object was never

released.

Good catch, but if xnack is off, we should not get here to recover fault.

The fix looks good to me.

Reviewed-by: Philip Yang <philip.yang@amd.com><mailto:philip.yang@amd.com>



Signed-off-by: Alex Sierra <alex.sierra@amd.com><mailto:alex.sierra@amd.com>

---

 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++-

 1 file changed, 2 insertions(+), 1 deletion(-)



diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

index 8f9b5b53dab5..110c46cd7fac 100644

--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

@@ -2484,7 +2484,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,

         }

         if (!p->xnack_enabled) {

                pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);

-               return -EFAULT;

+               r = -EFAULT;

+               goto out;

         }

         svms = &p->svms;



[-- Attachment #2: Type: text/html, Size: 6116 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable
  2021-09-01 13:45   ` Kim, Jonathan
@ 2021-09-01 16:29     ` philip yang
  2021-09-01 16:59       ` Kim, Jonathan
  0 siblings, 1 reply; 7+ messages in thread
From: philip yang @ 2021-09-01 16:29 UTC (permalink / raw)
  To: Kim, Jonathan, Yang, Philip, Sierra Guiza, Alejandro (Alex), amd-gfx

[-- Attachment #1: Type: text/html, Size: 7946 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] drm/amdkfd: drop process ref count when xnack disable
  2021-09-01 16:29     ` philip yang
@ 2021-09-01 16:59       ` Kim, Jonathan
  2021-09-01 17:54         ` Felix Kuehling
  0 siblings, 1 reply; 7+ messages in thread
From: Kim, Jonathan @ 2021-09-01 16:59 UTC (permalink / raw)
  To: Yang, Philip, Sierra Guiza, Alejandro (Alex), amd-gfx

[-- Attachment #1: Type: text/plain, Size: 3333 bytes --]

[Public]

I wouldn’t know if it was another bug elsewhere.
From what I was seeing, the leak was coming from !p->xnack_enable on the svm_range_restore_pages call.

If it helps, I saw this on Aldebaran where a shader does some bad memory access on purpose on a debugged ptraced child process.
The vm fault prompt pops up in dmesgs and a stale KFD process appends per run without this fix.
I’m just assuming at this point that the IV retry bit is set but I never confirmed that.

Thanks,

Jon
From: Yang, Philip <Philip.Yang@amd.com>
Sent: Wednesday, September 1, 2021 12:30 PM
To: Kim, Jonathan <Jonathan.Kim@amd.com>; Yang, Philip <Philip.Yang@amd.com>; Sierra Guiza, Alejandro (Alex) <Alex.Sierra@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable



On 2021-09-01 9:45 a.m., Kim, Jonathan wrote:

[AMD Official Use Only]

We were seeing process leaks on a couple of machines running certain tests that triggered vm faults on purpose.
I think svm_range_restore_pages gets called unconditionally on vm fault handling (unless the retry interrupt payload bit is supposed to be clear with xnack off)?


yes, with xnack off, sh_mem_config retry should be off, retry bit is supposed to be clear in fault interrupt vector, we should not try to recover vm fault, just report the vm fault back to application and evict user queues. Maybe it is another bug cause p->xnack_enabled and sh_mem_config retry mismatch under specific condition?

Regards,

Philip
Either way, this patch prevents the process leaks we seeing and is also:
Reviewed-by: Jonathan Kim <jonathan.kim@amd.com><mailto:jonathan.kim@amd.com>

Thanks,

Jon


From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org><mailto:amd-gfx-bounces@lists.freedesktop.org> On Behalf Of philip yang
Sent: Wednesday, September 1, 2021 7:30 AM
To: Sierra Guiza, Alejandro (Alex) <Alex.Sierra@amd.com><mailto:Alex.Sierra@amd.com>; amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable

[CAUTION: External Email]


On 2021-08-31 10:41 p.m., Alex Sierra wrote:

During svm restore pages interrupt handler, kfd_process ref count was

never dropped when xnack was disabled. Therefore, the object was never

released.

Good catch, but if xnack is off, we should not get here to recover fault.

The fix looks good to me.

Reviewed-by: Philip Yang <philip.yang@amd.com><mailto:philip.yang@amd.com>



Signed-off-by: Alex Sierra <alex.sierra@amd.com><mailto:alex.sierra@amd.com>

---

 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++-

 1 file changed, 2 insertions(+), 1 deletion(-)



diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

index 8f9b5b53dab5..110c46cd7fac 100644

--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c

@@ -2484,7 +2484,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,

         }

         if (!p->xnack_enabled) {

                pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);

-               return -EFAULT;

+               r = -EFAULT;

+               goto out;

         }

         svms = &p->svms;



[-- Attachment #2: Type: text/html, Size: 8742 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable
  2021-09-01 16:59       ` Kim, Jonathan
@ 2021-09-01 17:54         ` Felix Kuehling
  2021-09-01 19:43           ` Felix Kuehling
  0 siblings, 1 reply; 7+ messages in thread
From: Felix Kuehling @ 2021-09-01 17:54 UTC (permalink / raw)
  To: amd-gfx, Kim, Jonathan, Alex Sierra, Yang, Philip

Am 2021-09-01 um 12:59 p.m. schrieb Kim, Jonathan:
>
> [Public]
>
>
> [Public]
>
>
> I wouldn’t know if it was another bug elsewhere.
>
> From what I was seeing, the leak was coming from !p->xnack_enable on
> the svm_range_restore_pages call.
>
> If it helps, I saw this on Aldebaran where a shader does some bad
> memory access on purpose on a debugged ptraced child process.
>
On Aldebaran the XNACK mode can be changed per process. But the page
fault interrupts are retry faults (until they get turned into no-retry
faults by updating the PTE in amdgpu_vm_handle_fault). The retry faults
go into svm_range_restore_pages before they realize that the process in
question doesn't use XNACK.

The patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


> The vm fault prompt pops up in dmesgs and a stale KFD process appends
> per run without this fix.
>
> I’m just assuming at this point that the IV retry bit is set but I
> never confirmed that.
>
>  
>
> Thanks,
>
>  
>
> Jon
>
> *From:* Yang, Philip <Philip.Yang@amd.com>
> *Sent:* Wednesday, September 1, 2021 12:30 PM
> *To:* Kim, Jonathan <Jonathan.Kim@amd.com>; Yang, Philip
> <Philip.Yang@amd.com>; Sierra Guiza, Alejandro (Alex)
> <Alex.Sierra@amd.com>; amd-gfx@lists.freedesktop.org
> *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when xnack
> disable
>
>  
>
>  
>
> On 2021-09-01 9:45 a.m., Kim, Jonathan wrote:
>
>     [AMD Official Use Only]
>
>      
>
>     We were seeing process leaks on a couple of machines running
>     certain tests that triggered vm faults on purpose.
>
>     I think svm_range_restore_pages gets called unconditionally on vm
>     fault handling (unless the retry interrupt payload bit is supposed
>     to be clear with xnack off)?
>
>      
>
> yes, with xnack off, sh_mem_config retry should be off, retry bit is
> supposed to be clear in fault interrupt vector, we should not try to
> recover vm fault, just report the vm fault back to application and
> evict user queues. Maybe it is another bug cause p->xnack_enabled and
> sh_mem_config retry mismatch under specific condition?
>
> Regards,
>
> Philip
>
>     Either way, this patch prevents the process leaks we seeing and is
>     also:
>
>     Reviewed-by: Jonathan Kim <jonathan.kim@amd.com>
>     <mailto:jonathan.kim@amd.com>
>
>      
>
>     Thanks,
>
>      
>
>     Jon
>
>      
>
>      
>
>     *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org>
>     <mailto:amd-gfx-bounces@lists.freedesktop.org> *On Behalf Of
>     *philip yang
>     *Sent:* Wednesday, September 1, 2021 7:30 AM
>     *To:* Sierra Guiza, Alejandro (Alex) <Alex.Sierra@amd.com>
>     <mailto:Alex.Sierra@amd.com>; amd-gfx@lists.freedesktop.org
>     <mailto:amd-gfx@lists.freedesktop.org>
>     *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when
>     xnack disable
>
>      
>
>     [CAUTION: External Email]
>
>      
>
>     On 2021-08-31 10:41 p.m., Alex Sierra wrote:
>
>         During svm restore pages interrupt handler, kfd_process ref count was
>
>         never dropped when xnack was disabled. Therefore, the object was never
>
>         released.
>
>     Good catch, but if xnack is off, we should not get here to recover
>     fault.
>
>     The fix looks good to me.
>
>     Reviewed-by: Philip Yang <philip.yang@amd.com>
>     <mailto:philip.yang@amd.com>
>
>          
>
>         Signed-off-by: Alex Sierra <alex.sierra@amd.com> <mailto:alex.sierra@amd.com>
>
>         ---
>
>          drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++-
>
>          1 file changed, 2 insertions(+), 1 deletion(-)
>
>          
>
>         diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>
>         index 8f9b5b53dab5..110c46cd7fac 100644
>
>         --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>
>         +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>
>         @@ -2484,7 +2484,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
>
>                  }
>
>                  if (!p->xnack_enabled) {
>
>                         pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);
>
>         -               return -EFAULT;
>
>         +               r = -EFAULT;
>
>         +               goto out;
>
>                  }
>
>                  svms = &p->svms;
>
>          
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdkfd: drop process ref count when xnack disable
  2021-09-01 17:54         ` Felix Kuehling
@ 2021-09-01 19:43           ` Felix Kuehling
  0 siblings, 0 replies; 7+ messages in thread
From: Felix Kuehling @ 2021-09-01 19:43 UTC (permalink / raw)
  To: amd-gfx, Kim, Jonathan, Alex Sierra, Yang, Philip

If it's not too late, please add

Fixes: 2383f56bbe4a ("drm/amdkfd: page table restore through svm API")

Thanks,
  Felix


Am 2021-09-01 um 1:54 p.m. schrieb Felix Kuehling:
> Am 2021-09-01 um 12:59 p.m. schrieb Kim, Jonathan:
>> [Public]
>>
>>
>> [Public]
>>
>>
>> I wouldn’t know if it was another bug elsewhere.
>>
>> From what I was seeing, the leak was coming from !p->xnack_enable on
>> the svm_range_restore_pages call.
>>
>> If it helps, I saw this on Aldebaran where a shader does some bad
>> memory access on purpose on a debugged ptraced child process.
>>
> On Aldebaran the XNACK mode can be changed per process. But the page
> fault interrupts are retry faults (until they get turned into no-retry
> faults by updating the PTE in amdgpu_vm_handle_fault). The retry faults
> go into svm_range_restore_pages before they realize that the process in
> question doesn't use XNACK.
>
> The patch is
>
> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
>
>
>> The vm fault prompt pops up in dmesgs and a stale KFD process appends
>> per run without this fix.
>>
>> I’m just assuming at this point that the IV retry bit is set but I
>> never confirmed that.
>>
>>  
>>
>> Thanks,
>>
>>  
>>
>> Jon
>>
>> *From:* Yang, Philip <Philip.Yang@amd.com>
>> *Sent:* Wednesday, September 1, 2021 12:30 PM
>> *To:* Kim, Jonathan <Jonathan.Kim@amd.com>; Yang, Philip
>> <Philip.Yang@amd.com>; Sierra Guiza, Alejandro (Alex)
>> <Alex.Sierra@amd.com>; amd-gfx@lists.freedesktop.org
>> *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when xnack
>> disable
>>
>>  
>>
>>  
>>
>> On 2021-09-01 9:45 a.m., Kim, Jonathan wrote:
>>
>>     [AMD Official Use Only]
>>
>>      
>>
>>     We were seeing process leaks on a couple of machines running
>>     certain tests that triggered vm faults on purpose.
>>
>>     I think svm_range_restore_pages gets called unconditionally on vm
>>     fault handling (unless the retry interrupt payload bit is supposed
>>     to be clear with xnack off)?
>>
>>      
>>
>> yes, with xnack off, sh_mem_config retry should be off, retry bit is
>> supposed to be clear in fault interrupt vector, we should not try to
>> recover vm fault, just report the vm fault back to application and
>> evict user queues. Maybe it is another bug cause p->xnack_enabled and
>> sh_mem_config retry mismatch under specific condition?
>>
>> Regards,
>>
>> Philip
>>
>>     Either way, this patch prevents the process leaks we seeing and is
>>     also:
>>
>>     Reviewed-by: Jonathan Kim <jonathan.kim@amd.com>
>>     <mailto:jonathan.kim@amd.com>
>>
>>      
>>
>>     Thanks,
>>
>>      
>>
>>     Jon
>>
>>      
>>
>>      
>>
>>     *From:* amd-gfx <amd-gfx-bounces@lists.freedesktop.org>
>>     <mailto:amd-gfx-bounces@lists.freedesktop.org> *On Behalf Of
>>     *philip yang
>>     *Sent:* Wednesday, September 1, 2021 7:30 AM
>>     *To:* Sierra Guiza, Alejandro (Alex) <Alex.Sierra@amd.com>
>>     <mailto:Alex.Sierra@amd.com>; amd-gfx@lists.freedesktop.org
>>     <mailto:amd-gfx@lists.freedesktop.org>
>>     *Subject:* Re: [PATCH] drm/amdkfd: drop process ref count when
>>     xnack disable
>>
>>      
>>
>>     [CAUTION: External Email]
>>
>>      
>>
>>     On 2021-08-31 10:41 p.m., Alex Sierra wrote:
>>
>>         During svm restore pages interrupt handler, kfd_process ref count was
>>
>>         never dropped when xnack was disabled. Therefore, the object was never
>>
>>         released.
>>
>>     Good catch, but if xnack is off, we should not get here to recover
>>     fault.
>>
>>     The fix looks good to me.
>>
>>     Reviewed-by: Philip Yang <philip.yang@amd.com>
>>     <mailto:philip.yang@amd.com>
>>
>>          
>>
>>         Signed-off-by: Alex Sierra <alex.sierra@amd.com> <mailto:alex.sierra@amd.com>
>>
>>         ---
>>
>>          drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 ++-
>>
>>          1 file changed, 2 insertions(+), 1 deletion(-)
>>
>>          
>>
>>         diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>
>>         index 8f9b5b53dab5..110c46cd7fac 100644
>>
>>         --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>
>>         +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>>
>>         @@ -2484,7 +2484,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
>>
>>                  }
>>
>>                  if (!p->xnack_enabled) {
>>
>>                         pr_debug("XNACK not enabled for pasid 0x%x\n", pasid);
>>
>>         -               return -EFAULT;
>>
>>         +               r = -EFAULT;
>>
>>         +               goto out;
>>
>>                  }
>>
>>                  svms = &p->svms;
>>
>>          
>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-09-01 19:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-01  2:41 [PATCH] drm/amdkfd: drop process ref count when xnack disable Alex Sierra
2021-09-01 11:29 ` philip yang
2021-09-01 13:45   ` Kim, Jonathan
2021-09-01 16:29     ` philip yang
2021-09-01 16:59       ` Kim, Jonathan
2021-09-01 17:54         ` Felix Kuehling
2021-09-01 19:43           ` Felix Kuehling

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.