All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/1] drm/amdgpu: Take a reference to an exported BO
@ 2020-05-01 14:21 Felix Kuehling
  2020-05-01 14:29 ` Christian König
  0 siblings, 1 reply; 9+ messages in thread
From: Felix Kuehling @ 2020-05-01 14:21 UTC (permalink / raw)
  To: amd-gfx; +Cc: Felix Kuehling

From: Felix Kuehling <felix.kuehling@gmail.com>

That reference gets dropped when the the dma-buf is freed. Not incrementing
the refcount can lead to use-after-free errors.

Signed-off-by: Felix Kuehling <felix.kuehling@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
index ffeb20f11c07..a0f9b3ef4aad 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
@@ -398,8 +398,15 @@ struct dma_buf *amdgpu_gem_prime_export(struct drm_gem_object *gobj,
 		return ERR_PTR(-EPERM);
 
 	buf = drm_gem_prime_export(gobj, flags);
-	if (!IS_ERR(buf))
+	if (!IS_ERR(buf)) {
 		buf->ops = &amdgpu_dmabuf_ops;
+		/* GEM needs a reference to the underlying object
+		 * that gets dropped when the dma-buf is released,
+		 * through the amdgpu_gem_object_free callback
+		 * from drm_gem_object_put_unlocked.
+		 */
+		amdgpu_bo_ref(bo);
+	}
 
 	return buf;
 }
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Take a reference to an exported BO
  2020-05-01 14:21 [PATCH 1/1] drm/amdgpu: Take a reference to an exported BO Felix Kuehling
@ 2020-05-01 14:29 ` Christian König
  2020-05-01 14:44   ` Felix Kuehling
  0 siblings, 1 reply; 9+ messages in thread
From: Christian König @ 2020-05-01 14:29 UTC (permalink / raw)
  To: Felix Kuehling, amd-gfx; +Cc: Felix Kuehling

Am 01.05.20 um 16:21 schrieb Felix Kuehling:
> From: Felix Kuehling <felix.kuehling@gmail.com>
>
> That reference gets dropped when the the dma-buf is freed. Not incrementing
> the refcount can lead to use-after-free errors.
>
> Signed-off-by: Felix Kuehling <felix.kuehling@gmail.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> index ffeb20f11c07..a0f9b3ef4aad 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> @@ -398,8 +398,15 @@ struct dma_buf *amdgpu_gem_prime_export(struct drm_gem_object *gobj,
>   		return ERR_PTR(-EPERM);
>   
>   	buf = drm_gem_prime_export(gobj, flags);
> -	if (!IS_ERR(buf))
> +	if (!IS_ERR(buf)) {
>   		buf->ops = &amdgpu_dmabuf_ops;
> +		/* GEM needs a reference to the underlying object
> +		 * that gets dropped when the dma-buf is released,
> +		 * through the amdgpu_gem_object_free callback
> +		 * from drm_gem_object_put_unlocked.
> +		 */
> +		amdgpu_bo_ref(bo);
> +	}

Of hand that doesn't sounds correct to me. Why should the exported bo be 
closed through amdgpu_gem_object_free()?

Regards,
Christian.

>   
>   	return buf;
>   }

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Take a reference to an exported BO
  2020-05-01 14:29 ` Christian König
@ 2020-05-01 14:44   ` Felix Kuehling
  2020-05-05  7:47     ` Christian König
  0 siblings, 1 reply; 9+ messages in thread
From: Felix Kuehling @ 2020-05-01 14:44 UTC (permalink / raw)
  To: christian.koenig, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 4154 bytes --]

[dropping my gmail address]

We saw this backtrace showing the call chain while investigating a
kernel oops caused by this issue on the DKMS branch with the KFD IPC
API. It happens after a dma-buf file is released with fput:

[ 1255.049330] BUG: kernel NULL pointer dereference, address: 000000000000051e
[ 1255.049727] #PF: supervisor read access in kernel mode
[ 1255.050092] #PF: error_code(0x0000) - not-present page
[ 1255.050416] PGD 0 P4D 0
[ 1255.050736] Oops: 0000 [#1] SMP PTI
[ 1255.051060] CPU: 27 PID: 2292 Comm: kworker/27:2 Tainted: G           OE     5.3.0-46-generic #38~18.04.1-Ubuntu
[ 1255.051400] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 3.0a 02/26/2019
[ 1255.051752] Workqueue: events delayed_fput
[ 1255.052111] RIP: 0010:drm_gem_object_put_unlocked+0x1c/0x70 [drm]
[ 1255.052465] Code: 4d 80 c8 ee 0f 0b eb d8 66 0f 1f 44 00 00 0f 1f 44 00 00 48 85 ff 74 34 55 48 89 e5 41 54 53 48 89 fb 48 8b 7f 08 48 8b 47 20 <48> 83 b8 a0 00 00 00 00 74 1a 4c 8d 67 68 48 89 df 4c 89 e6 e8 9b
[ 1255.053224] RSP: 0018:ffffb4b62035fdc8 EFLAGS: 00010286
[ 1255.053613] RAX: 000000000000047e RBX: ffff9f2add197850 RCX: 0000000000000000
[ 1255.054032] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f2aa2548aa0
[ 1255.054440] RBP: ffffb4b62035fdd8 R08: 0000000000000000 R09: 0000000000000000
[ 1255.054860] R10: 0000000000000010 R11: ffff9f2a4b1cc310 R12: 0000000000080005
[ 1255.055268] R13: ffff9f2a4b1cc310 R14: ffff9f4e369161e0 R15: ffff9f2a1b2f9080
[ 1255.055674] FS:  0000000000000000(0000) GS:ffff9f4e3f740000(0000) knlGS:0000000000000000
[ 1255.056087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1255.056501] CR2: 000000000000051e CR3: 00000002df00a004 CR4: 00000000007606e0
[ 1255.056923] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1255.057345] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1255.057763] PKRU: 55555554
[ 1255.058179] Call Trace:
[ 1255.058603]  drm_gem_dmabuf_release+0x1a/0x30 [drm]
[ 1255.059025]  dma_buf_release+0x56/0x130
[ 1255.059443]  __fput+0xc6/0x260
[ 1255.059856]  delayed_fput+0x20/0x30
[ 1255.060272]  process_one_work+0x1fd/0x3f0
[ 1255.060686]  worker_thread+0x34/0x410
[ 1255.061099]  kthread+0x121/0x140
[ 1255.061510]  ? process_one_work+0x3f0/0x3f0
[ 1255.061923]  ? kthread_park+0xb0/0xb0
[ 1255.062336]  ret_from_fork+0x35/0x40

drm_gem_object_put_unlocked calls drm_gem_object_free when the
obj->refcount reaches 0. From there it calls
dev->driver->gem_free_object_unlocked, which is amdgpu_gem_object_free
in amdgpu.

Regards,
  Felix

Am 2020-05-01 um 10:29 a.m. schrieb Christian König:
> Am 01.05.20 um 16:21 schrieb Felix Kuehling:
>> From: Felix Kuehling <felix.kuehling@gmail.com>
>>
>> That reference gets dropped when the the dma-buf is freed. Not
>> incrementing
>> the refcount can lead to use-after-free errors.
>>
>> Signed-off-by: Felix Kuehling <felix.kuehling@gmail.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 9 ++++++++-
>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>> index ffeb20f11c07..a0f9b3ef4aad 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>> @@ -398,8 +398,15 @@ struct dma_buf *amdgpu_gem_prime_export(struct
>> drm_gem_object *gobj,
>>           return ERR_PTR(-EPERM);
>>         buf = drm_gem_prime_export(gobj, flags);
>> -    if (!IS_ERR(buf))
>> +    if (!IS_ERR(buf)) {
>>           buf->ops = &amdgpu_dmabuf_ops;
>> +        /* GEM needs a reference to the underlying object
>> +         * that gets dropped when the dma-buf is released,
>> +         * through the amdgpu_gem_object_free callback
>> +         * from drm_gem_object_put_unlocked.
>> +         */
>> +        amdgpu_bo_ref(bo);
>> +    }
>
> Of hand that doesn't sounds correct to me. Why should the exported bo
> be closed through amdgpu_gem_object_free()?
>
> Regards,
> Christian.
>
>>         return buf;
>>   }
>

[-- Attachment #1.2: Type: text/html, Size: 5918 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Take a reference to an exported BO
  2020-05-01 14:44   ` Felix Kuehling
@ 2020-05-05  7:47     ` Christian König
  2020-05-05 14:58       ` Felix Kuehling
  0 siblings, 1 reply; 9+ messages in thread
From: Christian König @ 2020-05-05  7:47 UTC (permalink / raw)
  To: Felix Kuehling, christian.koenig, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 4760 bytes --]

Just to reply here once more, this patch is a clear NAK.

The references are grabbed in the call path of drm_gem_prime_export() 
and dropped again in drm_gem_dmabuf_release().

So they are perfectly balanced as far as I can see.

Regards,
Christian.

Am 01.05.20 um 16:44 schrieb Felix Kuehling:
>
> [dropping my gmail address]
>
> We saw this backtrace showing the call chain while investigating a 
> kernel oops caused by this issue on the DKMS branch with the KFD IPC 
> API. It happens after a dma-buf file is released with fput:
>
> [ 1255.049330] BUG: kernel NULL pointer dereference, address: 000000000000051e
> [ 1255.049727] #PF: supervisor read access in kernel mode
> [ 1255.050092] #PF: error_code(0x0000) - not-present page
> [ 1255.050416] PGD 0 P4D 0
> [ 1255.050736] Oops: 0000 [#1] SMP PTI
> [ 1255.051060] CPU: 27 PID: 2292 Comm: kworker/27:2 Tainted: G           OE     5.3.0-46-generic #38~18.04.1-Ubuntu
> [ 1255.051400] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 3.0a 02/26/2019
> [ 1255.051752] Workqueue: events delayed_fput
> [ 1255.052111] RIP: 0010:drm_gem_object_put_unlocked+0x1c/0x70 [drm]
> [ 1255.052465] Code: 4d 80 c8 ee 0f 0b eb d8 66 0f 1f 44 00 00 0f 1f 44 00 00 48 85 ff 74 34 55 48 89 e5 41 54 53 48 89 fb 48 8b 7f 08 48 8b 47 20 <48> 83 b8 a0 00 00 00 00 74 1a 4c 8d 67 68 48 89 df 4c 89 e6 e8 9b
> [ 1255.053224] RSP: 0018:ffffb4b62035fdc8 EFLAGS: 00010286
> [ 1255.053613] RAX: 000000000000047e RBX: ffff9f2add197850 RCX: 0000000000000000
> [ 1255.054032] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f2aa2548aa0
> [ 1255.054440] RBP: ffffb4b62035fdd8 R08: 0000000000000000 R09: 0000000000000000
> [ 1255.054860] R10: 0000000000000010 R11: ffff9f2a4b1cc310 R12: 0000000000080005
> [ 1255.055268] R13: ffff9f2a4b1cc310 R14: ffff9f4e369161e0 R15: ffff9f2a1b2f9080
> [ 1255.055674] FS:  0000000000000000(0000) GS:ffff9f4e3f740000(0000) knlGS:0000000000000000
> [ 1255.056087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1255.056501] CR2: 000000000000051e CR3: 00000002df00a004 CR4: 00000000007606e0
> [ 1255.056923] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1255.057345] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 1255.057763] PKRU: 55555554
> [ 1255.058179] Call Trace:
> [ 1255.058603]  drm_gem_dmabuf_release+0x1a/0x30 [drm]
> [ 1255.059025]  dma_buf_release+0x56/0x130
> [ 1255.059443]  __fput+0xc6/0x260
> [ 1255.059856]  delayed_fput+0x20/0x30
> [ 1255.060272]  process_one_work+0x1fd/0x3f0
> [ 1255.060686]  worker_thread+0x34/0x410
> [ 1255.061099]  kthread+0x121/0x140
> [ 1255.061510]  ? process_one_work+0x3f0/0x3f0
> [ 1255.061923]  ? kthread_park+0xb0/0xb0
> [ 1255.062336]  ret_from_fork+0x35/0x40
>
> drm_gem_object_put_unlocked calls drm_gem_object_free when the 
> obj->refcount reaches 0. From there it calls 
> dev->driver->gem_free_object_unlocked, which is amdgpu_gem_object_free 
> in amdgpu.
>
> Regards,
>   Felix
>
> Am 2020-05-01 um 10:29 a.m. schrieb Christian König:
>> Am 01.05.20 um 16:21 schrieb Felix Kuehling:
>>> From: Felix Kuehling <felix.kuehling@gmail.com>
>>>
>>> That reference gets dropped when the the dma-buf is freed. Not 
>>> incrementing
>>> the refcount can lead to use-after-free errors.
>>>
>>> Signed-off-by: Felix Kuehling <felix.kuehling@gmail.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 9 ++++++++-
>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>> index ffeb20f11c07..a0f9b3ef4aad 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>> @@ -398,8 +398,15 @@ struct dma_buf *amdgpu_gem_prime_export(struct 
>>> drm_gem_object *gobj,
>>>           return ERR_PTR(-EPERM);
>>>         buf = drm_gem_prime_export(gobj, flags);
>>> -    if (!IS_ERR(buf))
>>> +    if (!IS_ERR(buf)) {
>>>           buf->ops = &amdgpu_dmabuf_ops;
>>> +        /* GEM needs a reference to the underlying object
>>> +         * that gets dropped when the dma-buf is released,
>>> +         * through the amdgpu_gem_object_free callback
>>> +         * from drm_gem_object_put_unlocked.
>>> +         */
>>> +        amdgpu_bo_ref(bo);
>>> +    }
>>
>> Of hand that doesn't sounds correct to me. Why should the exported bo 
>> be closed through amdgpu_gem_object_free()?
>>
>> Regards,
>> Christian.
>>
>>>         return buf;
>>>   }
>>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[-- Attachment #1.2: Type: text/html, Size: 6606 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Take a reference to an exported BO
  2020-05-05  7:47     ` Christian König
@ 2020-05-05 14:58       ` Felix Kuehling
  2020-05-05 15:19         ` Christian König
  0 siblings, 1 reply; 9+ messages in thread
From: Felix Kuehling @ 2020-05-05 14:58 UTC (permalink / raw)
  To: christian.koenig, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 5718 bytes --]

Am 2020-05-05 um 3:47 a.m. schrieb Christian König:
> Just to reply here once more, this patch is a clear NAK.

Agreed. But see below. I don't think all is well.


>
> The references are grabbed in the call path of drm_gem_prime_export()
> and dropped again in drm_gem_dmabuf_release().
>
> So they are perfectly balanced as far as I can see.

That is true for the GEM object references. But I believe there is still
a problem with the TTM BO references.

As far as I can tell amdgpu_bo_unref can free the TTM BO while there are
still references to the GEM object from DMA buf exports. I think that's
a fundamental problem with how we have two reference counts for the same
physical object (TTM BO and the embedded GEM BO).

I think the correct solution is for amdgpu_bo_ref/unref to delegate its
reference counting to drm_gem_object_get/put instead of ttm_bo_get/put.
The amdgpu BO would hold one token reference to the TTM BO, which it can
drop when the GEM BO refcount drops to 0. Finally, the amdgpu BO should
only be freed once the TTM BO refcount also becomes 0.

Regards,
  Felix


>
> Regards,
> Christian.
>
> Am 01.05.20 um 16:44 schrieb Felix Kuehling:
>>
>> [dropping my gmail address]
>>
>> We saw this backtrace showing the call chain while investigating a
>> kernel oops caused by this issue on the DKMS branch with the KFD IPC
>> API. It happens after a dma-buf file is released with fput:
>>
>> [ 1255.049330] BUG: kernel NULL pointer dereference, address: 000000000000051e
>> [ 1255.049727] #PF: supervisor read access in kernel mode
>> [ 1255.050092] #PF: error_code(0x0000) - not-present page
>> [ 1255.050416] PGD 0 P4D 0
>> [ 1255.050736] Oops: 0000 [#1] SMP PTI
>> [ 1255.051060] CPU: 27 PID: 2292 Comm: kworker/27:2 Tainted: G           OE     5.3.0-46-generic #38~18.04.1-Ubuntu
>> [ 1255.051400] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 3.0a 02/26/2019
>> [ 1255.051752] Workqueue: events delayed_fput
>> [ 1255.052111] RIP: 0010:drm_gem_object_put_unlocked+0x1c/0x70 [drm]
>> [ 1255.052465] Code: 4d 80 c8 ee 0f 0b eb d8 66 0f 1f 44 00 00 0f 1f 44 00 00 48 85 ff 74 34 55 48 89 e5 41 54 53 48 89 fb 48 8b 7f 08 48 8b 47 20 <48> 83 b8 a0 00 00 00 00 74 1a 4c 8d 67 68 48 89 df 4c 89 e6 e8 9b
>> [ 1255.053224] RSP: 0018:ffffb4b62035fdc8 EFLAGS: 00010286
>> [ 1255.053613] RAX: 000000000000047e RBX: ffff9f2add197850 RCX: 0000000000000000
>> [ 1255.054032] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f2aa2548aa0
>> [ 1255.054440] RBP: ffffb4b62035fdd8 R08: 0000000000000000 R09: 0000000000000000
>> [ 1255.054860] R10: 0000000000000010 R11: ffff9f2a4b1cc310 R12: 0000000000080005
>> [ 1255.055268] R13: ffff9f2a4b1cc310 R14: ffff9f4e369161e0 R15: ffff9f2a1b2f9080
>> [ 1255.055674] FS:  0000000000000000(0000) GS:ffff9f4e3f740000(0000) knlGS:0000000000000000
>> [ 1255.056087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1255.056501] CR2: 000000000000051e CR3: 00000002df00a004 CR4: 00000000007606e0
>> [ 1255.056923] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 1255.057345] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 1255.057763] PKRU: 55555554
>> [ 1255.058179] Call Trace:
>> [ 1255.058603]  drm_gem_dmabuf_release+0x1a/0x30 [drm]
>> [ 1255.059025]  dma_buf_release+0x56/0x130
>> [ 1255.059443]  __fput+0xc6/0x260
>> [ 1255.059856]  delayed_fput+0x20/0x30
>> [ 1255.060272]  process_one_work+0x1fd/0x3f0
>> [ 1255.060686]  worker_thread+0x34/0x410
>> [ 1255.061099]  kthread+0x121/0x140
>> [ 1255.061510]  ? process_one_work+0x3f0/0x3f0
>> [ 1255.061923]  ? kthread_park+0xb0/0xb0
>> [ 1255.062336]  ret_from_fork+0x35/0x40
>>
>> drm_gem_object_put_unlocked calls drm_gem_object_free when the
>> obj->refcount reaches 0. From there it calls
>> dev->driver->gem_free_object_unlocked, which is
>> amdgpu_gem_object_free in amdgpu.
>>
>> Regards,
>>   Felix
>>
>> Am 2020-05-01 um 10:29 a.m. schrieb Christian König:
>>> Am 01.05.20 um 16:21 schrieb Felix Kuehling:
>>>> From: Felix Kuehling <felix.kuehling@gmail.com>
>>>>
>>>> That reference gets dropped when the the dma-buf is freed. Not
>>>> incrementing
>>>> the refcount can lead to use-after-free errors.
>>>>
>>>> Signed-off-by: Felix Kuehling <felix.kuehling@gmail.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 9 ++++++++-
>>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>> index ffeb20f11c07..a0f9b3ef4aad 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>> @@ -398,8 +398,15 @@ struct dma_buf *amdgpu_gem_prime_export(struct
>>>> drm_gem_object *gobj,
>>>>           return ERR_PTR(-EPERM);
>>>>         buf = drm_gem_prime_export(gobj, flags);
>>>> -    if (!IS_ERR(buf))
>>>> +    if (!IS_ERR(buf)) {
>>>>           buf->ops = &amdgpu_dmabuf_ops;
>>>> +        /* GEM needs a reference to the underlying object
>>>> +         * that gets dropped when the dma-buf is released,
>>>> +         * through the amdgpu_gem_object_free callback
>>>> +         * from drm_gem_object_put_unlocked.
>>>> +         */
>>>> +        amdgpu_bo_ref(bo);
>>>> +    }
>>>
>>> Of hand that doesn't sounds correct to me. Why should the exported
>>> bo be closed through amdgpu_gem_object_free()?
>>>
>>> Regards,
>>> Christian.
>>>
>>>>         return buf;
>>>>   }
>>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

[-- Attachment #1.2: Type: text/html, Size: 8678 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Take a reference to an exported BO
  2020-05-05 14:58       ` Felix Kuehling
@ 2020-05-05 15:19         ` Christian König
  2020-05-05 17:29           ` Felix Kuehling
  0 siblings, 1 reply; 9+ messages in thread
From: Christian König @ 2020-05-05 15:19 UTC (permalink / raw)
  To: Felix Kuehling, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 6427 bytes --]

Am 05.05.20 um 16:58 schrieb Felix Kuehling:
> Am 2020-05-05 um 3:47 a.m. schrieb Christian König:
>> Just to reply here once more, this patch is a clear NAK.
>
> Agreed. But see below. I don't think all is well.
>
>>
>> The references are grabbed in the call path of drm_gem_prime_export() 
>> and dropped again in drm_gem_dmabuf_release().
>>
>> So they are perfectly balanced as far as I can see.
>
> That is true for the GEM object references. But I believe there is 
> still a problem with the TTM BO references.
>
> As far as I can tell amdgpu_bo_unref can free the TTM BO while there 
> are still references to the GEM object from DMA buf exports. I think 
> that's a fundamental problem with how we have two reference counts for 
> the same physical object (TTM BO and the embedded GEM BO).
>

Completely agree, I also mentioned that problem during my talk on 
FOSDEM. But calling amdgpu_bo_unref() to often is a bug in itself.

What we could probably do to detect this is adding a BUG_ON() in TTMs 
release function and checking if the GEM reference count is really dead.

> I think the correct solution is for amdgpu_bo_ref/unref to delegate 
> its reference counting to drm_gem_object_get/put instead of 
> ttm_bo_get/put. The amdgpu BO would hold one token reference to the 
> TTM BO, which it can drop when the GEM BO refcount drops to 0. 
> Finally, the amdgpu BO should only be freed once the TTM BO refcount 
> also becomes 0.

Just the other way around, but yes the long term plan should probably be 
to merge the two.

The difficult is currently we have a mismatch what locks could be taken 
when we drop the references.

Regards,
Christian.

> Regards,
>   Felix
>
>
>>
>> Regards,
>> Christian.
>>
>> Am 01.05.20 um 16:44 schrieb Felix Kuehling:
>>>
>>> [dropping my gmail address]
>>>
>>> We saw this backtrace showing the call chain while investigating a 
>>> kernel oops caused by this issue on the DKMS branch with the KFD IPC 
>>> API. It happens after a dma-buf file is released with fput:
>>>
>>> [ 1255.049330] BUG: kernel NULL pointer dereference, address: 000000000000051e
>>> [ 1255.049727] #PF: supervisor read access in kernel mode
>>> [ 1255.050092] #PF: error_code(0x0000) - not-present page
>>> [ 1255.050416] PGD 0 P4D 0
>>> [ 1255.050736] Oops: 0000 [#1] SMP PTI
>>> [ 1255.051060] CPU: 27 PID: 2292 Comm: kworker/27:2 Tainted: G           OE     5.3.0-46-generic #38~18.04.1-Ubuntu
>>> [ 1255.051400] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 3.0a 02/26/2019
>>> [ 1255.051752] Workqueue: events delayed_fput
>>> [ 1255.052111] RIP: 0010:drm_gem_object_put_unlocked+0x1c/0x70 [drm]
>>> [ 1255.052465] Code: 4d 80 c8 ee 0f 0b eb d8 66 0f 1f 44 00 00 0f 1f 44 00 00 48 85 ff 74 34 55 48 89 e5 41 54 53 48 89 fb 48 8b 7f 08 48 8b 47 20 <48> 83 b8 a0 00 00 00 00 74 1a 4c 8d 67 68 48 89 df 4c 89 e6 e8 9b
>>> [ 1255.053224] RSP: 0018:ffffb4b62035fdc8 EFLAGS: 00010286
>>> [ 1255.053613] RAX: 000000000000047e RBX: ffff9f2add197850 RCX: 0000000000000000
>>> [ 1255.054032] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f2aa2548aa0
>>> [ 1255.054440] RBP: ffffb4b62035fdd8 R08: 0000000000000000 R09: 0000000000000000
>>> [ 1255.054860] R10: 0000000000000010 R11: ffff9f2a4b1cc310 R12: 0000000000080005
>>> [ 1255.055268] R13: ffff9f2a4b1cc310 R14: ffff9f4e369161e0 R15: ffff9f2a1b2f9080
>>> [ 1255.055674] FS:  0000000000000000(0000) GS:ffff9f4e3f740000(0000) knlGS:0000000000000000
>>> [ 1255.056087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 1255.056501] CR2: 000000000000051e CR3: 00000002df00a004 CR4: 00000000007606e0
>>> [ 1255.056923] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [ 1255.057345] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [ 1255.057763] PKRU: 55555554
>>> [ 1255.058179] Call Trace:
>>> [ 1255.058603]  drm_gem_dmabuf_release+0x1a/0x30 [drm]
>>> [ 1255.059025]  dma_buf_release+0x56/0x130
>>> [ 1255.059443]  __fput+0xc6/0x260
>>> [ 1255.059856]  delayed_fput+0x20/0x30
>>> [ 1255.060272]  process_one_work+0x1fd/0x3f0
>>> [ 1255.060686]  worker_thread+0x34/0x410
>>> [ 1255.061099]  kthread+0x121/0x140
>>> [ 1255.061510]  ? process_one_work+0x3f0/0x3f0
>>> [ 1255.061923]  ? kthread_park+0xb0/0xb0
>>> [ 1255.062336]  ret_from_fork+0x35/0x40
>>>
>>> drm_gem_object_put_unlocked calls drm_gem_object_free when the 
>>> obj->refcount reaches 0. From there it calls 
>>> dev->driver->gem_free_object_unlocked, which is 
>>> amdgpu_gem_object_free in amdgpu.
>>>
>>> Regards,
>>>   Felix
>>>
>>> Am 2020-05-01 um 10:29 a.m. schrieb Christian König:
>>>> Am 01.05.20 um 16:21 schrieb Felix Kuehling:
>>>>> From: Felix Kuehling <felix.kuehling@gmail.com>
>>>>>
>>>>> That reference gets dropped when the the dma-buf is freed. Not 
>>>>> incrementing
>>>>> the refcount can lead to use-after-free errors.
>>>>>
>>>>> Signed-off-by: Felix Kuehling <felix.kuehling@gmail.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 9 ++++++++-
>>>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>> index ffeb20f11c07..a0f9b3ef4aad 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>> @@ -398,8 +398,15 @@ struct dma_buf 
>>>>> *amdgpu_gem_prime_export(struct drm_gem_object *gobj,
>>>>>           return ERR_PTR(-EPERM);
>>>>>         buf = drm_gem_prime_export(gobj, flags);
>>>>> -    if (!IS_ERR(buf))
>>>>> +    if (!IS_ERR(buf)) {
>>>>>           buf->ops = &amdgpu_dmabuf_ops;
>>>>> +        /* GEM needs a reference to the underlying object
>>>>> +         * that gets dropped when the dma-buf is released,
>>>>> +         * through the amdgpu_gem_object_free callback
>>>>> +         * from drm_gem_object_put_unlocked.
>>>>> +         */
>>>>> +        amdgpu_bo_ref(bo);
>>>>> +    }
>>>>
>>>> Of hand that doesn't sounds correct to me. Why should the exported 
>>>> bo be closed through amdgpu_gem_object_free()?
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>         return buf;
>>>>>   }
>>>>
>>>
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>


[-- Attachment #1.2: Type: text/html, Size: 9999 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Take a reference to an exported BO
  2020-05-05 15:19         ` Christian König
@ 2020-05-05 17:29           ` Felix Kuehling
  2020-05-05 20:10             ` Felix Kuehling
  2020-05-06  7:15             ` Christian König
  0 siblings, 2 replies; 9+ messages in thread
From: Felix Kuehling @ 2020-05-05 17:29 UTC (permalink / raw)
  To: Christian König, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 8092 bytes --]

Am 2020-05-05 um 11:19 a.m. schrieb Christian König:
> Am 05.05.20 um 16:58 schrieb Felix Kuehling:
>> Am 2020-05-05 um 3:47 a.m. schrieb Christian König:
>>> Just to reply here once more, this patch is a clear NAK.
>>
>> Agreed. But see below. I don't think all is well.
>>
>>>
>>> The references are grabbed in the call path of
>>> drm_gem_prime_export() and dropped again in drm_gem_dmabuf_release().
>>>
>>> So they are perfectly balanced as far as I can see.
>>
>> That is true for the GEM object references. But I believe there is
>> still a problem with the TTM BO references.
>>
>> As far as I can tell amdgpu_bo_unref can free the TTM BO while there
>> are still references to the GEM object from DMA buf exports. I think
>> that's a fundamental problem with how we have two reference counts
>> for the same physical object (TTM BO and the embedded GEM BO).
>>
>
> Completely agree, I also mentioned that problem during my talk on
> FOSDEM. But calling amdgpu_bo_unref() to often is a bug in itself.

That's not the problem here.


>
> What we could probably do to detect this is adding a BUG_ON() in TTMs
> release function and checking if the GEM reference count is really dead.

The problem is, that we have to guess whether there are still any dmabuf
references to the GEM BO. There is no way amdgpu can know that. You
can't make amdgpu responsible for keeping a reference to the TTM BO
while the GEM BO is still referenced by entities completely out of the
control of amdgpu.

Another weird thing I see is that amdgpu_gem_free_object calls
amdgpu_bo_unref. That implies that the GEM object conceptually holds a
reference to the amdpu/TTM BO. But that is not really the case. Amdgpu
never takes that reference that GEM is supposed to own. If it did, we
would leak all our memory because nobody would ever drop that reference.


>
>> I think the correct solution is for amdgpu_bo_ref/unref to delegate
>> its reference counting to drm_gem_object_get/put instead of
>> ttm_bo_get/put. The amdgpu BO would hold one token reference to the
>> TTM BO, which it can drop when the GEM BO refcount drops to 0.
>> Finally, the amdgpu BO should only be freed once the TTM BO refcount
>> also becomes 0.
>
> Just the other way around, but yes the long term plan should probably
> be to merge the two.

I need a short term solution. Because I have a bug that causes a kernel
oops with applications that are valid and correct, as far as I can tell.

I'm thinking a solution that doesn't require major changes to the way
TTM and GEM interact would put amdgpu in charge of coordinating the two.
Unfortunately that would mean adding a third reference count in
amdgpu_bo, in addition to the ones in TTM and GEM. The amdgpu BO would
hold one token reference to each of the GEM and TTM BO. When amdgpu
refcount goes to 0 it releases that GEM BO token reference. When the GEM
BO refcount goes  to 0, we get a callback amdgpu_gem_object_free. There
we can drop the token reference to the TTM BO. Once the TTM BO reference
goes to 0 we free the memory.

Does this sound feasible?

Regards,
  Felix


>
> The difficult is currently we have a mismatch what locks could be
> taken when we drop the references.
>
> Regards,
> Christian.
>
>> Regards,
>>   Felix
>>
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 01.05.20 um 16:44 schrieb Felix Kuehling:
>>>>
>>>> [dropping my gmail address]
>>>>
>>>> We saw this backtrace showing the call chain while investigating a
>>>> kernel oops caused by this issue on the DKMS branch with the KFD
>>>> IPC API. It happens after a dma-buf file is released with fput:
>>>>
>>>> [ 1255.049330] BUG: kernel NULL pointer dereference, address: 000000000000051e
>>>> [ 1255.049727] #PF: supervisor read access in kernel mode
>>>> [ 1255.050092] #PF: error_code(0x0000) - not-present page
>>>> [ 1255.050416] PGD 0 P4D 0
>>>> [ 1255.050736] Oops: 0000 [#1] SMP PTI
>>>> [ 1255.051060] CPU: 27 PID: 2292 Comm: kworker/27:2 Tainted: G           OE     5.3.0-46-generic #38~18.04.1-Ubuntu
>>>> [ 1255.051400] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 3.0a 02/26/2019
>>>> [ 1255.051752] Workqueue: events delayed_fput
>>>> [ 1255.052111] RIP: 0010:drm_gem_object_put_unlocked+0x1c/0x70 [drm]
>>>> [ 1255.052465] Code: 4d 80 c8 ee 0f 0b eb d8 66 0f 1f 44 00 00 0f 1f 44 00 00 48 85 ff 74 34 55 48 89 e5 41 54 53 48 89 fb 48 8b 7f 08 48 8b 47 20 <48> 83 b8 a0 00 00 00 00 74 1a 4c 8d 67 68 48 89 df 4c 89 e6 e8 9b
>>>> [ 1255.053224] RSP: 0018:ffffb4b62035fdc8 EFLAGS: 00010286
>>>> [ 1255.053613] RAX: 000000000000047e RBX: ffff9f2add197850 RCX: 0000000000000000
>>>> [ 1255.054032] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f2aa2548aa0
>>>> [ 1255.054440] RBP: ffffb4b62035fdd8 R08: 0000000000000000 R09: 0000000000000000
>>>> [ 1255.054860] R10: 0000000000000010 R11: ffff9f2a4b1cc310 R12: 0000000000080005
>>>> [ 1255.055268] R13: ffff9f2a4b1cc310 R14: ffff9f4e369161e0 R15: ffff9f2a1b2f9080
>>>> [ 1255.055674] FS:  0000000000000000(0000) GS:ffff9f4e3f740000(0000) knlGS:0000000000000000
>>>> [ 1255.056087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [ 1255.056501] CR2: 000000000000051e CR3: 00000002df00a004 CR4: 00000000007606e0
>>>> [ 1255.056923] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> [ 1255.057345] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>> [ 1255.057763] PKRU: 55555554
>>>> [ 1255.058179] Call Trace:
>>>> [ 1255.058603]  drm_gem_dmabuf_release+0x1a/0x30 [drm]
>>>> [ 1255.059025]  dma_buf_release+0x56/0x130
>>>> [ 1255.059443]  __fput+0xc6/0x260
>>>> [ 1255.059856]  delayed_fput+0x20/0x30
>>>> [ 1255.060272]  process_one_work+0x1fd/0x3f0
>>>> [ 1255.060686]  worker_thread+0x34/0x410
>>>> [ 1255.061099]  kthread+0x121/0x140
>>>> [ 1255.061510]  ? process_one_work+0x3f0/0x3f0
>>>> [ 1255.061923]  ? kthread_park+0xb0/0xb0
>>>> [ 1255.062336]  ret_from_fork+0x35/0x40
>>>>
>>>> drm_gem_object_put_unlocked calls drm_gem_object_free when the
>>>> obj->refcount reaches 0. From there it calls
>>>> dev->driver->gem_free_object_unlocked, which is
>>>> amdgpu_gem_object_free in amdgpu.
>>>>
>>>> Regards,
>>>>   Felix
>>>>
>>>> Am 2020-05-01 um 10:29 a.m. schrieb Christian König:
>>>>> Am 01.05.20 um 16:21 schrieb Felix Kuehling:
>>>>>> From: Felix Kuehling <felix.kuehling@gmail.com>
>>>>>>
>>>>>> That reference gets dropped when the the dma-buf is freed. Not
>>>>>> incrementing
>>>>>> the refcount can lead to use-after-free errors.
>>>>>>
>>>>>> Signed-off-by: Felix Kuehling <felix.kuehling@gmail.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 9 ++++++++-
>>>>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>>> index ffeb20f11c07..a0f9b3ef4aad 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>>> @@ -398,8 +398,15 @@ struct dma_buf
>>>>>> *amdgpu_gem_prime_export(struct drm_gem_object *gobj,
>>>>>>           return ERR_PTR(-EPERM);
>>>>>>         buf = drm_gem_prime_export(gobj, flags);
>>>>>> -    if (!IS_ERR(buf))
>>>>>> +    if (!IS_ERR(buf)) {
>>>>>>           buf->ops = &amdgpu_dmabuf_ops;
>>>>>> +        /* GEM needs a reference to the underlying object
>>>>>> +         * that gets dropped when the dma-buf is released,
>>>>>> +         * through the amdgpu_gem_object_free callback
>>>>>> +         * from drm_gem_object_put_unlocked.
>>>>>> +         */
>>>>>> +        amdgpu_bo_ref(bo);
>>>>>> +    }
>>>>>
>>>>> Of hand that doesn't sounds correct to me. Why should the exported
>>>>> bo be closed through amdgpu_gem_object_free()?
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>>         return buf;
>>>>>>   }
>>>>>
>>>>
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>
>

[-- Attachment #1.2: Type: text/html, Size: 12479 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Take a reference to an exported BO
  2020-05-05 17:29           ` Felix Kuehling
@ 2020-05-05 20:10             ` Felix Kuehling
  2020-05-06  7:15             ` Christian König
  1 sibling, 0 replies; 9+ messages in thread
From: Felix Kuehling @ 2020-05-05 20:10 UTC (permalink / raw)
  To: Christian König, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 8864 bytes --]

Am 2020-05-05 um 1:29 p.m. schrieb Felix Kuehling:
> Am 2020-05-05 um 11:19 a.m. schrieb Christian König:
>> Am 05.05.20 um 16:58 schrieb Felix Kuehling:
>>> Am 2020-05-05 um 3:47 a.m. schrieb Christian König:
>>>> Just to reply here once more, this patch is a clear NAK.
>>>
>>> Agreed. But see below. I don't think all is well.
>>>
>>>>
>>>> The references are grabbed in the call path of
>>>> drm_gem_prime_export() and dropped again in drm_gem_dmabuf_release().
>>>>
>>>> So they are perfectly balanced as far as I can see.
>>>
>>> That is true for the GEM object references. But I believe there is
>>> still a problem with the TTM BO references.
>>>
>>> As far as I can tell amdgpu_bo_unref can free the TTM BO while there
>>> are still references to the GEM object from DMA buf exports. I think
>>> that's a fundamental problem with how we have two reference counts
>>> for the same physical object (TTM BO and the embedded GEM BO).
>>>
>>
>> Completely agree, I also mentioned that problem during my talk on
>> FOSDEM. But calling amdgpu_bo_unref() to often is a bug in itself.
>
> That's not the problem here.
>
I think we may have a fundamental issue with how we release memory in
KFD using amdgpu_bo_unref. That worked OK before the GEM object was
embedded in the TTM BO, and it continued working for memory that's not
shared via DMABufs.

Alejandro is testing a patch that changes KFD to release its BOs with
drm_gem_object_put_unlocked instead of amdgpu_bo_unref.

Ignore my musings below. I think they were based on incorrect
assumptions that it's OK to release GEM BOs with amdgpu_bo_unref.

Regards,
  Felix


>
>>
>> What we could probably do to detect this is adding a BUG_ON() in TTMs
>> release function and checking if the GEM reference count is really dead.
>
> The problem is, that we have to guess whether there are still any
> dmabuf references to the GEM BO. There is no way amdgpu can know that.
> You can't make amdgpu responsible for keeping a reference to the TTM
> BO while the GEM BO is still referenced by entities completely out of
> the control of amdgpu.
>
> Another weird thing I see is that amdgpu_gem_free_object calls
> amdgpu_bo_unref. That implies that the GEM object conceptually holds a
> reference to the amdpu/TTM BO. But that is not really the case. Amdgpu
> never takes that reference that GEM is supposed to own. If it did, we
> would leak all our memory because nobody would ever drop that reference.
>
>>
>>> I think the correct solution is for amdgpu_bo_ref/unref to delegate
>>> its reference counting to drm_gem_object_get/put instead of
>>> ttm_bo_get/put. The amdgpu BO would hold one token reference to the
>>> TTM BO, which it can drop when the GEM BO refcount drops to 0.
>>> Finally, the amdgpu BO should only be freed once the TTM BO refcount
>>> also becomes 0.
>>
>> Just the other way around, but yes the long term plan should probably
>> be to merge the two.
>
> I need a short term solution. Because I have a bug that causes a
> kernel oops with applications that are valid and correct, as far as I
> can tell.
>
> I'm thinking a solution that doesn't require major changes to the way
> TTM and GEM interact would put amdgpu in charge of coordinating the
> two. Unfortunately that would mean adding a third reference count in
> amdgpu_bo, in addition to the ones in TTM and GEM. The amdgpu BO would
> hold one token reference to each of the GEM and TTM BO. When amdgpu
> refcount goes to 0 it releases that GEM BO token reference. When the
> GEM BO refcount goes  to 0, we get a callback amdgpu_gem_object_free.
> There we can drop the token reference to the TTM BO. Once the TTM BO
> reference goes to 0 we free the memory.
>
> Does this sound feasible?
>
> Regards,
>   Felix
>
>
>>
>> The difficult is currently we have a mismatch what locks could be
>> taken when we drop the references.
>>
>> Regards,
>> Christian.
>>
>>> Regards,
>>>   Felix
>>>
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 01.05.20 um 16:44 schrieb Felix Kuehling:
>>>>>
>>>>> [dropping my gmail address]
>>>>>
>>>>> We saw this backtrace showing the call chain while investigating a
>>>>> kernel oops caused by this issue on the DKMS branch with the KFD
>>>>> IPC API. It happens after a dma-buf file is released with fput:
>>>>>
>>>>> [ 1255.049330] BUG: kernel NULL pointer dereference, address: 000000000000051e
>>>>> [ 1255.049727] #PF: supervisor read access in kernel mode
>>>>> [ 1255.050092] #PF: error_code(0x0000) - not-present page
>>>>> [ 1255.050416] PGD 0 P4D 0
>>>>> [ 1255.050736] Oops: 0000 [#1] SMP PTI
>>>>> [ 1255.051060] CPU: 27 PID: 2292 Comm: kworker/27:2 Tainted: G           OE     5.3.0-46-generic #38~18.04.1-Ubuntu
>>>>> [ 1255.051400] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 3.0a 02/26/2019
>>>>> [ 1255.051752] Workqueue: events delayed_fput
>>>>> [ 1255.052111] RIP: 0010:drm_gem_object_put_unlocked+0x1c/0x70 [drm]
>>>>> [ 1255.052465] Code: 4d 80 c8 ee 0f 0b eb d8 66 0f 1f 44 00 00 0f 1f 44 00 00 48 85 ff 74 34 55 48 89 e5 41 54 53 48 89 fb 48 8b 7f 08 48 8b 47 20 <48> 83 b8 a0 00 00 00 00 74 1a 4c 8d 67 68 48 89 df 4c 89 e6 e8 9b
>>>>> [ 1255.053224] RSP: 0018:ffffb4b62035fdc8 EFLAGS: 00010286
>>>>> [ 1255.053613] RAX: 000000000000047e RBX: ffff9f2add197850 RCX: 0000000000000000
>>>>> [ 1255.054032] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f2aa2548aa0
>>>>> [ 1255.054440] RBP: ffffb4b62035fdd8 R08: 0000000000000000 R09: 0000000000000000
>>>>> [ 1255.054860] R10: 0000000000000010 R11: ffff9f2a4b1cc310 R12: 0000000000080005
>>>>> [ 1255.055268] R13: ffff9f2a4b1cc310 R14: ffff9f4e369161e0 R15: ffff9f2a1b2f9080
>>>>> [ 1255.055674] FS:  0000000000000000(0000) GS:ffff9f4e3f740000(0000) knlGS:0000000000000000
>>>>> [ 1255.056087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [ 1255.056501] CR2: 000000000000051e CR3: 00000002df00a004 CR4: 00000000007606e0
>>>>> [ 1255.056923] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>> [ 1255.057345] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>> [ 1255.057763] PKRU: 55555554
>>>>> [ 1255.058179] Call Trace:
>>>>> [ 1255.058603]  drm_gem_dmabuf_release+0x1a/0x30 [drm]
>>>>> [ 1255.059025]  dma_buf_release+0x56/0x130
>>>>> [ 1255.059443]  __fput+0xc6/0x260
>>>>> [ 1255.059856]  delayed_fput+0x20/0x30
>>>>> [ 1255.060272]  process_one_work+0x1fd/0x3f0
>>>>> [ 1255.060686]  worker_thread+0x34/0x410
>>>>> [ 1255.061099]  kthread+0x121/0x140
>>>>> [ 1255.061510]  ? process_one_work+0x3f0/0x3f0
>>>>> [ 1255.061923]  ? kthread_park+0xb0/0xb0
>>>>> [ 1255.062336]  ret_from_fork+0x35/0x40
>>>>>
>>>>> drm_gem_object_put_unlocked calls drm_gem_object_free when the
>>>>> obj->refcount reaches 0. From there it calls
>>>>> dev->driver->gem_free_object_unlocked, which is
>>>>> amdgpu_gem_object_free in amdgpu.
>>>>>
>>>>> Regards,
>>>>>   Felix
>>>>>
>>>>> Am 2020-05-01 um 10:29 a.m. schrieb Christian König:
>>>>>> Am 01.05.20 um 16:21 schrieb Felix Kuehling:
>>>>>>> From: Felix Kuehling <felix.kuehling@gmail.com>
>>>>>>>
>>>>>>> That reference gets dropped when the the dma-buf is freed. Not
>>>>>>> incrementing
>>>>>>> the refcount can lead to use-after-free errors.
>>>>>>>
>>>>>>> Signed-off-by: Felix Kuehling <felix.kuehling@gmail.com>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 9 ++++++++-
>>>>>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>>>> index ffeb20f11c07..a0f9b3ef4aad 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>>>> @@ -398,8 +398,15 @@ struct dma_buf
>>>>>>> *amdgpu_gem_prime_export(struct drm_gem_object *gobj,
>>>>>>>           return ERR_PTR(-EPERM);
>>>>>>>         buf = drm_gem_prime_export(gobj, flags);
>>>>>>> -    if (!IS_ERR(buf))
>>>>>>> +    if (!IS_ERR(buf)) {
>>>>>>>           buf->ops = &amdgpu_dmabuf_ops;
>>>>>>> +        /* GEM needs a reference to the underlying object
>>>>>>> +         * that gets dropped when the dma-buf is released,
>>>>>>> +         * through the amdgpu_gem_object_free callback
>>>>>>> +         * from drm_gem_object_put_unlocked.
>>>>>>> +         */
>>>>>>> +        amdgpu_bo_ref(bo);
>>>>>>> +    }
>>>>>>
>>>>>> Of hand that doesn't sounds correct to me. Why should the
>>>>>> exported bo be closed through amdgpu_gem_object_free()?
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>         return buf;
>>>>>>>   }
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>

[-- Attachment #1.2: Type: text/html, Size: 13938 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/1] drm/amdgpu: Take a reference to an exported BO
  2020-05-05 17:29           ` Felix Kuehling
  2020-05-05 20:10             ` Felix Kuehling
@ 2020-05-06  7:15             ` Christian König
  1 sibling, 0 replies; 9+ messages in thread
From: Christian König @ 2020-05-06  7:15 UTC (permalink / raw)
  To: Felix Kuehling, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 8956 bytes --]

Am 05.05.20 um 19:29 schrieb Felix Kuehling:
>>
>> What we could probably do to detect this is adding a BUG_ON() in TTMs 
>> release function and checking if the GEM reference count is really dead.
>
> The problem is, that we have to guess whether there are still any 
> dmabuf references to the GEM BO. There is no way amdgpu can know that. 
> You can't make amdgpu responsible for keeping a reference to the TTM 
> BO while the GEM BO is still referenced by entities completely out of 
> the control of amdgpu.
>

That sounds like you don't understand how this works on the graphics 
side, so here is a brief overview once more.

The last object still around is always the TTM BO, which can even be 
resurrected from the dead (kref_init() called again) if necessary for 
delayed delete.

Then we have the GEM object which holds a reference to the TTM BO. This 
reference gets dropped when the GEM object gets destroyed.

Then we optionally have the DMA-buf object for exported BOs which holds 
a reference to the GEM object and the driver.


This construct guarantees that the GEM object is never destroyed nor the 
driver unloaded before the DMA-buf object referencing it is destroyed.

Additional to that the reference from the GEM object to the TTM BO 
guarantees that the TTM BO is never destroyed before the GEM object is 
destroyed.

> Another weird thing I see is that amdgpu_gem_free_object calls 
> amdgpu_bo_unref. That implies that the GEM object conceptually holds a 
> reference to the amdpu/TTM BO. But that is not really the case. Amdgpu 
> never takes that reference that GEM is supposed to own. If it did, we 
> would leak all our memory because nobody would ever drop that reference. 

See amdgpu_bo_do_create(), the GEM object is initialized with 
drm_gem_private_object_init() with a reference count of 1. Then the TTM 
BO is initialized with ttm_bo_init_reserved() with a reference count of 1.

In general there are two use cases here, the first one is userspace 
allocations with a GEM object handle. In this case the initial reference 
is owned by the GEM object and dropped in amdgpu_gem_free_object().

The other use case are kernel internal allocations like page tables and 
other general buffers. In this case the GEM object is ignored and the 
last TTM BO reference is dropped by calling amdgpu_bo_unref().

>>> I think the correct solution is for amdgpu_bo_ref/unref to delegate 
>>> its reference counting to drm_gem_object_get/put instead of 
>>> ttm_bo_get/put. The amdgpu BO would hold one token reference to the 
>>> TTM BO, which it can drop when the GEM BO refcount drops to 0. 
>>> Finally, the amdgpu BO should only be freed once the TTM BO refcount 
>>> also becomes 0.
>>
>> Just the other way around, but yes the long term plan should probably 
>> be to merge the two.
>
> I need a short term solution. Because I have a bug that causes a 
> kernel oops with applications that are valid and correct, as far as I 
> can tell.
>
> I'm thinking a solution that doesn't require major changes to the way 
> TTM and GEM interact would put amdgpu in charge of coordinating the 
> two. Unfortunately that would mean adding a third reference count in 
> amdgpu_bo, in addition to the ones in TTM and GEM. The amdgpu BO would 
> hold one token reference to each of the GEM and TTM BO. When amdgpu 
> refcount goes to 0 it releases that GEM BO token reference. When the 
> GEM BO refcount goes  to 0, we get a callback amdgpu_gem_object_free. 
> There we can drop the token reference to the TTM BO. Once the TTM BO 
> reference goes to 0 we free the memory.
>
> Does this sound feasible?
>

Well of hand it sounds like it makes the whole thing much more 
complicated without any gain. I probably still haven't understood what 
the core problem here is.

Regards,
Christian.

> Regards,
>   Felix
>
>
>>
>> The difficult is currently we have a mismatch what locks could be 
>> taken when we drop the references.
>>
>> Regards,
>> Christian.
>>
>>> Regards,
>>>   Felix
>>>
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 01.05.20 um 16:44 schrieb Felix Kuehling:
>>>>>
>>>>> [dropping my gmail address]
>>>>>
>>>>> We saw this backtrace showing the call chain while investigating a 
>>>>> kernel oops caused by this issue on the DKMS branch with the KFD 
>>>>> IPC API. It happens after a dma-buf file is released with fput:
>>>>>
>>>>> [ 1255.049330] BUG: kernel NULL pointer dereference, address: 000000000000051e
>>>>> [ 1255.049727] #PF: supervisor read access in kernel mode
>>>>> [ 1255.050092] #PF: error_code(0x0000) - not-present page
>>>>> [ 1255.050416] PGD 0 P4D 0
>>>>> [ 1255.050736] Oops: 0000 [#1] SMP PTI
>>>>> [ 1255.051060] CPU: 27 PID: 2292 Comm: kworker/27:2 Tainted: G           OE     5.3.0-46-generic #38~18.04.1-Ubuntu
>>>>> [ 1255.051400] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, BIOS 3.0a 02/26/2019
>>>>> [ 1255.051752] Workqueue: events delayed_fput
>>>>> [ 1255.052111] RIP: 0010:drm_gem_object_put_unlocked+0x1c/0x70 [drm]
>>>>> [ 1255.052465] Code: 4d 80 c8 ee 0f 0b eb d8 66 0f 1f 44 00 00 0f 1f 44 00 00 48 85 ff 74 34 55 48 89 e5 41 54 53 48 89 fb 48 8b 7f 08 48 8b 47 20 <48> 83 b8 a0 00 00 00 00 74 1a 4c 8d 67 68 48 89 df 4c 89 e6 e8 9b
>>>>> [ 1255.053224] RSP: 0018:ffffb4b62035fdc8 EFLAGS: 00010286
>>>>> [ 1255.053613] RAX: 000000000000047e RBX: ffff9f2add197850 RCX: 0000000000000000
>>>>> [ 1255.054032] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9f2aa2548aa0
>>>>> [ 1255.054440] RBP: ffffb4b62035fdd8 R08: 0000000000000000 R09: 0000000000000000
>>>>> [ 1255.054860] R10: 0000000000000010 R11: ffff9f2a4b1cc310 R12: 0000000000080005
>>>>> [ 1255.055268] R13: ffff9f2a4b1cc310 R14: ffff9f4e369161e0 R15: ffff9f2a1b2f9080
>>>>> [ 1255.055674] FS:  0000000000000000(0000) GS:ffff9f4e3f740000(0000) knlGS:0000000000000000
>>>>> [ 1255.056087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [ 1255.056501] CR2: 000000000000051e CR3: 00000002df00a004 CR4: 00000000007606e0
>>>>> [ 1255.056923] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>> [ 1255.057345] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>> [ 1255.057763] PKRU: 55555554
>>>>> [ 1255.058179] Call Trace:
>>>>> [ 1255.058603]  drm_gem_dmabuf_release+0x1a/0x30 [drm]
>>>>> [ 1255.059025]  dma_buf_release+0x56/0x130
>>>>> [ 1255.059443]  __fput+0xc6/0x260
>>>>> [ 1255.059856]  delayed_fput+0x20/0x30
>>>>> [ 1255.060272]  process_one_work+0x1fd/0x3f0
>>>>> [ 1255.060686]  worker_thread+0x34/0x410
>>>>> [ 1255.061099]  kthread+0x121/0x140
>>>>> [ 1255.061510]  ? process_one_work+0x3f0/0x3f0
>>>>> [ 1255.061923]  ? kthread_park+0xb0/0xb0
>>>>> [ 1255.062336]  ret_from_fork+0x35/0x40
>>>>>
>>>>> drm_gem_object_put_unlocked calls drm_gem_object_free when the 
>>>>> obj->refcount reaches 0. From there it calls 
>>>>> dev->driver->gem_free_object_unlocked, which is 
>>>>> amdgpu_gem_object_free in amdgpu.
>>>>>
>>>>> Regards,
>>>>>   Felix
>>>>>
>>>>> Am 2020-05-01 um 10:29 a.m. schrieb Christian König:
>>>>>> Am 01.05.20 um 16:21 schrieb Felix Kuehling:
>>>>>>> From: Felix Kuehling <felix.kuehling@gmail.com>
>>>>>>>
>>>>>>> That reference gets dropped when the the dma-buf is freed. Not 
>>>>>>> incrementing
>>>>>>> the refcount can lead to use-after-free errors.
>>>>>>>
>>>>>>> Signed-off-by: Felix Kuehling <felix.kuehling@gmail.com>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 9 ++++++++-
>>>>>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>>>> index ffeb20f11c07..a0f9b3ef4aad 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>>>> @@ -398,8 +398,15 @@ struct dma_buf 
>>>>>>> *amdgpu_gem_prime_export(struct drm_gem_object *gobj,
>>>>>>>           return ERR_PTR(-EPERM);
>>>>>>>         buf = drm_gem_prime_export(gobj, flags);
>>>>>>> -    if (!IS_ERR(buf))
>>>>>>> +    if (!IS_ERR(buf)) {
>>>>>>>           buf->ops = &amdgpu_dmabuf_ops;
>>>>>>> +        /* GEM needs a reference to the underlying object
>>>>>>> +         * that gets dropped when the dma-buf is released,
>>>>>>> +         * through the amdgpu_gem_object_free callback
>>>>>>> +         * from drm_gem_object_put_unlocked.
>>>>>>> +         */
>>>>>>> +        amdgpu_bo_ref(bo);
>>>>>>> +    }
>>>>>>
>>>>>> Of hand that doesn't sounds correct to me. Why should the 
>>>>>> exported bo be closed through amdgpu_gem_object_free()?
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>         return buf;
>>>>>>>   }
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> amd-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>


[-- Attachment #1.2: Type: text/html, Size: 13372 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-05-06  7:15 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-01 14:21 [PATCH 1/1] drm/amdgpu: Take a reference to an exported BO Felix Kuehling
2020-05-01 14:29 ` Christian König
2020-05-01 14:44   ` Felix Kuehling
2020-05-05  7:47     ` Christian König
2020-05-05 14:58       ` Felix Kuehling
2020-05-05 15:19         ` Christian König
2020-05-05 17:29           ` Felix Kuehling
2020-05-05 20:10             ` Felix Kuehling
2020-05-06  7:15             ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.