* Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
@ 2021-03-22 10:34 ` Christian König
0 siblings, 0 replies; 31+ messages in thread
From: Christian König @ 2021-03-22 10:34 UTC (permalink / raw)
To: Daniel Gomez, Felix Kuehling
Cc: David Airlie, linux-kernel, amd-gfx, Alex Deucher, dri-devel,
Deucher, Alexander, Koenig, Christian, dagmcr
Hi Daniel,
Am 22.03.21 um 10:38 schrieb Daniel Gomez:
> On Fri, 19 Mar 2021 at 21:29, Felix Kuehling <felix.kuehling@amd.com> wrote:
>> This caused a regression in kfdtest in a large-buffer stress test after
>> memory allocation for user pages fails:
> I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and
> not this one.
> Just some background for the mem leak patch if helps to understand this:
> The leak was introduce here:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb
> where the bound status was introduced for all drm drivers including
> radeon and amdgpu. So this patch just reverts the logic to the
> original code but keeping the bound status. In my case, the binding
> code allocates the user pages memory and returns without bounding (at
> amdgpu_gtt_mgr_has_gart_addr). So,
> when the unbinding happens, the memory needs to be cleared to prevent the leak.
Ah, now I understand what's happening here. Daniel your patch is not
really correct.
The problem is rather that we don't set the tt object to bound if it
doesn't have a GTT address.
Going to provide a patch for this.
Regards,
Christian.
>
>> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16
>> [17359.543746] BUG: kernel NULL pointer dereference, address: 0000000000000000
>> [17359.551494] #PF: supervisor read access in kernel mode
>> [17359.557375] #PF: error_code(0x0000) - not-present page
>> [17359.563247] PGD 0 P4D 0
>> [17359.566514] Oops: 0000 [#1] SMP PTI
>> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted 5.11.0-kfd-fkuehlin #193
>> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3201 06/17/2016
>> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
>> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
>> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
>> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
>> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
>> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
>> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
>> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
>> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
>> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
>> [17359.682883] Call Trace:
>> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu]
>> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
>> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm]
>> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
>> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 [amdgpu]
>> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu]
>> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu]
>> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu]
>> [17359.734152] __x64_sys_ioctl+0x8b/0xd0
>> [17359.738796] do_syscall_64+0x2d/0x40
>> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [17359.749205] RIP: 0033:0x7febb083b6d7
>> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
>> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
>> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007febb083b6d7
>> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI: 0000000000000003
>> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09: 00000000c4000004
>> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12: 0000559416e4e2aa
>> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000000
>> [17359.822140] Modules linked in: ip6table_filter ip6_tables iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 gpu_sched ip_tables x_tables
>> [17359.837776] CR2: 0000000000000000
>> [17359.841888] ---[ end trace a6f27d64475b28c8 ]---
>> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
>> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
>> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
>> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
>> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
>> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
>> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
>> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
>> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
>> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> From what I understand, the init_user_pages fails (returns EBUSY) and
> the code goes to allocate_init_user_pages_failed where the unbind and
> the userptr clear occurs.
> Can we prevent this if we save the bounding status + userptr alloc? so
> the function amdgpu_ttm_backend_unbind returns without trying to clear
> the userptr memory?
>
> Something like:
>
> amdgpu_ttm_backend_bind:
> if (gtt->userptr) {
> r = amdgpu_ttm_tt_pin_userptr(bdev, ttm);
> if (r) ...
> gtt->sg_table = true;
> }
>
> amdgpu_ttm_backend_unbind:
> if (gtt->sg_table) {
> if (gtt->user_ptr) ...
> }
>
> If you agree, I'll send a v2 patch. Otherwise, maybe we could return
> within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated.
>
> Any other ideas?
>
> Regards,
> Daniel
>
>> Reverting this patch fixes the problem for me.
>>
>> Regards,
>> Felix
>>
>> On 2021-03-18 10:57 p.m., Alex Deucher wrote:
>>> Applied. Thanks!
>>>
>>> Alex
>>>
>>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian
>>> <Christian.Koenig@amd.com> wrote:
>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>>> ________________________________
>>>> Von: Daniel Gomez <daniel@qtec.com>
>>>> Gesendet: Donnerstag, 18. März 2021 09:32
>>>> Cc: dagmcr@gmail.com <dagmcr@gmail.com>; Daniel Gomez <daniel@qtec.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; David Airlie <airlied@linux.ie>; Daniel Vetter <daniel@ffwll.ch>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
>>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
>>>>
>>>> If userptr pages have been pinned but not bounded,
>>>> they remain uncleared.
>>>>
>>>> Signed-off-by: Daniel Gomez <daniel@qtec.com>
>>>> ---
>>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
>>>> index e8c66d10478f..bbcc6264d48f 100644
>>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
>>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
>>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct ttm_bo_device *bdev, struct ttm_tt
>>>> struct radeon_ttm_tt *gtt = (void *)ttm;
>>>> struct radeon_device *rdev = radeon_get_rdev(bdev);
>>>>
>>>> + if (gtt->userptr)
>>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm);
>>>> +
>>>> if (!gtt->bound)
>>>> return;
>>>>
>>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
>>>>
>>>> - if (gtt->userptr)
>>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm);
>>>> gtt->bound = false;
>>>> }
>>>>
>>>> --
>>>> 2.30.2
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
@ 2021-03-22 10:34 ` Christian König
0 siblings, 0 replies; 31+ messages in thread
From: Christian König @ 2021-03-22 10:34 UTC (permalink / raw)
To: Daniel Gomez, Felix Kuehling
Cc: David Airlie, linux-kernel, amd-gfx, dri-devel, Deucher,
Alexander, Koenig, Christian, dagmcr
Hi Daniel,
Am 22.03.21 um 10:38 schrieb Daniel Gomez:
> On Fri, 19 Mar 2021 at 21:29, Felix Kuehling <felix.kuehling@amd.com> wrote:
>> This caused a regression in kfdtest in a large-buffer stress test after
>> memory allocation for user pages fails:
> I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and
> not this one.
> Just some background for the mem leak patch if helps to understand this:
> The leak was introduce here:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb
> where the bound status was introduced for all drm drivers including
> radeon and amdgpu. So this patch just reverts the logic to the
> original code but keeping the bound status. In my case, the binding
> code allocates the user pages memory and returns without bounding (at
> amdgpu_gtt_mgr_has_gart_addr). So,
> when the unbinding happens, the memory needs to be cleared to prevent the leak.
Ah, now I understand what's happening here. Daniel your patch is not
really correct.
The problem is rather that we don't set the tt object to bound if it
doesn't have a GTT address.
Going to provide a patch for this.
Regards,
Christian.
>
>> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16
>> [17359.543746] BUG: kernel NULL pointer dereference, address: 0000000000000000
>> [17359.551494] #PF: supervisor read access in kernel mode
>> [17359.557375] #PF: error_code(0x0000) - not-present page
>> [17359.563247] PGD 0 P4D 0
>> [17359.566514] Oops: 0000 [#1] SMP PTI
>> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted 5.11.0-kfd-fkuehlin #193
>> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3201 06/17/2016
>> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
>> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
>> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
>> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
>> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
>> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
>> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
>> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
>> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
>> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
>> [17359.682883] Call Trace:
>> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu]
>> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
>> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm]
>> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
>> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 [amdgpu]
>> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu]
>> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu]
>> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu]
>> [17359.734152] __x64_sys_ioctl+0x8b/0xd0
>> [17359.738796] do_syscall_64+0x2d/0x40
>> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [17359.749205] RIP: 0033:0x7febb083b6d7
>> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
>> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
>> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007febb083b6d7
>> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI: 0000000000000003
>> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09: 00000000c4000004
>> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12: 0000559416e4e2aa
>> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000000
>> [17359.822140] Modules linked in: ip6table_filter ip6_tables iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 gpu_sched ip_tables x_tables
>> [17359.837776] CR2: 0000000000000000
>> [17359.841888] ---[ end trace a6f27d64475b28c8 ]---
>> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
>> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
>> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
>> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
>> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
>> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
>> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
>> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
>> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
>> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> From what I understand, the init_user_pages fails (returns EBUSY) and
> the code goes to allocate_init_user_pages_failed where the unbind and
> the userptr clear occurs.
> Can we prevent this if we save the bounding status + userptr alloc? so
> the function amdgpu_ttm_backend_unbind returns without trying to clear
> the userptr memory?
>
> Something like:
>
> amdgpu_ttm_backend_bind:
> if (gtt->userptr) {
> r = amdgpu_ttm_tt_pin_userptr(bdev, ttm);
> if (r) ...
> gtt->sg_table = true;
> }
>
> amdgpu_ttm_backend_unbind:
> if (gtt->sg_table) {
> if (gtt->user_ptr) ...
> }
>
> If you agree, I'll send a v2 patch. Otherwise, maybe we could return
> within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated.
>
> Any other ideas?
>
> Regards,
> Daniel
>
>> Reverting this patch fixes the problem for me.
>>
>> Regards,
>> Felix
>>
>> On 2021-03-18 10:57 p.m., Alex Deucher wrote:
>>> Applied. Thanks!
>>>
>>> Alex
>>>
>>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian
>>> <Christian.Koenig@amd.com> wrote:
>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>>> ________________________________
>>>> Von: Daniel Gomez <daniel@qtec.com>
>>>> Gesendet: Donnerstag, 18. März 2021 09:32
>>>> Cc: dagmcr@gmail.com <dagmcr@gmail.com>; Daniel Gomez <daniel@qtec.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; David Airlie <airlied@linux.ie>; Daniel Vetter <daniel@ffwll.ch>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
>>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
>>>>
>>>> If userptr pages have been pinned but not bounded,
>>>> they remain uncleared.
>>>>
>>>> Signed-off-by: Daniel Gomez <daniel@qtec.com>
>>>> ---
>>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
>>>> index e8c66d10478f..bbcc6264d48f 100644
>>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
>>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
>>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct ttm_bo_device *bdev, struct ttm_tt
>>>> struct radeon_ttm_tt *gtt = (void *)ttm;
>>>> struct radeon_device *rdev = radeon_get_rdev(bdev);
>>>>
>>>> + if (gtt->userptr)
>>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm);
>>>> +
>>>> if (!gtt->bound)
>>>> return;
>>>>
>>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
>>>>
>>>> - if (gtt->userptr)
>>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm);
>>>> gtt->bound = false;
>>>> }
>>>>
>>>> --
>>>> 2.30.2
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
2021-03-22 10:34 ` Christian König
(?)
@ 2021-03-22 10:49 ` Daniel Gomez
-1 siblings, 0 replies; 31+ messages in thread
From: Daniel Gomez @ 2021-03-22 10:49 UTC (permalink / raw)
To: Christian König
Cc: Felix Kuehling, David Airlie, linux-kernel, dri-devel, Deucher,
Alexander, amd-gfx, Alex Deucher, Koenig, Christian, dagmcr
On Mon, 22 Mar 2021 at 11:34, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Hi Daniel,
>
> Am 22.03.21 um 10:38 schrieb Daniel Gomez:
> > On Fri, 19 Mar 2021 at 21:29, Felix Kuehling <felix.kuehling@amd.com> wrote:
> >> This caused a regression in kfdtest in a large-buffer stress test after
> >> memory allocation for user pages fails:
> > I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and
> > not this one.
> > Just some background for the mem leak patch if helps to understand this:
> > The leak was introduce here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb
> > where the bound status was introduced for all drm drivers including
> > radeon and amdgpu. So this patch just reverts the logic to the
> > original code but keeping the bound status. In my case, the binding
> > code allocates the user pages memory and returns without bounding (at
> > amdgpu_gtt_mgr_has_gart_addr). So,
> > when the unbinding happens, the memory needs to be cleared to prevent the leak.
>
> Ah, now I understand what's happening here. Daniel your patch is not
> really correct.
>
> The problem is rather that we don't set the tt object to bound if it
> doesn't have a GTT address.
Okay, I understand.
>
> Going to provide a patch for this.
Looking forward to your patch. Thanks Christian!
>
> Regards,
> Christian.
>
> >
> >> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16
> >> [17359.543746] BUG: kernel NULL pointer dereference, address: 0000000000000000
> >> [17359.551494] #PF: supervisor read access in kernel mode
> >> [17359.557375] #PF: error_code(0x0000) - not-present page
> >> [17359.563247] PGD 0 P4D 0
> >> [17359.566514] Oops: 0000 [#1] SMP PTI
> >> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted 5.11.0-kfd-fkuehlin #193
> >> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3201 06/17/2016
> >> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> >> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> >> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> >> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> >> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> >> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> >> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> >> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> >> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> >> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> >> [17359.682883] Call Trace:
> >> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu]
> >> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
> >> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm]
> >> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
> >> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 [amdgpu]
> >> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu]
> >> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu]
> >> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu]
> >> [17359.734152] __x64_sys_ioctl+0x8b/0xd0
> >> [17359.738796] do_syscall_64+0x2d/0x40
> >> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> [17359.749205] RIP: 0033:0x7febb083b6d7
> >> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
> >> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
> >> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007febb083b6d7
> >> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI: 0000000000000003
> >> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09: 00000000c4000004
> >> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12: 0000559416e4e2aa
> >> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000000
> >> [17359.822140] Modules linked in: ip6table_filter ip6_tables iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 gpu_sched ip_tables x_tables
> >> [17359.837776] CR2: 0000000000000000
> >> [17359.841888] ---[ end trace a6f27d64475b28c8 ]---
> >> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> >> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> >> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> >> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> >> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> >> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> >> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> >> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> >> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> >> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> > From what I understand, the init_user_pages fails (returns EBUSY) and
> > the code goes to allocate_init_user_pages_failed where the unbind and
> > the userptr clear occurs.
> > Can we prevent this if we save the bounding status + userptr alloc? so
> > the function amdgpu_ttm_backend_unbind returns without trying to clear
> > the userptr memory?
> >
> > Something like:
> >
> > amdgpu_ttm_backend_bind:
> > if (gtt->userptr) {
> > r = amdgpu_ttm_tt_pin_userptr(bdev, ttm);
> > if (r) ...
> > gtt->sg_table = true;
> > }
> >
> > amdgpu_ttm_backend_unbind:
> > if (gtt->sg_table) {
> > if (gtt->user_ptr) ...
> > }
> >
> > If you agree, I'll send a v2 patch. Otherwise, maybe we could return
> > within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated.
> >
> > Any other ideas?
> >
> > Regards,
> > Daniel
> >
> >> Reverting this patch fixes the problem for me.
> >>
> >> Regards,
> >> Felix
> >>
> >> On 2021-03-18 10:57 p.m., Alex Deucher wrote:
> >>> Applied. Thanks!
> >>>
> >>> Alex
> >>>
> >>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian
> >>> <Christian.Koenig@amd.com> wrote:
> >>>> Reviewed-by: Christian König <christian.koenig@amd.com>
> >>>> ________________________________
> >>>> Von: Daniel Gomez <daniel@qtec.com>
> >>>> Gesendet: Donnerstag, 18. März 2021 09:32
> >>>> Cc: dagmcr@gmail.com <dagmcr@gmail.com>; Daniel Gomez <daniel@qtec.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; David Airlie <airlied@linux.ie>; Daniel Vetter <daniel@ffwll.ch>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
> >>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
> >>>>
> >>>> If userptr pages have been pinned but not bounded,
> >>>> they remain uncleared.
> >>>>
> >>>> Signed-off-by: Daniel Gomez <daniel@qtec.com>
> >>>> ---
> >>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
> >>>> 1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> index e8c66d10478f..bbcc6264d48f 100644
> >>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct ttm_bo_device *bdev, struct ttm_tt
> >>>> struct radeon_ttm_tt *gtt = (void *)ttm;
> >>>> struct radeon_device *rdev = radeon_get_rdev(bdev);
> >>>>
> >>>> + if (gtt->userptr)
> >>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm);
> >>>> +
> >>>> if (!gtt->bound)
> >>>> return;
> >>>>
> >>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
> >>>>
> >>>> - if (gtt->userptr)
> >>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm);
> >>>> gtt->bound = false;
> >>>> }
> >>>>
> >>>> --
> >>>> 2.30.2
> >>>>
> >>>> _______________________________________________
> >>>> dri-devel mailing list
> >>>> dri-devel@lists.freedesktop.org
> >>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >>> _______________________________________________
> >>> dri-devel mailing list
> >>> dri-devel@lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
@ 2021-03-22 10:49 ` Daniel Gomez
0 siblings, 0 replies; 31+ messages in thread
From: Daniel Gomez @ 2021-03-22 10:49 UTC (permalink / raw)
To: Christian König
Cc: David Airlie, Felix Kuehling, linux-kernel, amd-gfx,
Alex Deucher, dri-devel, Deucher, Alexander, Koenig, Christian,
dagmcr
On Mon, 22 Mar 2021 at 11:34, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Hi Daniel,
>
> Am 22.03.21 um 10:38 schrieb Daniel Gomez:
> > On Fri, 19 Mar 2021 at 21:29, Felix Kuehling <felix.kuehling@amd.com> wrote:
> >> This caused a regression in kfdtest in a large-buffer stress test after
> >> memory allocation for user pages fails:
> > I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and
> > not this one.
> > Just some background for the mem leak patch if helps to understand this:
> > The leak was introduce here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb
> > where the bound status was introduced for all drm drivers including
> > radeon and amdgpu. So this patch just reverts the logic to the
> > original code but keeping the bound status. In my case, the binding
> > code allocates the user pages memory and returns without bounding (at
> > amdgpu_gtt_mgr_has_gart_addr). So,
> > when the unbinding happens, the memory needs to be cleared to prevent the leak.
>
> Ah, now I understand what's happening here. Daniel your patch is not
> really correct.
>
> The problem is rather that we don't set the tt object to bound if it
> doesn't have a GTT address.
Okay, I understand.
>
> Going to provide a patch for this.
Looking forward to your patch. Thanks Christian!
>
> Regards,
> Christian.
>
> >
> >> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16
> >> [17359.543746] BUG: kernel NULL pointer dereference, address: 0000000000000000
> >> [17359.551494] #PF: supervisor read access in kernel mode
> >> [17359.557375] #PF: error_code(0x0000) - not-present page
> >> [17359.563247] PGD 0 P4D 0
> >> [17359.566514] Oops: 0000 [#1] SMP PTI
> >> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted 5.11.0-kfd-fkuehlin #193
> >> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3201 06/17/2016
> >> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> >> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> >> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> >> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> >> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> >> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> >> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> >> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> >> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> >> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> >> [17359.682883] Call Trace:
> >> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu]
> >> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
> >> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm]
> >> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
> >> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 [amdgpu]
> >> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu]
> >> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu]
> >> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu]
> >> [17359.734152] __x64_sys_ioctl+0x8b/0xd0
> >> [17359.738796] do_syscall_64+0x2d/0x40
> >> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> [17359.749205] RIP: 0033:0x7febb083b6d7
> >> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
> >> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
> >> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007febb083b6d7
> >> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI: 0000000000000003
> >> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09: 00000000c4000004
> >> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12: 0000559416e4e2aa
> >> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000000
> >> [17359.822140] Modules linked in: ip6table_filter ip6_tables iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 gpu_sched ip_tables x_tables
> >> [17359.837776] CR2: 0000000000000000
> >> [17359.841888] ---[ end trace a6f27d64475b28c8 ]---
> >> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> >> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> >> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> >> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> >> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> >> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> >> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> >> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> >> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> >> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> > From what I understand, the init_user_pages fails (returns EBUSY) and
> > the code goes to allocate_init_user_pages_failed where the unbind and
> > the userptr clear occurs.
> > Can we prevent this if we save the bounding status + userptr alloc? so
> > the function amdgpu_ttm_backend_unbind returns without trying to clear
> > the userptr memory?
> >
> > Something like:
> >
> > amdgpu_ttm_backend_bind:
> > if (gtt->userptr) {
> > r = amdgpu_ttm_tt_pin_userptr(bdev, ttm);
> > if (r) ...
> > gtt->sg_table = true;
> > }
> >
> > amdgpu_ttm_backend_unbind:
> > if (gtt->sg_table) {
> > if (gtt->user_ptr) ...
> > }
> >
> > If you agree, I'll send a v2 patch. Otherwise, maybe we could return
> > within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated.
> >
> > Any other ideas?
> >
> > Regards,
> > Daniel
> >
> >> Reverting this patch fixes the problem for me.
> >>
> >> Regards,
> >> Felix
> >>
> >> On 2021-03-18 10:57 p.m., Alex Deucher wrote:
> >>> Applied. Thanks!
> >>>
> >>> Alex
> >>>
> >>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian
> >>> <Christian.Koenig@amd.com> wrote:
> >>>> Reviewed-by: Christian König <christian.koenig@amd.com>
> >>>> ________________________________
> >>>> Von: Daniel Gomez <daniel@qtec.com>
> >>>> Gesendet: Donnerstag, 18. März 2021 09:32
> >>>> Cc: dagmcr@gmail.com <dagmcr@gmail.com>; Daniel Gomez <daniel@qtec.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; David Airlie <airlied@linux.ie>; Daniel Vetter <daniel@ffwll.ch>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
> >>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
> >>>>
> >>>> If userptr pages have been pinned but not bounded,
> >>>> they remain uncleared.
> >>>>
> >>>> Signed-off-by: Daniel Gomez <daniel@qtec.com>
> >>>> ---
> >>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
> >>>> 1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> index e8c66d10478f..bbcc6264d48f 100644
> >>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct ttm_bo_device *bdev, struct ttm_tt
> >>>> struct radeon_ttm_tt *gtt = (void *)ttm;
> >>>> struct radeon_device *rdev = radeon_get_rdev(bdev);
> >>>>
> >>>> + if (gtt->userptr)
> >>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm);
> >>>> +
> >>>> if (!gtt->bound)
> >>>> return;
> >>>>
> >>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
> >>>>
> >>>> - if (gtt->userptr)
> >>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm);
> >>>> gtt->bound = false;
> >>>> }
> >>>>
> >>>> --
> >>>> 2.30.2
> >>>>
> >>>> _______________________________________________
> >>>> dri-devel mailing list
> >>>> dri-devel@lists.freedesktop.org
> >>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >>> _______________________________________________
> >>> dri-devel mailing list
> >>> dri-devel@lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
@ 2021-03-22 10:49 ` Daniel Gomez
0 siblings, 0 replies; 31+ messages in thread
From: Daniel Gomez @ 2021-03-22 10:49 UTC (permalink / raw)
To: Christian König
Cc: David Airlie, Felix Kuehling, linux-kernel, amd-gfx, dri-devel,
Deucher, Alexander, Koenig, Christian, dagmcr
On Mon, 22 Mar 2021 at 11:34, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Hi Daniel,
>
> Am 22.03.21 um 10:38 schrieb Daniel Gomez:
> > On Fri, 19 Mar 2021 at 21:29, Felix Kuehling <felix.kuehling@amd.com> wrote:
> >> This caused a regression in kfdtest in a large-buffer stress test after
> >> memory allocation for user pages fails:
> > I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and
> > not this one.
> > Just some background for the mem leak patch if helps to understand this:
> > The leak was introduce here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb
> > where the bound status was introduced for all drm drivers including
> > radeon and amdgpu. So this patch just reverts the logic to the
> > original code but keeping the bound status. In my case, the binding
> > code allocates the user pages memory and returns without bounding (at
> > amdgpu_gtt_mgr_has_gart_addr). So,
> > when the unbinding happens, the memory needs to be cleared to prevent the leak.
>
> Ah, now I understand what's happening here. Daniel your patch is not
> really correct.
>
> The problem is rather that we don't set the tt object to bound if it
> doesn't have a GTT address.
Okay, I understand.
>
> Going to provide a patch for this.
Looking forward to your patch. Thanks Christian!
>
> Regards,
> Christian.
>
> >
> >> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16
> >> [17359.543746] BUG: kernel NULL pointer dereference, address: 0000000000000000
> >> [17359.551494] #PF: supervisor read access in kernel mode
> >> [17359.557375] #PF: error_code(0x0000) - not-present page
> >> [17359.563247] PGD 0 P4D 0
> >> [17359.566514] Oops: 0000 [#1] SMP PTI
> >> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted 5.11.0-kfd-fkuehlin #193
> >> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3201 06/17/2016
> >> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> >> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> >> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> >> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> >> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> >> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> >> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> >> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> >> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> >> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> >> [17359.682883] Call Trace:
> >> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu]
> >> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
> >> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm]
> >> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
> >> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 [amdgpu]
> >> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu]
> >> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu]
> >> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu]
> >> [17359.734152] __x64_sys_ioctl+0x8b/0xd0
> >> [17359.738796] do_syscall_64+0x2d/0x40
> >> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> [17359.749205] RIP: 0033:0x7febb083b6d7
> >> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
> >> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
> >> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007febb083b6d7
> >> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI: 0000000000000003
> >> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09: 00000000c4000004
> >> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12: 0000559416e4e2aa
> >> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000000
> >> [17359.822140] Modules linked in: ip6table_filter ip6_tables iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 gpu_sched ip_tables x_tables
> >> [17359.837776] CR2: 0000000000000000
> >> [17359.841888] ---[ end trace a6f27d64475b28c8 ]---
> >> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> >> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> >> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> >> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> >> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> >> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> >> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> >> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> >> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> >> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> > From what I understand, the init_user_pages fails (returns EBUSY) and
> > the code goes to allocate_init_user_pages_failed where the unbind and
> > the userptr clear occurs.
> > Can we prevent this if we save the bounding status + userptr alloc? so
> > the function amdgpu_ttm_backend_unbind returns without trying to clear
> > the userptr memory?
> >
> > Something like:
> >
> > amdgpu_ttm_backend_bind:
> > if (gtt->userptr) {
> > r = amdgpu_ttm_tt_pin_userptr(bdev, ttm);
> > if (r) ...
> > gtt->sg_table = true;
> > }
> >
> > amdgpu_ttm_backend_unbind:
> > if (gtt->sg_table) {
> > if (gtt->user_ptr) ...
> > }
> >
> > If you agree, I'll send a v2 patch. Otherwise, maybe we could return
> > within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated.
> >
> > Any other ideas?
> >
> > Regards,
> > Daniel
> >
> >> Reverting this patch fixes the problem for me.
> >>
> >> Regards,
> >> Felix
> >>
> >> On 2021-03-18 10:57 p.m., Alex Deucher wrote:
> >>> Applied. Thanks!
> >>>
> >>> Alex
> >>>
> >>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian
> >>> <Christian.Koenig@amd.com> wrote:
> >>>> Reviewed-by: Christian König <christian.koenig@amd.com>
> >>>> ________________________________
> >>>> Von: Daniel Gomez <daniel@qtec.com>
> >>>> Gesendet: Donnerstag, 18. März 2021 09:32
> >>>> Cc: dagmcr@gmail.com <dagmcr@gmail.com>; Daniel Gomez <daniel@qtec.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; David Airlie <airlied@linux.ie>; Daniel Vetter <daniel@ffwll.ch>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
> >>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
> >>>>
> >>>> If userptr pages have been pinned but not bounded,
> >>>> they remain uncleared.
> >>>>
> >>>> Signed-off-by: Daniel Gomez <daniel@qtec.com>
> >>>> ---
> >>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
> >>>> 1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> index e8c66d10478f..bbcc6264d48f 100644
> >>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct ttm_bo_device *bdev, struct ttm_tt
> >>>> struct radeon_ttm_tt *gtt = (void *)ttm;
> >>>> struct radeon_device *rdev = radeon_get_rdev(bdev);
> >>>>
> >>>> + if (gtt->userptr)
> >>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm);
> >>>> +
> >>>> if (!gtt->bound)
> >>>> return;
> >>>>
> >>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
> >>>>
> >>>> - if (gtt->userptr)
> >>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm);
> >>>> gtt->bound = false;
> >>>> }
> >>>>
> >>>> --
> >>>> 2.30.2
> >>>>
> >>>> _______________________________________________
> >>>> dri-devel mailing list
> >>>> dri-devel@lists.freedesktop.org
> >>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >>> _______________________________________________
> >>> dri-devel mailing list
> >>> dri-devel@lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
2021-03-22 10:34 ` Christian König
(?)
@ 2021-04-06 20:55 ` Alex Deucher
-1 siblings, 0 replies; 31+ messages in thread
From: Alex Deucher @ 2021-04-06 20:55 UTC (permalink / raw)
To: Christian König
Cc: Daniel Gomez, Felix Kuehling, David Airlie, linux-kernel,
dri-devel, Deucher, Alexander, amd-gfx, Koenig, Christian,
dagmcr
On Mon, Mar 22, 2021 at 6:34 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Hi Daniel,
>
> Am 22.03.21 um 10:38 schrieb Daniel Gomez:
> > On Fri, 19 Mar 2021 at 21:29, Felix Kuehling <felix.kuehling@amd.com> wrote:
> >> This caused a regression in kfdtest in a large-buffer stress test after
> >> memory allocation for user pages fails:
> > I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and
> > not this one.
> > Just some background for the mem leak patch if helps to understand this:
> > The leak was introduce here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb
> > where the bound status was introduced for all drm drivers including
> > radeon and amdgpu. So this patch just reverts the logic to the
> > original code but keeping the bound status. In my case, the binding
> > code allocates the user pages memory and returns without bounding (at
> > amdgpu_gtt_mgr_has_gart_addr). So,
> > when the unbinding happens, the memory needs to be cleared to prevent the leak.
>
> Ah, now I understand what's happening here. Daniel your patch is not
> really correct.
>
> The problem is rather that we don't set the tt object to bound if it
> doesn't have a GTT address.
>
> Going to provide a patch for this.
Did this patch ever land?
Alex
>
> Regards,
> Christian.
>
> >
> >> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16
> >> [17359.543746] BUG: kernel NULL pointer dereference, address: 0000000000000000
> >> [17359.551494] #PF: supervisor read access in kernel mode
> >> [17359.557375] #PF: error_code(0x0000) - not-present page
> >> [17359.563247] PGD 0 P4D 0
> >> [17359.566514] Oops: 0000 [#1] SMP PTI
> >> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted 5.11.0-kfd-fkuehlin #193
> >> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3201 06/17/2016
> >> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> >> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> >> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> >> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> >> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> >> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> >> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> >> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> >> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> >> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> >> [17359.682883] Call Trace:
> >> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu]
> >> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
> >> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm]
> >> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
> >> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 [amdgpu]
> >> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu]
> >> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu]
> >> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu]
> >> [17359.734152] __x64_sys_ioctl+0x8b/0xd0
> >> [17359.738796] do_syscall_64+0x2d/0x40
> >> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> [17359.749205] RIP: 0033:0x7febb083b6d7
> >> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
> >> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
> >> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007febb083b6d7
> >> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI: 0000000000000003
> >> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09: 00000000c4000004
> >> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12: 0000559416e4e2aa
> >> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000000
> >> [17359.822140] Modules linked in: ip6table_filter ip6_tables iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 gpu_sched ip_tables x_tables
> >> [17359.837776] CR2: 0000000000000000
> >> [17359.841888] ---[ end trace a6f27d64475b28c8 ]---
> >> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> >> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> >> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> >> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> >> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> >> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> >> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> >> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> >> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> >> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> > From what I understand, the init_user_pages fails (returns EBUSY) and
> > the code goes to allocate_init_user_pages_failed where the unbind and
> > the userptr clear occurs.
> > Can we prevent this if we save the bounding status + userptr alloc? so
> > the function amdgpu_ttm_backend_unbind returns without trying to clear
> > the userptr memory?
> >
> > Something like:
> >
> > amdgpu_ttm_backend_bind:
> > if (gtt->userptr) {
> > r = amdgpu_ttm_tt_pin_userptr(bdev, ttm);
> > if (r) ...
> > gtt->sg_table = true;
> > }
> >
> > amdgpu_ttm_backend_unbind:
> > if (gtt->sg_table) {
> > if (gtt->user_ptr) ...
> > }
> >
> > If you agree, I'll send a v2 patch. Otherwise, maybe we could return
> > within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated.
> >
> > Any other ideas?
> >
> > Regards,
> > Daniel
> >
> >> Reverting this patch fixes the problem for me.
> >>
> >> Regards,
> >> Felix
> >>
> >> On 2021-03-18 10:57 p.m., Alex Deucher wrote:
> >>> Applied. Thanks!
> >>>
> >>> Alex
> >>>
> >>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian
> >>> <Christian.Koenig@amd.com> wrote:
> >>>> Reviewed-by: Christian König <christian.koenig@amd.com>
> >>>> ________________________________
> >>>> Von: Daniel Gomez <daniel@qtec.com>
> >>>> Gesendet: Donnerstag, 18. März 2021 09:32
> >>>> Cc: dagmcr@gmail.com <dagmcr@gmail.com>; Daniel Gomez <daniel@qtec.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; David Airlie <airlied@linux.ie>; Daniel Vetter <daniel@ffwll.ch>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
> >>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
> >>>>
> >>>> If userptr pages have been pinned but not bounded,
> >>>> they remain uncleared.
> >>>>
> >>>> Signed-off-by: Daniel Gomez <daniel@qtec.com>
> >>>> ---
> >>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
> >>>> 1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> index e8c66d10478f..bbcc6264d48f 100644
> >>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct ttm_bo_device *bdev, struct ttm_tt
> >>>> struct radeon_ttm_tt *gtt = (void *)ttm;
> >>>> struct radeon_device *rdev = radeon_get_rdev(bdev);
> >>>>
> >>>> + if (gtt->userptr)
> >>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm);
> >>>> +
> >>>> if (!gtt->bound)
> >>>> return;
> >>>>
> >>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
> >>>>
> >>>> - if (gtt->userptr)
> >>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm);
> >>>> gtt->bound = false;
> >>>> }
> >>>>
> >>>> --
> >>>> 2.30.2
> >>>>
> >>>> _______________________________________________
> >>>> dri-devel mailing list
> >>>> dri-devel@lists.freedesktop.org
> >>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >>> _______________________________________________
> >>> dri-devel mailing list
> >>> dri-devel@lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
@ 2021-04-06 20:55 ` Alex Deucher
0 siblings, 0 replies; 31+ messages in thread
From: Alex Deucher @ 2021-04-06 20:55 UTC (permalink / raw)
To: Christian König
Cc: David Airlie, Felix Kuehling, linux-kernel, dri-devel, amd-gfx,
Deucher, Alexander, Daniel Gomez, Koenig, Christian, dagmcr
On Mon, Mar 22, 2021 at 6:34 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Hi Daniel,
>
> Am 22.03.21 um 10:38 schrieb Daniel Gomez:
> > On Fri, 19 Mar 2021 at 21:29, Felix Kuehling <felix.kuehling@amd.com> wrote:
> >> This caused a regression in kfdtest in a large-buffer stress test after
> >> memory allocation for user pages fails:
> > I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and
> > not this one.
> > Just some background for the mem leak patch if helps to understand this:
> > The leak was introduce here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb
> > where the bound status was introduced for all drm drivers including
> > radeon and amdgpu. So this patch just reverts the logic to the
> > original code but keeping the bound status. In my case, the binding
> > code allocates the user pages memory and returns without bounding (at
> > amdgpu_gtt_mgr_has_gart_addr). So,
> > when the unbinding happens, the memory needs to be cleared to prevent the leak.
>
> Ah, now I understand what's happening here. Daniel your patch is not
> really correct.
>
> The problem is rather that we don't set the tt object to bound if it
> doesn't have a GTT address.
>
> Going to provide a patch for this.
Did this patch ever land?
Alex
>
> Regards,
> Christian.
>
> >
> >> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16
> >> [17359.543746] BUG: kernel NULL pointer dereference, address: 0000000000000000
> >> [17359.551494] #PF: supervisor read access in kernel mode
> >> [17359.557375] #PF: error_code(0x0000) - not-present page
> >> [17359.563247] PGD 0 P4D 0
> >> [17359.566514] Oops: 0000 [#1] SMP PTI
> >> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted 5.11.0-kfd-fkuehlin #193
> >> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3201 06/17/2016
> >> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> >> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> >> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> >> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> >> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> >> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> >> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> >> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> >> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> >> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> >> [17359.682883] Call Trace:
> >> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu]
> >> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
> >> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm]
> >> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
> >> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 [amdgpu]
> >> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu]
> >> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu]
> >> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu]
> >> [17359.734152] __x64_sys_ioctl+0x8b/0xd0
> >> [17359.738796] do_syscall_64+0x2d/0x40
> >> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> [17359.749205] RIP: 0033:0x7febb083b6d7
> >> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
> >> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
> >> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007febb083b6d7
> >> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI: 0000000000000003
> >> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09: 00000000c4000004
> >> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12: 0000559416e4e2aa
> >> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000000
> >> [17359.822140] Modules linked in: ip6table_filter ip6_tables iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 gpu_sched ip_tables x_tables
> >> [17359.837776] CR2: 0000000000000000
> >> [17359.841888] ---[ end trace a6f27d64475b28c8 ]---
> >> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> >> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> >> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> >> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> >> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> >> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> >> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> >> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> >> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> >> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> > From what I understand, the init_user_pages fails (returns EBUSY) and
> > the code goes to allocate_init_user_pages_failed where the unbind and
> > the userptr clear occurs.
> > Can we prevent this if we save the bounding status + userptr alloc? so
> > the function amdgpu_ttm_backend_unbind returns without trying to clear
> > the userptr memory?
> >
> > Something like:
> >
> > amdgpu_ttm_backend_bind:
> > if (gtt->userptr) {
> > r = amdgpu_ttm_tt_pin_userptr(bdev, ttm);
> > if (r) ...
> > gtt->sg_table = true;
> > }
> >
> > amdgpu_ttm_backend_unbind:
> > if (gtt->sg_table) {
> > if (gtt->user_ptr) ...
> > }
> >
> > If you agree, I'll send a v2 patch. Otherwise, maybe we could return
> > within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated.
> >
> > Any other ideas?
> >
> > Regards,
> > Daniel
> >
> >> Reverting this patch fixes the problem for me.
> >>
> >> Regards,
> >> Felix
> >>
> >> On 2021-03-18 10:57 p.m., Alex Deucher wrote:
> >>> Applied. Thanks!
> >>>
> >>> Alex
> >>>
> >>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian
> >>> <Christian.Koenig@amd.com> wrote:
> >>>> Reviewed-by: Christian König <christian.koenig@amd.com>
> >>>> ________________________________
> >>>> Von: Daniel Gomez <daniel@qtec.com>
> >>>> Gesendet: Donnerstag, 18. März 2021 09:32
> >>>> Cc: dagmcr@gmail.com <dagmcr@gmail.com>; Daniel Gomez <daniel@qtec.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; David Airlie <airlied@linux.ie>; Daniel Vetter <daniel@ffwll.ch>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
> >>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
> >>>>
> >>>> If userptr pages have been pinned but not bounded,
> >>>> they remain uncleared.
> >>>>
> >>>> Signed-off-by: Daniel Gomez <daniel@qtec.com>
> >>>> ---
> >>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
> >>>> 1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> index e8c66d10478f..bbcc6264d48f 100644
> >>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct ttm_bo_device *bdev, struct ttm_tt
> >>>> struct radeon_ttm_tt *gtt = (void *)ttm;
> >>>> struct radeon_device *rdev = radeon_get_rdev(bdev);
> >>>>
> >>>> + if (gtt->userptr)
> >>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm);
> >>>> +
> >>>> if (!gtt->bound)
> >>>> return;
> >>>>
> >>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
> >>>>
> >>>> - if (gtt->userptr)
> >>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm);
> >>>> gtt->bound = false;
> >>>> }
> >>>>
> >>>> --
> >>>> 2.30.2
> >>>>
> >>>> _______________________________________________
> >>>> dri-devel mailing list
> >>>> dri-devel@lists.freedesktop.org
> >>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >>> _______________________________________________
> >>> dri-devel mailing list
> >>> dri-devel@lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
@ 2021-04-06 20:55 ` Alex Deucher
0 siblings, 0 replies; 31+ messages in thread
From: Alex Deucher @ 2021-04-06 20:55 UTC (permalink / raw)
To: Christian König
Cc: David Airlie, Felix Kuehling, linux-kernel, dri-devel, amd-gfx,
Deucher, Alexander, Daniel Gomez, Koenig, Christian, dagmcr
On Mon, Mar 22, 2021 at 6:34 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Hi Daniel,
>
> Am 22.03.21 um 10:38 schrieb Daniel Gomez:
> > On Fri, 19 Mar 2021 at 21:29, Felix Kuehling <felix.kuehling@amd.com> wrote:
> >> This caused a regression in kfdtest in a large-buffer stress test after
> >> memory allocation for user pages fails:
> > I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and
> > not this one.
> > Just some background for the mem leak patch if helps to understand this:
> > The leak was introduce here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb
> > where the bound status was introduced for all drm drivers including
> > radeon and amdgpu. So this patch just reverts the logic to the
> > original code but keeping the bound status. In my case, the binding
> > code allocates the user pages memory and returns without bounding (at
> > amdgpu_gtt_mgr_has_gart_addr). So,
> > when the unbinding happens, the memory needs to be cleared to prevent the leak.
>
> Ah, now I understand what's happening here. Daniel your patch is not
> really correct.
>
> The problem is rather that we don't set the tt object to bound if it
> doesn't have a GTT address.
>
> Going to provide a patch for this.
Did this patch ever land?
Alex
>
> Regards,
> Christian.
>
> >
> >> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16
> >> [17359.543746] BUG: kernel NULL pointer dereference, address: 0000000000000000
> >> [17359.551494] #PF: supervisor read access in kernel mode
> >> [17359.557375] #PF: error_code(0x0000) - not-present page
> >> [17359.563247] PGD 0 P4D 0
> >> [17359.566514] Oops: 0000 [#1] SMP PTI
> >> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted 5.11.0-kfd-fkuehlin #193
> >> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3201 06/17/2016
> >> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> >> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> >> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> >> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> >> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> >> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> >> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> >> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> >> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> >> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> >> [17359.682883] Call Trace:
> >> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu]
> >> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
> >> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm]
> >> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
> >> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 [amdgpu]
> >> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu]
> >> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu]
> >> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu]
> >> [17359.734152] __x64_sys_ioctl+0x8b/0xd0
> >> [17359.738796] do_syscall_64+0x2d/0x40
> >> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> [17359.749205] RIP: 0033:0x7febb083b6d7
> >> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
> >> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
> >> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007febb083b6d7
> >> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI: 0000000000000003
> >> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09: 00000000c4000004
> >> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12: 0000559416e4e2aa
> >> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000000
> >> [17359.822140] Modules linked in: ip6table_filter ip6_tables iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 gpu_sched ip_tables x_tables
> >> [17359.837776] CR2: 0000000000000000
> >> [17359.841888] ---[ end trace a6f27d64475b28c8 ]---
> >> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> >> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> >> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> >> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> >> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> >> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> >> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> >> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> >> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> >> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> > From what I understand, the init_user_pages fails (returns EBUSY) and
> > the code goes to allocate_init_user_pages_failed where the unbind and
> > the userptr clear occurs.
> > Can we prevent this if we save the bounding status + userptr alloc? so
> > the function amdgpu_ttm_backend_unbind returns without trying to clear
> > the userptr memory?
> >
> > Something like:
> >
> > amdgpu_ttm_backend_bind:
> > if (gtt->userptr) {
> > r = amdgpu_ttm_tt_pin_userptr(bdev, ttm);
> > if (r) ...
> > gtt->sg_table = true;
> > }
> >
> > amdgpu_ttm_backend_unbind:
> > if (gtt->sg_table) {
> > if (gtt->user_ptr) ...
> > }
> >
> > If you agree, I'll send a v2 patch. Otherwise, maybe we could return
> > within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated.
> >
> > Any other ideas?
> >
> > Regards,
> > Daniel
> >
> >> Reverting this patch fixes the problem for me.
> >>
> >> Regards,
> >> Felix
> >>
> >> On 2021-03-18 10:57 p.m., Alex Deucher wrote:
> >>> Applied. Thanks!
> >>>
> >>> Alex
> >>>
> >>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian
> >>> <Christian.Koenig@amd.com> wrote:
> >>>> Reviewed-by: Christian König <christian.koenig@amd.com>
> >>>> ________________________________
> >>>> Von: Daniel Gomez <daniel@qtec.com>
> >>>> Gesendet: Donnerstag, 18. März 2021 09:32
> >>>> Cc: dagmcr@gmail.com <dagmcr@gmail.com>; Daniel Gomez <daniel@qtec.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; David Airlie <airlied@linux.ie>; Daniel Vetter <daniel@ffwll.ch>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
> >>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
> >>>>
> >>>> If userptr pages have been pinned but not bounded,
> >>>> they remain uncleared.
> >>>>
> >>>> Signed-off-by: Daniel Gomez <daniel@qtec.com>
> >>>> ---
> >>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
> >>>> 1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> index e8c66d10478f..bbcc6264d48f 100644
> >>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct ttm_bo_device *bdev, struct ttm_tt
> >>>> struct radeon_ttm_tt *gtt = (void *)ttm;
> >>>> struct radeon_device *rdev = radeon_get_rdev(bdev);
> >>>>
> >>>> + if (gtt->userptr)
> >>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm);
> >>>> +
> >>>> if (!gtt->bound)
> >>>> return;
> >>>>
> >>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
> >>>>
> >>>> - if (gtt->userptr)
> >>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm);
> >>>> gtt->bound = false;
> >>>> }
> >>>>
> >>>> --
> >>>> 2.30.2
> >>>>
> >>>> _______________________________________________
> >>>> dri-devel mailing list
> >>>> dri-devel@lists.freedesktop.org
> >>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> >>> _______________________________________________
> >>> dri-devel mailing list
> >>> dri-devel@lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > _______________________________________________
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
2021-04-06 20:55 ` Alex Deucher
(?)
@ 2021-04-07 7:47 ` Daniel Gomez
-1 siblings, 0 replies; 31+ messages in thread
From: Daniel Gomez @ 2021-04-07 7:47 UTC (permalink / raw)
To: Alex Deucher
Cc: Christian König, Felix Kuehling, David Airlie, linux-kernel,
dri-devel, Deucher, Alexander, amd-gfx, Koenig, Christian,
dagmcr
On Tue, 6 Apr 2021 at 22:56, Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Mon, Mar 22, 2021 at 6:34 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
> >
> > Hi Daniel,
> >
> > Am 22.03.21 um 10:38 schrieb Daniel Gomez:
> > > On Fri, 19 Mar 2021 at 21:29, Felix Kuehling <felix.kuehling@amd.com> wrote:
> > >> This caused a regression in kfdtest in a large-buffer stress test after
> > >> memory allocation for user pages fails:
> > > I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and
> > > not this one.
> > > Just some background for the mem leak patch if helps to understand this:
> > > The leak was introduce here:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb
> > > where the bound status was introduced for all drm drivers including
> > > radeon and amdgpu. So this patch just reverts the logic to the
> > > original code but keeping the bound status. In my case, the binding
> > > code allocates the user pages memory and returns without bounding (at
> > > amdgpu_gtt_mgr_has_gart_addr). So,
> > > when the unbinding happens, the memory needs to be cleared to prevent the leak.
> >
> > Ah, now I understand what's happening here. Daniel your patch is not
> > really correct.
> >
> > The problem is rather that we don't set the tt object to bound if it
> > doesn't have a GTT address.
> >
> > Going to provide a patch for this.
>
> Did this patch ever land?
I don't think so but I might send a v2 following Christian's comment
if you guys agree.
Also, the patch here is for radeon but the pagefault issue reported by
Felix is affected by the amdgpu one:
radeon patch: drm/radeon/ttm: Fix memory leak userptr pages
https://patchwork.kernel.org/project/dri-devel/patch/20210318083236.43578-1-daniel@qtec.com/
amdgpu patch: drm/amdgpu/ttm: Fix memory leak userptr pages
https://patchwork.kernel.org/project/dri-devel/patch/20210317160840.36019-1-daniel@qtec.com/
I assume both need to be fixed with the same approach.
Daniel
>
> Alex
>
> >
> > Regards,
> > Christian.
> >
> > >
> > >> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16
> > >> [17359.543746] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > >> [17359.551494] #PF: supervisor read access in kernel mode
> > >> [17359.557375] #PF: error_code(0x0000) - not-present page
> > >> [17359.563247] PGD 0 P4D 0
> > >> [17359.566514] Oops: 0000 [#1] SMP PTI
> > >> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted 5.11.0-kfd-fkuehlin #193
> > >> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3201 06/17/2016
> > >> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> > >> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> > >> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> > >> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> > >> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> > >> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> > >> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> > >> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> > >> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> > >> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> > >> [17359.682883] Call Trace:
> > >> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu]
> > >> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
> > >> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm]
> > >> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
> > >> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 [amdgpu]
> > >> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu]
> > >> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu]
> > >> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu]
> > >> [17359.734152] __x64_sys_ioctl+0x8b/0xd0
> > >> [17359.738796] do_syscall_64+0x2d/0x40
> > >> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >> [17359.749205] RIP: 0033:0x7febb083b6d7
> > >> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
> > >> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
> > >> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007febb083b6d7
> > >> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI: 0000000000000003
> > >> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09: 00000000c4000004
> > >> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12: 0000559416e4e2aa
> > >> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000000
> > >> [17359.822140] Modules linked in: ip6table_filter ip6_tables iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 gpu_sched ip_tables x_tables
> > >> [17359.837776] CR2: 0000000000000000
> > >> [17359.841888] ---[ end trace a6f27d64475b28c8 ]---
> > >> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> > >> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> > >> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> > >> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> > >> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> > >> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> > >> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> > >> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> > >> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> > >> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> > > From what I understand, the init_user_pages fails (returns EBUSY) and
> > > the code goes to allocate_init_user_pages_failed where the unbind and
> > > the userptr clear occurs.
> > > Can we prevent this if we save the bounding status + userptr alloc? so
> > > the function amdgpu_ttm_backend_unbind returns without trying to clear
> > > the userptr memory?
> > >
> > > Something like:
> > >
> > > amdgpu_ttm_backend_bind:
> > > if (gtt->userptr) {
> > > r = amdgpu_ttm_tt_pin_userptr(bdev, ttm);
> > > if (r) ...
> > > gtt->sg_table = true;
> > > }
> > >
> > > amdgpu_ttm_backend_unbind:
> > > if (gtt->sg_table) {
> > > if (gtt->user_ptr) ...
> > > }
> > >
> > > If you agree, I'll send a v2 patch. Otherwise, maybe we could return
> > > within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated.
> > >
> > > Any other ideas?
> > >
> > > Regards,
> > > Daniel
> > >
> > >> Reverting this patch fixes the problem for me.
> > >>
> > >> Regards,
> > >> Felix
> > >>
> > >> On 2021-03-18 10:57 p.m., Alex Deucher wrote:
> > >>> Applied. Thanks!
> > >>>
> > >>> Alex
> > >>>
> > >>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian
> > >>> <Christian.Koenig@amd.com> wrote:
> > >>>> Reviewed-by: Christian König <christian.koenig@amd.com>
> > >>>> ________________________________
> > >>>> Von: Daniel Gomez <daniel@qtec.com>
> > >>>> Gesendet: Donnerstag, 18. März 2021 09:32
> > >>>> Cc: dagmcr@gmail.com <dagmcr@gmail.com>; Daniel Gomez <daniel@qtec.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; David Airlie <airlied@linux.ie>; Daniel Vetter <daniel@ffwll.ch>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
> > >>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
> > >>>>
> > >>>> If userptr pages have been pinned but not bounded,
> > >>>> they remain uncleared.
> > >>>>
> > >>>> Signed-off-by: Daniel Gomez <daniel@qtec.com>
> > >>>> ---
> > >>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
> > >>>> 1 file changed, 3 insertions(+), 2 deletions(-)
> > >>>>
> > >>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
> > >>>> index e8c66d10478f..bbcc6264d48f 100644
> > >>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> > >>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> > >>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct ttm_bo_device *bdev, struct ttm_tt
> > >>>> struct radeon_ttm_tt *gtt = (void *)ttm;
> > >>>> struct radeon_device *rdev = radeon_get_rdev(bdev);
> > >>>>
> > >>>> + if (gtt->userptr)
> > >>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm);
> > >>>> +
> > >>>> if (!gtt->bound)
> > >>>> return;
> > >>>>
> > >>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
> > >>>>
> > >>>> - if (gtt->userptr)
> > >>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm);
> > >>>> gtt->bound = false;
> > >>>> }
> > >>>>
> > >>>> --
> > >>>> 2.30.2
> > >>>>
> > >>>> _______________________________________________
> > >>>> dri-devel mailing list
> > >>>> dri-devel@lists.freedesktop.org
> > >>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > >>> _______________________________________________
> > >>> dri-devel mailing list
> > >>> dri-devel@lists.freedesktop.org
> > >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > > _______________________________________________
> > > amd-gfx mailing list
> > > amd-gfx@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> >
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
@ 2021-04-07 7:47 ` Daniel Gomez
0 siblings, 0 replies; 31+ messages in thread
From: Daniel Gomez @ 2021-04-07 7:47 UTC (permalink / raw)
To: Alex Deucher
Cc: David Airlie, Christian König, Felix Kuehling, linux-kernel,
dri-devel, amd-gfx, Deucher, Alexander, Koenig, Christian,
dagmcr
On Tue, 6 Apr 2021 at 22:56, Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Mon, Mar 22, 2021 at 6:34 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
> >
> > Hi Daniel,
> >
> > Am 22.03.21 um 10:38 schrieb Daniel Gomez:
> > > On Fri, 19 Mar 2021 at 21:29, Felix Kuehling <felix.kuehling@amd.com> wrote:
> > >> This caused a regression in kfdtest in a large-buffer stress test after
> > >> memory allocation for user pages fails:
> > > I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and
> > > not this one.
> > > Just some background for the mem leak patch if helps to understand this:
> > > The leak was introduce here:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb
> > > where the bound status was introduced for all drm drivers including
> > > radeon and amdgpu. So this patch just reverts the logic to the
> > > original code but keeping the bound status. In my case, the binding
> > > code allocates the user pages memory and returns without bounding (at
> > > amdgpu_gtt_mgr_has_gart_addr). So,
> > > when the unbinding happens, the memory needs to be cleared to prevent the leak.
> >
> > Ah, now I understand what's happening here. Daniel your patch is not
> > really correct.
> >
> > The problem is rather that we don't set the tt object to bound if it
> > doesn't have a GTT address.
> >
> > Going to provide a patch for this.
>
> Did this patch ever land?
I don't think so but I might send a v2 following Christian's comment
if you guys agree.
Also, the patch here is for radeon but the pagefault issue reported by
Felix is affected by the amdgpu one:
radeon patch: drm/radeon/ttm: Fix memory leak userptr pages
https://patchwork.kernel.org/project/dri-devel/patch/20210318083236.43578-1-daniel@qtec.com/
amdgpu patch: drm/amdgpu/ttm: Fix memory leak userptr pages
https://patchwork.kernel.org/project/dri-devel/patch/20210317160840.36019-1-daniel@qtec.com/
I assume both need to be fixed with the same approach.
Daniel
>
> Alex
>
> >
> > Regards,
> > Christian.
> >
> > >
> > >> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16
> > >> [17359.543746] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > >> [17359.551494] #PF: supervisor read access in kernel mode
> > >> [17359.557375] #PF: error_code(0x0000) - not-present page
> > >> [17359.563247] PGD 0 P4D 0
> > >> [17359.566514] Oops: 0000 [#1] SMP PTI
> > >> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted 5.11.0-kfd-fkuehlin #193
> > >> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3201 06/17/2016
> > >> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> > >> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> > >> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> > >> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> > >> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> > >> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> > >> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> > >> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> > >> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> > >> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> > >> [17359.682883] Call Trace:
> > >> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu]
> > >> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
> > >> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm]
> > >> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
> > >> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 [amdgpu]
> > >> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu]
> > >> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu]
> > >> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu]
> > >> [17359.734152] __x64_sys_ioctl+0x8b/0xd0
> > >> [17359.738796] do_syscall_64+0x2d/0x40
> > >> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >> [17359.749205] RIP: 0033:0x7febb083b6d7
> > >> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
> > >> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
> > >> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007febb083b6d7
> > >> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI: 0000000000000003
> > >> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09: 00000000c4000004
> > >> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12: 0000559416e4e2aa
> > >> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000000
> > >> [17359.822140] Modules linked in: ip6table_filter ip6_tables iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 gpu_sched ip_tables x_tables
> > >> [17359.837776] CR2: 0000000000000000
> > >> [17359.841888] ---[ end trace a6f27d64475b28c8 ]---
> > >> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> > >> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> > >> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> > >> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> > >> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> > >> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> > >> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> > >> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> > >> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> > >> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> > > From what I understand, the init_user_pages fails (returns EBUSY) and
> > > the code goes to allocate_init_user_pages_failed where the unbind and
> > > the userptr clear occurs.
> > > Can we prevent this if we save the bounding status + userptr alloc? so
> > > the function amdgpu_ttm_backend_unbind returns without trying to clear
> > > the userptr memory?
> > >
> > > Something like:
> > >
> > > amdgpu_ttm_backend_bind:
> > > if (gtt->userptr) {
> > > r = amdgpu_ttm_tt_pin_userptr(bdev, ttm);
> > > if (r) ...
> > > gtt->sg_table = true;
> > > }
> > >
> > > amdgpu_ttm_backend_unbind:
> > > if (gtt->sg_table) {
> > > if (gtt->user_ptr) ...
> > > }
> > >
> > > If you agree, I'll send a v2 patch. Otherwise, maybe we could return
> > > within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated.
> > >
> > > Any other ideas?
> > >
> > > Regards,
> > > Daniel
> > >
> > >> Reverting this patch fixes the problem for me.
> > >>
> > >> Regards,
> > >> Felix
> > >>
> > >> On 2021-03-18 10:57 p.m., Alex Deucher wrote:
> > >>> Applied. Thanks!
> > >>>
> > >>> Alex
> > >>>
> > >>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian
> > >>> <Christian.Koenig@amd.com> wrote:
> > >>>> Reviewed-by: Christian König <christian.koenig@amd.com>
> > >>>> ________________________________
> > >>>> Von: Daniel Gomez <daniel@qtec.com>
> > >>>> Gesendet: Donnerstag, 18. März 2021 09:32
> > >>>> Cc: dagmcr@gmail.com <dagmcr@gmail.com>; Daniel Gomez <daniel@qtec.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; David Airlie <airlied@linux.ie>; Daniel Vetter <daniel@ffwll.ch>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
> > >>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
> > >>>>
> > >>>> If userptr pages have been pinned but not bounded,
> > >>>> they remain uncleared.
> > >>>>
> > >>>> Signed-off-by: Daniel Gomez <daniel@qtec.com>
> > >>>> ---
> > >>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
> > >>>> 1 file changed, 3 insertions(+), 2 deletions(-)
> > >>>>
> > >>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
> > >>>> index e8c66d10478f..bbcc6264d48f 100644
> > >>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> > >>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> > >>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct ttm_bo_device *bdev, struct ttm_tt
> > >>>> struct radeon_ttm_tt *gtt = (void *)ttm;
> > >>>> struct radeon_device *rdev = radeon_get_rdev(bdev);
> > >>>>
> > >>>> + if (gtt->userptr)
> > >>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm);
> > >>>> +
> > >>>> if (!gtt->bound)
> > >>>> return;
> > >>>>
> > >>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
> > >>>>
> > >>>> - if (gtt->userptr)
> > >>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm);
> > >>>> gtt->bound = false;
> > >>>> }
> > >>>>
> > >>>> --
> > >>>> 2.30.2
> > >>>>
> > >>>> _______________________________________________
> > >>>> dri-devel mailing list
> > >>>> dri-devel@lists.freedesktop.org
> > >>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > >>> _______________________________________________
> > >>> dri-devel mailing list
> > >>> dri-devel@lists.freedesktop.org
> > >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > > _______________________________________________
> > > amd-gfx mailing list
> > > amd-gfx@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> >
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
@ 2021-04-07 7:47 ` Daniel Gomez
0 siblings, 0 replies; 31+ messages in thread
From: Daniel Gomez @ 2021-04-07 7:47 UTC (permalink / raw)
To: Alex Deucher
Cc: David Airlie, Christian König, Felix Kuehling, linux-kernel,
dri-devel, amd-gfx, Deucher, Alexander, Koenig, Christian,
dagmcr
On Tue, 6 Apr 2021 at 22:56, Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Mon, Mar 22, 2021 at 6:34 AM Christian König
> <ckoenig.leichtzumerken@gmail.com> wrote:
> >
> > Hi Daniel,
> >
> > Am 22.03.21 um 10:38 schrieb Daniel Gomez:
> > > On Fri, 19 Mar 2021 at 21:29, Felix Kuehling <felix.kuehling@amd.com> wrote:
> > >> This caused a regression in kfdtest in a large-buffer stress test after
> > >> memory allocation for user pages fails:
> > > I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and
> > > not this one.
> > > Just some background for the mem leak patch if helps to understand this:
> > > The leak was introduce here:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb
> > > where the bound status was introduced for all drm drivers including
> > > radeon and amdgpu. So this patch just reverts the logic to the
> > > original code but keeping the bound status. In my case, the binding
> > > code allocates the user pages memory and returns without bounding (at
> > > amdgpu_gtt_mgr_has_gart_addr). So,
> > > when the unbinding happens, the memory needs to be cleared to prevent the leak.
> >
> > Ah, now I understand what's happening here. Daniel your patch is not
> > really correct.
> >
> > The problem is rather that we don't set the tt object to bound if it
> > doesn't have a GTT address.
> >
> > Going to provide a patch for this.
>
> Did this patch ever land?
I don't think so but I might send a v2 following Christian's comment
if you guys agree.
Also, the patch here is for radeon but the pagefault issue reported by
Felix is affected by the amdgpu one:
radeon patch: drm/radeon/ttm: Fix memory leak userptr pages
https://patchwork.kernel.org/project/dri-devel/patch/20210318083236.43578-1-daniel@qtec.com/
amdgpu patch: drm/amdgpu/ttm: Fix memory leak userptr pages
https://patchwork.kernel.org/project/dri-devel/patch/20210317160840.36019-1-daniel@qtec.com/
I assume both need to be fixed with the same approach.
Daniel
>
> Alex
>
> >
> > Regards,
> > Christian.
> >
> > >
> > >> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16
> > >> [17359.543746] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > >> [17359.551494] #PF: supervisor read access in kernel mode
> > >> [17359.557375] #PF: error_code(0x0000) - not-present page
> > >> [17359.563247] PGD 0 P4D 0
> > >> [17359.566514] Oops: 0000 [#1] SMP PTI
> > >> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted 5.11.0-kfd-fkuehlin #193
> > >> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3201 06/17/2016
> > >> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> > >> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> > >> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> > >> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> > >> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> > >> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> > >> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> > >> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> > >> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> > >> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> > >> [17359.682883] Call Trace:
> > >> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu]
> > >> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
> > >> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm]
> > >> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
> > >> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 [amdgpu]
> > >> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu]
> > >> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu]
> > >> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu]
> > >> [17359.734152] __x64_sys_ioctl+0x8b/0xd0
> > >> [17359.738796] do_syscall_64+0x2d/0x40
> > >> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >> [17359.749205] RIP: 0033:0x7febb083b6d7
> > >> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
> > >> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
> > >> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007febb083b6d7
> > >> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI: 0000000000000003
> > >> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09: 00000000c4000004
> > >> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12: 0000559416e4e2aa
> > >> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000000
> > >> [17359.822140] Modules linked in: ip6table_filter ip6_tables iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 gpu_sched ip_tables x_tables
> > >> [17359.837776] CR2: 0000000000000000
> > >> [17359.841888] ---[ end trace a6f27d64475b28c8 ]---
> > >> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
> > >> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> > >> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> > >> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
> > >> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
> > >> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
> > >> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
> > >> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
> > >> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
> > >> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
> > > From what I understand, the init_user_pages fails (returns EBUSY) and
> > > the code goes to allocate_init_user_pages_failed where the unbind and
> > > the userptr clear occurs.
> > > Can we prevent this if we save the bounding status + userptr alloc? so
> > > the function amdgpu_ttm_backend_unbind returns without trying to clear
> > > the userptr memory?
> > >
> > > Something like:
> > >
> > > amdgpu_ttm_backend_bind:
> > > if (gtt->userptr) {
> > > r = amdgpu_ttm_tt_pin_userptr(bdev, ttm);
> > > if (r) ...
> > > gtt->sg_table = true;
> > > }
> > >
> > > amdgpu_ttm_backend_unbind:
> > > if (gtt->sg_table) {
> > > if (gtt->user_ptr) ...
> > > }
> > >
> > > If you agree, I'll send a v2 patch. Otherwise, maybe we could return
> > > within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated.
> > >
> > > Any other ideas?
> > >
> > > Regards,
> > > Daniel
> > >
> > >> Reverting this patch fixes the problem for me.
> > >>
> > >> Regards,
> > >> Felix
> > >>
> > >> On 2021-03-18 10:57 p.m., Alex Deucher wrote:
> > >>> Applied. Thanks!
> > >>>
> > >>> Alex
> > >>>
> > >>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian
> > >>> <Christian.Koenig@amd.com> wrote:
> > >>>> Reviewed-by: Christian König <christian.koenig@amd.com>
> > >>>> ________________________________
> > >>>> Von: Daniel Gomez <daniel@qtec.com>
> > >>>> Gesendet: Donnerstag, 18. März 2021 09:32
> > >>>> Cc: dagmcr@gmail.com <dagmcr@gmail.com>; Daniel Gomez <daniel@qtec.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; David Airlie <airlied@linux.ie>; Daniel Vetter <daniel@ffwll.ch>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
> > >>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
> > >>>>
> > >>>> If userptr pages have been pinned but not bounded,
> > >>>> they remain uncleared.
> > >>>>
> > >>>> Signed-off-by: Daniel Gomez <daniel@qtec.com>
> > >>>> ---
> > >>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
> > >>>> 1 file changed, 3 insertions(+), 2 deletions(-)
> > >>>>
> > >>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
> > >>>> index e8c66d10478f..bbcc6264d48f 100644
> > >>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> > >>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> > >>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct ttm_bo_device *bdev, struct ttm_tt
> > >>>> struct radeon_ttm_tt *gtt = (void *)ttm;
> > >>>> struct radeon_device *rdev = radeon_get_rdev(bdev);
> > >>>>
> > >>>> + if (gtt->userptr)
> > >>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm);
> > >>>> +
> > >>>> if (!gtt->bound)
> > >>>> return;
> > >>>>
> > >>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
> > >>>>
> > >>>> - if (gtt->userptr)
> > >>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm);
> > >>>> gtt->bound = false;
> > >>>> }
> > >>>>
> > >>>> --
> > >>>> 2.30.2
> > >>>>
> > >>>> _______________________________________________
> > >>>> dri-devel mailing list
> > >>>> dri-devel@lists.freedesktop.org
> > >>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > >>> _______________________________________________
> > >>> dri-devel mailing list
> > >>> dri-devel@lists.freedesktop.org
> > >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > > _______________________________________________
> > > amd-gfx mailing list
> > > amd-gfx@lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> >
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
2021-04-07 7:47 ` Daniel Gomez
(?)
@ 2021-04-07 9:27 ` Christian König
-1 siblings, 0 replies; 31+ messages in thread
From: Christian König @ 2021-04-07 9:27 UTC (permalink / raw)
To: Daniel Gomez, Alex Deucher
Cc: Christian König, Felix Kuehling, David Airlie, linux-kernel,
dri-devel, Deucher, Alexander, amd-gfx, dagmcr
Am 07.04.21 um 09:47 schrieb Daniel Gomez:
> On Tue, 6 Apr 2021 at 22:56, Alex Deucher <alexdeucher@gmail.com> wrote:
>> On Mon, Mar 22, 2021 at 6:34 AM Christian König
>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>> Hi Daniel,
>>>
>>> Am 22.03.21 um 10:38 schrieb Daniel Gomez:
>>>> On Fri, 19 Mar 2021 at 21:29, Felix Kuehling <felix.kuehling@amd.com> wrote:
>>>>> This caused a regression in kfdtest in a large-buffer stress test after
>>>>> memory allocation for user pages fails:
>>>> I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and
>>>> not this one.
>>>> Just some background for the mem leak patch if helps to understand this:
>>>> The leak was introduce here:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fid%3D0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2FeOQf12NBkC3YGZ7QW66%2FT%2FpyM3DjEb9IMbqUvISMfo%3D&reserved=0
>>>> where the bound status was introduced for all drm drivers including
>>>> radeon and amdgpu. So this patch just reverts the logic to the
>>>> original code but keeping the bound status. In my case, the binding
>>>> code allocates the user pages memory and returns without bounding (at
>>>> amdgpu_gtt_mgr_has_gart_addr). So,
>>>> when the unbinding happens, the memory needs to be cleared to prevent the leak.
>>> Ah, now I understand what's happening here. Daniel your patch is not
>>> really correct.
>>>
>>> The problem is rather that we don't set the tt object to bound if it
>>> doesn't have a GTT address.
>>>
>>> Going to provide a patch for this.
>> Did this patch ever land?
> I don't think so but I might send a v2 following Christian's comment
> if you guys agree.
Somebody else already provided a patch which I reviewed, but I'm not
sure if that landed either.
> Also, the patch here is for radeon but the pagefault issue reported by
> Felix is affected by the amdgpu one:
>
> radeon patch: drm/radeon/ttm: Fix memory leak userptr pages
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210318083236.43578-1-daniel%40qtec.com%2F&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HSMK%2FqYz%2Bzz9qbKc%2FITUWluBDeaW9YrgyH8p0L640%2F0%3D&reserved=0
>
> amdgpu patch: drm/amdgpu/ttm: Fix memory leak userptr pages
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210317160840.36019-1-daniel%40qtec.com%2F&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UsUZ4YjCSjHmzlPB07oTaGrsntTrQSwlGk%2BxUjwDiag%3D&reserved=0
>
> I assume both need to be fixed with the same approach.
Yes correct. Let me double check where that fix went.
Thanks,
Christian.
>
> Daniel
>> Alex
>>
>>> Regards,
>>> Christian.
>>>
>>>>> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16
>>>>> [17359.543746] BUG: kernel NULL pointer dereference, address: 0000000000000000
>>>>> [17359.551494] #PF: supervisor read access in kernel mode
>>>>> [17359.557375] #PF: error_code(0x0000) - not-present page
>>>>> [17359.563247] PGD 0 P4D 0
>>>>> [17359.566514] Oops: 0000 [#1] SMP PTI
>>>>> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted 5.11.0-kfd-fkuehlin #193
>>>>> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3201 06/17/2016
>>>>> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
>>>>> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
>>>>> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
>>>>> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
>>>>> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
>>>>> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
>>>>> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
>>>>> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
>>>>> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
>>>>> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
>>>>> [17359.682883] Call Trace:
>>>>> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu]
>>>>> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
>>>>> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm]
>>>>> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
>>>>> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 [amdgpu]
>>>>> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu]
>>>>> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu]
>>>>> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu]
>>>>> [17359.734152] __x64_sys_ioctl+0x8b/0xd0
>>>>> [17359.738796] do_syscall_64+0x2d/0x40
>>>>> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>> [17359.749205] RIP: 0033:0x7febb083b6d7
>>>>> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
>>>>> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
>>>>> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007febb083b6d7
>>>>> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI: 0000000000000003
>>>>> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09: 00000000c4000004
>>>>> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12: 0000559416e4e2aa
>>>>> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000000
>>>>> [17359.822140] Modules linked in: ip6table_filter ip6_tables iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 gpu_sched ip_tables x_tables
>>>>> [17359.837776] CR2: 0000000000000000
>>>>> [17359.841888] ---[ end trace a6f27d64475b28c8 ]---
>>>>> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
>>>>> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
>>>>> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
>>>>> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
>>>>> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
>>>>> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
>>>>> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
>>>>> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
>>>>> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
>>>>> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
>>>> From what I understand, the init_user_pages fails (returns EBUSY) and
>>>> the code goes to allocate_init_user_pages_failed where the unbind and
>>>> the userptr clear occurs.
>>>> Can we prevent this if we save the bounding status + userptr alloc? so
>>>> the function amdgpu_ttm_backend_unbind returns without trying to clear
>>>> the userptr memory?
>>>>
>>>> Something like:
>>>>
>>>> amdgpu_ttm_backend_bind:
>>>> if (gtt->userptr) {
>>>> r = amdgpu_ttm_tt_pin_userptr(bdev, ttm);
>>>> if (r) ...
>>>> gtt->sg_table = true;
>>>> }
>>>>
>>>> amdgpu_ttm_backend_unbind:
>>>> if (gtt->sg_table) {
>>>> if (gtt->user_ptr) ...
>>>> }
>>>>
>>>> If you agree, I'll send a v2 patch. Otherwise, maybe we could return
>>>> within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated.
>>>>
>>>> Any other ideas?
>>>>
>>>> Regards,
>>>> Daniel
>>>>
>>>>> Reverting this patch fixes the problem for me.
>>>>>
>>>>> Regards,
>>>>> Felix
>>>>>
>>>>> On 2021-03-18 10:57 p.m., Alex Deucher wrote:
>>>>>> Applied. Thanks!
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian
>>>>>> <Christian.Koenig@amd.com> wrote:
>>>>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>>>>>> ________________________________
>>>>>>> Von: Daniel Gomez <daniel@qtec.com>
>>>>>>> Gesendet: Donnerstag, 18. März 2021 09:32
>>>>>>> Cc: dagmcr@gmail.com <dagmcr@gmail.com>; Daniel Gomez <daniel@qtec.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; David Airlie <airlied@linux.ie>; Daniel Vetter <daniel@ffwll.ch>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
>>>>>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
>>>>>>>
>>>>>>> If userptr pages have been pinned but not bounded,
>>>>>>> they remain uncleared.
>>>>>>>
>>>>>>> Signed-off-by: Daniel Gomez <daniel@qtec.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
>>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
>>>>>>> index e8c66d10478f..bbcc6264d48f 100644
>>>>>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
>>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
>>>>>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct ttm_bo_device *bdev, struct ttm_tt
>>>>>>> struct radeon_ttm_tt *gtt = (void *)ttm;
>>>>>>> struct radeon_device *rdev = radeon_get_rdev(bdev);
>>>>>>>
>>>>>>> + if (gtt->userptr)
>>>>>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm);
>>>>>>> +
>>>>>>> if (!gtt->bound)
>>>>>>> return;
>>>>>>>
>>>>>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
>>>>>>>
>>>>>>> - if (gtt->userptr)
>>>>>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm);
>>>>>>> gtt->bound = false;
>>>>>>> }
>>>>>>>
>>>>>>> --
>>>>>>> 2.30.2
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> dri-devel mailing list
>>>>>>> dri-devel@lists.freedesktop.org
>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=aU9ZAl66EOLKphjWFPXJPR%2BTM%2BZeeMv%2BVJC6vliEqrs%3D&reserved=0
>>>>>> _______________________________________________
>>>>>> dri-devel mailing list
>>>>>> dri-devel@lists.freedesktop.org
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=aU9ZAl66EOLKphjWFPXJPR%2BTM%2BZeeMv%2BVJC6vliEqrs%3D&reserved=0
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=USmYbhfkSfPcE1npvsMfRwBr9Ijresh1fH4cAeNEr2M%3D&reserved=0
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
@ 2021-04-07 9:27 ` Christian König
0 siblings, 0 replies; 31+ messages in thread
From: Christian König @ 2021-04-07 9:27 UTC (permalink / raw)
To: Daniel Gomez, Alex Deucher
Cc: David Airlie, Christian König, Felix Kuehling, linux-kernel,
dri-devel, amd-gfx, Deucher, Alexander, dagmcr
Am 07.04.21 um 09:47 schrieb Daniel Gomez:
> On Tue, 6 Apr 2021 at 22:56, Alex Deucher <alexdeucher@gmail.com> wrote:
>> On Mon, Mar 22, 2021 at 6:34 AM Christian König
>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>> Hi Daniel,
>>>
>>> Am 22.03.21 um 10:38 schrieb Daniel Gomez:
>>>> On Fri, 19 Mar 2021 at 21:29, Felix Kuehling <felix.kuehling@amd.com> wrote:
>>>>> This caused a regression in kfdtest in a large-buffer stress test after
>>>>> memory allocation for user pages fails:
>>>> I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and
>>>> not this one.
>>>> Just some background for the mem leak patch if helps to understand this:
>>>> The leak was introduce here:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fid%3D0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2FeOQf12NBkC3YGZ7QW66%2FT%2FpyM3DjEb9IMbqUvISMfo%3D&reserved=0
>>>> where the bound status was introduced for all drm drivers including
>>>> radeon and amdgpu. So this patch just reverts the logic to the
>>>> original code but keeping the bound status. In my case, the binding
>>>> code allocates the user pages memory and returns without bounding (at
>>>> amdgpu_gtt_mgr_has_gart_addr). So,
>>>> when the unbinding happens, the memory needs to be cleared to prevent the leak.
>>> Ah, now I understand what's happening here. Daniel your patch is not
>>> really correct.
>>>
>>> The problem is rather that we don't set the tt object to bound if it
>>> doesn't have a GTT address.
>>>
>>> Going to provide a patch for this.
>> Did this patch ever land?
> I don't think so but I might send a v2 following Christian's comment
> if you guys agree.
Somebody else already provided a patch which I reviewed, but I'm not
sure if that landed either.
> Also, the patch here is for radeon but the pagefault issue reported by
> Felix is affected by the amdgpu one:
>
> radeon patch: drm/radeon/ttm: Fix memory leak userptr pages
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210318083236.43578-1-daniel%40qtec.com%2F&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HSMK%2FqYz%2Bzz9qbKc%2FITUWluBDeaW9YrgyH8p0L640%2F0%3D&reserved=0
>
> amdgpu patch: drm/amdgpu/ttm: Fix memory leak userptr pages
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210317160840.36019-1-daniel%40qtec.com%2F&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UsUZ4YjCSjHmzlPB07oTaGrsntTrQSwlGk%2BxUjwDiag%3D&reserved=0
>
> I assume both need to be fixed with the same approach.
Yes correct. Let me double check where that fix went.
Thanks,
Christian.
>
> Daniel
>> Alex
>>
>>> Regards,
>>> Christian.
>>>
>>>>> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16
>>>>> [17359.543746] BUG: kernel NULL pointer dereference, address: 0000000000000000
>>>>> [17359.551494] #PF: supervisor read access in kernel mode
>>>>> [17359.557375] #PF: error_code(0x0000) - not-present page
>>>>> [17359.563247] PGD 0 P4D 0
>>>>> [17359.566514] Oops: 0000 [#1] SMP PTI
>>>>> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted 5.11.0-kfd-fkuehlin #193
>>>>> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3201 06/17/2016
>>>>> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
>>>>> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
>>>>> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
>>>>> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
>>>>> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
>>>>> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
>>>>> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
>>>>> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
>>>>> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
>>>>> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
>>>>> [17359.682883] Call Trace:
>>>>> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu]
>>>>> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
>>>>> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm]
>>>>> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
>>>>> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 [amdgpu]
>>>>> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu]
>>>>> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu]
>>>>> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu]
>>>>> [17359.734152] __x64_sys_ioctl+0x8b/0xd0
>>>>> [17359.738796] do_syscall_64+0x2d/0x40
>>>>> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>> [17359.749205] RIP: 0033:0x7febb083b6d7
>>>>> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
>>>>> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
>>>>> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007febb083b6d7
>>>>> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI: 0000000000000003
>>>>> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09: 00000000c4000004
>>>>> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12: 0000559416e4e2aa
>>>>> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000000
>>>>> [17359.822140] Modules linked in: ip6table_filter ip6_tables iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 gpu_sched ip_tables x_tables
>>>>> [17359.837776] CR2: 0000000000000000
>>>>> [17359.841888] ---[ end trace a6f27d64475b28c8 ]---
>>>>> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
>>>>> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
>>>>> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
>>>>> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
>>>>> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
>>>>> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
>>>>> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
>>>>> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
>>>>> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
>>>>> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
>>>> From what I understand, the init_user_pages fails (returns EBUSY) and
>>>> the code goes to allocate_init_user_pages_failed where the unbind and
>>>> the userptr clear occurs.
>>>> Can we prevent this if we save the bounding status + userptr alloc? so
>>>> the function amdgpu_ttm_backend_unbind returns without trying to clear
>>>> the userptr memory?
>>>>
>>>> Something like:
>>>>
>>>> amdgpu_ttm_backend_bind:
>>>> if (gtt->userptr) {
>>>> r = amdgpu_ttm_tt_pin_userptr(bdev, ttm);
>>>> if (r) ...
>>>> gtt->sg_table = true;
>>>> }
>>>>
>>>> amdgpu_ttm_backend_unbind:
>>>> if (gtt->sg_table) {
>>>> if (gtt->user_ptr) ...
>>>> }
>>>>
>>>> If you agree, I'll send a v2 patch. Otherwise, maybe we could return
>>>> within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated.
>>>>
>>>> Any other ideas?
>>>>
>>>> Regards,
>>>> Daniel
>>>>
>>>>> Reverting this patch fixes the problem for me.
>>>>>
>>>>> Regards,
>>>>> Felix
>>>>>
>>>>> On 2021-03-18 10:57 p.m., Alex Deucher wrote:
>>>>>> Applied. Thanks!
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian
>>>>>> <Christian.Koenig@amd.com> wrote:
>>>>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>>>>>> ________________________________
>>>>>>> Von: Daniel Gomez <daniel@qtec.com>
>>>>>>> Gesendet: Donnerstag, 18. März 2021 09:32
>>>>>>> Cc: dagmcr@gmail.com <dagmcr@gmail.com>; Daniel Gomez <daniel@qtec.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; David Airlie <airlied@linux.ie>; Daniel Vetter <daniel@ffwll.ch>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
>>>>>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
>>>>>>>
>>>>>>> If userptr pages have been pinned but not bounded,
>>>>>>> they remain uncleared.
>>>>>>>
>>>>>>> Signed-off-by: Daniel Gomez <daniel@qtec.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
>>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
>>>>>>> index e8c66d10478f..bbcc6264d48f 100644
>>>>>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
>>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
>>>>>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct ttm_bo_device *bdev, struct ttm_tt
>>>>>>> struct radeon_ttm_tt *gtt = (void *)ttm;
>>>>>>> struct radeon_device *rdev = radeon_get_rdev(bdev);
>>>>>>>
>>>>>>> + if (gtt->userptr)
>>>>>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm);
>>>>>>> +
>>>>>>> if (!gtt->bound)
>>>>>>> return;
>>>>>>>
>>>>>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
>>>>>>>
>>>>>>> - if (gtt->userptr)
>>>>>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm);
>>>>>>> gtt->bound = false;
>>>>>>> }
>>>>>>>
>>>>>>> --
>>>>>>> 2.30.2
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> dri-devel mailing list
>>>>>>> dri-devel@lists.freedesktop.org
>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=aU9ZAl66EOLKphjWFPXJPR%2BTM%2BZeeMv%2BVJC6vliEqrs%3D&reserved=0
>>>>>> _______________________________________________
>>>>>> dri-devel mailing list
>>>>>> dri-devel@lists.freedesktop.org
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=aU9ZAl66EOLKphjWFPXJPR%2BTM%2BZeeMv%2BVJC6vliEqrs%3D&reserved=0
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=USmYbhfkSfPcE1npvsMfRwBr9Ijresh1fH4cAeNEr2M%3D&reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
@ 2021-04-07 9:27 ` Christian König
0 siblings, 0 replies; 31+ messages in thread
From: Christian König @ 2021-04-07 9:27 UTC (permalink / raw)
To: Daniel Gomez, Alex Deucher
Cc: David Airlie, Christian König, Felix Kuehling, linux-kernel,
dri-devel, amd-gfx, Deucher, Alexander, dagmcr
Am 07.04.21 um 09:47 schrieb Daniel Gomez:
> On Tue, 6 Apr 2021 at 22:56, Alex Deucher <alexdeucher@gmail.com> wrote:
>> On Mon, Mar 22, 2021 at 6:34 AM Christian König
>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>> Hi Daniel,
>>>
>>> Am 22.03.21 um 10:38 schrieb Daniel Gomez:
>>>> On Fri, 19 Mar 2021 at 21:29, Felix Kuehling <felix.kuehling@amd.com> wrote:
>>>>> This caused a regression in kfdtest in a large-buffer stress test after
>>>>> memory allocation for user pages fails:
>>>> I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and
>>>> not this one.
>>>> Just some background for the mem leak patch if helps to understand this:
>>>> The leak was introduce here:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fid%3D0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2FeOQf12NBkC3YGZ7QW66%2FT%2FpyM3DjEb9IMbqUvISMfo%3D&reserved=0
>>>> where the bound status was introduced for all drm drivers including
>>>> radeon and amdgpu. So this patch just reverts the logic to the
>>>> original code but keeping the bound status. In my case, the binding
>>>> code allocates the user pages memory and returns without bounding (at
>>>> amdgpu_gtt_mgr_has_gart_addr). So,
>>>> when the unbinding happens, the memory needs to be cleared to prevent the leak.
>>> Ah, now I understand what's happening here. Daniel your patch is not
>>> really correct.
>>>
>>> The problem is rather that we don't set the tt object to bound if it
>>> doesn't have a GTT address.
>>>
>>> Going to provide a patch for this.
>> Did this patch ever land?
> I don't think so but I might send a v2 following Christian's comment
> if you guys agree.
Somebody else already provided a patch which I reviewed, but I'm not
sure if that landed either.
> Also, the patch here is for radeon but the pagefault issue reported by
> Felix is affected by the amdgpu one:
>
> radeon patch: drm/radeon/ttm: Fix memory leak userptr pages
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210318083236.43578-1-daniel%40qtec.com%2F&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HSMK%2FqYz%2Bzz9qbKc%2FITUWluBDeaW9YrgyH8p0L640%2F0%3D&reserved=0
>
> amdgpu patch: drm/amdgpu/ttm: Fix memory leak userptr pages
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210317160840.36019-1-daniel%40qtec.com%2F&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UsUZ4YjCSjHmzlPB07oTaGrsntTrQSwlGk%2BxUjwDiag%3D&reserved=0
>
> I assume both need to be fixed with the same approach.
Yes correct. Let me double check where that fix went.
Thanks,
Christian.
>
> Daniel
>> Alex
>>
>>> Regards,
>>> Christian.
>>>
>>>>> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16
>>>>> [17359.543746] BUG: kernel NULL pointer dereference, address: 0000000000000000
>>>>> [17359.551494] #PF: supervisor read access in kernel mode
>>>>> [17359.557375] #PF: error_code(0x0000) - not-present page
>>>>> [17359.563247] PGD 0 P4D 0
>>>>> [17359.566514] Oops: 0000 [#1] SMP PTI
>>>>> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted 5.11.0-kfd-fkuehlin #193
>>>>> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS 3201 06/17/2016
>>>>> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
>>>>> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
>>>>> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
>>>>> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
>>>>> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
>>>>> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
>>>>> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
>>>>> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
>>>>> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
>>>>> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
>>>>> [17359.682883] Call Trace:
>>>>> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu]
>>>>> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
>>>>> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm]
>>>>> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
>>>>> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 [amdgpu]
>>>>> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu]
>>>>> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu]
>>>>> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu]
>>>>> [17359.734152] __x64_sys_ioctl+0x8b/0xd0
>>>>> [17359.738796] do_syscall_64+0x2d/0x40
>>>>> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>> [17359.749205] RIP: 0033:0x7febb083b6d7
>>>>> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
>>>>> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
>>>>> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007febb083b6d7
>>>>> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI: 0000000000000003
>>>>> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09: 00000000c4000004
>>>>> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12: 0000559416e4e2aa
>>>>> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15: 0000000000000000
>>>>> [17359.822140] Modules linked in: ip6table_filter ip6_tables iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 gpu_sched ip_tables x_tables
>>>>> [17359.837776] CR2: 0000000000000000
>>>>> [17359.841888] ---[ end trace a6f27d64475b28c8 ]---
>>>>> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 [amdgpu]
>>>>> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
>>>>> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
>>>>> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: 0000000000000000
>>>>> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: ffff950eadec5e80
>>>>> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: 0000000000000000
>>>>> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12: ffff950c03377800
>>>>> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: 0000000000000000
>>>>> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000) knlGS:0000000000000000
>>>>> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: 00000000001706e0
>>>> From what I understand, the init_user_pages fails (returns EBUSY) and
>>>> the code goes to allocate_init_user_pages_failed where the unbind and
>>>> the userptr clear occurs.
>>>> Can we prevent this if we save the bounding status + userptr alloc? so
>>>> the function amdgpu_ttm_backend_unbind returns without trying to clear
>>>> the userptr memory?
>>>>
>>>> Something like:
>>>>
>>>> amdgpu_ttm_backend_bind:
>>>> if (gtt->userptr) {
>>>> r = amdgpu_ttm_tt_pin_userptr(bdev, ttm);
>>>> if (r) ...
>>>> gtt->sg_table = true;
>>>> }
>>>>
>>>> amdgpu_ttm_backend_unbind:
>>>> if (gtt->sg_table) {
>>>> if (gtt->user_ptr) ...
>>>> }
>>>>
>>>> If you agree, I'll send a v2 patch. Otherwise, maybe we could return
>>>> within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated.
>>>>
>>>> Any other ideas?
>>>>
>>>> Regards,
>>>> Daniel
>>>>
>>>>> Reverting this patch fixes the problem for me.
>>>>>
>>>>> Regards,
>>>>> Felix
>>>>>
>>>>> On 2021-03-18 10:57 p.m., Alex Deucher wrote:
>>>>>> Applied. Thanks!
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian
>>>>>> <Christian.Koenig@amd.com> wrote:
>>>>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
>>>>>>> ________________________________
>>>>>>> Von: Daniel Gomez <daniel@qtec.com>
>>>>>>> Gesendet: Donnerstag, 18. März 2021 09:32
>>>>>>> Cc: dagmcr@gmail.com <dagmcr@gmail.com>; Daniel Gomez <daniel@qtec.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; David Airlie <airlied@linux.ie>; Daniel Vetter <daniel@ffwll.ch>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org>
>>>>>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
>>>>>>>
>>>>>>> If userptr pages have been pinned but not bounded,
>>>>>>> they remain uncleared.
>>>>>>>
>>>>>>> Signed-off-by: Daniel Gomez <daniel@qtec.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
>>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
>>>>>>> index e8c66d10478f..bbcc6264d48f 100644
>>>>>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
>>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
>>>>>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct ttm_bo_device *bdev, struct ttm_tt
>>>>>>> struct radeon_ttm_tt *gtt = (void *)ttm;
>>>>>>> struct radeon_device *rdev = radeon_get_rdev(bdev);
>>>>>>>
>>>>>>> + if (gtt->userptr)
>>>>>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm);
>>>>>>> +
>>>>>>> if (!gtt->bound)
>>>>>>> return;
>>>>>>>
>>>>>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
>>>>>>>
>>>>>>> - if (gtt->userptr)
>>>>>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm);
>>>>>>> gtt->bound = false;
>>>>>>> }
>>>>>>>
>>>>>>> --
>>>>>>> 2.30.2
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> dri-devel mailing list
>>>>>>> dri-devel@lists.freedesktop.org
>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=aU9ZAl66EOLKphjWFPXJPR%2BTM%2BZeeMv%2BVJC6vliEqrs%3D&reserved=0
>>>>>> _______________________________________________
>>>>>> dri-devel mailing list
>>>>>> dri-devel@lists.freedesktop.org
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=aU9ZAl66EOLKphjWFPXJPR%2BTM%2BZeeMv%2BVJC6vliEqrs%3D&reserved=0
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=USmYbhfkSfPcE1npvsMfRwBr9Ijresh1fH4cAeNEr2M%3D&reserved=0
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
2021-04-07 9:27 ` Christian König
@ 2021-05-08 9:30 ` Daniel Gomez
-1 siblings, 0 replies; 31+ messages in thread
From: Daniel Gomez @ 2021-05-08 9:30 UTC (permalink / raw)
To: Christian König
Cc: dri-devel, David Airlie, Christian König, Felix Kuehling,
linux-kernel, amd-gfx, Deucher, Alexander, dagmcr
[-- Attachment #1: Type: text/plain, Size: 14112 bytes --]
Hi guys,
On Wed, 7 Apr 2021 at 11:27, Christian König <christian.koenig@amd.com>
wrote:
> Am 07.04.21 um 09:47 schrieb Daniel Gomez:
> > On Tue, 6 Apr 2021 at 22:56, Alex Deucher <alexdeucher@gmail.com> wrote:
> >> On Mon, Mar 22, 2021 at 6:34 AM Christian König
> >> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>> Hi Daniel,
> >>>
> >>> Am 22.03.21 um 10:38 schrieb Daniel Gomez:
> >>>> On Fri, 19 Mar 2021 at 21:29, Felix Kuehling <felix.kuehling@amd.com>
> wrote:
> >>>>> This caused a regression in kfdtest in a large-buffer stress test
> after
> >>>>> memory allocation for user pages fails:
> >>>> I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and
> >>>> not this one.
> >>>> Just some background for the mem leak patch if helps to understand
> this:
> >>>> The leak was introduce here:
> >>>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fid%3D0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2FeOQf12NBkC3YGZ7QW66%2FT%2FpyM3DjEb9IMbqUvISMfo%3D&reserved=0
> >>>> where the bound status was introduced for all drm drivers including
> >>>> radeon and amdgpu. So this patch just reverts the logic to the
> >>>> original code but keeping the bound status. In my case, the binding
> >>>> code allocates the user pages memory and returns without bounding (at
> >>>> amdgpu_gtt_mgr_has_gart_addr). So,
> >>>> when the unbinding happens, the memory needs to be cleared to prevent
> the leak.
> >>> Ah, now I understand what's happening here. Daniel your patch is not
> >>> really correct.
> >>>
> >>> The problem is rather that we don't set the tt object to bound if it
> >>> doesn't have a GTT address.
> >>>
> >>> Going to provide a patch for this.
> >> Did this patch ever land?
> > I don't think so but I might send a v2 following Christian's comment
> > if you guys agree.
>
> Somebody else already provided a patch which I reviewed, but I'm not
> sure if that landed either.
>
> > Also, the patch here is for radeon but the pagefault issue reported by
> > Felix is affected by the amdgpu one:
> >
> > radeon patch: drm/radeon/ttm: Fix memory leak userptr pages
> >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210318083236.43578-1-daniel%40qtec.com%2F&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HSMK%2FqYz%2Bzz9qbKc%2FITUWluBDeaW9YrgyH8p0L640%2F0%3D&reserved=0
> >
> > amdgpu patch: drm/amdgpu/ttm: Fix memory leak userptr pages
> >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210317160840.36019-1-daniel%40qtec.com%2F&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UsUZ4YjCSjHmzlPB07oTaGrsntTrQSwlGk%2BxUjwDiag%3D&reserved=0
> >
> > I assume both need to be fixed with the same approach.
>
> Yes correct. Let me double check where that fix went.
This patch (actually, the memory leak fix for amdgpu not radeon) has landed
in mainline and has been back-ported to the stable branches. I just want to
verify with you if that’s okay and the NULL pointer issue reported by Felix
is fixed by this other patch:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c?id=3c3dc654333f6389803cdcaf03912e94173ae510
Thanks,
Daniel
>
>
> Thanks,
> Christian.
>
> >
> > Daniel
> >> Alex
> >>
> >>> Regards,
> >>> Christian.
> >>>
> >>>>> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16
> >>>>> [17359.543746] BUG: kernel NULL pointer dereference, address:
> 0000000000000000
> >>>>> [17359.551494] #PF: supervisor read access in kernel mode
> >>>>> [17359.557375] #PF: error_code(0x0000) - not-present page
> >>>>> [17359.563247] PGD 0 P4D 0
> >>>>> [17359.566514] Oops: 0000 [#1] SMP PTI
> >>>>> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted
> 5.11.0-kfd-fkuehlin #193
> >>>>> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS
> 3201 06/17/2016
> >>>>> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110
> [amdgpu]
> >>>>> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8
> 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e
> 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> >>>>> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> >>>>> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX:
> 0000000000000000
> >>>>> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI:
> ffff950eadec5e80
> >>>>> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09:
> 0000000000000000
> >>>>> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12:
> ffff950c03377800
> >>>>> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15:
> 0000000000000000
> >>>>> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000)
> knlGS:0000000000000000
> >>>>> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4:
> 00000000001706e0
> >>>>> [17359.682883] Call Trace:
> >>>>> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu]
> >>>>> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
> >>>>> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm]
> >>>>> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
> >>>>> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910
> [amdgpu]
> >>>>> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu]
> >>>>> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu]
> >>>>> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu]
> >>>>> [17359.734152] __x64_sys_ioctl+0x8b/0xd0
> >>>>> [17359.738796] do_syscall_64+0x2d/0x40
> >>>>> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>>>> [17359.749205] RIP: 0033:0x7febb083b6d7
> >>>>> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00
> 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
> >>>>> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX:
> 0000000000000010
> >>>>> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX:
> 00007febb083b6d7
> >>>>> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI:
> 0000000000000003
> >>>>> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09:
> 00000000c4000004
> >>>>> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12:
> 0000559416e4e2aa
> >>>>> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15:
> 0000000000000000
> >>>>> [17359.822140] Modules linked in: ip6table_filter ip6_tables
> iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2
> gpu_sched ip_tables x_tables
> >>>>> [17359.837776] CR2: 0000000000000000
> >>>>> [17359.841888] ---[ end trace a6f27d64475b28c8 ]---
> >>>>> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110
> [amdgpu]
> >>>>> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8
> 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e
> 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> >>>>> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> >>>>> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX:
> 0000000000000000
> >>>>> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI:
> ffff950eadec5e80
> >>>>> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09:
> 0000000000000000
> >>>>> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12:
> ffff950c03377800
> >>>>> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15:
> 0000000000000000
> >>>>> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000)
> knlGS:0000000000000000
> >>>>> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4:
> 00000000001706e0
> >>>> From what I understand, the init_user_pages fails (returns EBUSY)
> and
> >>>> the code goes to allocate_init_user_pages_failed where the unbind and
> >>>> the userptr clear occurs.
> >>>> Can we prevent this if we save the bounding status + userptr alloc? so
> >>>> the function amdgpu_ttm_backend_unbind returns without trying to clear
> >>>> the userptr memory?
> >>>>
> >>>> Something like:
> >>>>
> >>>> amdgpu_ttm_backend_bind:
> >>>> if (gtt->userptr) {
> >>>> r = amdgpu_ttm_tt_pin_userptr(bdev, ttm);
> >>>> if (r) ...
> >>>> gtt->sg_table = true;
> >>>> }
> >>>>
> >>>> amdgpu_ttm_backend_unbind:
> >>>> if (gtt->sg_table) {
> >>>> if (gtt->user_ptr) ...
> >>>> }
> >>>>
> >>>> If you agree, I'll send a v2 patch. Otherwise, maybe we could return
> >>>> within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated.
> >>>>
> >>>> Any other ideas?
> >>>>
> >>>> Regards,
> >>>> Daniel
> >>>>
> >>>>> Reverting this patch fixes the problem for me.
> >>>>>
> >>>>> Regards,
> >>>>> Felix
> >>>>>
> >>>>> On 2021-03-18 10:57 p.m., Alex Deucher wrote:
> >>>>>> Applied. Thanks!
> >>>>>>
> >>>>>> Alex
> >>>>>>
> >>>>>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian
> >>>>>> <Christian.Koenig@amd.com> wrote:
> >>>>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
> >>>>>>> ________________________________
> >>>>>>> Von: Daniel Gomez <daniel@qtec.com>
> >>>>>>> Gesendet: Donnerstag, 18. März 2021 09:32
> >>>>>>> Cc: dagmcr@gmail.com <dagmcr@gmail.com>; Daniel Gomez <
> daniel@qtec.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig,
> Christian <Christian.Koenig@amd.com>; David Airlie <airlied@linux.ie>;
> Daniel Vetter <daniel@ffwll.ch>; amd-gfx@lists.freedesktop.org <
> amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <
> dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <
> linux-kernel@vger.kernel.org>
> >>>>>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
> >>>>>>>
> >>>>>>> If userptr pages have been pinned but not bounded,
> >>>>>>> they remain uncleared.
> >>>>>>>
> >>>>>>> Signed-off-by: Daniel Gomez <daniel@qtec.com>
> >>>>>>> ---
> >>>>>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
> >>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c
> b/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>>>>> index e8c66d10478f..bbcc6264d48f 100644
> >>>>>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>>>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct
> ttm_bo_device *bdev, struct ttm_tt
> >>>>>>> struct radeon_ttm_tt *gtt = (void *)ttm;
> >>>>>>> struct radeon_device *rdev = radeon_get_rdev(bdev);
> >>>>>>>
> >>>>>>> + if (gtt->userptr)
> >>>>>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm);
> >>>>>>> +
> >>>>>>> if (!gtt->bound)
> >>>>>>> return;
> >>>>>>>
> >>>>>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
> >>>>>>>
> >>>>>>> - if (gtt->userptr)
> >>>>>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm);
> >>>>>>> gtt->bound = false;
> >>>>>>> }
> >>>>>>>
> >>>>>>> --
> >>>>>>> 2.30.2
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> dri-devel mailing list
> >>>>>>> dri-devel@lists.freedesktop.org
> >>>>>>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=aU9ZAl66EOLKphjWFPXJPR%2BTM%2BZeeMv%2BVJC6vliEqrs%3D&reserved=0
> >>>>>> _______________________________________________
> >>>>>> dri-devel mailing list
> >>>>>> dri-devel@lists.freedesktop.org
> >>>>>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=aU9ZAl66EOLKphjWFPXJPR%2BTM%2BZeeMv%2BVJC6vliEqrs%3D&reserved=0
> >>>> _______________________________________________
> >>>> amd-gfx mailing list
> >>>> amd-gfx@lists.freedesktop.org
> >>>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=USmYbhfkSfPcE1npvsMfRwBr9Ijresh1fH4cAeNEr2M%3D&reserved=0
>
>
[-- Attachment #2: Type: text/html, Size: 22899 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
@ 2021-05-08 9:30 ` Daniel Gomez
0 siblings, 0 replies; 31+ messages in thread
From: Daniel Gomez @ 2021-05-08 9:30 UTC (permalink / raw)
To: Christian König
Cc: dri-devel, David Airlie, Christian König, Felix Kuehling,
linux-kernel, amd-gfx, Deucher, Alexander, dagmcr, Alex Deucher
[-- Attachment #1.1: Type: text/plain, Size: 14112 bytes --]
Hi guys,
On Wed, 7 Apr 2021 at 11:27, Christian König <christian.koenig@amd.com>
wrote:
> Am 07.04.21 um 09:47 schrieb Daniel Gomez:
> > On Tue, 6 Apr 2021 at 22:56, Alex Deucher <alexdeucher@gmail.com> wrote:
> >> On Mon, Mar 22, 2021 at 6:34 AM Christian König
> >> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>> Hi Daniel,
> >>>
> >>> Am 22.03.21 um 10:38 schrieb Daniel Gomez:
> >>>> On Fri, 19 Mar 2021 at 21:29, Felix Kuehling <felix.kuehling@amd.com>
> wrote:
> >>>>> This caused a regression in kfdtest in a large-buffer stress test
> after
> >>>>> memory allocation for user pages fails:
> >>>> I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and
> >>>> not this one.
> >>>> Just some background for the mem leak patch if helps to understand
> this:
> >>>> The leak was introduce here:
> >>>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fid%3D0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2FeOQf12NBkC3YGZ7QW66%2FT%2FpyM3DjEb9IMbqUvISMfo%3D&reserved=0
> >>>> where the bound status was introduced for all drm drivers including
> >>>> radeon and amdgpu. So this patch just reverts the logic to the
> >>>> original code but keeping the bound status. In my case, the binding
> >>>> code allocates the user pages memory and returns without bounding (at
> >>>> amdgpu_gtt_mgr_has_gart_addr). So,
> >>>> when the unbinding happens, the memory needs to be cleared to prevent
> the leak.
> >>> Ah, now I understand what's happening here. Daniel your patch is not
> >>> really correct.
> >>>
> >>> The problem is rather that we don't set the tt object to bound if it
> >>> doesn't have a GTT address.
> >>>
> >>> Going to provide a patch for this.
> >> Did this patch ever land?
> > I don't think so but I might send a v2 following Christian's comment
> > if you guys agree.
>
> Somebody else already provided a patch which I reviewed, but I'm not
> sure if that landed either.
>
> > Also, the patch here is for radeon but the pagefault issue reported by
> > Felix is affected by the amdgpu one:
> >
> > radeon patch: drm/radeon/ttm: Fix memory leak userptr pages
> >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210318083236.43578-1-daniel%40qtec.com%2F&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HSMK%2FqYz%2Bzz9qbKc%2FITUWluBDeaW9YrgyH8p0L640%2F0%3D&reserved=0
> >
> > amdgpu patch: drm/amdgpu/ttm: Fix memory leak userptr pages
> >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210317160840.36019-1-daniel%40qtec.com%2F&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UsUZ4YjCSjHmzlPB07oTaGrsntTrQSwlGk%2BxUjwDiag%3D&reserved=0
> >
> > I assume both need to be fixed with the same approach.
>
> Yes correct. Let me double check where that fix went.
This patch (actually, the memory leak fix for amdgpu not radeon) has landed
in mainline and has been back-ported to the stable branches. I just want to
verify with you if that’s okay and the NULL pointer issue reported by Felix
is fixed by this other patch:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c?id=3c3dc654333f6389803cdcaf03912e94173ae510
Thanks,
Daniel
>
>
> Thanks,
> Christian.
>
> >
> > Daniel
> >> Alex
> >>
> >>> Regards,
> >>> Christian.
> >>>
> >>>>> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -16
> >>>>> [17359.543746] BUG: kernel NULL pointer dereference, address:
> 0000000000000000
> >>>>> [17359.551494] #PF: supervisor read access in kernel mode
> >>>>> [17359.557375] #PF: error_code(0x0000) - not-present page
> >>>>> [17359.563247] PGD 0 P4D 0
> >>>>> [17359.566514] Oops: 0000 [#1] SMP PTI
> >>>>> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted
> 5.11.0-kfd-fkuehlin #193
> >>>>> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIOS
> 3201 06/17/2016
> >>>>> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110
> [amdgpu]
> >>>>> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8
> 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e
> 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> >>>>> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> >>>>> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX:
> 0000000000000000
> >>>>> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI:
> ffff950eadec5e80
> >>>>> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09:
> 0000000000000000
> >>>>> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12:
> ffff950c03377800
> >>>>> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15:
> 0000000000000000
> >>>>> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000)
> knlGS:0000000000000000
> >>>>> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4:
> 00000000001706e0
> >>>>> [17359.682883] Call Trace:
> >>>>> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu]
> >>>>> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm]
> >>>>> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm]
> >>>>> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu]
> >>>>> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910
> [amdgpu]
> >>>>> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu]
> >>>>> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu]
> >>>>> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu]
> >>>>> [17359.734152] __x64_sys_ioctl+0x8b/0xd0
> >>>>> [17359.738796] do_syscall_64+0x2d/0x40
> >>>>> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>>>> [17359.749205] RIP: 0033:0x7febb083b6d7
> >>>>> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 00
> 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48
> >>>>> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX:
> 0000000000000010
> >>>>> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX:
> 00007febb083b6d7
> >>>>> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI:
> 0000000000000003
> >>>>> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09:
> 00000000c4000004
> >>>>> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12:
> 0000559416e4e2aa
> >>>>> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15:
> 0000000000000000
> >>>>> [17359.822140] Modules linked in: ip6table_filter ip6_tables
> iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2
> gpu_sched ip_tables x_tables
> >>>>> [17359.837776] CR2: 0000000000000000
> >>>>> [17359.841888] ---[ end trace a6f27d64475b28c8 ]---
> >>>>> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110
> [amdgpu]
> >>>>> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e8
> 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4e
> 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45
> >>>>> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206
> >>>>> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX:
> 0000000000000000
> >>>>> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI:
> ffff950eadec5e80
> >>>>> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09:
> 0000000000000000
> >>>>> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12:
> ffff950c03377800
> >>>>> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15:
> 0000000000000000
> >>>>> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000)
> knlGS:0000000000000000
> >>>>> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>>> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4:
> 00000000001706e0
> >>>> From what I understand, the init_user_pages fails (returns EBUSY)
> and
> >>>> the code goes to allocate_init_user_pages_failed where the unbind and
> >>>> the userptr clear occurs.
> >>>> Can we prevent this if we save the bounding status + userptr alloc? so
> >>>> the function amdgpu_ttm_backend_unbind returns without trying to clear
> >>>> the userptr memory?
> >>>>
> >>>> Something like:
> >>>>
> >>>> amdgpu_ttm_backend_bind:
> >>>> if (gtt->userptr) {
> >>>> r = amdgpu_ttm_tt_pin_userptr(bdev, ttm);
> >>>> if (r) ...
> >>>> gtt->sg_table = true;
> >>>> }
> >>>>
> >>>> amdgpu_ttm_backend_unbind:
> >>>> if (gtt->sg_table) {
> >>>> if (gtt->user_ptr) ...
> >>>> }
> >>>>
> >>>> If you agree, I'll send a v2 patch. Otherwise, maybe we could return
> >>>> within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated.
> >>>>
> >>>> Any other ideas?
> >>>>
> >>>> Regards,
> >>>> Daniel
> >>>>
> >>>>> Reverting this patch fixes the problem for me.
> >>>>>
> >>>>> Regards,
> >>>>> Felix
> >>>>>
> >>>>> On 2021-03-18 10:57 p.m., Alex Deucher wrote:
> >>>>>> Applied. Thanks!
> >>>>>>
> >>>>>> Alex
> >>>>>>
> >>>>>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian
> >>>>>> <Christian.Koenig@amd.com> wrote:
> >>>>>>> Reviewed-by: Christian König <christian.koenig@amd.com>
> >>>>>>> ________________________________
> >>>>>>> Von: Daniel Gomez <daniel@qtec.com>
> >>>>>>> Gesendet: Donnerstag, 18. März 2021 09:32
> >>>>>>> Cc: dagmcr@gmail.com <dagmcr@gmail.com>; Daniel Gomez <
> daniel@qtec.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Koenig,
> Christian <Christian.Koenig@amd.com>; David Airlie <airlied@linux.ie>;
> Daniel Vetter <daniel@ffwll.ch>; amd-gfx@lists.freedesktop.org <
> amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org <
> dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org <
> linux-kernel@vger.kernel.org>
> >>>>>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages
> >>>>>>>
> >>>>>>> If userptr pages have been pinned but not bounded,
> >>>>>>> they remain uncleared.
> >>>>>>>
> >>>>>>> Signed-off-by: Daniel Gomez <daniel@qtec.com>
> >>>>>>> ---
> >>>>>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++--
> >>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c
> b/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>>>>> index e8c66d10478f..bbcc6264d48f 100644
> >>>>>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> >>>>>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struct
> ttm_bo_device *bdev, struct ttm_tt
> >>>>>>> struct radeon_ttm_tt *gtt = (void *)ttm;
> >>>>>>> struct radeon_device *rdev = radeon_get_rdev(bdev);
> >>>>>>>
> >>>>>>> + if (gtt->userptr)
> >>>>>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm);
> >>>>>>> +
> >>>>>>> if (!gtt->bound)
> >>>>>>> return;
> >>>>>>>
> >>>>>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
> >>>>>>>
> >>>>>>> - if (gtt->userptr)
> >>>>>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm);
> >>>>>>> gtt->bound = false;
> >>>>>>> }
> >>>>>>>
> >>>>>>> --
> >>>>>>> 2.30.2
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> dri-devel mailing list
> >>>>>>> dri-devel@lists.freedesktop.org
> >>>>>>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=aU9ZAl66EOLKphjWFPXJPR%2BTM%2BZeeMv%2BVJC6vliEqrs%3D&reserved=0
> >>>>>> _______________________________________________
> >>>>>> dri-devel mailing list
> >>>>>> dri-devel@lists.freedesktop.org
> >>>>>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=aU9ZAl66EOLKphjWFPXJPR%2BTM%2BZeeMv%2BVJC6vliEqrs%3D&reserved=0
> >>>> _______________________________________________
> >>>> amd-gfx mailing list
> >>>> amd-gfx@lists.freedesktop.org
> >>>>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=USmYbhfkSfPcE1npvsMfRwBr9Ijresh1fH4cAeNEr2M%3D&reserved=0
>
>
[-- Attachment #1.2: Type: text/html, Size: 22899 bytes --]
[-- Attachment #2: Type: text/plain, Size: 154 bytes --]
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
^ permalink raw reply [flat|nested] 31+ messages in thread