From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 64432C433ED for ; Sat, 8 May 2021 09:31:02 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id E6A0B61426 for ; Sat, 8 May 2021 09:31:01 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E6A0B61426 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=qtec.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 8FE446E875; Sat, 8 May 2021 09:31:00 +0000 (UTC) Received: from mail-lf1-x131.google.com (mail-lf1-x131.google.com [IPv6:2a00:1450:4864:20::131]) by gabe.freedesktop.org (Postfix) with ESMTPS id 2022A6E870 for ; Sat, 8 May 2021 09:30:59 +0000 (UTC) Received: by mail-lf1-x131.google.com with SMTP id t11so16171105lfl.11 for ; Sat, 08 May 2021 02:30:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qtec.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=eWJyRyeQ5MxvA7qcuiyWtFi5pFj6BXtkfVg0lxOlM48=; b=ONor3CYJ+G3cplIPzRuAHbRye/FwGNd69CMsXxLB7DF1f7B+iE7S7GcA997J3Y/9zv 2Q6LVICRameG0dNDL9S1kiP45h7RO37u7mWg+QCzR6flp+jJg4iysGe40fU2ubu+MOs8 l9CKAgwtYsJ8EmZFBe/8BpGNSatd1N/YDoS6/pmJwwMMa1px5rwKLILN4s97Z0VWo8lE tWpQGoChoQvXPXbjdpardWnm2H2H4VpXwFFjImjq++ugoYURb/LGFyyYhkc31Db7AJk7 Nq4ll+ugY9Y+1YYSSZ5QQwQTYKwfEx5O8Q1AEkZhvfQ7pB+UXwBfy2iniKi14+MehD2b x4oQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=eWJyRyeQ5MxvA7qcuiyWtFi5pFj6BXtkfVg0lxOlM48=; b=bBRd15X7FE5gJa7rPxvHQ1wyl6xH5vQNrwXSswMc0PQRRsAIUtMJrEH0o0tGLiVyyY uTq05OMuy7DKsZzDySZkw9VIhqhtIwVn76nEZgNurpf0j7UlGMQBG2D2ib16PLk6z7gH /FgmkaKOWbVA4xx29kw3FmDdf69PeeEbOGryLl1Z0UJsCLU1YzFqMUy4N1UmMyrCV2Gy 1ffX4yS4b7tIUYybXjNwoJlK7zVXHzkarqxdibnt9Ba27uhilF47HbMzur3PcKkoOEjk IgKkUedw5nsKP/I48xObNw0kMQX0T6dk0uaxLDz/rK8A/HfH7fYBKItp+k2lh5/x1bWF Yx3w== X-Gm-Message-State: AOAM532wYPaWlP2EPN+bqqh3p+MikjVm2j5a0si9JpxEsAkRWug7uSsQ OhwxnOWaUySJXrOuniY2dYxs82zDifdFiorTBaNQXg== X-Google-Smtp-Source: ABdhPJwcMrHQNEMqUJXEnpuVQuWQ1t9VzK8eBl/rXycSqJgd1R1UV/10l4P7xu9ewe9JGqzRWaWkrKxYz76Hoa3hH1A= X-Received: by 2002:a05:6512:304c:: with SMTP id b12mr9280009lfb.3.1620466257288; Sat, 08 May 2021 02:30:57 -0700 (PDT) MIME-Version: 1.0 References: <20210318083236.43578-1-daniel@qtec.com> <375f0915-83b3-c729-b95f-939d828d24d0@amd.com> <605b85ab-57ce-86ed-1899-8874450dd318@amd.com> In-Reply-To: <605b85ab-57ce-86ed-1899-8874450dd318@amd.com> From: Daniel Gomez Date: Sat, 8 May 2021 11:30:46 +0200 Message-ID: Subject: Re: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages To: =?UTF-8?Q?Christian_K=C3=B6nig?= Content-Type: multipart/alternative; boundary="000000000000a2f2b305c1ce3317" X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "dri-devel@lists.freedesktop.org" , David Airlie , =?UTF-8?Q?Christian_K=C3=B6nig?= , Felix Kuehling , "linux-kernel@vger.kernel.org" , "amd-gfx@lists.freedesktop.org" , "Deucher, Alexander" , "dagmcr@gmail.com" Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" --000000000000a2f2b305c1ce3317 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi guys, On Wed, 7 Apr 2021 at 11:27, Christian K=C3=B6nig wrote: > Am 07.04.21 um 09:47 schrieb Daniel Gomez: > > On Tue, 6 Apr 2021 at 22:56, Alex Deucher wrote= : > >> On Mon, Mar 22, 2021 at 6:34 AM Christian K=C3=B6nig > >> wrote: > >>> Hi Daniel, > >>> > >>> Am 22.03.21 um 10:38 schrieb Daniel Gomez: > >>>> On Fri, 19 Mar 2021 at 21:29, Felix Kuehling > wrote: > >>>>> This caused a regression in kfdtest in a large-buffer stress test > after > >>>>> memory allocation for user pages fails: > >>>> I'm sorry to hear that. BTW, I guess you meant amdgpu leak patch and > >>>> not this one. > >>>> Just some background for the mem leak patch if helps to understand > this: > >>>> The leak was introduce here: > >>>> > https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Fgit.k= ernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommi= t%2F%3Fid%3D0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb&data=3D04%7C01%7CC= hristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe488= 4e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8e= yJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&am= p;sdata=3D%2FeOQf12NBkC3YGZ7QW66%2FT%2FpyM3DjEb9IMbqUvISMfo%3D&reserved= =3D0 > >>>> where the bound status was introduced for all drm drivers including > >>>> radeon and amdgpu. So this patch just reverts the logic to the > >>>> original code but keeping the bound status. In my case, the binding > >>>> code allocates the user pages memory and returns without bounding (a= t > >>>> amdgpu_gtt_mgr_has_gart_addr). So, > >>>> when the unbinding happens, the memory needs to be cleared to preven= t > the leak. > >>> Ah, now I understand what's happening here. Daniel your patch is not > >>> really correct. > >>> > >>> The problem is rather that we don't set the tt object to bound if it > >>> doesn't have a GTT address. > >>> > >>> Going to provide a patch for this. > >> Did this patch ever land? > > I don't think so but I might send a v2 following Christian's comment > > if you guys agree. > > Somebody else already provided a patch which I reviewed, but I'm not > sure if that landed either. > > > Also, the patch here is for radeon but the pagefault issue reported by > > Felix is affected by the amdgpu one: > > > > radeon patch: drm/radeon/ttm: Fix memory leak userptr pages > > > https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Fpatch= work.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210318083236.43578-1-dani= el%40qtec.com%2F&data=3D04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f= 02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63753= 3784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL= CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=3DHSMK%2FqYz%2Bzz9qbKc%2FITU= WluBDeaW9YrgyH8p0L640%2F0%3D&reserved=3D0 > > > > amdgpu patch: drm/amdgpu/ttm: Fix memory leak userptr pages > > > https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Fpatch= work.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F20210317160840.36019-1-dani= el%40qtec.com%2F&data=3D04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f= 02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C63753= 3784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL= CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=3DUsUZ4YjCSjHmzlPB07oTaGrsnt= TrQSwlGk%2BxUjwDiag%3D&reserved=3D0 > > > > I assume both need to be fixed with the same approach. > > Yes correct. Let me double check where that fix went. This patch (actually, the memory leak fix for amdgpu not radeon) has landed in mainline and has been back-ported to the stable branches. I just want to verify with you if that=E2=80=99s okay and the NULL pointer issue reported = by Felix is fixed by this other patch: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/d= rivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c?id=3D3c3dc654333f6389803cdcaf03912e9= 4173ae510 Thanks, Daniel > > > Thanks, > Christian. > > > > > Daniel > >> Alex > >> > >>> Regards, > >>> Christian. > >>> > >>>>> [17359.536303] amdgpu: init_user_pages: Failed to get user pages: -= 16 > >>>>> [17359.543746] BUG: kernel NULL pointer dereference, address: > 0000000000000000 > >>>>> [17359.551494] #PF: supervisor read access in kernel mode > >>>>> [17359.557375] #PF: error_code(0x0000) - not-present page > >>>>> [17359.563247] PGD 0 P4D 0 > >>>>> [17359.566514] Oops: 0000 [#1] SMP PTI > >>>>> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tainted > 5.11.0-kfd-fkuehlin #193 > >>>>> [17359.578760] Hardware name: ASUS All Series/X99-E WS/USB 3.1, BIO= S > 3201 06/17/2016 > >>>>> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 > [amdgpu] > >>>>> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e= 8 > 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4= e > 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45 > >>>>> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206 > >>>>> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: > 0000000000000000 > >>>>> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: > ffff950eadec5e80 > >>>>> [17359.636084] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: > 0000000000000000 > >>>>> [17359.643958] R10: 0000000000000246 R11: 0000000000000001 R12: > ffff950c03377800 > >>>>> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: > 0000000000000000 > >>>>> [17359.659701] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000= ) > knlGS:0000000000000000 > >>>>> [17359.668528] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>>>> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: > 00000000001706e0 > >>>>> [17359.682883] Call Trace: > >>>>> [17359.686063] amdgpu_ttm_backend_destroy+0x12/0x70 [amdgpu] > >>>>> [17359.692349] ttm_bo_cleanup_memtype_use+0x37/0x60 [ttm] > >>>>> [17359.698307] ttm_bo_release+0x278/0x5e0 [ttm] > >>>>> [17359.703385] amdgpu_bo_unref+0x1a/0x30 [amdgpu] > >>>>> [17359.708701] amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7e5/0x910 > [amdgpu] > >>>>> [17359.716307] kfd_ioctl_alloc_memory_of_gpu+0x11a/0x220 [amdgpu] > >>>>> [17359.723036] kfd_ioctl+0x223/0x400 [amdgpu] > >>>>> [17359.728017] ? kfd_dev_is_large_bar+0x90/0x90 [amdgpu] > >>>>> [17359.734152] __x64_sys_ioctl+0x8b/0xd0 > >>>>> [17359.738796] do_syscall_64+0x2d/0x40 > >>>>> [17359.743259] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > >>>>> [17359.749205] RIP: 0033:0x7febb083b6d7 > >>>>> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 c7 00 26 00 0= 0 > 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0= f > 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f7 d8 64 89 01 48 > >>>>> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 00000202 ORIG_RAX= : > 0000000000000010 > >>>>> [17359.782668] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: > 00007febb083b6d7 > >>>>> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c0284b16 RDI: > 0000000000000003 > >>>>> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb5522dd0 R09: > 00000000c4000004 > >>>>> [17359.806352] R10: 0000000000000000 R11: 0000000000000202 R12: > 0000559416e4e2aa > >>>>> [17359.814251] R13: 0000000000000000 R14: 0000000000000021 R15: > 0000000000000000 > >>>>> [17359.822140] Modules linked in: ip6table_filter ip6_tables > iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_v2 > gpu_sched ip_tables x_tables > >>>>> [17359.837776] CR2: 0000000000000000 > >>>>> [17359.841888] ---[ end trace a6f27d64475b28c8 ]--- > >>>>> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x52/0x110 > [amdgpu] > >>>>> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd 80 a1 ff ff e= 8 > 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 8b 46 10 8b 4= e > 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 ff ff 83 e1 01 45 > >>>>> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 00010206 > >>>>> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4edf00 RCX: > 0000000000000000 > >>>>> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4edf00 RDI: > ffff950eadec5e80 > >>>>> [17359.897008] RBP: ffff950eadec5e80 R08: 0000000000000000 R09: > 0000000000000000 > >>>>> [17359.905020] R10: 0000000000000246 R11: 0000000000000001 R12: > ffff950c03377800 > >>>>> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c03377858 R15: > 0000000000000000 > >>>>> [17359.921050] FS: 00007febb20cb740(0000) GS:ffff950ebfc00000(0000= ) > knlGS:0000000000000000 > >>>>> [17359.930047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>>>> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700e005 CR4: > 00000000001706e0 > >>>> From what I understand, the init_user_pages fails (returns EBUSY) > and > >>>> the code goes to allocate_init_user_pages_failed where the unbind an= d > >>>> the userptr clear occurs. > >>>> Can we prevent this if we save the bounding status + userptr alloc? = so > >>>> the function amdgpu_ttm_backend_unbind returns without trying to cle= ar > >>>> the userptr memory? > >>>> > >>>> Something like: > >>>> > >>>> amdgpu_ttm_backend_bind: > >>>> if (gtt->userptr) { > >>>> r =3D amdgpu_ttm_tt_pin_userptr(bdev, ttm); > >>>> if (r) ... > >>>> gtt->sg_table =3D true; > >>>> } > >>>> > >>>> amdgpu_ttm_backend_unbind: > >>>> if (gtt->sg_table) { > >>>> if (gtt->user_ptr) ... > >>>> } > >>>> > >>>> If you agree, I'll send a v2 patch. Otherwise, maybe we could return > >>>> within amdgpu_ttm_tt_unpin_userptr if memory hasn't been allocated. > >>>> > >>>> Any other ideas? > >>>> > >>>> Regards, > >>>> Daniel > >>>> > >>>>> Reverting this patch fixes the problem for me. > >>>>> > >>>>> Regards, > >>>>> Felix > >>>>> > >>>>> On 2021-03-18 10:57 p.m., Alex Deucher wrote: > >>>>>> Applied. Thanks! > >>>>>> > >>>>>> Alex > >>>>>> > >>>>>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian > >>>>>> wrote: > >>>>>>> Reviewed-by: Christian K=C3=B6nig > >>>>>>> ________________________________ > >>>>>>> Von: Daniel Gomez > >>>>>>> Gesendet: Donnerstag, 18. M=C3=A4rz 2021 09:32 > >>>>>>> Cc: dagmcr@gmail.com ; Daniel Gomez < > daniel@qtec.com>; Deucher, Alexander ; Koenig, > Christian ; David Airlie ; > Daniel Vetter ; amd-gfx@lists.freedesktop.org < > amd-gfx@lists.freedesktop.org>; dri-devel@lists.freedesktop.org < > dri-devel@lists.freedesktop.org>; linux-kernel@vger.kernel.org < > linux-kernel@vger.kernel.org> > >>>>>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory leak userptr pages > >>>>>>> > >>>>>>> If userptr pages have been pinned but not bounded, > >>>>>>> they remain uncleared. > >>>>>>> > >>>>>>> Signed-off-by: Daniel Gomez > >>>>>>> --- > >>>>>>> drivers/gpu/drm/radeon/radeon_ttm.c | 5 +++-- > >>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-) > >>>>>>> > >>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c > b/drivers/gpu/drm/radeon/radeon_ttm.c > >>>>>>> index e8c66d10478f..bbcc6264d48f 100644 > >>>>>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c > >>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c > >>>>>>> @@ -485,13 +485,14 @@ static void radeon_ttm_backend_unbind(struc= t > ttm_bo_device *bdev, struct ttm_tt > >>>>>>> struct radeon_ttm_tt *gtt =3D (void *)ttm; > >>>>>>> struct radeon_device *rdev =3D radeon_get_rdev(bdev); > >>>>>>> > >>>>>>> + if (gtt->userptr) > >>>>>>> + radeon_ttm_tt_unpin_userptr(bdev, ttm); > >>>>>>> + > >>>>>>> if (!gtt->bound) > >>>>>>> return; > >>>>>>> > >>>>>>> radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages)= ; > >>>>>>> > >>>>>>> - if (gtt->userptr) > >>>>>>> - radeon_ttm_tt_unpin_userptr(bdev, ttm); > >>>>>>> gtt->bound =3D false; > >>>>>>> } > >>>>>>> > >>>>>>> -- > >>>>>>> 2.30.2 > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> dri-devel mailing list > >>>>>>> dri-devel@lists.freedesktop.org > >>>>>>> > https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists= .freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=3D04%7C01%7CChri= stian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e6= 08e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJW= IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&s= data=3DaU9ZAl66EOLKphjWFPXJPR%2BTM%2BZeeMv%2BVJC6vliEqrs%3D&reserved=3D= 0 > >>>>>> _______________________________________________ > >>>>>> dri-devel mailing list > >>>>>> dri-devel@lists.freedesktop.org > >>>>>> > https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists= .freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=3D04%7C01%7CChri= stian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e6= 08e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJW= IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&s= data=3DaU9ZAl66EOLKphjWFPXJPR%2BTM%2BZeeMv%2BVJC6vliEqrs%3D&reserved=3D= 0 > >>>> _______________________________________________ > >>>> amd-gfx mailing list > >>>> amd-gfx@lists.freedesktop.org > >>>> > https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists= .freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=3D04%7C01%7CChrist= ian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608= e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIj= oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sda= ta=3DUSmYbhfkSfPcE1npvsMfRwBr9Ijresh1fH4cAeNEr2M%3D&reserved=3D0 > > --000000000000a2f2b305c1ce3317 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi guys,


On Wed, 7 Apr 20= 21 at 11:27, Christian K=C3=B6nig <christian.koenig@amd.com> wrote:
= Am 07.04.21 um 09:47 schrieb Daniel Gomez:
> On Tue, 6 Apr 2021 at 22:56, Alex Deucher <alexdeucher@gmail.com> wrote:
>> On Mon, Mar 22, 2021 at 6:34 AM Christian K=C3=B6nig
>> <ckoenig.leichtzumerken@gmail.com> wrote:
>>> Hi Daniel,
>>>
>>> Am 22.03.21 um 10:38 schrieb Daniel Gomez:
>>>> On Fri, 19 Mar 2021 at 21:29, Felix Kuehling <felix.kuehling@amd.com= > wrote:
>>>>> This caused a regression in kfdtest in a large-buffer = stress test after
>>>>> memory allocation for user pages fails:
>>>> I'm sorry to hear that. BTW, I guess you meant amdgpu = leak patch and
>>>> not this one.
>>>> Just some background for the mem leak patch if helps to un= derstand this:
>>>> The leak was introduce here:
>>>> https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Fg= it.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fc= ommit%2F%3Fid%3D0b988ca1c7c4c73983b4ea96ef7c2af2263c87eb&amp;data=3D04%= 7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8= 961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpb= GZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7= C1000&amp;sdata=3D%2FeOQf12NBkC3YGZ7QW66%2FT%2FpyM3DjEb9IMbqUvISMfo%3D&= amp;amp;reserved=3D0
>>>> where the bound status was introduced for all drm drivers = including
>>>> radeon and amdgpu. So this patch just reverts the logic to= the
>>>> original code but keeping the bound status. In my case, th= e binding
>>>> code allocates the user pages memory and returns without b= ounding (at
>>>> amdgpu_gtt_mgr_has_gart_addr). So,
>>>> when the unbinding happens, the memory needs to be cleared= to prevent the leak.
>>> Ah, now I understand what's happening here. Daniel your pa= tch is not
>>> really correct.
>>>
>>> The problem is rather that we don't set the tt object to b= ound if it
>>> doesn't have a GTT address.
>>>
>>> Going to provide a patch for this.
>> Did this patch ever land?
> I don't think so but I might send a v2 following Christian's c= omment
> if you guys agree.

Somebody else already provided a patch which I reviewed, but I'm not sure if that landed either.

> Also, the patch here is for radeon but the pagefault issue reported by=
> Felix is affected by the amdgpu one:
>
> radeon patch: drm/radeon/ttm: Fix memory leak userptr pages
> https://nam11.safelinks.protection.outlo= ok.com/?url=3Dhttps%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpa= tch%2F20210318083236.43578-1-daniel%40qtec.com%2F&amp;data=3D04%7C01%7C= Christian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe48= 84e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8= eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&a= mp;amp;sdata=3DHSMK%2FqYz%2Bzz9qbKc%2FITUWluBDeaW9YrgyH8p0L640%2F0%3D&a= mp;reserved=3D0
>
> amdgpu patch: drm/amdgpu/ttm: Fix memory leak userptr pages
> https://nam11.safelinks.protection.outlook.com= /?url=3Dhttps%3A%2F%2Fpatchwork.kernel.org%2Fproject%2Fdri-devel%2Fpatch%2F= 20210317160840.36019-1-daniel%40qtec.com%2F&amp;data=3D04%7C01%7CChrist= ian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608= e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIj= oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp= ;sdata=3DUsUZ4YjCSjHmzlPB07oTaGrsntTrQSwlGk%2BxUjwDiag%3D&amp;reserved= =3D0
>
> I assume both need to be fixed with the same approach.

Yes correct. Let me double check where that fix went.

This patch (actually, the memory leak= fix for amdgpu not radeon) has landed in mainline and has been back-ported= to the stable branches. I just want to verify with you if that=E2=80=99s o= kay and the NULL pointer issue reported by Felix is fixed by this other pat= ch:

Thanks,
Daniel


Thanks,
Christian.

>
> Daniel
>> Alex
>>
>>> Regards,
>>> Christian.
>>>
>>>>> [17359.536303] amdgpu: init_user_pages: Failed to get = user pages: -16
>>>>> [17359.543746] BUG: kernel NULL pointer dereference, a= ddress: 0000000000000000
>>>>> [17359.551494] #PF: supervisor read access in kernel m= ode
>>>>> [17359.557375] #PF: error_code(0x0000) - not-present p= age
>>>>> [17359.563247] PGD 0 P4D 0
>>>>> [17359.566514] Oops: 0000 [#1] SMP PTI
>>>>> [17359.570728] CPU: 8 PID: 5944 Comm: kfdtest Not tain= ted 5.11.0-kfd-fkuehlin #193
>>>>> [17359.578760] Hardware name: ASUS All Series/X99-E WS= /USB 3.1, BIOS 3201 06/17/2016
>>>>> [17359.586971] RIP: 0010:amdgpu_ttm_backend_unbind+0x5= 2/0x110 [amdgpu]
>>>>> [17359.594075] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd = 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 = 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 f= f ff 83 e1 01 45
>>>>> [17359.614340] RSP: 0018:ffffa4764971fc98 EFLAGS: 0001= 0206
>>>>> [17359.620315] RAX: 0000000000000000 RBX: ffff950e8d4e= df00 RCX: 0000000000000000
>>>>> [17359.628204] RDX: 0000000000000000 RSI: ffff950e8d4e= df00 RDI: ffff950eadec5e80
>>>>> [17359.636084] RBP: ffff950eadec5e80 R08: 000000000000= 0000 R09: 0000000000000000
>>>>> [17359.643958] R10: 0000000000000246 R11: 000000000000= 0001 R12: ffff950c03377800
>>>>> [17359.651833] R13: ffff950eadec5e80 R14: ffff950c0337= 7858 R15: 0000000000000000
>>>>> [17359.659701] FS:=C2=A0 00007febb20cb740(0000) GS:fff= f950ebfc00000(0000) knlGS:0000000000000000
>>>>> [17359.668528] CS:=C2=A0 0010 DS: 0000 ES: 0000 CR0: 0= 000000080050033
>>>>> [17359.675012] CR2: 0000000000000000 CR3: 00000006d700= e005 CR4: 00000000001706e0
>>>>> [17359.682883] Call Trace:
>>>>> [17359.686063]=C2=A0 amdgpu_ttm_backend_destroy+0x12/0= x70 [amdgpu]
>>>>> [17359.692349]=C2=A0 ttm_bo_cleanup_memtype_use+0x37/0= x60 [ttm]
>>>>> [17359.698307]=C2=A0 ttm_bo_release+0x278/0x5e0 [ttm]<= br> >>>>> [17359.703385]=C2=A0 amdgpu_bo_unref+0x1a/0x30 [amdgpu= ]
>>>>> [17359.708701]=C2=A0 amdgpu_amdkfd_gpuvm_alloc_memory_= of_gpu+0x7e5/0x910 [amdgpu]
>>>>> [17359.716307]=C2=A0 kfd_ioctl_alloc_memory_of_gpu+0x1= 1a/0x220 [amdgpu]
>>>>> [17359.723036]=C2=A0 kfd_ioctl+0x223/0x400 [amdgpu] >>>>> [17359.728017]=C2=A0 ? kfd_dev_is_large_bar+0x90/0x90 = [amdgpu]
>>>>> [17359.734152]=C2=A0 __x64_sys_ioctl+0x8b/0xd0
>>>>> [17359.738796]=C2=A0 do_syscall_64+0x2d/0x40
>>>>> [17359.743259]=C2=A0 entry_SYSCALL_64_after_hwframe+0x= 44/0xa9
>>>>> [17359.749205] RIP: 0033:0x7febb083b6d7
>>>>> [17359.753681] Code: b3 66 90 48 8b 05 b1 47 2d 00 64 = c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 = 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 47 2d 00 f= 7 d8 64 89 01 48
>>>>> [17359.774340] RSP: 002b:00007ffdb5522cd8 EFLAGS: 0000= 0202 ORIG_RAX: 0000000000000010
>>>>> [17359.782668] RAX: ffffffffffffffda RBX: 000000000000= 0001 RCX: 00007febb083b6d7
>>>>> [17359.790566] RDX: 00007ffdb5522d60 RSI: 00000000c028= 4b16 RDI: 0000000000000003
>>>>> [17359.798459] RBP: 00007ffdb5522d10 R08: 00007ffdb552= 2dd0 R09: 00000000c4000004
>>>>> [17359.806352] R10: 0000000000000000 R11: 000000000000= 0202 R12: 0000559416e4e2aa
>>>>> [17359.814251] R13: 0000000000000000 R14: 000000000000= 0021 R15: 0000000000000000
>>>>> [17359.822140] Modules linked in: ip6table_filter ip6_= tables iptable_filter amdgpu x86_pkg_temp_thermal drm_ttm_helper ttm iommu_= v2 gpu_sched ip_tables x_tables
>>>>> [17359.837776] CR2: 0000000000000000
>>>>> [17359.841888] ---[ end trace a6f27d64475b28c8 ]--- >>>>> [17359.847318] RIP: 0010:amdgpu_ttm_backend_unbind+0x5= 2/0x110 [amdgpu]
>>>>> [17359.854479] Code: 48 39 c6 74 1b 8b 53 0c 48 8d bd = 80 a1 ff ff e8 24 62 00 00 85 c0 0f 85 ab 00 00 00 c6 43 54 00 5b 5d c3 48 = 8b 46 10 8b 4e 50 <48> 8b 30 48 85 f6 74 ba 8b 50 0c 48 8b bf 80 a1 f= f ff 83 e1 01 45
>>>>> [17359.874929] RSP: 0018:ffffa4764971fc98 EFLAGS: 0001= 0206
>>>>> [17359.881014] RAX: 0000000000000000 RBX: ffff950e8d4e= df00 RCX: 0000000000000000
>>>>> [17359.889007] RDX: 0000000000000000 RSI: ffff950e8d4e= df00 RDI: ffff950eadec5e80
>>>>> [17359.897008] RBP: ffff950eadec5e80 R08: 000000000000= 0000 R09: 0000000000000000
>>>>> [17359.905020] R10: 0000000000000246 R11: 000000000000= 0001 R12: ffff950c03377800
>>>>> [17359.913034] R13: ffff950eadec5e80 R14: ffff950c0337= 7858 R15: 0000000000000000
>>>>> [17359.921050] FS:=C2=A0 00007febb20cb740(0000) GS:fff= f950ebfc00000(0000) knlGS:0000000000000000
>>>>> [17359.930047] CS:=C2=A0 0010 DS: 0000 ES: 0000 CR0: 0= 000000080050033
>>>>> [17359.936674] CR2: 0000000000000000 CR3: 00000006d700= e005 CR4: 00000000001706e0
>>>>=C2=A0 =C2=A0From what I understand, the init_user_pages fa= ils (returns EBUSY) and
>>>> the code goes to allocate_init_user_pages_failed where the= unbind and
>>>> the userptr clear occurs.
>>>> Can we prevent this if we save the bounding status + userp= tr alloc? so
>>>> the function amdgpu_ttm_backend_unbind returns without try= ing to clear
>>>> the userptr memory?
>>>>
>>>> Something like:
>>>>
>>>> amdgpu_ttm_backend_bind:
>>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0if (gtt->userptr) {
>>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0r =3D amdgpu_ttm_t= t_pin_userptr(bdev, ttm);
>>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (r) ...
>>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 gtt->sg_table =3D tru= e;
>>>>=C2=A0 =C2=A0 =C2=A0 }
>>>>
>>>> amdgpu_ttm_backend_unbind:
>>>> if (gtt->sg_table) {
>>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (gtt->user_p= tr) ...
>>>> }
>>>>
>>>> If you agree, I'll send a v2 patch. Otherwise, maybe w= e could return
>>>> within amdgpu_ttm_tt_unpin_userptr if memory hasn't be= en allocated.
>>>>
>>>> Any other ideas?
>>>>
>>>> Regards,
>>>> Daniel
>>>>
>>>>> Reverting this patch fixes the problem for me.
>>>>>
>>>>> Regards,
>>>>>=C2=A0 =C2=A0 =C2=A0 Felix
>>>>>
>>>>> On 2021-03-18 10:57 p.m., Alex Deucher wrote:
>>>>>> Applied.=C2=A0 Thanks!
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> On Thu, Mar 18, 2021 at 5:00 AM Koenig, Christian<= br> >>>>>> <Christian.Koenig@amd.com> wrote:
>>>>>>> Reviewed-by: Christian K=C3=B6nig <christian.koenig@amd= .com>
>>>>>>> ________________________________
>>>>>>> Von: Daniel Gomez <daniel@qtec.com>
>>>>>>> Gesendet: Donnerstag, 18. M=C3=A4rz 2021 09:32=
>>>>>>> Cc: dagmcr@gmail.com <dagmcr@gmail.com>; Daniel Gomez <daniel@qtec.com>; Deucher, Alex= ander <Al= exander.Deucher@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>;= David Airlie <air= lied@linux.ie>; Daniel Vetter <daniel@ffwll.ch>; amd-gfx@lists.freedesktop.org <<= a href=3D"mailto:amd-gfx@lists.freedesktop.org" target=3D"_blank">amd-gfx@l= ists.freedesktop.org>; dri-devel@lists.freedesktop.org <dri-devel@lists.f= reedesktop.org>; linux-kernel@vger.kernel.org <linux-kernel@vger.kernel.org= >
>>>>>>> Betreff: [PATCH] drm/radeon/ttm: Fix memory le= ak userptr pages
>>>>>>>
>>>>>>> If userptr pages have been pinned but not boun= ded,
>>>>>>> they remain uncleared.
>>>>>>>
>>>>>>> Signed-off-by: Daniel Gomez <daniel@qtec.com>
>>>>>>> ---
>>>>>>>=C2=A0 =C2=A0 =C2=A0drivers/gpu/drm/radeon/rade= on_ttm.c | 5 +++--
>>>>>>>=C2=A0 =C2=A0 =C2=A01 file changed, 3 insertion= s(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/radeon/radeon_ttm= .c b/drivers/gpu/drm/radeon/radeon_ttm.c
>>>>>>> index e8c66d10478f..bbcc6264d48f 100644
>>>>>>> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
>>>>>>> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
>>>>>>> @@ -485,13 +485,14 @@ static void radeon_ttm_b= ackend_unbind(struct ttm_bo_device *bdev, struct ttm_tt
>>>>>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0struct radeon_ttm_tt *gtt =3D (void *)ttm;
>>>>>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0struct radeon_device *rdev =3D radeon_get_rdev(bdev);
>>>>>>>
>>>>>>> +=C2=A0 =C2=A0 =C2=A0 =C2=A0if (gtt->userpt= r)
>>>>>>> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0radeon_ttm_tt_unpin_userptr(bdev, ttm);
>>>>>>> +
>>>>>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0if (!gtt->bound)
>>>>>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return;
>>>>>>>
>>>>>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0radeon_gart_unbind(rdev, gtt->offset, ttm->num_pages);
>>>>>>>
>>>>>>> -=C2=A0 =C2=A0 =C2=A0 =C2=A0if (gtt->userpt= r)
>>>>>>> -=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0radeon_ttm_tt_unpin_userptr(bdev, ttm);
>>>>>>>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0gtt->bound =3D false;
>>>>>>>=C2=A0 =C2=A0 =C2=A0}
>>>>>>>
>>>>>>> --
>>>>>>> 2.30.2
>>>>>>>
>>>>>>> ______________________________________________= _
>>>>>>> dri-devel mailing list
>>>>>>> dri-devel@lists.freedesktop.org
>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%= 3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data= =3D04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%= 7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7= CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn= 0%3D%7C1000&amp;sdata=3DaU9ZAl66EOLKphjWFPXJPR%2BTM%2BZeeMv%2BVJC6vliEq= rs%3D&amp;reserved=3D0
>>>>>> _______________________________________________ >>>>>> dri-devel mailing list
>>>>>> dri-devel@lists.freedesktop.org
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%= 2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=3D= 04%7C01%7CChristian.Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3= dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTW= FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3= D%7C1000&amp;sdata=3DaU9ZAl66EOLKphjWFPXJPR%2BTM%2BZeeMv%2BVJC6vliEqrs%= 3D&amp;reserved=3D0
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https= ://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.freede= sktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=3D04%7C01%7CChristian= .Koenig%40amd.com%7C65d21b6f02da409ac7b508d8f9997367%7C3dd8961fe4884e608e11= a82d994e183d%7C0%7C0%7C637533784761980218%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiM= C4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sd= ata=3DUSmYbhfkSfPcE1npvsMfRwBr9Ijresh1fH4cAeNEr2M%3D&amp;reserved=3D0

--000000000000a2f2b305c1ce3317--