* [PATCH] drm/amdgpu: Fix a potential sdma invalid access @ 2021-04-02 3:18 ` Qu Huang 0 siblings, 0 replies; 20+ messages in thread From: Qu Huang @ 2021-04-02 3:18 UTC (permalink / raw) To: alexander.deucher, christian.koenig, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: amd-gfx, dri-devel, linux-kernel, linux-media, linaro-mm-sig, jinsdb Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(), the bo->base.resv lock may be held by ttm_mem_evict_first(), and the VRAM mem will be evicted, mem region was replaced by Gtt mem region. amdgpu_bo_release_notify() will then hold the bo->base.resv lock, and SDMA will get an invalid address in amdgpu_fill_buffer(), resulting in a VMFAULT or memory corruption. To avoid it, we have to hold bo->base.resv lock first, and check whether the mem.mem_type is TTM_PL_VRAM. Signed-off-by: Qu Huang <jinsdb@126.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 4b29b82..8018574 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo) if (bo->base.resv == &bo->base._resv) amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) return; dma_resv_lock(bo->base.resv, NULL); + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { + dma_resv_unlock(bo->base.resv); + return; + } + r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence); if (!WARN_ON(r)) { amdgpu_bo_fence(abo, fence, false); -- 1.8.3.1 ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH] drm/amdgpu: Fix a potential sdma invalid access @ 2021-04-02 3:18 ` Qu Huang 0 siblings, 0 replies; 20+ messages in thread From: Qu Huang @ 2021-04-02 3:18 UTC (permalink / raw) To: alexander.deucher, christian.koenig, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: jinsdb, linux-kernel, dri-devel, linaro-mm-sig, amd-gfx, linux-media Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(), the bo->base.resv lock may be held by ttm_mem_evict_first(), and the VRAM mem will be evicted, mem region was replaced by Gtt mem region. amdgpu_bo_release_notify() will then hold the bo->base.resv lock, and SDMA will get an invalid address in amdgpu_fill_buffer(), resulting in a VMFAULT or memory corruption. To avoid it, we have to hold bo->base.resv lock first, and check whether the mem.mem_type is TTM_PL_VRAM. Signed-off-by: Qu Huang <jinsdb@126.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 4b29b82..8018574 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo) if (bo->base.resv == &bo->base._resv) amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) return; dma_resv_lock(bo->base.resv, NULL); + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { + dma_resv_unlock(bo->base.resv); + return; + } + r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence); if (!WARN_ON(r)) { amdgpu_bo_fence(abo, fence, false); -- 1.8.3.1 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH] drm/amdgpu: Fix a potential sdma invalid access @ 2021-04-02 3:18 ` Qu Huang 0 siblings, 0 replies; 20+ messages in thread From: Qu Huang @ 2021-04-02 3:18 UTC (permalink / raw) To: alexander.deucher, christian.koenig, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: jinsdb, linux-kernel, dri-devel, linaro-mm-sig, amd-gfx, linux-media Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(), the bo->base.resv lock may be held by ttm_mem_evict_first(), and the VRAM mem will be evicted, mem region was replaced by Gtt mem region. amdgpu_bo_release_notify() will then hold the bo->base.resv lock, and SDMA will get an invalid address in amdgpu_fill_buffer(), resulting in a VMFAULT or memory corruption. To avoid it, we have to hold bo->base.resv lock first, and check whether the mem.mem_type is TTM_PL_VRAM. Signed-off-by: Qu Huang <jinsdb@126.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 4b29b82..8018574 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo) if (bo->base.resv == &bo->base._resv) amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) return; dma_resv_lock(bo->base.resv, NULL); + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { + dma_resv_unlock(bo->base.resv); + return; + } + r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence); if (!WARN_ON(r)) { amdgpu_bo_fence(abo, fence, false); -- 1.8.3.1 _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access 2021-04-02 3:18 ` Qu Huang (?) @ 2021-04-02 16:25 ` Christian König -1 siblings, 0 replies; 20+ messages in thread From: Christian König @ 2021-04-02 16:25 UTC (permalink / raw) To: Qu Huang, alexander.deucher, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: amd-gfx, dri-devel, linux-kernel, linux-media, linaro-mm-sig Hi Qu, Am 02.04.21 um 05:18 schrieb Qu Huang: > Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(), > the bo->base.resv lock may be held by ttm_mem_evict_first(), That can't happen since when bo_release_notify is called the BO has not more references and is therefore deleted. And we never evict a deleted BO, we just wait for it to become idle. Regards, Christian. > and the VRAM mem will be evicted, mem region was replaced > by Gtt mem region. amdgpu_bo_release_notify() will then > hold the bo->base.resv lock, and SDMA will get an invalid > address in amdgpu_fill_buffer(), resulting in a VMFAULT > or memory corruption. > > To avoid it, we have to hold bo->base.resv lock first, and > check whether the mem.mem_type is TTM_PL_VRAM. > > Signed-off-by: Qu Huang <jinsdb@126.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > index 4b29b82..8018574 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo) > if (bo->base.resv == &bo->base._resv) > amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); > > - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || > - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) > + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) > return; > > dma_resv_lock(bo->base.resv, NULL); > > + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { > + dma_resv_unlock(bo->base.resv); > + return; > + } > + > r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence); > if (!WARN_ON(r)) { > amdgpu_bo_fence(abo, fence, false); > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access @ 2021-04-02 16:25 ` Christian König 0 siblings, 0 replies; 20+ messages in thread From: Christian König @ 2021-04-02 16:25 UTC (permalink / raw) To: Qu Huang, alexander.deucher, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel Hi Qu, Am 02.04.21 um 05:18 schrieb Qu Huang: > Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(), > the bo->base.resv lock may be held by ttm_mem_evict_first(), That can't happen since when bo_release_notify is called the BO has not more references and is therefore deleted. And we never evict a deleted BO, we just wait for it to become idle. Regards, Christian. > and the VRAM mem will be evicted, mem region was replaced > by Gtt mem region. amdgpu_bo_release_notify() will then > hold the bo->base.resv lock, and SDMA will get an invalid > address in amdgpu_fill_buffer(), resulting in a VMFAULT > or memory corruption. > > To avoid it, we have to hold bo->base.resv lock first, and > check whether the mem.mem_type is TTM_PL_VRAM. > > Signed-off-by: Qu Huang <jinsdb@126.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > index 4b29b82..8018574 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo) > if (bo->base.resv == &bo->base._resv) > amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); > > - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || > - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) > + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) > return; > > dma_resv_lock(bo->base.resv, NULL); > > + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { > + dma_resv_unlock(bo->base.resv); > + return; > + } > + > r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence); > if (!WARN_ON(r)) { > amdgpu_bo_fence(abo, fence, false); > -- > 1.8.3.1 > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access @ 2021-04-02 16:25 ` Christian König 0 siblings, 0 replies; 20+ messages in thread From: Christian König @ 2021-04-02 16:25 UTC (permalink / raw) To: Qu Huang, alexander.deucher, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel Hi Qu, Am 02.04.21 um 05:18 schrieb Qu Huang: > Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(), > the bo->base.resv lock may be held by ttm_mem_evict_first(), That can't happen since when bo_release_notify is called the BO has not more references and is therefore deleted. And we never evict a deleted BO, we just wait for it to become idle. Regards, Christian. > and the VRAM mem will be evicted, mem region was replaced > by Gtt mem region. amdgpu_bo_release_notify() will then > hold the bo->base.resv lock, and SDMA will get an invalid > address in amdgpu_fill_buffer(), resulting in a VMFAULT > or memory corruption. > > To avoid it, we have to hold bo->base.resv lock first, and > check whether the mem.mem_type is TTM_PL_VRAM. > > Signed-off-by: Qu Huang <jinsdb@126.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > index 4b29b82..8018574 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo) > if (bo->base.resv == &bo->base._resv) > amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); > > - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || > - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) > + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) > return; > > dma_resv_lock(bo->base.resv, NULL); > > + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { > + dma_resv_unlock(bo->base.resv); > + return; > + } > + > r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence); > if (!WARN_ON(r)) { > amdgpu_bo_fence(abo, fence, false); > -- > 1.8.3.1 > _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access 2021-04-02 16:25 ` Christian König @ 2021-04-03 3:08 ` Qu Huang -1 siblings, 0 replies; 20+ messages in thread From: Qu Huang @ 2021-04-03 3:08 UTC (permalink / raw) To: Christian König, alexander.deucher, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel Hi Christian, On 2021/4/3 0:25, Christian König wrote: > Hi Qu, > > Am 02.04.21 um 05:18 schrieb Qu Huang: >> Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(), >> the bo->base.resv lock may be held by ttm_mem_evict_first(), > > That can't happen since when bo_release_notify is called the BO has not > more references and is therefore deleted. > > And we never evict a deleted BO, we just wait for it to become idle. Yes, the bo reference counter return to zero will enter ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify()) first happen, and then test if a reservation object's fences have been signaled, and then mark bo as deleted and remove bo from the LRU list. When ttm_bo_release() and ttm_mem_evict_first() is concurrent, the Bo has not been removed from the LRU list and is not marked as deleted, this will happen. As a test, when we use CPU memset instead of SDMA fill in amdgpu_bo_release_notify(), the result is page fault: PID: 5490 TASK: ffff8e8136e04100 CPU: 4 COMMAND: "gemmPerf" #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784 #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92 #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80 #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768 #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6 #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0 #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925 #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758 [exception RIP: memset+31] RIP: ffffffffb2b8668f RSP: ffff8e79eaa17ce8 RFLAGS: 00010a17 RAX: bebebebebebebebe RBX: ffff8e747bff10c0 RCX: 0000060b00200000 RDX: 0000000000000000 RSI: 00000000000000be RDI: ffffab807f000000 RBP: ffff8e79eaa17d10 R8: ffff8e79eaa14000 R9: ffffab7c80000000 R10: 000000000000bcba R11: 00000000000001ba R12: ffff8e79ebaa4050 R13: ffffab7c80000000 R14: 0000000000022600 R15: ffff8e8136e04100 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 [amdgpu] #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm] #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm] #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm] #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115 #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64 #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7 #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95 #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544 #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb Regards, Qu. > > Regards, > Christian. > >> and the VRAM mem will be evicted, mem region was replaced >> by Gtt mem region. amdgpu_bo_release_notify() will then >> hold the bo->base.resv lock, and SDMA will get an invalid >> address in amdgpu_fill_buffer(), resulting in a VMFAULT >> or memory corruption. >> >> To avoid it, we have to hold bo->base.resv lock first, and >> check whether the mem.mem_type is TTM_PL_VRAM. >> >> Signed-off-by: Qu Huang <jinsdb@126.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- >> 1 file changed, 6 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> index 4b29b82..8018574 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct >> ttm_buffer_object *bo) >> if (bo->base.resv == &bo->base._resv) >> amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); >> >> - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || >> - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >> + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >> return; >> >> dma_resv_lock(bo->base.resv, NULL); >> >> + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { >> + dma_resv_unlock(bo->base.resv); >> + return; >> + } >> + >> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence); >> if (!WARN_ON(r)) { >> amdgpu_bo_fence(abo, fence, false); >> -- >> 1.8.3.1 >> _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access @ 2021-04-03 3:08 ` Qu Huang 0 siblings, 0 replies; 20+ messages in thread From: Qu Huang @ 2021-04-03 3:08 UTC (permalink / raw) To: Christian König, alexander.deucher, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel Hi Christian, On 2021/4/3 0:25, Christian König wrote: > Hi Qu, > > Am 02.04.21 um 05:18 schrieb Qu Huang: >> Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(), >> the bo->base.resv lock may be held by ttm_mem_evict_first(), > > That can't happen since when bo_release_notify is called the BO has not > more references and is therefore deleted. > > And we never evict a deleted BO, we just wait for it to become idle. Yes, the bo reference counter return to zero will enter ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify()) first happen, and then test if a reservation object's fences have been signaled, and then mark bo as deleted and remove bo from the LRU list. When ttm_bo_release() and ttm_mem_evict_first() is concurrent, the Bo has not been removed from the LRU list and is not marked as deleted, this will happen. As a test, when we use CPU memset instead of SDMA fill in amdgpu_bo_release_notify(), the result is page fault: PID: 5490 TASK: ffff8e8136e04100 CPU: 4 COMMAND: "gemmPerf" #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784 #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92 #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80 #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768 #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6 #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0 #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925 #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758 [exception RIP: memset+31] RIP: ffffffffb2b8668f RSP: ffff8e79eaa17ce8 RFLAGS: 00010a17 RAX: bebebebebebebebe RBX: ffff8e747bff10c0 RCX: 0000060b00200000 RDX: 0000000000000000 RSI: 00000000000000be RDI: ffffab807f000000 RBP: ffff8e79eaa17d10 R8: ffff8e79eaa14000 R9: ffffab7c80000000 R10: 000000000000bcba R11: 00000000000001ba R12: ffff8e79ebaa4050 R13: ffffab7c80000000 R14: 0000000000022600 R15: ffff8e8136e04100 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 [amdgpu] #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm] #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm] #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm] #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115 #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64 #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7 #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95 #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544 #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb Regards, Qu. > > Regards, > Christian. > >> and the VRAM mem will be evicted, mem region was replaced >> by Gtt mem region. amdgpu_bo_release_notify() will then >> hold the bo->base.resv lock, and SDMA will get an invalid >> address in amdgpu_fill_buffer(), resulting in a VMFAULT >> or memory corruption. >> >> To avoid it, we have to hold bo->base.resv lock first, and >> check whether the mem.mem_type is TTM_PL_VRAM. >> >> Signed-off-by: Qu Huang <jinsdb@126.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- >> 1 file changed, 6 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> index 4b29b82..8018574 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct >> ttm_buffer_object *bo) >> if (bo->base.resv == &bo->base._resv) >> amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); >> >> - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || >> - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >> + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >> return; >> >> dma_resv_lock(bo->base.resv, NULL); >> >> + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { >> + dma_resv_unlock(bo->base.resv); >> + return; >> + } >> + >> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence); >> if (!WARN_ON(r)) { >> amdgpu_bo_fence(abo, fence, false); >> -- >> 1.8.3.1 >> _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access 2021-04-02 16:25 ` Christian König (?) @ 2021-04-03 5:08 ` Qu Huang -1 siblings, 0 replies; 20+ messages in thread From: Qu Huang @ 2021-04-03 5:08 UTC (permalink / raw) To: Christian König, alexander.deucher, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: amd-gfx, dri-devel, linux-kernel, linux-media Hi Christian, On 2021/4/3 0:25, Christian König wrote: > Hi Qu, > > Am 02.04.21 um 05:18 schrieb Qu Huang: >> Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(), >> the bo->base.resv lock may be held by ttm_mem_evict_first(), > > That can't happen since when bo_release_notify is called the BO has not > more references and is therefore deleted. > > And we never evict a deleted BO, we just wait for it to become idle. > Yes, the bo reference counter return to zero will enter ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify()) first happen, and then test if a reservation object's fences have been signaled, and then mark bo as deleted and remove bo from the LRU list. When ttm_bo_release() and ttm_mem_evict_first() is concurrent, the Bo has not been removed from the LRU list and is not marked as deleted, this will happen. As a test, when we use CPU memset instead of SDMA fill in amdgpu_bo_release_notify(), the result is page fault: PID: 5490 TASK: ffff8e8136e04100 CPU: 4 COMMAND: "gemmPerf" #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784 #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92 #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80 #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768 #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6 #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0 #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925 #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758 [exception RIP: memset+31] RIP: ffffffffb2b8668f RSP: ffff8e79eaa17ce8 RFLAGS: 00010a17 RAX: bebebebebebebebe RBX: ffff8e747bff10c0 RCX: 0000060b00200000 RDX: 0000000000000000 RSI: 00000000000000be RDI: ffffab807f000000 RBP: ffff8e79eaa17d10 R8: ffff8e79eaa14000 R9: ffffab7c80000000 R10: 000000000000bcba R11: 00000000000001ba R12: ffff8e79ebaa4050 R13: ffffab7c80000000 R14: 0000000000022600 R15: ffff8e8136e04100 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 [amdgpu] #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm] #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm] #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm] #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115 #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64 #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7 #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95 #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544 #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb Regards, Qu. > Regards, > Christian. > >> and the VRAM mem will be evicted, mem region was replaced >> by Gtt mem region. amdgpu_bo_release_notify() will then >> hold the bo->base.resv lock, and SDMA will get an invalid >> address in amdgpu_fill_buffer(), resulting in a VMFAULT >> or memory corruption. >> >> To avoid it, we have to hold bo->base.resv lock first, and >> check whether the mem.mem_type is TTM_PL_VRAM. >> >> Signed-off-by: Qu Huang <jinsdb@126.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- >> 1 file changed, 6 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> index 4b29b82..8018574 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct >> ttm_buffer_object *bo) >> if (bo->base.resv == &bo->base._resv) >> amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); >> >> - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || >> - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >> + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >> return; >> >> dma_resv_lock(bo->base.resv, NULL); >> >> + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { >> + dma_resv_unlock(bo->base.resv); >> + return; >> + } >> + >> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence); >> if (!WARN_ON(r)) { >> amdgpu_bo_fence(abo, fence, false); >> -- >> 1.8.3.1 >> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access @ 2021-04-03 5:08 ` Qu Huang 0 siblings, 0 replies; 20+ messages in thread From: Qu Huang @ 2021-04-03 5:08 UTC (permalink / raw) To: Christian König, alexander.deucher, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: linux-media, dri-devel, amd-gfx, linux-kernel Hi Christian, On 2021/4/3 0:25, Christian König wrote: > Hi Qu, > > Am 02.04.21 um 05:18 schrieb Qu Huang: >> Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(), >> the bo->base.resv lock may be held by ttm_mem_evict_first(), > > That can't happen since when bo_release_notify is called the BO has not > more references and is therefore deleted. > > And we never evict a deleted BO, we just wait for it to become idle. > Yes, the bo reference counter return to zero will enter ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify()) first happen, and then test if a reservation object's fences have been signaled, and then mark bo as deleted and remove bo from the LRU list. When ttm_bo_release() and ttm_mem_evict_first() is concurrent, the Bo has not been removed from the LRU list and is not marked as deleted, this will happen. As a test, when we use CPU memset instead of SDMA fill in amdgpu_bo_release_notify(), the result is page fault: PID: 5490 TASK: ffff8e8136e04100 CPU: 4 COMMAND: "gemmPerf" #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784 #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92 #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80 #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768 #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6 #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0 #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925 #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758 [exception RIP: memset+31] RIP: ffffffffb2b8668f RSP: ffff8e79eaa17ce8 RFLAGS: 00010a17 RAX: bebebebebebebebe RBX: ffff8e747bff10c0 RCX: 0000060b00200000 RDX: 0000000000000000 RSI: 00000000000000be RDI: ffffab807f000000 RBP: ffff8e79eaa17d10 R8: ffff8e79eaa14000 R9: ffffab7c80000000 R10: 000000000000bcba R11: 00000000000001ba R12: ffff8e79ebaa4050 R13: ffffab7c80000000 R14: 0000000000022600 R15: ffff8e8136e04100 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 [amdgpu] #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm] #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm] #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm] #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115 #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64 #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7 #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95 #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544 #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb Regards, Qu. > Regards, > Christian. > >> and the VRAM mem will be evicted, mem region was replaced >> by Gtt mem region. amdgpu_bo_release_notify() will then >> hold the bo->base.resv lock, and SDMA will get an invalid >> address in amdgpu_fill_buffer(), resulting in a VMFAULT >> or memory corruption. >> >> To avoid it, we have to hold bo->base.resv lock first, and >> check whether the mem.mem_type is TTM_PL_VRAM. >> >> Signed-off-by: Qu Huang <jinsdb@126.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- >> 1 file changed, 6 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> index 4b29b82..8018574 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct >> ttm_buffer_object *bo) >> if (bo->base.resv == &bo->base._resv) >> amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); >> >> - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || >> - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >> + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >> return; >> >> dma_resv_lock(bo->base.resv, NULL); >> >> + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { >> + dma_resv_unlock(bo->base.resv); >> + return; >> + } >> + >> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence); >> if (!WARN_ON(r)) { >> amdgpu_bo_fence(abo, fence, false); >> -- >> 1.8.3.1 >> _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access @ 2021-04-03 5:08 ` Qu Huang 0 siblings, 0 replies; 20+ messages in thread From: Qu Huang @ 2021-04-03 5:08 UTC (permalink / raw) To: Christian König, alexander.deucher, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: linux-media, dri-devel, amd-gfx, linux-kernel Hi Christian, On 2021/4/3 0:25, Christian König wrote: > Hi Qu, > > Am 02.04.21 um 05:18 schrieb Qu Huang: >> Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(), >> the bo->base.resv lock may be held by ttm_mem_evict_first(), > > That can't happen since when bo_release_notify is called the BO has not > more references and is therefore deleted. > > And we never evict a deleted BO, we just wait for it to become idle. > Yes, the bo reference counter return to zero will enter ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify()) first happen, and then test if a reservation object's fences have been signaled, and then mark bo as deleted and remove bo from the LRU list. When ttm_bo_release() and ttm_mem_evict_first() is concurrent, the Bo has not been removed from the LRU list and is not marked as deleted, this will happen. As a test, when we use CPU memset instead of SDMA fill in amdgpu_bo_release_notify(), the result is page fault: PID: 5490 TASK: ffff8e8136e04100 CPU: 4 COMMAND: "gemmPerf" #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784 #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92 #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80 #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768 #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6 #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0 #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925 #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758 [exception RIP: memset+31] RIP: ffffffffb2b8668f RSP: ffff8e79eaa17ce8 RFLAGS: 00010a17 RAX: bebebebebebebebe RBX: ffff8e747bff10c0 RCX: 0000060b00200000 RDX: 0000000000000000 RSI: 00000000000000be RDI: ffffab807f000000 RBP: ffff8e79eaa17d10 R8: ffff8e79eaa14000 R9: ffffab7c80000000 R10: 000000000000bcba R11: 00000000000001ba R12: ffff8e79ebaa4050 R13: ffffab7c80000000 R14: 0000000000022600 R15: ffff8e8136e04100 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 [amdgpu] #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm] #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm] #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm] #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115 #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64 #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7 #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95 #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544 #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb Regards, Qu. > Regards, > Christian. > >> and the VRAM mem will be evicted, mem region was replaced >> by Gtt mem region. amdgpu_bo_release_notify() will then >> hold the bo->base.resv lock, and SDMA will get an invalid >> address in amdgpu_fill_buffer(), resulting in a VMFAULT >> or memory corruption. >> >> To avoid it, we have to hold bo->base.resv lock first, and >> check whether the mem.mem_type is TTM_PL_VRAM. >> >> Signed-off-by: Qu Huang <jinsdb@126.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- >> 1 file changed, 6 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> index 4b29b82..8018574 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct >> ttm_buffer_object *bo) >> if (bo->base.resv == &bo->base._resv) >> amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); >> >> - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || >> - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >> + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >> return; >> >> dma_resv_lock(bo->base.resv, NULL); >> >> + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { >> + dma_resv_unlock(bo->base.resv); >> + return; >> + } >> + >> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence); >> if (!WARN_ON(r)) { >> amdgpu_bo_fence(abo, fence, false); >> -- >> 1.8.3.1 >> _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access 2021-04-03 5:08 ` Qu Huang (?) @ 2021-04-03 8:49 ` Christian König -1 siblings, 0 replies; 20+ messages in thread From: Christian König @ 2021-04-03 8:49 UTC (permalink / raw) To: Qu Huang, alexander.deucher, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: amd-gfx, dri-devel, linux-kernel, linux-media Hi Qu, Am 03.04.21 um 07:08 schrieb Qu Huang: > Hi Christian, > > On 2021/4/3 0:25, Christian König wrote: >> Hi Qu, >> >> Am 02.04.21 um 05:18 schrieb Qu Huang: >>> Before dma_resv_lock(bo->base.resv, NULL) in >>> amdgpu_bo_release_notify(), >>> the bo->base.resv lock may be held by ttm_mem_evict_first(), >> >> That can't happen since when bo_release_notify is called the BO has not >> more references and is therefore deleted. >> >> And we never evict a deleted BO, we just wait for it to become idle. >> > Yes, the bo reference counter return to zero will enter > ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify()) > first happen, and then test if a reservation object's fences have been > signaled, and then mark bo as deleted and remove bo from the LRU list. > > When ttm_bo_release() and ttm_mem_evict_first() is concurrent, > the Bo has not been removed from the LRU list and is not marked as > deleted, this will happen. Not sure on which code base you are, but I don't see how this can happen. ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and ttm_bo_release() is only called when the BO reference count becomes zero. So ttm_mem_evict_first() will see that this BO is about to be destroyed and skips it. > > As a test, when we use CPU memset instead of SDMA fill in > amdgpu_bo_release_notify(), the result is page fault: > > PID: 5490 TASK: ffff8e8136e04100 CPU: 4 COMMAND: "gemmPerf" > #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784 > #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92 > #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80 > #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768 > #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6 > #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d > #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae > #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0 > #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925 > #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758 > [exception RIP: memset+31] > RIP: ffffffffb2b8668f RSP: ffff8e79eaa17ce8 RFLAGS: 00010a17 > RAX: bebebebebebebebe RBX: ffff8e747bff10c0 RCX: 0000060b00200000 > RDX: 0000000000000000 RSI: 00000000000000be RDI: ffffab807f000000 > RBP: ffff8e79eaa17d10 R8: ffff8e79eaa14000 R9: ffffab7c80000000 > R10: 000000000000bcba R11: 00000000000001ba R12: ffff8e79ebaa4050 > R13: ffffab7c80000000 R14: 0000000000022600 R15: ffff8e8136e04100 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 > [amdgpu] > #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm] > #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm] > #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm] > #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115 > #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64 > #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7 > #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95 > #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf > #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544 > #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb Well that might be perfectly expected. VRAM is not necessarily CPU accessible. Regards, Christian. > > Regards, > Qu. > > >> Regards, >> Christian. >> >>> and the VRAM mem will be evicted, mem region was replaced >>> by Gtt mem region. amdgpu_bo_release_notify() will then >>> hold the bo->base.resv lock, and SDMA will get an invalid >>> address in amdgpu_fill_buffer(), resulting in a VMFAULT >>> or memory corruption. >>> >>> To avoid it, we have to hold bo->base.resv lock first, and >>> check whether the mem.mem_type is TTM_PL_VRAM. >>> >>> Signed-off-by: Qu Huang <jinsdb@126.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- >>> 1 file changed, 6 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> index 4b29b82..8018574 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct >>> ttm_buffer_object *bo) >>> if (bo->base.resv == &bo->base._resv) >>> amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); >>> >>> - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || >>> - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>> + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>> return; >>> >>> dma_resv_lock(bo->base.resv, NULL); >>> >>> + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { >>> + dma_resv_unlock(bo->base.resv); >>> + return; >>> + } >>> + >>> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, >>> &fence); >>> if (!WARN_ON(r)) { >>> amdgpu_bo_fence(abo, fence, false); >>> -- >>> 1.8.3.1 >>> > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access @ 2021-04-03 8:49 ` Christian König 0 siblings, 0 replies; 20+ messages in thread From: Christian König @ 2021-04-03 8:49 UTC (permalink / raw) To: Qu Huang, alexander.deucher, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: linux-media, dri-devel, amd-gfx, linux-kernel Hi Qu, Am 03.04.21 um 07:08 schrieb Qu Huang: > Hi Christian, > > On 2021/4/3 0:25, Christian König wrote: >> Hi Qu, >> >> Am 02.04.21 um 05:18 schrieb Qu Huang: >>> Before dma_resv_lock(bo->base.resv, NULL) in >>> amdgpu_bo_release_notify(), >>> the bo->base.resv lock may be held by ttm_mem_evict_first(), >> >> That can't happen since when bo_release_notify is called the BO has not >> more references and is therefore deleted. >> >> And we never evict a deleted BO, we just wait for it to become idle. >> > Yes, the bo reference counter return to zero will enter > ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify()) > first happen, and then test if a reservation object's fences have been > signaled, and then mark bo as deleted and remove bo from the LRU list. > > When ttm_bo_release() and ttm_mem_evict_first() is concurrent, > the Bo has not been removed from the LRU list and is not marked as > deleted, this will happen. Not sure on which code base you are, but I don't see how this can happen. ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and ttm_bo_release() is only called when the BO reference count becomes zero. So ttm_mem_evict_first() will see that this BO is about to be destroyed and skips it. > > As a test, when we use CPU memset instead of SDMA fill in > amdgpu_bo_release_notify(), the result is page fault: > > PID: 5490 TASK: ffff8e8136e04100 CPU: 4 COMMAND: "gemmPerf" > #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784 > #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92 > #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80 > #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768 > #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6 > #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d > #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae > #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0 > #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925 > #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758 > [exception RIP: memset+31] > RIP: ffffffffb2b8668f RSP: ffff8e79eaa17ce8 RFLAGS: 00010a17 > RAX: bebebebebebebebe RBX: ffff8e747bff10c0 RCX: 0000060b00200000 > RDX: 0000000000000000 RSI: 00000000000000be RDI: ffffab807f000000 > RBP: ffff8e79eaa17d10 R8: ffff8e79eaa14000 R9: ffffab7c80000000 > R10: 000000000000bcba R11: 00000000000001ba R12: ffff8e79ebaa4050 > R13: ffffab7c80000000 R14: 0000000000022600 R15: ffff8e8136e04100 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 > [amdgpu] > #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm] > #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm] > #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm] > #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115 > #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64 > #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7 > #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95 > #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf > #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544 > #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb Well that might be perfectly expected. VRAM is not necessarily CPU accessible. Regards, Christian. > > Regards, > Qu. > > >> Regards, >> Christian. >> >>> and the VRAM mem will be evicted, mem region was replaced >>> by Gtt mem region. amdgpu_bo_release_notify() will then >>> hold the bo->base.resv lock, and SDMA will get an invalid >>> address in amdgpu_fill_buffer(), resulting in a VMFAULT >>> or memory corruption. >>> >>> To avoid it, we have to hold bo->base.resv lock first, and >>> check whether the mem.mem_type is TTM_PL_VRAM. >>> >>> Signed-off-by: Qu Huang <jinsdb@126.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- >>> 1 file changed, 6 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> index 4b29b82..8018574 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct >>> ttm_buffer_object *bo) >>> if (bo->base.resv == &bo->base._resv) >>> amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); >>> >>> - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || >>> - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>> + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>> return; >>> >>> dma_resv_lock(bo->base.resv, NULL); >>> >>> + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { >>> + dma_resv_unlock(bo->base.resv); >>> + return; >>> + } >>> + >>> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, >>> &fence); >>> if (!WARN_ON(r)) { >>> amdgpu_bo_fence(abo, fence, false); >>> -- >>> 1.8.3.1 >>> > _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access @ 2021-04-03 8:49 ` Christian König 0 siblings, 0 replies; 20+ messages in thread From: Christian König @ 2021-04-03 8:49 UTC (permalink / raw) To: Qu Huang, alexander.deucher, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: linux-media, dri-devel, amd-gfx, linux-kernel Hi Qu, Am 03.04.21 um 07:08 schrieb Qu Huang: > Hi Christian, > > On 2021/4/3 0:25, Christian König wrote: >> Hi Qu, >> >> Am 02.04.21 um 05:18 schrieb Qu Huang: >>> Before dma_resv_lock(bo->base.resv, NULL) in >>> amdgpu_bo_release_notify(), >>> the bo->base.resv lock may be held by ttm_mem_evict_first(), >> >> That can't happen since when bo_release_notify is called the BO has not >> more references and is therefore deleted. >> >> And we never evict a deleted BO, we just wait for it to become idle. >> > Yes, the bo reference counter return to zero will enter > ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify()) > first happen, and then test if a reservation object's fences have been > signaled, and then mark bo as deleted and remove bo from the LRU list. > > When ttm_bo_release() and ttm_mem_evict_first() is concurrent, > the Bo has not been removed from the LRU list and is not marked as > deleted, this will happen. Not sure on which code base you are, but I don't see how this can happen. ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and ttm_bo_release() is only called when the BO reference count becomes zero. So ttm_mem_evict_first() will see that this BO is about to be destroyed and skips it. > > As a test, when we use CPU memset instead of SDMA fill in > amdgpu_bo_release_notify(), the result is page fault: > > PID: 5490 TASK: ffff8e8136e04100 CPU: 4 COMMAND: "gemmPerf" > #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784 > #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92 > #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80 > #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768 > #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6 > #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d > #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae > #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0 > #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925 > #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758 > [exception RIP: memset+31] > RIP: ffffffffb2b8668f RSP: ffff8e79eaa17ce8 RFLAGS: 00010a17 > RAX: bebebebebebebebe RBX: ffff8e747bff10c0 RCX: 0000060b00200000 > RDX: 0000000000000000 RSI: 00000000000000be RDI: ffffab807f000000 > RBP: ffff8e79eaa17d10 R8: ffff8e79eaa14000 R9: ffffab7c80000000 > R10: 000000000000bcba R11: 00000000000001ba R12: ffff8e79ebaa4050 > R13: ffffab7c80000000 R14: 0000000000022600 R15: ffff8e8136e04100 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 > [amdgpu] > #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm] > #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm] > #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm] > #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115 > #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64 > #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7 > #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95 > #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf > #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544 > #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb Well that might be perfectly expected. VRAM is not necessarily CPU accessible. Regards, Christian. > > Regards, > Qu. > > >> Regards, >> Christian. >> >>> and the VRAM mem will be evicted, mem region was replaced >>> by Gtt mem region. amdgpu_bo_release_notify() will then >>> hold the bo->base.resv lock, and SDMA will get an invalid >>> address in amdgpu_fill_buffer(), resulting in a VMFAULT >>> or memory corruption. >>> >>> To avoid it, we have to hold bo->base.resv lock first, and >>> check whether the mem.mem_type is TTM_PL_VRAM. >>> >>> Signed-off-by: Qu Huang <jinsdb@126.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- >>> 1 file changed, 6 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> index 4b29b82..8018574 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct >>> ttm_buffer_object *bo) >>> if (bo->base.resv == &bo->base._resv) >>> amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); >>> >>> - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || >>> - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>> + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>> return; >>> >>> dma_resv_lock(bo->base.resv, NULL); >>> >>> + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { >>> + dma_resv_unlock(bo->base.resv); >>> + return; >>> + } >>> + >>> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, >>> &fence); >>> if (!WARN_ON(r)) { >>> amdgpu_bo_fence(abo, fence, false); >>> -- >>> 1.8.3.1 >>> > _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access 2021-04-03 8:49 ` Christian König (?) @ 2021-04-06 6:04 ` Qu Huang -1 siblings, 0 replies; 20+ messages in thread From: Qu Huang @ 2021-04-06 6:04 UTC (permalink / raw) To: Christian König, alexander.deucher, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: amd-gfx, dri-devel, linux-kernel, linux-media Hi Christian, On 2021/4/3 16:49, Christian König wrote: > Hi Qu, > > Am 03.04.21 um 07:08 schrieb Qu Huang: >> Hi Christian, >> >> On 2021/4/3 0:25, Christian König wrote: >>> Hi Qu, >>> >>> Am 02.04.21 um 05:18 schrieb Qu Huang: >>>> Before dma_resv_lock(bo->base.resv, NULL) in >>>> amdgpu_bo_release_notify(), >>>> the bo->base.resv lock may be held by ttm_mem_evict_first(), >>> >>> That can't happen since when bo_release_notify is called the BO has not >>> more references and is therefore deleted. >>> >>> And we never evict a deleted BO, we just wait for it to become idle. >>> >> Yes, the bo reference counter return to zero will enter >> ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify()) >> first happen, and then test if a reservation object's fences have been >> signaled, and then mark bo as deleted and remove bo from the LRU list. >> >> When ttm_bo_release() and ttm_mem_evict_first() is concurrent, >> the Bo has not been removed from the LRU list and is not marked as >> deleted, this will happen. > > Not sure on which code base you are, but I don't see how this can happen. > > ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and > ttm_bo_release() is only called when the BO reference count becomes zero. > > So ttm_mem_evict_first() will see that this BO is about to be destroyed > and skips it. > Yes, you are right. My version of TTM is ROCM 3.3, so ttm_mem_evict_first() did not call ttm_bo_get_unless_zero(), check that ROCM 4.0 ttm doesn't have this issue. This is an oversight on my part. >> >> As a test, when we use CPU memset instead of SDMA fill in >> amdgpu_bo_release_notify(), the result is page fault: >> >> PID: 5490 TASK: ffff8e8136e04100 CPU: 4 COMMAND: "gemmPerf" >> #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784 >> #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92 >> #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80 >> #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768 >> #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6 >> #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d >> #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae >> #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0 >> #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925 >> #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758 >> [exception RIP: memset+31] >> RIP: ffffffffb2b8668f RSP: ffff8e79eaa17ce8 RFLAGS: 00010a17 >> RAX: bebebebebebebebe RBX: ffff8e747bff10c0 RCX: 0000060b00200000 >> RDX: 0000000000000000 RSI: 00000000000000be RDI: ffffab807f000000 >> RBP: ffff8e79eaa17d10 R8: ffff8e79eaa14000 R9: ffffab7c80000000 >> R10: 000000000000bcba R11: 00000000000001ba R12: ffff8e79ebaa4050 >> R13: ffffab7c80000000 R14: 0000000000022600 R15: ffff8e8136e04100 >> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 >> #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 >> [amdgpu] >> #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm] >> #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm] >> #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm] >> #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115 >> #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64 >> #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7 >> #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95 >> #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf >> #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544 >> #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb > > Well that might be perfectly expected. VRAM is not necessarily CPU > accessible. > As a test,use CPU memset instead of SDMA fill, This is my code: void amdgpu_bo_release_notify(struct ttm_buffer_object *bo) { struct amdgpu_bo *abo; uint64_t num_pages; struct drm_mm_node *mm_node; struct amdgpu_device *adev; void __iomem *kaddr; if (!amdgpu_bo_is_amdgpu_bo(bo)) return; abo = ttm_to_amdgpu_bo(bo); num_pages = abo->tbo.num_pages; mm_node = abo->tbo.mem.mm_node; adev = amdgpu_ttm_adev(abo->tbo.bdev); kaddr = adev->mman.aper_base_kaddr; if (abo->kfd_bo) amdgpu_amdkfd_unreserve_memory_limit(abo); if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) return; dma_resv_lock(amdkcl_ttm_resvp(bo), NULL); while (num_pages && mm_node) { void *ptr = kaddr + (mm_node->start << PAGE_SHIFT); memset_io(ptr, AMDGPU_POISON & 0xff, mm_node->size <<PAGE_SHIFT); num_pages -= mm_node->size; ++mm_node; } dma_resv_unlock(amdkcl_ttm_resvp(bo)); } I have used the old version through oversight, so I am sorry for your trouble. Regards, Qu. > Regards, > Christian. > >> >> Regards, >> Qu. >> >> >>> Regards, >>> Christian. >>> >>>> and the VRAM mem will be evicted, mem region was replaced >>>> by Gtt mem region. amdgpu_bo_release_notify() will then >>>> hold the bo->base.resv lock, and SDMA will get an invalid >>>> address in amdgpu_fill_buffer(), resulting in a VMFAULT >>>> or memory corruption. >>>> >>>> To avoid it, we have to hold bo->base.resv lock first, and >>>> check whether the mem.mem_type is TTM_PL_VRAM. >>>> >>>> Signed-off-by: Qu Huang <jinsdb@126.com> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- >>>> 1 file changed, 6 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>> index 4b29b82..8018574 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct >>>> ttm_buffer_object *bo) >>>> if (bo->base.resv == &bo->base._resv) >>>> amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); >>>> >>>> - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || >>>> - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>>> + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>>> return; >>>> >>>> dma_resv_lock(bo->base.resv, NULL); >>>> >>>> + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { >>>> + dma_resv_unlock(bo->base.resv); >>>> + return; >>>> + } >>>> + >>>> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, >>>> &fence); >>>> if (!WARN_ON(r)) { >>>> amdgpu_bo_fence(abo, fence, false); >>>> -- >>>> 1.8.3.1 >>>> >> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access @ 2021-04-06 6:04 ` Qu Huang 0 siblings, 0 replies; 20+ messages in thread From: Qu Huang @ 2021-04-06 6:04 UTC (permalink / raw) To: Christian König, alexander.deucher, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: linux-media, dri-devel, amd-gfx, linux-kernel Hi Christian, On 2021/4/3 16:49, Christian König wrote: > Hi Qu, > > Am 03.04.21 um 07:08 schrieb Qu Huang: >> Hi Christian, >> >> On 2021/4/3 0:25, Christian König wrote: >>> Hi Qu, >>> >>> Am 02.04.21 um 05:18 schrieb Qu Huang: >>>> Before dma_resv_lock(bo->base.resv, NULL) in >>>> amdgpu_bo_release_notify(), >>>> the bo->base.resv lock may be held by ttm_mem_evict_first(), >>> >>> That can't happen since when bo_release_notify is called the BO has not >>> more references and is therefore deleted. >>> >>> And we never evict a deleted BO, we just wait for it to become idle. >>> >> Yes, the bo reference counter return to zero will enter >> ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify()) >> first happen, and then test if a reservation object's fences have been >> signaled, and then mark bo as deleted and remove bo from the LRU list. >> >> When ttm_bo_release() and ttm_mem_evict_first() is concurrent, >> the Bo has not been removed from the LRU list and is not marked as >> deleted, this will happen. > > Not sure on which code base you are, but I don't see how this can happen. > > ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and > ttm_bo_release() is only called when the BO reference count becomes zero. > > So ttm_mem_evict_first() will see that this BO is about to be destroyed > and skips it. > Yes, you are right. My version of TTM is ROCM 3.3, so ttm_mem_evict_first() did not call ttm_bo_get_unless_zero(), check that ROCM 4.0 ttm doesn't have this issue. This is an oversight on my part. >> >> As a test, when we use CPU memset instead of SDMA fill in >> amdgpu_bo_release_notify(), the result is page fault: >> >> PID: 5490 TASK: ffff8e8136e04100 CPU: 4 COMMAND: "gemmPerf" >> #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784 >> #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92 >> #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80 >> #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768 >> #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6 >> #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d >> #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae >> #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0 >> #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925 >> #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758 >> [exception RIP: memset+31] >> RIP: ffffffffb2b8668f RSP: ffff8e79eaa17ce8 RFLAGS: 00010a17 >> RAX: bebebebebebebebe RBX: ffff8e747bff10c0 RCX: 0000060b00200000 >> RDX: 0000000000000000 RSI: 00000000000000be RDI: ffffab807f000000 >> RBP: ffff8e79eaa17d10 R8: ffff8e79eaa14000 R9: ffffab7c80000000 >> R10: 000000000000bcba R11: 00000000000001ba R12: ffff8e79ebaa4050 >> R13: ffffab7c80000000 R14: 0000000000022600 R15: ffff8e8136e04100 >> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 >> #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 >> [amdgpu] >> #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm] >> #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm] >> #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm] >> #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115 >> #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64 >> #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7 >> #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95 >> #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf >> #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544 >> #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb > > Well that might be perfectly expected. VRAM is not necessarily CPU > accessible. > As a test,use CPU memset instead of SDMA fill, This is my code: void amdgpu_bo_release_notify(struct ttm_buffer_object *bo) { struct amdgpu_bo *abo; uint64_t num_pages; struct drm_mm_node *mm_node; struct amdgpu_device *adev; void __iomem *kaddr; if (!amdgpu_bo_is_amdgpu_bo(bo)) return; abo = ttm_to_amdgpu_bo(bo); num_pages = abo->tbo.num_pages; mm_node = abo->tbo.mem.mm_node; adev = amdgpu_ttm_adev(abo->tbo.bdev); kaddr = adev->mman.aper_base_kaddr; if (abo->kfd_bo) amdgpu_amdkfd_unreserve_memory_limit(abo); if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) return; dma_resv_lock(amdkcl_ttm_resvp(bo), NULL); while (num_pages && mm_node) { void *ptr = kaddr + (mm_node->start << PAGE_SHIFT); memset_io(ptr, AMDGPU_POISON & 0xff, mm_node->size <<PAGE_SHIFT); num_pages -= mm_node->size; ++mm_node; } dma_resv_unlock(amdkcl_ttm_resvp(bo)); } I have used the old version through oversight, so I am sorry for your trouble. Regards, Qu. > Regards, > Christian. > >> >> Regards, >> Qu. >> >> >>> Regards, >>> Christian. >>> >>>> and the VRAM mem will be evicted, mem region was replaced >>>> by Gtt mem region. amdgpu_bo_release_notify() will then >>>> hold the bo->base.resv lock, and SDMA will get an invalid >>>> address in amdgpu_fill_buffer(), resulting in a VMFAULT >>>> or memory corruption. >>>> >>>> To avoid it, we have to hold bo->base.resv lock first, and >>>> check whether the mem.mem_type is TTM_PL_VRAM. >>>> >>>> Signed-off-by: Qu Huang <jinsdb@126.com> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- >>>> 1 file changed, 6 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>> index 4b29b82..8018574 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct >>>> ttm_buffer_object *bo) >>>> if (bo->base.resv == &bo->base._resv) >>>> amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); >>>> >>>> - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || >>>> - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>>> + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>>> return; >>>> >>>> dma_resv_lock(bo->base.resv, NULL); >>>> >>>> + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { >>>> + dma_resv_unlock(bo->base.resv); >>>> + return; >>>> + } >>>> + >>>> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, >>>> &fence); >>>> if (!WARN_ON(r)) { >>>> amdgpu_bo_fence(abo, fence, false); >>>> -- >>>> 1.8.3.1 >>>> >> _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access @ 2021-04-06 6:04 ` Qu Huang 0 siblings, 0 replies; 20+ messages in thread From: Qu Huang @ 2021-04-06 6:04 UTC (permalink / raw) To: Christian König, alexander.deucher, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: linux-media, dri-devel, amd-gfx, linux-kernel Hi Christian, On 2021/4/3 16:49, Christian König wrote: > Hi Qu, > > Am 03.04.21 um 07:08 schrieb Qu Huang: >> Hi Christian, >> >> On 2021/4/3 0:25, Christian König wrote: >>> Hi Qu, >>> >>> Am 02.04.21 um 05:18 schrieb Qu Huang: >>>> Before dma_resv_lock(bo->base.resv, NULL) in >>>> amdgpu_bo_release_notify(), >>>> the bo->base.resv lock may be held by ttm_mem_evict_first(), >>> >>> That can't happen since when bo_release_notify is called the BO has not >>> more references and is therefore deleted. >>> >>> And we never evict a deleted BO, we just wait for it to become idle. >>> >> Yes, the bo reference counter return to zero will enter >> ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify()) >> first happen, and then test if a reservation object's fences have been >> signaled, and then mark bo as deleted and remove bo from the LRU list. >> >> When ttm_bo_release() and ttm_mem_evict_first() is concurrent, >> the Bo has not been removed from the LRU list and is not marked as >> deleted, this will happen. > > Not sure on which code base you are, but I don't see how this can happen. > > ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and > ttm_bo_release() is only called when the BO reference count becomes zero. > > So ttm_mem_evict_first() will see that this BO is about to be destroyed > and skips it. > Yes, you are right. My version of TTM is ROCM 3.3, so ttm_mem_evict_first() did not call ttm_bo_get_unless_zero(), check that ROCM 4.0 ttm doesn't have this issue. This is an oversight on my part. >> >> As a test, when we use CPU memset instead of SDMA fill in >> amdgpu_bo_release_notify(), the result is page fault: >> >> PID: 5490 TASK: ffff8e8136e04100 CPU: 4 COMMAND: "gemmPerf" >> #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784 >> #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92 >> #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80 >> #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768 >> #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6 >> #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d >> #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae >> #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0 >> #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925 >> #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758 >> [exception RIP: memset+31] >> RIP: ffffffffb2b8668f RSP: ffff8e79eaa17ce8 RFLAGS: 00010a17 >> RAX: bebebebebebebebe RBX: ffff8e747bff10c0 RCX: 0000060b00200000 >> RDX: 0000000000000000 RSI: 00000000000000be RDI: ffffab807f000000 >> RBP: ffff8e79eaa17d10 R8: ffff8e79eaa14000 R9: ffffab7c80000000 >> R10: 000000000000bcba R11: 00000000000001ba R12: ffff8e79ebaa4050 >> R13: ffffab7c80000000 R14: 0000000000022600 R15: ffff8e8136e04100 >> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 >> #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 >> [amdgpu] >> #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm] >> #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm] >> #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm] >> #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115 >> #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64 >> #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7 >> #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95 >> #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf >> #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544 >> #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb > > Well that might be perfectly expected. VRAM is not necessarily CPU > accessible. > As a test,use CPU memset instead of SDMA fill, This is my code: void amdgpu_bo_release_notify(struct ttm_buffer_object *bo) { struct amdgpu_bo *abo; uint64_t num_pages; struct drm_mm_node *mm_node; struct amdgpu_device *adev; void __iomem *kaddr; if (!amdgpu_bo_is_amdgpu_bo(bo)) return; abo = ttm_to_amdgpu_bo(bo); num_pages = abo->tbo.num_pages; mm_node = abo->tbo.mem.mm_node; adev = amdgpu_ttm_adev(abo->tbo.bdev); kaddr = adev->mman.aper_base_kaddr; if (abo->kfd_bo) amdgpu_amdkfd_unreserve_memory_limit(abo); if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) return; dma_resv_lock(amdkcl_ttm_resvp(bo), NULL); while (num_pages && mm_node) { void *ptr = kaddr + (mm_node->start << PAGE_SHIFT); memset_io(ptr, AMDGPU_POISON & 0xff, mm_node->size <<PAGE_SHIFT); num_pages -= mm_node->size; ++mm_node; } dma_resv_unlock(amdkcl_ttm_resvp(bo)); } I have used the old version through oversight, so I am sorry for your trouble. Regards, Qu. > Regards, > Christian. > >> >> Regards, >> Qu. >> >> >>> Regards, >>> Christian. >>> >>>> and the VRAM mem will be evicted, mem region was replaced >>>> by Gtt mem region. amdgpu_bo_release_notify() will then >>>> hold the bo->base.resv lock, and SDMA will get an invalid >>>> address in amdgpu_fill_buffer(), resulting in a VMFAULT >>>> or memory corruption. >>>> >>>> To avoid it, we have to hold bo->base.resv lock first, and >>>> check whether the mem.mem_type is TTM_PL_VRAM. >>>> >>>> Signed-off-by: Qu Huang <jinsdb@126.com> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- >>>> 1 file changed, 6 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>> index 4b29b82..8018574 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct >>>> ttm_buffer_object *bo) >>>> if (bo->base.resv == &bo->base._resv) >>>> amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); >>>> >>>> - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || >>>> - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>>> + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>>> return; >>>> >>>> dma_resv_lock(bo->base.resv, NULL); >>>> >>>> + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { >>>> + dma_resv_unlock(bo->base.resv); >>>> + return; >>>> + } >>>> + >>>> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, >>>> &fence); >>>> if (!WARN_ON(r)) { >>>> amdgpu_bo_fence(abo, fence, false); >>>> -- >>>> 1.8.3.1 >>>> >> _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access 2021-04-06 6:04 ` Qu Huang (?) @ 2021-04-06 13:44 ` Christian König -1 siblings, 0 replies; 20+ messages in thread From: Christian König @ 2021-04-06 13:44 UTC (permalink / raw) To: Qu Huang, Christian König, alexander.deucher, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: linux-media, dri-devel, amd-gfx, linux-kernel Hi Qu, Am 06.04.21 um 08:04 schrieb Qu Huang: > Hi Christian, > > On 2021/4/3 16:49, Christian König wrote: >> Hi Qu, >> >> Am 03.04.21 um 07:08 schrieb Qu Huang: >>> Hi Christian, >>> >>> On 2021/4/3 0:25, Christian König wrote: >>>> Hi Qu, >>>> >>>> Am 02.04.21 um 05:18 schrieb Qu Huang: >>>>> Before dma_resv_lock(bo->base.resv, NULL) in >>>>> amdgpu_bo_release_notify(), >>>>> the bo->base.resv lock may be held by ttm_mem_evict_first(), >>>> >>>> That can't happen since when bo_release_notify is called the BO has >>>> not >>>> more references and is therefore deleted. >>>> >>>> And we never evict a deleted BO, we just wait for it to become idle. >>>> >>> Yes, the bo reference counter return to zero will enter >>> ttm_bo_release(),but notify bo release (call >>> amdgpu_bo_release_notify()) >>> first happen, and then test if a reservation object's fences have been >>> signaled, and then mark bo as deleted and remove bo from the LRU list. >>> >>> When ttm_bo_release() and ttm_mem_evict_first() is concurrent, >>> the Bo has not been removed from the LRU list and is not marked as >>> deleted, this will happen. >> >> Not sure on which code base you are, but I don't see how this can >> happen. >> >> ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and >> ttm_bo_release() is only called when the BO reference count becomes >> zero. >> >> So ttm_mem_evict_first() will see that this BO is about to be destroyed >> and skips it. >> > > Yes, you are right. My version of TTM is ROCM 3.3, so > ttm_mem_evict_first() did not call ttm_bo_get_unless_zero(), check that > ROCM 4.0 ttm doesn't have this issue. This is an oversight on my part. > >>> >>> As a test, when we use CPU memset instead of SDMA fill in >>> amdgpu_bo_release_notify(), the result is page fault: >>> >>> PID: 5490 TASK: ffff8e8136e04100 CPU: 4 COMMAND: "gemmPerf" >>> #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784 >>> #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92 >>> #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80 >>> #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768 >>> #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6 >>> #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d >>> #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae >>> #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0 >>> #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925 >>> #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758 >>> [exception RIP: memset+31] >>> RIP: ffffffffb2b8668f RSP: ffff8e79eaa17ce8 RFLAGS: 00010a17 >>> RAX: bebebebebebebebe RBX: ffff8e747bff10c0 RCX: >>> 0000060b00200000 >>> RDX: 0000000000000000 RSI: 00000000000000be RDI: >>> ffffab807f000000 >>> RBP: ffff8e79eaa17d10 R8: ffff8e79eaa14000 R9: >>> ffffab7c80000000 >>> R10: 000000000000bcba R11: 00000000000001ba R12: >>> ffff8e79ebaa4050 >>> R13: ffffab7c80000000 R14: 0000000000022600 R15: >>> ffff8e8136e04100 >>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 >>> #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 >>> [amdgpu] >>> #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm] >>> #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm] >>> #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm] >>> #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115 >>> #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64 >>> #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7 >>> #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95 >>> #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf >>> #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544 >>> #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb >> >> Well that might be perfectly expected. VRAM is not necessarily CPU >> accessible. >> > As a test,use CPU memset instead of SDMA fill, This is my code: > void amdgpu_bo_release_notify(struct ttm_buffer_object *bo) > { > struct amdgpu_bo *abo; > uint64_t num_pages; > struct drm_mm_node *mm_node; > struct amdgpu_device *adev; > void __iomem *kaddr; > > if (!amdgpu_bo_is_amdgpu_bo(bo)) > return; > > abo = ttm_to_amdgpu_bo(bo); > num_pages = abo->tbo.num_pages; > mm_node = abo->tbo.mem.mm_node; > adev = amdgpu_ttm_adev(abo->tbo.bdev); > kaddr = adev->mman.aper_base_kaddr; > > if (abo->kfd_bo) > amdgpu_amdkfd_unreserve_memory_limit(abo); > > if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || > !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) > return; > > dma_resv_lock(amdkcl_ttm_resvp(bo), NULL); > while (num_pages && mm_node) { > void *ptr = kaddr + (mm_node->start << PAGE_SHIFT); That might not work as expected. aper_base_kaddr can only point to a 256MiB window into VRAM, but VRAM itself is usually much larger. So your memset_io() might end up in nirvana if the BO is allocated outside of the window. > memset_io(ptr, AMDGPU_POISON & 0xff, mm_node->size <<PAGE_SHIFT); > num_pages -= mm_node->size; > ++mm_node; > } > dma_resv_unlock(amdkcl_ttm_resvp(bo)); > } > > > > > > I have used the old version through oversight, so I am sorry for your > trouble. No, problem. I was just wondering if I was missing something. Regards, Christian. > > > Regards, > Qu. > >> Regards, >> Christian. >> >>> >>> Regards, >>> Qu. >>> >>> >>>> Regards, >>>> Christian. >>>> >>>>> and the VRAM mem will be evicted, mem region was replaced >>>>> by Gtt mem region. amdgpu_bo_release_notify() will then >>>>> hold the bo->base.resv lock, and SDMA will get an invalid >>>>> address in amdgpu_fill_buffer(), resulting in a VMFAULT >>>>> or memory corruption. >>>>> >>>>> To avoid it, we have to hold bo->base.resv lock first, and >>>>> check whether the mem.mem_type is TTM_PL_VRAM. >>>>> >>>>> Signed-off-by: Qu Huang <jinsdb@126.com> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- >>>>> 1 file changed, 6 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>> index 4b29b82..8018574 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct >>>>> ttm_buffer_object *bo) >>>>> if (bo->base.resv == &bo->base._resv) >>>>> amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); >>>>> >>>>> - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || >>>>> - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>>>> + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>>>> return; >>>>> >>>>> dma_resv_lock(bo->base.resv, NULL); >>>>> >>>>> + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { >>>>> + dma_resv_unlock(bo->base.resv); >>>>> + return; >>>>> + } >>>>> + >>>>> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, >>>>> &fence); >>>>> if (!WARN_ON(r)) { >>>>> amdgpu_bo_fence(abo, fence, false); >>>>> -- >>>>> 1.8.3.1 >>>>> >>> > > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access @ 2021-04-06 13:44 ` Christian König 0 siblings, 0 replies; 20+ messages in thread From: Christian König @ 2021-04-06 13:44 UTC (permalink / raw) To: Qu Huang, Christian König, alexander.deucher, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: linux-kernel, amd-gfx, dri-devel, linux-media Hi Qu, Am 06.04.21 um 08:04 schrieb Qu Huang: > Hi Christian, > > On 2021/4/3 16:49, Christian König wrote: >> Hi Qu, >> >> Am 03.04.21 um 07:08 schrieb Qu Huang: >>> Hi Christian, >>> >>> On 2021/4/3 0:25, Christian König wrote: >>>> Hi Qu, >>>> >>>> Am 02.04.21 um 05:18 schrieb Qu Huang: >>>>> Before dma_resv_lock(bo->base.resv, NULL) in >>>>> amdgpu_bo_release_notify(), >>>>> the bo->base.resv lock may be held by ttm_mem_evict_first(), >>>> >>>> That can't happen since when bo_release_notify is called the BO has >>>> not >>>> more references and is therefore deleted. >>>> >>>> And we never evict a deleted BO, we just wait for it to become idle. >>>> >>> Yes, the bo reference counter return to zero will enter >>> ttm_bo_release(),but notify bo release (call >>> amdgpu_bo_release_notify()) >>> first happen, and then test if a reservation object's fences have been >>> signaled, and then mark bo as deleted and remove bo from the LRU list. >>> >>> When ttm_bo_release() and ttm_mem_evict_first() is concurrent, >>> the Bo has not been removed from the LRU list and is not marked as >>> deleted, this will happen. >> >> Not sure on which code base you are, but I don't see how this can >> happen. >> >> ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and >> ttm_bo_release() is only called when the BO reference count becomes >> zero. >> >> So ttm_mem_evict_first() will see that this BO is about to be destroyed >> and skips it. >> > > Yes, you are right. My version of TTM is ROCM 3.3, so > ttm_mem_evict_first() did not call ttm_bo_get_unless_zero(), check that > ROCM 4.0 ttm doesn't have this issue. This is an oversight on my part. > >>> >>> As a test, when we use CPU memset instead of SDMA fill in >>> amdgpu_bo_release_notify(), the result is page fault: >>> >>> PID: 5490 TASK: ffff8e8136e04100 CPU: 4 COMMAND: "gemmPerf" >>> #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784 >>> #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92 >>> #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80 >>> #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768 >>> #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6 >>> #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d >>> #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae >>> #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0 >>> #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925 >>> #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758 >>> [exception RIP: memset+31] >>> RIP: ffffffffb2b8668f RSP: ffff8e79eaa17ce8 RFLAGS: 00010a17 >>> RAX: bebebebebebebebe RBX: ffff8e747bff10c0 RCX: >>> 0000060b00200000 >>> RDX: 0000000000000000 RSI: 00000000000000be RDI: >>> ffffab807f000000 >>> RBP: ffff8e79eaa17d10 R8: ffff8e79eaa14000 R9: >>> ffffab7c80000000 >>> R10: 000000000000bcba R11: 00000000000001ba R12: >>> ffff8e79ebaa4050 >>> R13: ffffab7c80000000 R14: 0000000000022600 R15: >>> ffff8e8136e04100 >>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 >>> #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 >>> [amdgpu] >>> #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm] >>> #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm] >>> #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm] >>> #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115 >>> #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64 >>> #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7 >>> #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95 >>> #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf >>> #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544 >>> #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb >> >> Well that might be perfectly expected. VRAM is not necessarily CPU >> accessible. >> > As a test,use CPU memset instead of SDMA fill, This is my code: > void amdgpu_bo_release_notify(struct ttm_buffer_object *bo) > { > struct amdgpu_bo *abo; > uint64_t num_pages; > struct drm_mm_node *mm_node; > struct amdgpu_device *adev; > void __iomem *kaddr; > > if (!amdgpu_bo_is_amdgpu_bo(bo)) > return; > > abo = ttm_to_amdgpu_bo(bo); > num_pages = abo->tbo.num_pages; > mm_node = abo->tbo.mem.mm_node; > adev = amdgpu_ttm_adev(abo->tbo.bdev); > kaddr = adev->mman.aper_base_kaddr; > > if (abo->kfd_bo) > amdgpu_amdkfd_unreserve_memory_limit(abo); > > if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || > !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) > return; > > dma_resv_lock(amdkcl_ttm_resvp(bo), NULL); > while (num_pages && mm_node) { > void *ptr = kaddr + (mm_node->start << PAGE_SHIFT); That might not work as expected. aper_base_kaddr can only point to a 256MiB window into VRAM, but VRAM itself is usually much larger. So your memset_io() might end up in nirvana if the BO is allocated outside of the window. > memset_io(ptr, AMDGPU_POISON & 0xff, mm_node->size <<PAGE_SHIFT); > num_pages -= mm_node->size; > ++mm_node; > } > dma_resv_unlock(amdkcl_ttm_resvp(bo)); > } > > > > > > I have used the old version through oversight, so I am sorry for your > trouble. No, problem. I was just wondering if I was missing something. Regards, Christian. > > > Regards, > Qu. > >> Regards, >> Christian. >> >>> >>> Regards, >>> Qu. >>> >>> >>>> Regards, >>>> Christian. >>>> >>>>> and the VRAM mem will be evicted, mem region was replaced >>>>> by Gtt mem region. amdgpu_bo_release_notify() will then >>>>> hold the bo->base.resv lock, and SDMA will get an invalid >>>>> address in amdgpu_fill_buffer(), resulting in a VMFAULT >>>>> or memory corruption. >>>>> >>>>> To avoid it, we have to hold bo->base.resv lock first, and >>>>> check whether the mem.mem_type is TTM_PL_VRAM. >>>>> >>>>> Signed-off-by: Qu Huang <jinsdb@126.com> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- >>>>> 1 file changed, 6 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>> index 4b29b82..8018574 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct >>>>> ttm_buffer_object *bo) >>>>> if (bo->base.resv == &bo->base._resv) >>>>> amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); >>>>> >>>>> - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || >>>>> - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>>>> + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>>>> return; >>>>> >>>>> dma_resv_lock(bo->base.resv, NULL); >>>>> >>>>> + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { >>>>> + dma_resv_unlock(bo->base.resv); >>>>> + return; >>>>> + } >>>>> + >>>>> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, >>>>> &fence); >>>>> if (!WARN_ON(r)) { >>>>> amdgpu_bo_fence(abo, fence, false); >>>>> -- >>>>> 1.8.3.1 >>>>> >>> > > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access @ 2021-04-06 13:44 ` Christian König 0 siblings, 0 replies; 20+ messages in thread From: Christian König @ 2021-04-06 13:44 UTC (permalink / raw) To: Qu Huang, Christian König, alexander.deucher, airlied, daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd Cc: linux-kernel, amd-gfx, dri-devel, linux-media Hi Qu, Am 06.04.21 um 08:04 schrieb Qu Huang: > Hi Christian, > > On 2021/4/3 16:49, Christian König wrote: >> Hi Qu, >> >> Am 03.04.21 um 07:08 schrieb Qu Huang: >>> Hi Christian, >>> >>> On 2021/4/3 0:25, Christian König wrote: >>>> Hi Qu, >>>> >>>> Am 02.04.21 um 05:18 schrieb Qu Huang: >>>>> Before dma_resv_lock(bo->base.resv, NULL) in >>>>> amdgpu_bo_release_notify(), >>>>> the bo->base.resv lock may be held by ttm_mem_evict_first(), >>>> >>>> That can't happen since when bo_release_notify is called the BO has >>>> not >>>> more references and is therefore deleted. >>>> >>>> And we never evict a deleted BO, we just wait for it to become idle. >>>> >>> Yes, the bo reference counter return to zero will enter >>> ttm_bo_release(),but notify bo release (call >>> amdgpu_bo_release_notify()) >>> first happen, and then test if a reservation object's fences have been >>> signaled, and then mark bo as deleted and remove bo from the LRU list. >>> >>> When ttm_bo_release() and ttm_mem_evict_first() is concurrent, >>> the Bo has not been removed from the LRU list and is not marked as >>> deleted, this will happen. >> >> Not sure on which code base you are, but I don't see how this can >> happen. >> >> ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and >> ttm_bo_release() is only called when the BO reference count becomes >> zero. >> >> So ttm_mem_evict_first() will see that this BO is about to be destroyed >> and skips it. >> > > Yes, you are right. My version of TTM is ROCM 3.3, so > ttm_mem_evict_first() did not call ttm_bo_get_unless_zero(), check that > ROCM 4.0 ttm doesn't have this issue. This is an oversight on my part. > >>> >>> As a test, when we use CPU memset instead of SDMA fill in >>> amdgpu_bo_release_notify(), the result is page fault: >>> >>> PID: 5490 TASK: ffff8e8136e04100 CPU: 4 COMMAND: "gemmPerf" >>> #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784 >>> #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92 >>> #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80 >>> #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768 >>> #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6 >>> #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d >>> #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae >>> #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0 >>> #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925 >>> #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758 >>> [exception RIP: memset+31] >>> RIP: ffffffffb2b8668f RSP: ffff8e79eaa17ce8 RFLAGS: 00010a17 >>> RAX: bebebebebebebebe RBX: ffff8e747bff10c0 RCX: >>> 0000060b00200000 >>> RDX: 0000000000000000 RSI: 00000000000000be RDI: >>> ffffab807f000000 >>> RBP: ffff8e79eaa17d10 R8: ffff8e79eaa14000 R9: >>> ffffab7c80000000 >>> R10: 000000000000bcba R11: 00000000000001ba R12: >>> ffff8e79ebaa4050 >>> R13: ffffab7c80000000 R14: 0000000000022600 R15: >>> ffff8e8136e04100 >>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 >>> #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 >>> [amdgpu] >>> #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm] >>> #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm] >>> #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm] >>> #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115 >>> #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64 >>> #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7 >>> #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95 >>> #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf >>> #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544 >>> #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb >> >> Well that might be perfectly expected. VRAM is not necessarily CPU >> accessible. >> > As a test,use CPU memset instead of SDMA fill, This is my code: > void amdgpu_bo_release_notify(struct ttm_buffer_object *bo) > { > struct amdgpu_bo *abo; > uint64_t num_pages; > struct drm_mm_node *mm_node; > struct amdgpu_device *adev; > void __iomem *kaddr; > > if (!amdgpu_bo_is_amdgpu_bo(bo)) > return; > > abo = ttm_to_amdgpu_bo(bo); > num_pages = abo->tbo.num_pages; > mm_node = abo->tbo.mem.mm_node; > adev = amdgpu_ttm_adev(abo->tbo.bdev); > kaddr = adev->mman.aper_base_kaddr; > > if (abo->kfd_bo) > amdgpu_amdkfd_unreserve_memory_limit(abo); > > if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || > !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) > return; > > dma_resv_lock(amdkcl_ttm_resvp(bo), NULL); > while (num_pages && mm_node) { > void *ptr = kaddr + (mm_node->start << PAGE_SHIFT); That might not work as expected. aper_base_kaddr can only point to a 256MiB window into VRAM, but VRAM itself is usually much larger. So your memset_io() might end up in nirvana if the BO is allocated outside of the window. > memset_io(ptr, AMDGPU_POISON & 0xff, mm_node->size <<PAGE_SHIFT); > num_pages -= mm_node->size; > ++mm_node; > } > dma_resv_unlock(amdkcl_ttm_resvp(bo)); > } > > > > > > I have used the old version through oversight, so I am sorry for your > trouble. No, problem. I was just wondering if I was missing something. Regards, Christian. > > > Regards, > Qu. > >> Regards, >> Christian. >> >>> >>> Regards, >>> Qu. >>> >>> >>>> Regards, >>>> Christian. >>>> >>>>> and the VRAM mem will be evicted, mem region was replaced >>>>> by Gtt mem region. amdgpu_bo_release_notify() will then >>>>> hold the bo->base.resv lock, and SDMA will get an invalid >>>>> address in amdgpu_fill_buffer(), resulting in a VMFAULT >>>>> or memory corruption. >>>>> >>>>> To avoid it, we have to hold bo->base.resv lock first, and >>>>> check whether the mem.mem_type is TTM_PL_VRAM. >>>>> >>>>> Signed-off-by: Qu Huang <jinsdb@126.com> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- >>>>> 1 file changed, 6 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>> index 4b29b82..8018574 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>>>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct >>>>> ttm_buffer_object *bo) >>>>> if (bo->base.resv == &bo->base._resv) >>>>> amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); >>>>> >>>>> - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || >>>>> - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>>>> + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>>>> return; >>>>> >>>>> dma_resv_lock(bo->base.resv, NULL); >>>>> >>>>> + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { >>>>> + dma_resv_unlock(bo->base.resv); >>>>> + return; >>>>> + } >>>>> + >>>>> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, >>>>> &fence); >>>>> if (!WARN_ON(r)) { >>>>> amdgpu_bo_fence(abo, fence, false); >>>>> -- >>>>> 1.8.3.1 >>>>> >>> > > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2021-04-06 13:44 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-04-02 3:18 [PATCH] drm/amdgpu: Fix a potential sdma invalid access Qu Huang 2021-04-02 3:18 ` Qu Huang 2021-04-02 3:18 ` Qu Huang 2021-04-02 16:25 ` Christian König 2021-04-02 16:25 ` Christian König 2021-04-02 16:25 ` Christian König 2021-04-03 3:08 ` Qu Huang 2021-04-03 3:08 ` Qu Huang 2021-04-03 5:08 ` Qu Huang 2021-04-03 5:08 ` Qu Huang 2021-04-03 5:08 ` Qu Huang 2021-04-03 8:49 ` Christian König 2021-04-03 8:49 ` Christian König 2021-04-03 8:49 ` Christian König 2021-04-06 6:04 ` Qu Huang 2021-04-06 6:04 ` Qu Huang 2021-04-06 6:04 ` Qu Huang 2021-04-06 13:44 ` Christian König 2021-04-06 13:44 ` Christian König 2021-04-06 13:44 ` Christian König
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.