All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: Fix a potential sdma invalid access
@ 2021-04-02  3:18 ` Qu Huang
  0 siblings, 0 replies; 20+ messages in thread
From: Qu Huang @ 2021-04-02  3:18 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, airlied, daniel,
	sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd
  Cc: amd-gfx, dri-devel, linux-kernel, linux-media, linaro-mm-sig, jinsdb

Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(),
the bo->base.resv lock may be held by ttm_mem_evict_first(),
and the VRAM mem will be evicted, mem region was replaced
by Gtt mem region. amdgpu_bo_release_notify() will then
hold the bo->base.resv lock, and SDMA will get an invalid
address in amdgpu_fill_buffer(), resulting in a VMFAULT
or memory corruption.

To avoid it, we have to hold bo->base.resv lock first, and
check whether the mem.mem_type is TTM_PL_VRAM.

Signed-off-by: Qu Huang <jinsdb@126.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 4b29b82..8018574 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
 	if (bo->base.resv == &bo->base._resv)
 		amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);

-	if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
-	    !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
+	if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
 		return;

 	dma_resv_lock(bo->base.resv, NULL);

+	if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
+		dma_resv_unlock(bo->base.resv);
+		return;
+	}
+
 	r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence);
 	if (!WARN_ON(r)) {
 		amdgpu_bo_fence(abo, fence, false);
--
1.8.3.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH] drm/amdgpu: Fix a potential sdma invalid access
@ 2021-04-02  3:18 ` Qu Huang
  0 siblings, 0 replies; 20+ messages in thread
From: Qu Huang @ 2021-04-02  3:18 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, airlied, daniel,
	sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd
  Cc: jinsdb, linux-kernel, dri-devel, linaro-mm-sig, amd-gfx, linux-media

Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(),
the bo->base.resv lock may be held by ttm_mem_evict_first(),
and the VRAM mem will be evicted, mem region was replaced
by Gtt mem region. amdgpu_bo_release_notify() will then
hold the bo->base.resv lock, and SDMA will get an invalid
address in amdgpu_fill_buffer(), resulting in a VMFAULT
or memory corruption.

To avoid it, we have to hold bo->base.resv lock first, and
check whether the mem.mem_type is TTM_PL_VRAM.

Signed-off-by: Qu Huang <jinsdb@126.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 4b29b82..8018574 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
 	if (bo->base.resv == &bo->base._resv)
 		amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);

-	if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
-	    !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
+	if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
 		return;

 	dma_resv_lock(bo->base.resv, NULL);

+	if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
+		dma_resv_unlock(bo->base.resv);
+		return;
+	}
+
 	r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence);
 	if (!WARN_ON(r)) {
 		amdgpu_bo_fence(abo, fence, false);
--
1.8.3.1

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH] drm/amdgpu: Fix a potential sdma invalid access
@ 2021-04-02  3:18 ` Qu Huang
  0 siblings, 0 replies; 20+ messages in thread
From: Qu Huang @ 2021-04-02  3:18 UTC (permalink / raw)
  To: alexander.deucher, christian.koenig, airlied, daniel,
	sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd
  Cc: jinsdb, linux-kernel, dri-devel, linaro-mm-sig, amd-gfx, linux-media

Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(),
the bo->base.resv lock may be held by ttm_mem_evict_first(),
and the VRAM mem will be evicted, mem region was replaced
by Gtt mem region. amdgpu_bo_release_notify() will then
hold the bo->base.resv lock, and SDMA will get an invalid
address in amdgpu_fill_buffer(), resulting in a VMFAULT
or memory corruption.

To avoid it, we have to hold bo->base.resv lock first, and
check whether the mem.mem_type is TTM_PL_VRAM.

Signed-off-by: Qu Huang <jinsdb@126.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 4b29b82..8018574 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
 	if (bo->base.resv == &bo->base._resv)
 		amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);

-	if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
-	    !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
+	if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
 		return;

 	dma_resv_lock(bo->base.resv, NULL);

+	if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
+		dma_resv_unlock(bo->base.resv);
+		return;
+	}
+
 	r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence);
 	if (!WARN_ON(r)) {
 		amdgpu_bo_fence(abo, fence, false);
--
1.8.3.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access
  2021-04-02  3:18 ` Qu Huang
  (?)
@ 2021-04-02 16:25   ` Christian König
  -1 siblings, 0 replies; 20+ messages in thread
From: Christian König @ 2021-04-02 16:25 UTC (permalink / raw)
  To: Qu Huang, alexander.deucher, airlied, daniel, sumit.semwal,
	airlied, ray.huang, Mihir.Patel, nirmoy.aiemd
  Cc: amd-gfx, dri-devel, linux-kernel, linux-media, linaro-mm-sig

Hi Qu,

Am 02.04.21 um 05:18 schrieb Qu Huang:
> Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(),
> the bo->base.resv lock may be held by ttm_mem_evict_first(),

That can't happen since when bo_release_notify is called the BO has not 
more references and is therefore deleted.

And we never evict a deleted BO, we just wait for it to become idle.

Regards,
Christian.

> and the VRAM mem will be evicted, mem region was replaced
> by Gtt mem region. amdgpu_bo_release_notify() will then
> hold the bo->base.resv lock, and SDMA will get an invalid
> address in amdgpu_fill_buffer(), resulting in a VMFAULT
> or memory corruption.
>
> To avoid it, we have to hold bo->base.resv lock first, and
> check whether the mem.mem_type is TTM_PL_VRAM.
>
> Signed-off-by: Qu Huang <jinsdb@126.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 4b29b82..8018574 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
>   	if (bo->base.resv == &bo->base._resv)
>   		amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>
> -	if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
> -	    !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
> +	if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>   		return;
>
>   	dma_resv_lock(bo->base.resv, NULL);
>
> +	if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
> +		dma_resv_unlock(bo->base.resv);
> +		return;
> +	}
> +
>   	r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence);
>   	if (!WARN_ON(r)) {
>   		amdgpu_bo_fence(abo, fence, false);
> --
> 1.8.3.1
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access
@ 2021-04-02 16:25   ` Christian König
  0 siblings, 0 replies; 20+ messages in thread
From: Christian König @ 2021-04-02 16:25 UTC (permalink / raw)
  To: Qu Huang, alexander.deucher, airlied, daniel, sumit.semwal,
	airlied, ray.huang, Mihir.Patel, nirmoy.aiemd
  Cc: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel

Hi Qu,

Am 02.04.21 um 05:18 schrieb Qu Huang:
> Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(),
> the bo->base.resv lock may be held by ttm_mem_evict_first(),

That can't happen since when bo_release_notify is called the BO has not 
more references and is therefore deleted.

And we never evict a deleted BO, we just wait for it to become idle.

Regards,
Christian.

> and the VRAM mem will be evicted, mem region was replaced
> by Gtt mem region. amdgpu_bo_release_notify() will then
> hold the bo->base.resv lock, and SDMA will get an invalid
> address in amdgpu_fill_buffer(), resulting in a VMFAULT
> or memory corruption.
>
> To avoid it, we have to hold bo->base.resv lock first, and
> check whether the mem.mem_type is TTM_PL_VRAM.
>
> Signed-off-by: Qu Huang <jinsdb@126.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 4b29b82..8018574 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
>   	if (bo->base.resv == &bo->base._resv)
>   		amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>
> -	if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
> -	    !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
> +	if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>   		return;
>
>   	dma_resv_lock(bo->base.resv, NULL);
>
> +	if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
> +		dma_resv_unlock(bo->base.resv);
> +		return;
> +	}
> +
>   	r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence);
>   	if (!WARN_ON(r)) {
>   		amdgpu_bo_fence(abo, fence, false);
> --
> 1.8.3.1
>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access
@ 2021-04-02 16:25   ` Christian König
  0 siblings, 0 replies; 20+ messages in thread
From: Christian König @ 2021-04-02 16:25 UTC (permalink / raw)
  To: Qu Huang, alexander.deucher, airlied, daniel, sumit.semwal,
	airlied, ray.huang, Mihir.Patel, nirmoy.aiemd
  Cc: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel

Hi Qu,

Am 02.04.21 um 05:18 schrieb Qu Huang:
> Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(),
> the bo->base.resv lock may be held by ttm_mem_evict_first(),

That can't happen since when bo_release_notify is called the BO has not 
more references and is therefore deleted.

And we never evict a deleted BO, we just wait for it to become idle.

Regards,
Christian.

> and the VRAM mem will be evicted, mem region was replaced
> by Gtt mem region. amdgpu_bo_release_notify() will then
> hold the bo->base.resv lock, and SDMA will get an invalid
> address in amdgpu_fill_buffer(), resulting in a VMFAULT
> or memory corruption.
>
> To avoid it, we have to hold bo->base.resv lock first, and
> check whether the mem.mem_type is TTM_PL_VRAM.
>
> Signed-off-by: Qu Huang <jinsdb@126.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 4b29b82..8018574 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
>   	if (bo->base.resv == &bo->base._resv)
>   		amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>
> -	if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
> -	    !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
> +	if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>   		return;
>
>   	dma_resv_lock(bo->base.resv, NULL);
>
> +	if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
> +		dma_resv_unlock(bo->base.resv);
> +		return;
> +	}
> +
>   	r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence);
>   	if (!WARN_ON(r)) {
>   		amdgpu_bo_fence(abo, fence, false);
> --
> 1.8.3.1
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access
  2021-04-02 16:25   ` Christian König
@ 2021-04-03  3:08     ` Qu Huang
  -1 siblings, 0 replies; 20+ messages in thread
From: Qu Huang @ 2021-04-03  3:08 UTC (permalink / raw)
  To: Christian König, alexander.deucher, airlied, daniel,
	sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd
  Cc: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel

Hi Christian,

On 2021/4/3 0:25, Christian König wrote:
> Hi Qu,
> 
> Am 02.04.21 um 05:18 schrieb Qu Huang:
>> Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(),
>> the bo->base.resv lock may be held by ttm_mem_evict_first(),
> 
> That can't happen since when bo_release_notify is called the BO has not 
> more references and is therefore deleted.
> 
> And we never evict a deleted BO, we just wait for it to become idle.

Yes, the bo reference counter return to zero will enter 
ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify()) 
first happen, and then test if a reservation object's fences have been 
signaled, and then mark bo as deleted and remove bo from the LRU list.

When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
the Bo has not been removed from the LRU list and is not marked as 
deleted, this will happen.

As a test, when we use CPU memset instead of SDMA fill in 
amdgpu_bo_release_notify(), the result is page fault:

PID: 5490   TASK: ffff8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
  #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784
  #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92
  #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80
  #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768
  #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6
  #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d
  #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae
  #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0
  #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925
  #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758
     [exception RIP: memset+31]
     RIP: ffffffffb2b8668f  RSP: ffff8e79eaa17ce8  RFLAGS: 00010a17
     RAX: bebebebebebebebe  RBX: ffff8e747bff10c0  RCX: 0000060b00200000
     RDX: 0000000000000000  RSI: 00000000000000be  RDI: ffffab807f000000
     RBP: ffff8e79eaa17d10   R8: ffff8e79eaa14000   R9: ffffab7c80000000
     R10: 000000000000bcba  R11: 00000000000001ba  R12: ffff8e79ebaa4050
     R13: ffffab7c80000000  R14: 0000000000022600  R15: ffff8e8136e04100
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 [amdgpu]
#11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm]
#12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm]
#13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm]
#14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115
#15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64
#16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7
#17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95
#18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf
#19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544
#20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb

Regards,
Qu.


> 
> Regards,
> Christian.
> 
>> and the VRAM mem will be evicted, mem region was replaced
>> by Gtt mem region. amdgpu_bo_release_notify() will then
>> hold the bo->base.resv lock, and SDMA will get an invalid
>> address in amdgpu_fill_buffer(), resulting in a VMFAULT
>> or memory corruption.
>>
>> To avoid it, we have to hold bo->base.resv lock first, and
>> check whether the mem.mem_type is TTM_PL_VRAM.
>>
>> Signed-off-by: Qu Huang <jinsdb@126.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> index 4b29b82..8018574 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct 
>> ttm_buffer_object *bo)
>>       if (bo->base.resv == &bo->base._resv)
>>           amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>>
>> -    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
>> -        !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>> +    if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>           return;
>>
>>       dma_resv_lock(bo->base.resv, NULL);
>>
>> +    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
>> +        dma_resv_unlock(bo->base.resv);
>> +        return;
>> +    }
>> +
>>       r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence);
>>       if (!WARN_ON(r)) {
>>           amdgpu_bo_fence(abo, fence, false);
>> -- 
>> 1.8.3.1
>>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access
@ 2021-04-03  3:08     ` Qu Huang
  0 siblings, 0 replies; 20+ messages in thread
From: Qu Huang @ 2021-04-03  3:08 UTC (permalink / raw)
  To: Christian König, alexander.deucher, airlied, daniel,
	sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd
  Cc: linaro-mm-sig, linux-media, dri-devel, amd-gfx, linux-kernel

Hi Christian,

On 2021/4/3 0:25, Christian König wrote:
> Hi Qu,
> 
> Am 02.04.21 um 05:18 schrieb Qu Huang:
>> Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(),
>> the bo->base.resv lock may be held by ttm_mem_evict_first(),
> 
> That can't happen since when bo_release_notify is called the BO has not 
> more references and is therefore deleted.
> 
> And we never evict a deleted BO, we just wait for it to become idle.

Yes, the bo reference counter return to zero will enter 
ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify()) 
first happen, and then test if a reservation object's fences have been 
signaled, and then mark bo as deleted and remove bo from the LRU list.

When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
the Bo has not been removed from the LRU list and is not marked as 
deleted, this will happen.

As a test, when we use CPU memset instead of SDMA fill in 
amdgpu_bo_release_notify(), the result is page fault:

PID: 5490   TASK: ffff8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
  #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784
  #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92
  #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80
  #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768
  #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6
  #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d
  #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae
  #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0
  #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925
  #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758
     [exception RIP: memset+31]
     RIP: ffffffffb2b8668f  RSP: ffff8e79eaa17ce8  RFLAGS: 00010a17
     RAX: bebebebebebebebe  RBX: ffff8e747bff10c0  RCX: 0000060b00200000
     RDX: 0000000000000000  RSI: 00000000000000be  RDI: ffffab807f000000
     RBP: ffff8e79eaa17d10   R8: ffff8e79eaa14000   R9: ffffab7c80000000
     R10: 000000000000bcba  R11: 00000000000001ba  R12: ffff8e79ebaa4050
     R13: ffffab7c80000000  R14: 0000000000022600  R15: ffff8e8136e04100
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 [amdgpu]
#11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm]
#12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm]
#13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm]
#14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115
#15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64
#16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7
#17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95
#18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf
#19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544
#20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb

Regards,
Qu.


> 
> Regards,
> Christian.
> 
>> and the VRAM mem will be evicted, mem region was replaced
>> by Gtt mem region. amdgpu_bo_release_notify() will then
>> hold the bo->base.resv lock, and SDMA will get an invalid
>> address in amdgpu_fill_buffer(), resulting in a VMFAULT
>> or memory corruption.
>>
>> To avoid it, we have to hold bo->base.resv lock first, and
>> check whether the mem.mem_type is TTM_PL_VRAM.
>>
>> Signed-off-by: Qu Huang <jinsdb@126.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> index 4b29b82..8018574 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct 
>> ttm_buffer_object *bo)
>>       if (bo->base.resv == &bo->base._resv)
>>           amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>>
>> -    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
>> -        !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>> +    if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>           return;
>>
>>       dma_resv_lock(bo->base.resv, NULL);
>>
>> +    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
>> +        dma_resv_unlock(bo->base.resv);
>> +        return;
>> +    }
>> +
>>       r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence);
>>       if (!WARN_ON(r)) {
>>           amdgpu_bo_fence(abo, fence, false);
>> -- 
>> 1.8.3.1
>>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access
  2021-04-02 16:25   ` Christian König
  (?)
@ 2021-04-03  5:08     ` Qu Huang
  -1 siblings, 0 replies; 20+ messages in thread
From: Qu Huang @ 2021-04-03  5:08 UTC (permalink / raw)
  To: Christian König, alexander.deucher, airlied, daniel,
	sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd
  Cc: amd-gfx, dri-devel, linux-kernel, linux-media

Hi Christian,

On 2021/4/3 0:25, Christian König wrote:
> Hi Qu,
>
> Am 02.04.21 um 05:18 schrieb Qu Huang:
>> Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(),
>> the bo->base.resv lock may be held by ttm_mem_evict_first(),
>
> That can't happen since when bo_release_notify is called the BO has not
> more references and is therefore deleted.
>
> And we never evict a deleted BO, we just wait for it to become idle.
>
Yes, the bo reference counter return to zero will enter
ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify())
first happen, and then test if a reservation object's fences have been
signaled, and then mark bo as deleted and remove bo from the LRU list.

When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
the Bo has not been removed from the LRU list and is not marked as
deleted, this will happen.

As a test, when we use CPU memset instead of SDMA fill in
amdgpu_bo_release_notify(), the result is page fault:

PID: 5490   TASK: ffff8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
   #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784
   #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92
   #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80
   #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768
   #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6
   #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d
   #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae
   #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0
   #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925
   #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758
      [exception RIP: memset+31]
      RIP: ffffffffb2b8668f  RSP: ffff8e79eaa17ce8  RFLAGS: 00010a17
      RAX: bebebebebebebebe  RBX: ffff8e747bff10c0  RCX: 0000060b00200000
      RDX: 0000000000000000  RSI: 00000000000000be  RDI: ffffab807f000000
      RBP: ffff8e79eaa17d10   R8: ffff8e79eaa14000   R9: ffffab7c80000000
      R10: 000000000000bcba  R11: 00000000000001ba  R12: ffff8e79ebaa4050
      R13: ffffab7c80000000  R14: 0000000000022600  R15: ffff8e8136e04100
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 [amdgpu]
#11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm]
#12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm]
#13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm]
#14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115
#15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64
#16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7
#17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95
#18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf
#19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544
#20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb

Regards,
Qu.


> Regards,
> Christian.
>
>> and the VRAM mem will be evicted, mem region was replaced
>> by Gtt mem region. amdgpu_bo_release_notify() will then
>> hold the bo->base.resv lock, and SDMA will get an invalid
>> address in amdgpu_fill_buffer(), resulting in a VMFAULT
>> or memory corruption.
>>
>> To avoid it, we have to hold bo->base.resv lock first, and
>> check whether the mem.mem_type is TTM_PL_VRAM.
>>
>> Signed-off-by: Qu Huang <jinsdb@126.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> index 4b29b82..8018574 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct
>> ttm_buffer_object *bo)
>>       if (bo->base.resv == &bo->base._resv)
>>           amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>>
>> -    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
>> -        !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>> +    if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>           return;
>>
>>       dma_resv_lock(bo->base.resv, NULL);
>>
>> +    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
>> +        dma_resv_unlock(bo->base.resv);
>> +        return;
>> +    }
>> +
>>       r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence);
>>       if (!WARN_ON(r)) {
>>           amdgpu_bo_fence(abo, fence, false);
>> --
>> 1.8.3.1
>>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access
@ 2021-04-03  5:08     ` Qu Huang
  0 siblings, 0 replies; 20+ messages in thread
From: Qu Huang @ 2021-04-03  5:08 UTC (permalink / raw)
  To: Christian König, alexander.deucher, airlied, daniel,
	sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd
  Cc: linux-media, dri-devel, amd-gfx, linux-kernel

Hi Christian,

On 2021/4/3 0:25, Christian König wrote:
> Hi Qu,
>
> Am 02.04.21 um 05:18 schrieb Qu Huang:
>> Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(),
>> the bo->base.resv lock may be held by ttm_mem_evict_first(),
>
> That can't happen since when bo_release_notify is called the BO has not
> more references and is therefore deleted.
>
> And we never evict a deleted BO, we just wait for it to become idle.
>
Yes, the bo reference counter return to zero will enter
ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify())
first happen, and then test if a reservation object's fences have been
signaled, and then mark bo as deleted and remove bo from the LRU list.

When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
the Bo has not been removed from the LRU list and is not marked as
deleted, this will happen.

As a test, when we use CPU memset instead of SDMA fill in
amdgpu_bo_release_notify(), the result is page fault:

PID: 5490   TASK: ffff8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
   #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784
   #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92
   #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80
   #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768
   #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6
   #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d
   #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae
   #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0
   #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925
   #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758
      [exception RIP: memset+31]
      RIP: ffffffffb2b8668f  RSP: ffff8e79eaa17ce8  RFLAGS: 00010a17
      RAX: bebebebebebebebe  RBX: ffff8e747bff10c0  RCX: 0000060b00200000
      RDX: 0000000000000000  RSI: 00000000000000be  RDI: ffffab807f000000
      RBP: ffff8e79eaa17d10   R8: ffff8e79eaa14000   R9: ffffab7c80000000
      R10: 000000000000bcba  R11: 00000000000001ba  R12: ffff8e79ebaa4050
      R13: ffffab7c80000000  R14: 0000000000022600  R15: ffff8e8136e04100
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 [amdgpu]
#11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm]
#12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm]
#13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm]
#14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115
#15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64
#16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7
#17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95
#18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf
#19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544
#20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb

Regards,
Qu.


> Regards,
> Christian.
>
>> and the VRAM mem will be evicted, mem region was replaced
>> by Gtt mem region. amdgpu_bo_release_notify() will then
>> hold the bo->base.resv lock, and SDMA will get an invalid
>> address in amdgpu_fill_buffer(), resulting in a VMFAULT
>> or memory corruption.
>>
>> To avoid it, we have to hold bo->base.resv lock first, and
>> check whether the mem.mem_type is TTM_PL_VRAM.
>>
>> Signed-off-by: Qu Huang <jinsdb@126.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> index 4b29b82..8018574 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct
>> ttm_buffer_object *bo)
>>       if (bo->base.resv == &bo->base._resv)
>>           amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>>
>> -    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
>> -        !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>> +    if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>           return;
>>
>>       dma_resv_lock(bo->base.resv, NULL);
>>
>> +    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
>> +        dma_resv_unlock(bo->base.resv);
>> +        return;
>> +    }
>> +
>>       r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence);
>>       if (!WARN_ON(r)) {
>>           amdgpu_bo_fence(abo, fence, false);
>> --
>> 1.8.3.1
>>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access
@ 2021-04-03  5:08     ` Qu Huang
  0 siblings, 0 replies; 20+ messages in thread
From: Qu Huang @ 2021-04-03  5:08 UTC (permalink / raw)
  To: Christian König, alexander.deucher, airlied, daniel,
	sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd
  Cc: linux-media, dri-devel, amd-gfx, linux-kernel

Hi Christian,

On 2021/4/3 0:25, Christian König wrote:
> Hi Qu,
>
> Am 02.04.21 um 05:18 schrieb Qu Huang:
>> Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(),
>> the bo->base.resv lock may be held by ttm_mem_evict_first(),
>
> That can't happen since when bo_release_notify is called the BO has not
> more references and is therefore deleted.
>
> And we never evict a deleted BO, we just wait for it to become idle.
>
Yes, the bo reference counter return to zero will enter
ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify())
first happen, and then test if a reservation object's fences have been
signaled, and then mark bo as deleted and remove bo from the LRU list.

When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
the Bo has not been removed from the LRU list and is not marked as
deleted, this will happen.

As a test, when we use CPU memset instead of SDMA fill in
amdgpu_bo_release_notify(), the result is page fault:

PID: 5490   TASK: ffff8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
   #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784
   #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92
   #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80
   #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768
   #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6
   #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d
   #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae
   #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0
   #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925
   #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758
      [exception RIP: memset+31]
      RIP: ffffffffb2b8668f  RSP: ffff8e79eaa17ce8  RFLAGS: 00010a17
      RAX: bebebebebebebebe  RBX: ffff8e747bff10c0  RCX: 0000060b00200000
      RDX: 0000000000000000  RSI: 00000000000000be  RDI: ffffab807f000000
      RBP: ffff8e79eaa17d10   R8: ffff8e79eaa14000   R9: ffffab7c80000000
      R10: 000000000000bcba  R11: 00000000000001ba  R12: ffff8e79ebaa4050
      R13: ffffab7c80000000  R14: 0000000000022600  R15: ffff8e8136e04100
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 [amdgpu]
#11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm]
#12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm]
#13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm]
#14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115
#15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64
#16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7
#17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95
#18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf
#19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544
#20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb

Regards,
Qu.


> Regards,
> Christian.
>
>> and the VRAM mem will be evicted, mem region was replaced
>> by Gtt mem region. amdgpu_bo_release_notify() will then
>> hold the bo->base.resv lock, and SDMA will get an invalid
>> address in amdgpu_fill_buffer(), resulting in a VMFAULT
>> or memory corruption.
>>
>> To avoid it, we have to hold bo->base.resv lock first, and
>> check whether the mem.mem_type is TTM_PL_VRAM.
>>
>> Signed-off-by: Qu Huang <jinsdb@126.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> index 4b29b82..8018574 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct
>> ttm_buffer_object *bo)
>>       if (bo->base.resv == &bo->base._resv)
>>           amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>>
>> -    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
>> -        !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>> +    if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>           return;
>>
>>       dma_resv_lock(bo->base.resv, NULL);
>>
>> +    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
>> +        dma_resv_unlock(bo->base.resv);
>> +        return;
>> +    }
>> +
>>       r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence);
>>       if (!WARN_ON(r)) {
>>           amdgpu_bo_fence(abo, fence, false);
>> --
>> 1.8.3.1
>>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access
  2021-04-03  5:08     ` Qu Huang
  (?)
@ 2021-04-03  8:49       ` Christian König
  -1 siblings, 0 replies; 20+ messages in thread
From: Christian König @ 2021-04-03  8:49 UTC (permalink / raw)
  To: Qu Huang, alexander.deucher, airlied, daniel, sumit.semwal,
	airlied, ray.huang, Mihir.Patel, nirmoy.aiemd
  Cc: amd-gfx, dri-devel, linux-kernel, linux-media

Hi Qu,

Am 03.04.21 um 07:08 schrieb Qu Huang:
> Hi Christian,
>
> On 2021/4/3 0:25, Christian König wrote:
>> Hi Qu,
>>
>> Am 02.04.21 um 05:18 schrieb Qu Huang:
>>> Before dma_resv_lock(bo->base.resv, NULL) in 
>>> amdgpu_bo_release_notify(),
>>> the bo->base.resv lock may be held by ttm_mem_evict_first(),
>>
>> That can't happen since when bo_release_notify is called the BO has not
>> more references and is therefore deleted.
>>
>> And we never evict a deleted BO, we just wait for it to become idle.
>>
> Yes, the bo reference counter return to zero will enter
> ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify())
> first happen, and then test if a reservation object's fences have been
> signaled, and then mark bo as deleted and remove bo from the LRU list.
>
> When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
> the Bo has not been removed from the LRU list and is not marked as
> deleted, this will happen.

Not sure on which code base you are, but I don't see how this can happen.

ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and 
ttm_bo_release() is only called when the BO reference count becomes zero.

So ttm_mem_evict_first() will see that this BO is about to be destroyed 
and skips it.

>
> As a test, when we use CPU memset instead of SDMA fill in
> amdgpu_bo_release_notify(), the result is page fault:
>
> PID: 5490   TASK: ffff8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
>   #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784
>   #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92
>   #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80
>   #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768
>   #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6
>   #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d
>   #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae
>   #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0
>   #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925
>   #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758
>      [exception RIP: memset+31]
>      RIP: ffffffffb2b8668f  RSP: ffff8e79eaa17ce8  RFLAGS: 00010a17
>      RAX: bebebebebebebebe  RBX: ffff8e747bff10c0  RCX: 0000060b00200000
>      RDX: 0000000000000000  RSI: 00000000000000be  RDI: ffffab807f000000
>      RBP: ffff8e79eaa17d10   R8: ffff8e79eaa14000   R9: ffffab7c80000000
>      R10: 000000000000bcba  R11: 00000000000001ba  R12: ffff8e79ebaa4050
>      R13: ffffab7c80000000  R14: 0000000000022600  R15: ffff8e8136e04100
>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 
> [amdgpu]
> #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm]
> #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm]
> #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm]
> #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115
> #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64
> #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7
> #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95
> #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf
> #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544
> #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb

Well that might be perfectly expected. VRAM is not necessarily CPU 
accessible.

Regards,
Christian.

>
> Regards,
> Qu.
>
>
>> Regards,
>> Christian.
>>
>>> and the VRAM mem will be evicted, mem region was replaced
>>> by Gtt mem region. amdgpu_bo_release_notify() will then
>>> hold the bo->base.resv lock, and SDMA will get an invalid
>>> address in amdgpu_fill_buffer(), resulting in a VMFAULT
>>> or memory corruption.
>>>
>>> To avoid it, we have to hold bo->base.resv lock first, and
>>> check whether the mem.mem_type is TTM_PL_VRAM.
>>>
>>> Signed-off-by: Qu Huang <jinsdb@126.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> index 4b29b82..8018574 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct
>>> ttm_buffer_object *bo)
>>>       if (bo->base.resv == &bo->base._resv)
>>>           amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>>>
>>> -    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
>>> -        !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>> +    if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>>           return;
>>>
>>>       dma_resv_lock(bo->base.resv, NULL);
>>>
>>> +    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
>>> +        dma_resv_unlock(bo->base.resv);
>>> +        return;
>>> +    }
>>> +
>>>       r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, 
>>> &fence);
>>>       if (!WARN_ON(r)) {
>>>           amdgpu_bo_fence(abo, fence, false);
>>> -- 
>>> 1.8.3.1
>>>
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access
@ 2021-04-03  8:49       ` Christian König
  0 siblings, 0 replies; 20+ messages in thread
From: Christian König @ 2021-04-03  8:49 UTC (permalink / raw)
  To: Qu Huang, alexander.deucher, airlied, daniel, sumit.semwal,
	airlied, ray.huang, Mihir.Patel, nirmoy.aiemd
  Cc: linux-media, dri-devel, amd-gfx, linux-kernel

Hi Qu,

Am 03.04.21 um 07:08 schrieb Qu Huang:
> Hi Christian,
>
> On 2021/4/3 0:25, Christian König wrote:
>> Hi Qu,
>>
>> Am 02.04.21 um 05:18 schrieb Qu Huang:
>>> Before dma_resv_lock(bo->base.resv, NULL) in 
>>> amdgpu_bo_release_notify(),
>>> the bo->base.resv lock may be held by ttm_mem_evict_first(),
>>
>> That can't happen since when bo_release_notify is called the BO has not
>> more references and is therefore deleted.
>>
>> And we never evict a deleted BO, we just wait for it to become idle.
>>
> Yes, the bo reference counter return to zero will enter
> ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify())
> first happen, and then test if a reservation object's fences have been
> signaled, and then mark bo as deleted and remove bo from the LRU list.
>
> When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
> the Bo has not been removed from the LRU list and is not marked as
> deleted, this will happen.

Not sure on which code base you are, but I don't see how this can happen.

ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and 
ttm_bo_release() is only called when the BO reference count becomes zero.

So ttm_mem_evict_first() will see that this BO is about to be destroyed 
and skips it.

>
> As a test, when we use CPU memset instead of SDMA fill in
> amdgpu_bo_release_notify(), the result is page fault:
>
> PID: 5490   TASK: ffff8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
>   #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784
>   #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92
>   #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80
>   #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768
>   #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6
>   #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d
>   #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae
>   #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0
>   #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925
>   #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758
>      [exception RIP: memset+31]
>      RIP: ffffffffb2b8668f  RSP: ffff8e79eaa17ce8  RFLAGS: 00010a17
>      RAX: bebebebebebebebe  RBX: ffff8e747bff10c0  RCX: 0000060b00200000
>      RDX: 0000000000000000  RSI: 00000000000000be  RDI: ffffab807f000000
>      RBP: ffff8e79eaa17d10   R8: ffff8e79eaa14000   R9: ffffab7c80000000
>      R10: 000000000000bcba  R11: 00000000000001ba  R12: ffff8e79ebaa4050
>      R13: ffffab7c80000000  R14: 0000000000022600  R15: ffff8e8136e04100
>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 
> [amdgpu]
> #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm]
> #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm]
> #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm]
> #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115
> #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64
> #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7
> #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95
> #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf
> #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544
> #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb

Well that might be perfectly expected. VRAM is not necessarily CPU 
accessible.

Regards,
Christian.

>
> Regards,
> Qu.
>
>
>> Regards,
>> Christian.
>>
>>> and the VRAM mem will be evicted, mem region was replaced
>>> by Gtt mem region. amdgpu_bo_release_notify() will then
>>> hold the bo->base.resv lock, and SDMA will get an invalid
>>> address in amdgpu_fill_buffer(), resulting in a VMFAULT
>>> or memory corruption.
>>>
>>> To avoid it, we have to hold bo->base.resv lock first, and
>>> check whether the mem.mem_type is TTM_PL_VRAM.
>>>
>>> Signed-off-by: Qu Huang <jinsdb@126.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> index 4b29b82..8018574 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct
>>> ttm_buffer_object *bo)
>>>       if (bo->base.resv == &bo->base._resv)
>>>           amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>>>
>>> -    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
>>> -        !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>> +    if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>>           return;
>>>
>>>       dma_resv_lock(bo->base.resv, NULL);
>>>
>>> +    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
>>> +        dma_resv_unlock(bo->base.resv);
>>> +        return;
>>> +    }
>>> +
>>>       r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, 
>>> &fence);
>>>       if (!WARN_ON(r)) {
>>>           amdgpu_bo_fence(abo, fence, false);
>>> -- 
>>> 1.8.3.1
>>>
>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access
@ 2021-04-03  8:49       ` Christian König
  0 siblings, 0 replies; 20+ messages in thread
From: Christian König @ 2021-04-03  8:49 UTC (permalink / raw)
  To: Qu Huang, alexander.deucher, airlied, daniel, sumit.semwal,
	airlied, ray.huang, Mihir.Patel, nirmoy.aiemd
  Cc: linux-media, dri-devel, amd-gfx, linux-kernel

Hi Qu,

Am 03.04.21 um 07:08 schrieb Qu Huang:
> Hi Christian,
>
> On 2021/4/3 0:25, Christian König wrote:
>> Hi Qu,
>>
>> Am 02.04.21 um 05:18 schrieb Qu Huang:
>>> Before dma_resv_lock(bo->base.resv, NULL) in 
>>> amdgpu_bo_release_notify(),
>>> the bo->base.resv lock may be held by ttm_mem_evict_first(),
>>
>> That can't happen since when bo_release_notify is called the BO has not
>> more references and is therefore deleted.
>>
>> And we never evict a deleted BO, we just wait for it to become idle.
>>
> Yes, the bo reference counter return to zero will enter
> ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify())
> first happen, and then test if a reservation object's fences have been
> signaled, and then mark bo as deleted and remove bo from the LRU list.
>
> When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
> the Bo has not been removed from the LRU list and is not marked as
> deleted, this will happen.

Not sure on which code base you are, but I don't see how this can happen.

ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and 
ttm_bo_release() is only called when the BO reference count becomes zero.

So ttm_mem_evict_first() will see that this BO is about to be destroyed 
and skips it.

>
> As a test, when we use CPU memset instead of SDMA fill in
> amdgpu_bo_release_notify(), the result is page fault:
>
> PID: 5490   TASK: ffff8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
>   #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784
>   #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92
>   #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80
>   #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768
>   #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6
>   #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d
>   #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae
>   #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0
>   #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925
>   #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758
>      [exception RIP: memset+31]
>      RIP: ffffffffb2b8668f  RSP: ffff8e79eaa17ce8  RFLAGS: 00010a17
>      RAX: bebebebebebebebe  RBX: ffff8e747bff10c0  RCX: 0000060b00200000
>      RDX: 0000000000000000  RSI: 00000000000000be  RDI: ffffab807f000000
>      RBP: ffff8e79eaa17d10   R8: ffff8e79eaa14000   R9: ffffab7c80000000
>      R10: 000000000000bcba  R11: 00000000000001ba  R12: ffff8e79ebaa4050
>      R13: ffffab7c80000000  R14: 0000000000022600  R15: ffff8e8136e04100
>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 
> [amdgpu]
> #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm]
> #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm]
> #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm]
> #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115
> #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64
> #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7
> #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95
> #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf
> #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544
> #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb

Well that might be perfectly expected. VRAM is not necessarily CPU 
accessible.

Regards,
Christian.

>
> Regards,
> Qu.
>
>
>> Regards,
>> Christian.
>>
>>> and the VRAM mem will be evicted, mem region was replaced
>>> by Gtt mem region. amdgpu_bo_release_notify() will then
>>> hold the bo->base.resv lock, and SDMA will get an invalid
>>> address in amdgpu_fill_buffer(), resulting in a VMFAULT
>>> or memory corruption.
>>>
>>> To avoid it, we have to hold bo->base.resv lock first, and
>>> check whether the mem.mem_type is TTM_PL_VRAM.
>>>
>>> Signed-off-by: Qu Huang <jinsdb@126.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> index 4b29b82..8018574 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct
>>> ttm_buffer_object *bo)
>>>       if (bo->base.resv == &bo->base._resv)
>>>           amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>>>
>>> -    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
>>> -        !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>> +    if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>>           return;
>>>
>>>       dma_resv_lock(bo->base.resv, NULL);
>>>
>>> +    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
>>> +        dma_resv_unlock(bo->base.resv);
>>> +        return;
>>> +    }
>>> +
>>>       r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, 
>>> &fence);
>>>       if (!WARN_ON(r)) {
>>>           amdgpu_bo_fence(abo, fence, false);
>>> -- 
>>> 1.8.3.1
>>>
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access
  2021-04-03  8:49       ` Christian König
  (?)
@ 2021-04-06  6:04         ` Qu Huang
  -1 siblings, 0 replies; 20+ messages in thread
From: Qu Huang @ 2021-04-06  6:04 UTC (permalink / raw)
  To: Christian König, alexander.deucher, airlied, daniel,
	sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd
  Cc: amd-gfx, dri-devel, linux-kernel, linux-media

Hi Christian,

On 2021/4/3 16:49, Christian König wrote:
> Hi Qu,
>
> Am 03.04.21 um 07:08 schrieb Qu Huang:
>> Hi Christian,
>>
>> On 2021/4/3 0:25, Christian König wrote:
>>> Hi Qu,
>>>
>>> Am 02.04.21 um 05:18 schrieb Qu Huang:
>>>> Before dma_resv_lock(bo->base.resv, NULL) in
>>>> amdgpu_bo_release_notify(),
>>>> the bo->base.resv lock may be held by ttm_mem_evict_first(),
>>>
>>> That can't happen since when bo_release_notify is called the BO has not
>>> more references and is therefore deleted.
>>>
>>> And we never evict a deleted BO, we just wait for it to become idle.
>>>
>> Yes, the bo reference counter return to zero will enter
>> ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify())
>> first happen, and then test if a reservation object's fences have been
>> signaled, and then mark bo as deleted and remove bo from the LRU list.
>>
>> When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
>> the Bo has not been removed from the LRU list and is not marked as
>> deleted, this will happen.
>
> Not sure on which code base you are, but I don't see how this can happen.
>
> ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and
> ttm_bo_release() is only called when the BO reference count becomes zero.
>
> So ttm_mem_evict_first() will see that this BO is about to be destroyed
> and skips it.
>

Yes, you are right. My version of TTM is ROCM 3.3, so
ttm_mem_evict_first() did not call ttm_bo_get_unless_zero(), check that
ROCM 4.0 ttm doesn't have this issue. This is an oversight on my part.

>>
>> As a test, when we use CPU memset instead of SDMA fill in
>> amdgpu_bo_release_notify(), the result is page fault:
>>
>> PID: 5490   TASK: ffff8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
>>   #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784
>>   #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92
>>   #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80
>>   #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768
>>   #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6
>>   #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d
>>   #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae
>>   #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0
>>   #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925
>>   #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758
>>      [exception RIP: memset+31]
>>      RIP: ffffffffb2b8668f  RSP: ffff8e79eaa17ce8  RFLAGS: 00010a17
>>      RAX: bebebebebebebebe  RBX: ffff8e747bff10c0  RCX: 0000060b00200000
>>      RDX: 0000000000000000  RSI: 00000000000000be  RDI: ffffab807f000000
>>      RBP: ffff8e79eaa17d10   R8: ffff8e79eaa14000   R9: ffffab7c80000000
>>      R10: 000000000000bcba  R11: 00000000000001ba  R12: ffff8e79ebaa4050
>>      R13: ffffab7c80000000  R14: 0000000000022600  R15: ffff8e8136e04100
>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>> #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1
>> [amdgpu]
>> #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm]
>> #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm]
>> #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm]
>> #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115
>> #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64
>> #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7
>> #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95
>> #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf
>> #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544
>> #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb
>
> Well that might be perfectly expected. VRAM is not necessarily CPU
> accessible.
>
As a test,use CPU memset instead of SDMA fill, This is my code:
void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
{
	struct amdgpu_bo *abo;
	uint64_t num_pages;
	struct drm_mm_node *mm_node;
	struct amdgpu_device *adev;
	void __iomem *kaddr;

	if (!amdgpu_bo_is_amdgpu_bo(bo))
		return;

	abo = ttm_to_amdgpu_bo(bo);
	num_pages = abo->tbo.num_pages;
	mm_node = abo->tbo.mem.mm_node;
	adev = amdgpu_ttm_adev(abo->tbo.bdev);
	kaddr = adev->mman.aper_base_kaddr;

	if (abo->kfd_bo)
		amdgpu_amdkfd_unreserve_memory_limit(abo);

	if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
	    !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
		return;

	dma_resv_lock(amdkcl_ttm_resvp(bo), NULL);
	while (num_pages && mm_node) {
		void *ptr = kaddr + (mm_node->start << PAGE_SHIFT);
		memset_io(ptr, AMDGPU_POISON & 0xff, mm_node->size <<PAGE_SHIFT);
		num_pages -= mm_node->size;
		++mm_node;
	}
	dma_resv_unlock(amdkcl_ttm_resvp(bo));
}





I have used the old version through oversight, so I am sorry for your
trouble.


Regards,
Qu.

> Regards,
> Christian.
>
>>
>> Regards,
>> Qu.
>>
>>
>>> Regards,
>>> Christian.
>>>
>>>> and the VRAM mem will be evicted, mem region was replaced
>>>> by Gtt mem region. amdgpu_bo_release_notify() will then
>>>> hold the bo->base.resv lock, and SDMA will get an invalid
>>>> address in amdgpu_fill_buffer(), resulting in a VMFAULT
>>>> or memory corruption.
>>>>
>>>> To avoid it, we have to hold bo->base.resv lock first, and
>>>> check whether the mem.mem_type is TTM_PL_VRAM.
>>>>
>>>> Signed-off-by: Qu Huang <jinsdb@126.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
>>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> index 4b29b82..8018574 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct
>>>> ttm_buffer_object *bo)
>>>>       if (bo->base.resv == &bo->base._resv)
>>>>           amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>>>>
>>>> -    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
>>>> -        !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>>> +    if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>>>           return;
>>>>
>>>>       dma_resv_lock(bo->base.resv, NULL);
>>>>
>>>> +    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
>>>> +        dma_resv_unlock(bo->base.resv);
>>>> +        return;
>>>> +    }
>>>> +
>>>>       r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv,
>>>> &fence);
>>>>       if (!WARN_ON(r)) {
>>>>           amdgpu_bo_fence(abo, fence, false);
>>>> --
>>>> 1.8.3.1
>>>>
>>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access
@ 2021-04-06  6:04         ` Qu Huang
  0 siblings, 0 replies; 20+ messages in thread
From: Qu Huang @ 2021-04-06  6:04 UTC (permalink / raw)
  To: Christian König, alexander.deucher, airlied, daniel,
	sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd
  Cc: linux-media, dri-devel, amd-gfx, linux-kernel

Hi Christian,

On 2021/4/3 16:49, Christian König wrote:
> Hi Qu,
>
> Am 03.04.21 um 07:08 schrieb Qu Huang:
>> Hi Christian,
>>
>> On 2021/4/3 0:25, Christian König wrote:
>>> Hi Qu,
>>>
>>> Am 02.04.21 um 05:18 schrieb Qu Huang:
>>>> Before dma_resv_lock(bo->base.resv, NULL) in
>>>> amdgpu_bo_release_notify(),
>>>> the bo->base.resv lock may be held by ttm_mem_evict_first(),
>>>
>>> That can't happen since when bo_release_notify is called the BO has not
>>> more references and is therefore deleted.
>>>
>>> And we never evict a deleted BO, we just wait for it to become idle.
>>>
>> Yes, the bo reference counter return to zero will enter
>> ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify())
>> first happen, and then test if a reservation object's fences have been
>> signaled, and then mark bo as deleted and remove bo from the LRU list.
>>
>> When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
>> the Bo has not been removed from the LRU list and is not marked as
>> deleted, this will happen.
>
> Not sure on which code base you are, but I don't see how this can happen.
>
> ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and
> ttm_bo_release() is only called when the BO reference count becomes zero.
>
> So ttm_mem_evict_first() will see that this BO is about to be destroyed
> and skips it.
>

Yes, you are right. My version of TTM is ROCM 3.3, so
ttm_mem_evict_first() did not call ttm_bo_get_unless_zero(), check that
ROCM 4.0 ttm doesn't have this issue. This is an oversight on my part.

>>
>> As a test, when we use CPU memset instead of SDMA fill in
>> amdgpu_bo_release_notify(), the result is page fault:
>>
>> PID: 5490   TASK: ffff8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
>>   #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784
>>   #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92
>>   #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80
>>   #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768
>>   #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6
>>   #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d
>>   #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae
>>   #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0
>>   #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925
>>   #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758
>>      [exception RIP: memset+31]
>>      RIP: ffffffffb2b8668f  RSP: ffff8e79eaa17ce8  RFLAGS: 00010a17
>>      RAX: bebebebebebebebe  RBX: ffff8e747bff10c0  RCX: 0000060b00200000
>>      RDX: 0000000000000000  RSI: 00000000000000be  RDI: ffffab807f000000
>>      RBP: ffff8e79eaa17d10   R8: ffff8e79eaa14000   R9: ffffab7c80000000
>>      R10: 000000000000bcba  R11: 00000000000001ba  R12: ffff8e79ebaa4050
>>      R13: ffffab7c80000000  R14: 0000000000022600  R15: ffff8e8136e04100
>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>> #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1
>> [amdgpu]
>> #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm]
>> #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm]
>> #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm]
>> #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115
>> #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64
>> #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7
>> #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95
>> #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf
>> #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544
>> #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb
>
> Well that might be perfectly expected. VRAM is not necessarily CPU
> accessible.
>
As a test,use CPU memset instead of SDMA fill, This is my code:
void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
{
	struct amdgpu_bo *abo;
	uint64_t num_pages;
	struct drm_mm_node *mm_node;
	struct amdgpu_device *adev;
	void __iomem *kaddr;

	if (!amdgpu_bo_is_amdgpu_bo(bo))
		return;

	abo = ttm_to_amdgpu_bo(bo);
	num_pages = abo->tbo.num_pages;
	mm_node = abo->tbo.mem.mm_node;
	adev = amdgpu_ttm_adev(abo->tbo.bdev);
	kaddr = adev->mman.aper_base_kaddr;

	if (abo->kfd_bo)
		amdgpu_amdkfd_unreserve_memory_limit(abo);

	if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
	    !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
		return;

	dma_resv_lock(amdkcl_ttm_resvp(bo), NULL);
	while (num_pages && mm_node) {
		void *ptr = kaddr + (mm_node->start << PAGE_SHIFT);
		memset_io(ptr, AMDGPU_POISON & 0xff, mm_node->size <<PAGE_SHIFT);
		num_pages -= mm_node->size;
		++mm_node;
	}
	dma_resv_unlock(amdkcl_ttm_resvp(bo));
}





I have used the old version through oversight, so I am sorry for your
trouble.


Regards,
Qu.

> Regards,
> Christian.
>
>>
>> Regards,
>> Qu.
>>
>>
>>> Regards,
>>> Christian.
>>>
>>>> and the VRAM mem will be evicted, mem region was replaced
>>>> by Gtt mem region. amdgpu_bo_release_notify() will then
>>>> hold the bo->base.resv lock, and SDMA will get an invalid
>>>> address in amdgpu_fill_buffer(), resulting in a VMFAULT
>>>> or memory corruption.
>>>>
>>>> To avoid it, we have to hold bo->base.resv lock first, and
>>>> check whether the mem.mem_type is TTM_PL_VRAM.
>>>>
>>>> Signed-off-by: Qu Huang <jinsdb@126.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
>>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> index 4b29b82..8018574 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct
>>>> ttm_buffer_object *bo)
>>>>       if (bo->base.resv == &bo->base._resv)
>>>>           amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>>>>
>>>> -    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
>>>> -        !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>>> +    if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>>>           return;
>>>>
>>>>       dma_resv_lock(bo->base.resv, NULL);
>>>>
>>>> +    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
>>>> +        dma_resv_unlock(bo->base.resv);
>>>> +        return;
>>>> +    }
>>>> +
>>>>       r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv,
>>>> &fence);
>>>>       if (!WARN_ON(r)) {
>>>>           amdgpu_bo_fence(abo, fence, false);
>>>> --
>>>> 1.8.3.1
>>>>
>>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access
@ 2021-04-06  6:04         ` Qu Huang
  0 siblings, 0 replies; 20+ messages in thread
From: Qu Huang @ 2021-04-06  6:04 UTC (permalink / raw)
  To: Christian König, alexander.deucher, airlied, daniel,
	sumit.semwal, airlied, ray.huang, Mihir.Patel, nirmoy.aiemd
  Cc: linux-media, dri-devel, amd-gfx, linux-kernel

Hi Christian,

On 2021/4/3 16:49, Christian König wrote:
> Hi Qu,
>
> Am 03.04.21 um 07:08 schrieb Qu Huang:
>> Hi Christian,
>>
>> On 2021/4/3 0:25, Christian König wrote:
>>> Hi Qu,
>>>
>>> Am 02.04.21 um 05:18 schrieb Qu Huang:
>>>> Before dma_resv_lock(bo->base.resv, NULL) in
>>>> amdgpu_bo_release_notify(),
>>>> the bo->base.resv lock may be held by ttm_mem_evict_first(),
>>>
>>> That can't happen since when bo_release_notify is called the BO has not
>>> more references and is therefore deleted.
>>>
>>> And we never evict a deleted BO, we just wait for it to become idle.
>>>
>> Yes, the bo reference counter return to zero will enter
>> ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify())
>> first happen, and then test if a reservation object's fences have been
>> signaled, and then mark bo as deleted and remove bo from the LRU list.
>>
>> When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
>> the Bo has not been removed from the LRU list and is not marked as
>> deleted, this will happen.
>
> Not sure on which code base you are, but I don't see how this can happen.
>
> ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and
> ttm_bo_release() is only called when the BO reference count becomes zero.
>
> So ttm_mem_evict_first() will see that this BO is about to be destroyed
> and skips it.
>

Yes, you are right. My version of TTM is ROCM 3.3, so
ttm_mem_evict_first() did not call ttm_bo_get_unless_zero(), check that
ROCM 4.0 ttm doesn't have this issue. This is an oversight on my part.

>>
>> As a test, when we use CPU memset instead of SDMA fill in
>> amdgpu_bo_release_notify(), the result is page fault:
>>
>> PID: 5490   TASK: ffff8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
>>   #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784
>>   #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92
>>   #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80
>>   #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768
>>   #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6
>>   #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d
>>   #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae
>>   #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0
>>   #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925
>>   #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758
>>      [exception RIP: memset+31]
>>      RIP: ffffffffb2b8668f  RSP: ffff8e79eaa17ce8  RFLAGS: 00010a17
>>      RAX: bebebebebebebebe  RBX: ffff8e747bff10c0  RCX: 0000060b00200000
>>      RDX: 0000000000000000  RSI: 00000000000000be  RDI: ffffab807f000000
>>      RBP: ffff8e79eaa17d10   R8: ffff8e79eaa14000   R9: ffffab7c80000000
>>      R10: 000000000000bcba  R11: 00000000000001ba  R12: ffff8e79ebaa4050
>>      R13: ffffab7c80000000  R14: 0000000000022600  R15: ffff8e8136e04100
>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>> #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1
>> [amdgpu]
>> #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm]
>> #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm]
>> #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm]
>> #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115
>> #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64
>> #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7
>> #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95
>> #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf
>> #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544
>> #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb
>
> Well that might be perfectly expected. VRAM is not necessarily CPU
> accessible.
>
As a test,use CPU memset instead of SDMA fill, This is my code:
void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
{
	struct amdgpu_bo *abo;
	uint64_t num_pages;
	struct drm_mm_node *mm_node;
	struct amdgpu_device *adev;
	void __iomem *kaddr;

	if (!amdgpu_bo_is_amdgpu_bo(bo))
		return;

	abo = ttm_to_amdgpu_bo(bo);
	num_pages = abo->tbo.num_pages;
	mm_node = abo->tbo.mem.mm_node;
	adev = amdgpu_ttm_adev(abo->tbo.bdev);
	kaddr = adev->mman.aper_base_kaddr;

	if (abo->kfd_bo)
		amdgpu_amdkfd_unreserve_memory_limit(abo);

	if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
	    !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
		return;

	dma_resv_lock(amdkcl_ttm_resvp(bo), NULL);
	while (num_pages && mm_node) {
		void *ptr = kaddr + (mm_node->start << PAGE_SHIFT);
		memset_io(ptr, AMDGPU_POISON & 0xff, mm_node->size <<PAGE_SHIFT);
		num_pages -= mm_node->size;
		++mm_node;
	}
	dma_resv_unlock(amdkcl_ttm_resvp(bo));
}





I have used the old version through oversight, so I am sorry for your
trouble.


Regards,
Qu.

> Regards,
> Christian.
>
>>
>> Regards,
>> Qu.
>>
>>
>>> Regards,
>>> Christian.
>>>
>>>> and the VRAM mem will be evicted, mem region was replaced
>>>> by Gtt mem region. amdgpu_bo_release_notify() will then
>>>> hold the bo->base.resv lock, and SDMA will get an invalid
>>>> address in amdgpu_fill_buffer(), resulting in a VMFAULT
>>>> or memory corruption.
>>>>
>>>> To avoid it, we have to hold bo->base.resv lock first, and
>>>> check whether the mem.mem_type is TTM_PL_VRAM.
>>>>
>>>> Signed-off-by: Qu Huang <jinsdb@126.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
>>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> index 4b29b82..8018574 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct
>>>> ttm_buffer_object *bo)
>>>>       if (bo->base.resv == &bo->base._resv)
>>>>           amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>>>>
>>>> -    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
>>>> -        !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>>> +    if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>>>           return;
>>>>
>>>>       dma_resv_lock(bo->base.resv, NULL);
>>>>
>>>> +    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
>>>> +        dma_resv_unlock(bo->base.resv);
>>>> +        return;
>>>> +    }
>>>> +
>>>>       r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv,
>>>> &fence);
>>>>       if (!WARN_ON(r)) {
>>>>           amdgpu_bo_fence(abo, fence, false);
>>>> --
>>>> 1.8.3.1
>>>>
>>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access
  2021-04-06  6:04         ` Qu Huang
  (?)
@ 2021-04-06 13:44           ` Christian König
  -1 siblings, 0 replies; 20+ messages in thread
From: Christian König @ 2021-04-06 13:44 UTC (permalink / raw)
  To: Qu Huang, Christian König, alexander.deucher, airlied,
	daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel,
	nirmoy.aiemd
  Cc: linux-media, dri-devel, amd-gfx, linux-kernel

Hi Qu,

Am 06.04.21 um 08:04 schrieb Qu Huang:
> Hi Christian,
>
> On 2021/4/3 16:49, Christian König wrote:
>> Hi Qu,
>>
>> Am 03.04.21 um 07:08 schrieb Qu Huang:
>>> Hi Christian,
>>>
>>> On 2021/4/3 0:25, Christian König wrote:
>>>> Hi Qu,
>>>>
>>>> Am 02.04.21 um 05:18 schrieb Qu Huang:
>>>>> Before dma_resv_lock(bo->base.resv, NULL) in
>>>>> amdgpu_bo_release_notify(),
>>>>> the bo->base.resv lock may be held by ttm_mem_evict_first(),
>>>>
>>>> That can't happen since when bo_release_notify is called the BO has 
>>>> not
>>>> more references and is therefore deleted.
>>>>
>>>> And we never evict a deleted BO, we just wait for it to become idle.
>>>>
>>> Yes, the bo reference counter return to zero will enter
>>> ttm_bo_release(),but notify bo release (call 
>>> amdgpu_bo_release_notify())
>>> first happen, and then test if a reservation object's fences have been
>>> signaled, and then mark bo as deleted and remove bo from the LRU list.
>>>
>>> When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
>>> the Bo has not been removed from the LRU list and is not marked as
>>> deleted, this will happen.
>>
>> Not sure on which code base you are, but I don't see how this can 
>> happen.
>>
>> ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and
>> ttm_bo_release() is only called when the BO reference count becomes 
>> zero.
>>
>> So ttm_mem_evict_first() will see that this BO is about to be destroyed
>> and skips it.
>>
>
> Yes, you are right. My version of TTM is ROCM 3.3, so
> ttm_mem_evict_first() did not call ttm_bo_get_unless_zero(), check that
> ROCM 4.0 ttm doesn't have this issue. This is an oversight on my part.
>
>>>
>>> As a test, when we use CPU memset instead of SDMA fill in
>>> amdgpu_bo_release_notify(), the result is page fault:
>>>
>>> PID: 5490   TASK: ffff8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
>>>   #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784
>>>   #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92
>>>   #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80
>>>   #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768
>>>   #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6
>>>   #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d
>>>   #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae
>>>   #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0
>>>   #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925
>>>   #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758
>>>      [exception RIP: memset+31]
>>>      RIP: ffffffffb2b8668f  RSP: ffff8e79eaa17ce8  RFLAGS: 00010a17
>>>      RAX: bebebebebebebebe  RBX: ffff8e747bff10c0  RCX: 
>>> 0000060b00200000
>>>      RDX: 0000000000000000  RSI: 00000000000000be  RDI: 
>>> ffffab807f000000
>>>      RBP: ffff8e79eaa17d10   R8: ffff8e79eaa14000   R9: 
>>> ffffab7c80000000
>>>      R10: 000000000000bcba  R11: 00000000000001ba  R12: 
>>> ffff8e79ebaa4050
>>>      R13: ffffab7c80000000  R14: 0000000000022600  R15: 
>>> ffff8e8136e04100
>>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>> #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1
>>> [amdgpu]
>>> #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm]
>>> #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm]
>>> #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm]
>>> #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115
>>> #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64
>>> #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7
>>> #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95
>>> #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf
>>> #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544
>>> #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb
>>
>> Well that might be perfectly expected. VRAM is not necessarily CPU
>> accessible.
>>
> As a test,use CPU memset instead of SDMA fill, This is my code:
> void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
> {
>     struct amdgpu_bo *abo;
>     uint64_t num_pages;
>     struct drm_mm_node *mm_node;
>     struct amdgpu_device *adev;
>     void __iomem *kaddr;
>
>     if (!amdgpu_bo_is_amdgpu_bo(bo))
>         return;
>
>     abo = ttm_to_amdgpu_bo(bo);
>     num_pages = abo->tbo.num_pages;
>     mm_node = abo->tbo.mem.mm_node;
>     adev = amdgpu_ttm_adev(abo->tbo.bdev);
>     kaddr = adev->mman.aper_base_kaddr;
>
>     if (abo->kfd_bo)
>         amdgpu_amdkfd_unreserve_memory_limit(abo);
>
>     if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
>         !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>         return;
>
>     dma_resv_lock(amdkcl_ttm_resvp(bo), NULL);
>     while (num_pages && mm_node) {
>         void *ptr = kaddr + (mm_node->start << PAGE_SHIFT);

That might not work as expected.

aper_base_kaddr can only point to a 256MiB window into VRAM, but VRAM 
itself is usually much larger.

So your memset_io() might end up in nirvana if the BO is allocated 
outside of the window.

> memset_io(ptr, AMDGPU_POISON & 0xff, mm_node->size <<PAGE_SHIFT);
>         num_pages -= mm_node->size;
>         ++mm_node;
>     }
>     dma_resv_unlock(amdkcl_ttm_resvp(bo));
> }
>
>
>
>
>
> I have used the old version through oversight, so I am sorry for your
> trouble.

No, problem. I was just wondering if I was missing something.

Regards,
Christian.

>
>
> Regards,
> Qu.
>
>> Regards,
>> Christian.
>>
>>>
>>> Regards,
>>> Qu.
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> and the VRAM mem will be evicted, mem region was replaced
>>>>> by Gtt mem region. amdgpu_bo_release_notify() will then
>>>>> hold the bo->base.resv lock, and SDMA will get an invalid
>>>>> address in amdgpu_fill_buffer(), resulting in a VMFAULT
>>>>> or memory corruption.
>>>>>
>>>>> To avoid it, we have to hold bo->base.resv lock first, and
>>>>> check whether the mem.mem_type is TTM_PL_VRAM.
>>>>>
>>>>> Signed-off-by: Qu Huang <jinsdb@126.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
>>>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> index 4b29b82..8018574 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct
>>>>> ttm_buffer_object *bo)
>>>>>       if (bo->base.resv == &bo->base._resv)
>>>>>           amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>>>>>
>>>>> -    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
>>>>> -        !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>>>> +    if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>>>>           return;
>>>>>
>>>>>       dma_resv_lock(bo->base.resv, NULL);
>>>>>
>>>>> +    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
>>>>> +        dma_resv_unlock(bo->base.resv);
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>>       r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv,
>>>>> &fence);
>>>>>       if (!WARN_ON(r)) {
>>>>>           amdgpu_bo_fence(abo, fence, false);
>>>>> -- 
>>>>> 1.8.3.1
>>>>>
>>>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access
@ 2021-04-06 13:44           ` Christian König
  0 siblings, 0 replies; 20+ messages in thread
From: Christian König @ 2021-04-06 13:44 UTC (permalink / raw)
  To: Qu Huang, Christian König, alexander.deucher, airlied,
	daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel,
	nirmoy.aiemd
  Cc: linux-kernel, amd-gfx, dri-devel, linux-media

Hi Qu,

Am 06.04.21 um 08:04 schrieb Qu Huang:
> Hi Christian,
>
> On 2021/4/3 16:49, Christian König wrote:
>> Hi Qu,
>>
>> Am 03.04.21 um 07:08 schrieb Qu Huang:
>>> Hi Christian,
>>>
>>> On 2021/4/3 0:25, Christian König wrote:
>>>> Hi Qu,
>>>>
>>>> Am 02.04.21 um 05:18 schrieb Qu Huang:
>>>>> Before dma_resv_lock(bo->base.resv, NULL) in
>>>>> amdgpu_bo_release_notify(),
>>>>> the bo->base.resv lock may be held by ttm_mem_evict_first(),
>>>>
>>>> That can't happen since when bo_release_notify is called the BO has 
>>>> not
>>>> more references and is therefore deleted.
>>>>
>>>> And we never evict a deleted BO, we just wait for it to become idle.
>>>>
>>> Yes, the bo reference counter return to zero will enter
>>> ttm_bo_release(),but notify bo release (call 
>>> amdgpu_bo_release_notify())
>>> first happen, and then test if a reservation object's fences have been
>>> signaled, and then mark bo as deleted and remove bo from the LRU list.
>>>
>>> When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
>>> the Bo has not been removed from the LRU list and is not marked as
>>> deleted, this will happen.
>>
>> Not sure on which code base you are, but I don't see how this can 
>> happen.
>>
>> ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and
>> ttm_bo_release() is only called when the BO reference count becomes 
>> zero.
>>
>> So ttm_mem_evict_first() will see that this BO is about to be destroyed
>> and skips it.
>>
>
> Yes, you are right. My version of TTM is ROCM 3.3, so
> ttm_mem_evict_first() did not call ttm_bo_get_unless_zero(), check that
> ROCM 4.0 ttm doesn't have this issue. This is an oversight on my part.
>
>>>
>>> As a test, when we use CPU memset instead of SDMA fill in
>>> amdgpu_bo_release_notify(), the result is page fault:
>>>
>>> PID: 5490   TASK: ffff8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
>>>   #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784
>>>   #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92
>>>   #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80
>>>   #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768
>>>   #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6
>>>   #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d
>>>   #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae
>>>   #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0
>>>   #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925
>>>   #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758
>>>      [exception RIP: memset+31]
>>>      RIP: ffffffffb2b8668f  RSP: ffff8e79eaa17ce8  RFLAGS: 00010a17
>>>      RAX: bebebebebebebebe  RBX: ffff8e747bff10c0  RCX: 
>>> 0000060b00200000
>>>      RDX: 0000000000000000  RSI: 00000000000000be  RDI: 
>>> ffffab807f000000
>>>      RBP: ffff8e79eaa17d10   R8: ffff8e79eaa14000   R9: 
>>> ffffab7c80000000
>>>      R10: 000000000000bcba  R11: 00000000000001ba  R12: 
>>> ffff8e79ebaa4050
>>>      R13: ffffab7c80000000  R14: 0000000000022600  R15: 
>>> ffff8e8136e04100
>>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>> #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1
>>> [amdgpu]
>>> #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm]
>>> #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm]
>>> #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm]
>>> #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115
>>> #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64
>>> #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7
>>> #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95
>>> #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf
>>> #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544
>>> #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb
>>
>> Well that might be perfectly expected. VRAM is not necessarily CPU
>> accessible.
>>
> As a test,use CPU memset instead of SDMA fill, This is my code:
> void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
> {
>     struct amdgpu_bo *abo;
>     uint64_t num_pages;
>     struct drm_mm_node *mm_node;
>     struct amdgpu_device *adev;
>     void __iomem *kaddr;
>
>     if (!amdgpu_bo_is_amdgpu_bo(bo))
>         return;
>
>     abo = ttm_to_amdgpu_bo(bo);
>     num_pages = abo->tbo.num_pages;
>     mm_node = abo->tbo.mem.mm_node;
>     adev = amdgpu_ttm_adev(abo->tbo.bdev);
>     kaddr = adev->mman.aper_base_kaddr;
>
>     if (abo->kfd_bo)
>         amdgpu_amdkfd_unreserve_memory_limit(abo);
>
>     if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
>         !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>         return;
>
>     dma_resv_lock(amdkcl_ttm_resvp(bo), NULL);
>     while (num_pages && mm_node) {
>         void *ptr = kaddr + (mm_node->start << PAGE_SHIFT);

That might not work as expected.

aper_base_kaddr can only point to a 256MiB window into VRAM, but VRAM 
itself is usually much larger.

So your memset_io() might end up in nirvana if the BO is allocated 
outside of the window.

> memset_io(ptr, AMDGPU_POISON & 0xff, mm_node->size <<PAGE_SHIFT);
>         num_pages -= mm_node->size;
>         ++mm_node;
>     }
>     dma_resv_unlock(amdkcl_ttm_resvp(bo));
> }
>
>
>
>
>
> I have used the old version through oversight, so I am sorry for your
> trouble.

No, problem. I was just wondering if I was missing something.

Regards,
Christian.

>
>
> Regards,
> Qu.
>
>> Regards,
>> Christian.
>>
>>>
>>> Regards,
>>> Qu.
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> and the VRAM mem will be evicted, mem region was replaced
>>>>> by Gtt mem region. amdgpu_bo_release_notify() will then
>>>>> hold the bo->base.resv lock, and SDMA will get an invalid
>>>>> address in amdgpu_fill_buffer(), resulting in a VMFAULT
>>>>> or memory corruption.
>>>>>
>>>>> To avoid it, we have to hold bo->base.resv lock first, and
>>>>> check whether the mem.mem_type is TTM_PL_VRAM.
>>>>>
>>>>> Signed-off-by: Qu Huang <jinsdb@126.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
>>>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> index 4b29b82..8018574 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct
>>>>> ttm_buffer_object *bo)
>>>>>       if (bo->base.resv == &bo->base._resv)
>>>>>           amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>>>>>
>>>>> -    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
>>>>> -        !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>>>> +    if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>>>>           return;
>>>>>
>>>>>       dma_resv_lock(bo->base.resv, NULL);
>>>>>
>>>>> +    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
>>>>> +        dma_resv_unlock(bo->base.resv);
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>>       r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv,
>>>>> &fence);
>>>>>       if (!WARN_ON(r)) {
>>>>>           amdgpu_bo_fence(abo, fence, false);
>>>>> -- 
>>>>> 1.8.3.1
>>>>>
>>>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a potential sdma invalid access
@ 2021-04-06 13:44           ` Christian König
  0 siblings, 0 replies; 20+ messages in thread
From: Christian König @ 2021-04-06 13:44 UTC (permalink / raw)
  To: Qu Huang, Christian König, alexander.deucher, airlied,
	daniel, sumit.semwal, airlied, ray.huang, Mihir.Patel,
	nirmoy.aiemd
  Cc: linux-kernel, amd-gfx, dri-devel, linux-media

Hi Qu,

Am 06.04.21 um 08:04 schrieb Qu Huang:
> Hi Christian,
>
> On 2021/4/3 16:49, Christian König wrote:
>> Hi Qu,
>>
>> Am 03.04.21 um 07:08 schrieb Qu Huang:
>>> Hi Christian,
>>>
>>> On 2021/4/3 0:25, Christian König wrote:
>>>> Hi Qu,
>>>>
>>>> Am 02.04.21 um 05:18 schrieb Qu Huang:
>>>>> Before dma_resv_lock(bo->base.resv, NULL) in
>>>>> amdgpu_bo_release_notify(),
>>>>> the bo->base.resv lock may be held by ttm_mem_evict_first(),
>>>>
>>>> That can't happen since when bo_release_notify is called the BO has 
>>>> not
>>>> more references and is therefore deleted.
>>>>
>>>> And we never evict a deleted BO, we just wait for it to become idle.
>>>>
>>> Yes, the bo reference counter return to zero will enter
>>> ttm_bo_release(),but notify bo release (call 
>>> amdgpu_bo_release_notify())
>>> first happen, and then test if a reservation object's fences have been
>>> signaled, and then mark bo as deleted and remove bo from the LRU list.
>>>
>>> When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
>>> the Bo has not been removed from the LRU list and is not marked as
>>> deleted, this will happen.
>>
>> Not sure on which code base you are, but I don't see how this can 
>> happen.
>>
>> ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and
>> ttm_bo_release() is only called when the BO reference count becomes 
>> zero.
>>
>> So ttm_mem_evict_first() will see that this BO is about to be destroyed
>> and skips it.
>>
>
> Yes, you are right. My version of TTM is ROCM 3.3, so
> ttm_mem_evict_first() did not call ttm_bo_get_unless_zero(), check that
> ROCM 4.0 ttm doesn't have this issue. This is an oversight on my part.
>
>>>
>>> As a test, when we use CPU memset instead of SDMA fill in
>>> amdgpu_bo_release_notify(), the result is page fault:
>>>
>>> PID: 5490   TASK: ffff8e8136e04100  CPU: 4   COMMAND: "gemmPerf"
>>>   #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784
>>>   #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92
>>>   #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80
>>>   #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768
>>>   #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6
>>>   #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d
>>>   #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae
>>>   #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0
>>>   #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925
>>>   #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758
>>>      [exception RIP: memset+31]
>>>      RIP: ffffffffb2b8668f  RSP: ffff8e79eaa17ce8  RFLAGS: 00010a17
>>>      RAX: bebebebebebebebe  RBX: ffff8e747bff10c0  RCX: 
>>> 0000060b00200000
>>>      RDX: 0000000000000000  RSI: 00000000000000be  RDI: 
>>> ffffab807f000000
>>>      RBP: ffff8e79eaa17d10   R8: ffff8e79eaa14000   R9: 
>>> ffffab7c80000000
>>>      R10: 000000000000bcba  R11: 00000000000001ba  R12: 
>>> ffff8e79ebaa4050
>>>      R13: ffffab7c80000000  R14: 0000000000022600  R15: 
>>> ffff8e8136e04100
>>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>> #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1
>>> [amdgpu]
>>> #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm]
>>> #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm]
>>> #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm]
>>> #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115
>>> #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64
>>> #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7
>>> #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95
>>> #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf
>>> #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544
>>> #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb
>>
>> Well that might be perfectly expected. VRAM is not necessarily CPU
>> accessible.
>>
> As a test,use CPU memset instead of SDMA fill, This is my code:
> void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
> {
>     struct amdgpu_bo *abo;
>     uint64_t num_pages;
>     struct drm_mm_node *mm_node;
>     struct amdgpu_device *adev;
>     void __iomem *kaddr;
>
>     if (!amdgpu_bo_is_amdgpu_bo(bo))
>         return;
>
>     abo = ttm_to_amdgpu_bo(bo);
>     num_pages = abo->tbo.num_pages;
>     mm_node = abo->tbo.mem.mm_node;
>     adev = amdgpu_ttm_adev(abo->tbo.bdev);
>     kaddr = adev->mman.aper_base_kaddr;
>
>     if (abo->kfd_bo)
>         amdgpu_amdkfd_unreserve_memory_limit(abo);
>
>     if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
>         !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>         return;
>
>     dma_resv_lock(amdkcl_ttm_resvp(bo), NULL);
>     while (num_pages && mm_node) {
>         void *ptr = kaddr + (mm_node->start << PAGE_SHIFT);

That might not work as expected.

aper_base_kaddr can only point to a 256MiB window into VRAM, but VRAM 
itself is usually much larger.

So your memset_io() might end up in nirvana if the BO is allocated 
outside of the window.

> memset_io(ptr, AMDGPU_POISON & 0xff, mm_node->size <<PAGE_SHIFT);
>         num_pages -= mm_node->size;
>         ++mm_node;
>     }
>     dma_resv_unlock(amdkcl_ttm_resvp(bo));
> }
>
>
>
>
>
> I have used the old version through oversight, so I am sorry for your
> trouble.

No, problem. I was just wondering if I was missing something.

Regards,
Christian.

>
>
> Regards,
> Qu.
>
>> Regards,
>> Christian.
>>
>>>
>>> Regards,
>>> Qu.
>>>
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> and the VRAM mem will be evicted, mem region was replaced
>>>>> by Gtt mem region. amdgpu_bo_release_notify() will then
>>>>> hold the bo->base.resv lock, and SDMA will get an invalid
>>>>> address in amdgpu_fill_buffer(), resulting in a VMFAULT
>>>>> or memory corruption.
>>>>>
>>>>> To avoid it, we have to hold bo->base.resv lock first, and
>>>>> check whether the mem.mem_type is TTM_PL_VRAM.
>>>>>
>>>>> Signed-off-by: Qu Huang <jinsdb@126.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
>>>>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> index 4b29b82..8018574 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>>>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct
>>>>> ttm_buffer_object *bo)
>>>>>       if (bo->base.resv == &bo->base._resv)
>>>>>           amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
>>>>>
>>>>> -    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
>>>>> -        !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>>>> +    if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>>>>           return;
>>>>>
>>>>>       dma_resv_lock(bo->base.resv, NULL);
>>>>>
>>>>> +    if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
>>>>> +        dma_resv_unlock(bo->base.resv);
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>>       r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv,
>>>>> &fence);
>>>>>       if (!WARN_ON(r)) {
>>>>>           amdgpu_bo_fence(abo, fence, false);
>>>>> -- 
>>>>> 1.8.3.1
>>>>>
>>>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-04-06 13:44 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-02  3:18 [PATCH] drm/amdgpu: Fix a potential sdma invalid access Qu Huang
2021-04-02  3:18 ` Qu Huang
2021-04-02  3:18 ` Qu Huang
2021-04-02 16:25 ` Christian König
2021-04-02 16:25   ` Christian König
2021-04-02 16:25   ` Christian König
2021-04-03  3:08   ` Qu Huang
2021-04-03  3:08     ` Qu Huang
2021-04-03  5:08   ` Qu Huang
2021-04-03  5:08     ` Qu Huang
2021-04-03  5:08     ` Qu Huang
2021-04-03  8:49     ` Christian König
2021-04-03  8:49       ` Christian König
2021-04-03  8:49       ` Christian König
2021-04-06  6:04       ` Qu Huang
2021-04-06  6:04         ` Qu Huang
2021-04-06  6:04         ` Qu Huang
2021-04-06 13:44         ` Christian König
2021-04-06 13:44           ` Christian König
2021-04-06 13:44           ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.