[PATCH] drm/amdgpu: fix NULL pointer dereference when run App with DRI

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] drm/amdgpu: fix NULL pointer dereference when run App with DRI_PRIME=1
@ 2018-05-25  5:41 Junwei Zhang
       [not found] ` <1527226896-29270-1-git-send-email-Jerry.Zhang-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Junwei Zhang @ 2018-05-25  5:41 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Junwei Zhang

[  632.679861] BUG: unable to handle kernel NULL pointer dereference at (null)
[  632.679892] IP: drm_prime_sg_to_page_addr_arrays+0x52/0xb0 [drm]
<snip>
[  632.680011] Call Trace:
[  632.680082]  amdgpu_ttm_tt_populate+0x3e/0xa0 [amdgpu]
[  632.680092]  ttm_tt_populate.part.7+0x22/0x60 [amdttm]
[  632.680098]  amdttm_tt_bind+0x52/0x60 [amdttm]
[  632.680106]  ttm_bo_handle_move_mem+0x54b/0x5c0 [amdttm]
[  632.680112]  ? find_next_bit+0xb/0x10
[  632.680119]  amdttm_bo_validate+0x11d/0x130 [amdttm]
[  632.680176]  amdgpu_cs_bo_validate+0x9d/0x150 [amdgpu]
[  632.680232]  amdgpu_cs_validate+0x41/0x270 [amdgpu]
[  632.680288]  amdgpu_cs_list_validate+0xc7/0x1a0 [amdgpu]
[  632.680343]  amdgpu_cs_ioctl+0x1634/0x1c00 [amdgpu]
[  632.680401]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[  632.680416]  drm_ioctl_kernel+0x6b/0xb0 [drm]
[  632.680431]  drm_ioctl+0x3e4/0x450 [drm]
[  632.680485]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[  632.680537]  amdgpu_drm_ioctl+0x4c/0x80 [amdgpu]
[  632.680542]  do_vfs_ioctl+0xa4/0x600
[  632.680546]  ? SyS_futex+0x7f/0x180
[  632.680549]  SyS_ioctl+0x79/0x90
[  632.680554]  entry_SYSCALL_64_fastpath+0x24/0xab

Signed-off-by: Junwei Zhang <Jerry.Zhang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 57d4da6..b293809 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1212,7 +1212,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct ttm_buffer_object *bo,
 	gtt->ttm.ttm.func = &amdgpu_backend_func;
 
 	/* allocate space for the uninitialized page entries */
-	if (ttm_sg_tt_init(&gtt->ttm, bo, page_flags)) {
+	if (ttm_dma_tt_init(&gtt->ttm, bo, page_flags)) {
 		kfree(gtt);
 		return NULL;
 	}
-- 
1.9.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: fix NULL pointer dereference when run App with DRI_PRIME=1
       [not found] ` <1527226896-29270-1-git-send-email-Jerry.Zhang-5C7GfCeVMHo@public.gmane.org>
@ 2018-05-25  6:44   ` Christian König
       [not found]     ` <f9495673-cb9a-cd75-a569-c4eb5b2e0c63-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Christian König @ 2018-05-25  6:44 UTC (permalink / raw)
  To: Junwei Zhang, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

NAK, that probably just fixed the symptom but not the underlying problem.

Somebody is accessing the page array when it should never be accessed.

How did you manage to trigger this?

Regards,
Christian.

Am 25.05.2018 um 07:41 schrieb Junwei Zhang:
> [  632.679861] BUG: unable to handle kernel NULL pointer dereference at (null)
> [  632.679892] IP: drm_prime_sg_to_page_addr_arrays+0x52/0xb0 [drm]
> <snip>
> [  632.680011] Call Trace:
> [  632.680082]  amdgpu_ttm_tt_populate+0x3e/0xa0 [amdgpu]
> [  632.680092]  ttm_tt_populate.part.7+0x22/0x60 [amdttm]
> [  632.680098]  amdttm_tt_bind+0x52/0x60 [amdttm]
> [  632.680106]  ttm_bo_handle_move_mem+0x54b/0x5c0 [amdttm]
> [  632.680112]  ? find_next_bit+0xb/0x10
> [  632.680119]  amdttm_bo_validate+0x11d/0x130 [amdttm]
> [  632.680176]  amdgpu_cs_bo_validate+0x9d/0x150 [amdgpu]
> [  632.680232]  amdgpu_cs_validate+0x41/0x270 [amdgpu]
> [  632.680288]  amdgpu_cs_list_validate+0xc7/0x1a0 [amdgpu]
> [  632.680343]  amdgpu_cs_ioctl+0x1634/0x1c00 [amdgpu]
> [  632.680401]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
> [  632.680416]  drm_ioctl_kernel+0x6b/0xb0 [drm]
> [  632.680431]  drm_ioctl+0x3e4/0x450 [drm]
> [  632.680485]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
> [  632.680537]  amdgpu_drm_ioctl+0x4c/0x80 [amdgpu]
> [  632.680542]  do_vfs_ioctl+0xa4/0x600
> [  632.680546]  ? SyS_futex+0x7f/0x180
> [  632.680549]  SyS_ioctl+0x79/0x90
> [  632.680554]  entry_SYSCALL_64_fastpath+0x24/0xab
>
> Signed-off-by: Junwei Zhang <Jerry.Zhang@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index 57d4da6..b293809 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1212,7 +1212,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct ttm_buffer_object *bo,
>   	gtt->ttm.ttm.func = &amdgpu_backend_func;
>   
>   	/* allocate space for the uninitialized page entries */
> -	if (ttm_sg_tt_init(&gtt->ttm, bo, page_flags)) {
> +	if (ttm_dma_tt_init(&gtt->ttm, bo, page_flags)) {
>   		kfree(gtt);
>   		return NULL;
>   	}

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: fix NULL pointer dereference when run App with DRI_PRIME=1
       [not found]     ` <f9495673-cb9a-cd75-a569-c4eb5b2e0c63-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-05-25  7:20       ` Zhang, Jerry (Junwei)
       [not found]         ` <5B07B933.4060701-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Zhang, Jerry (Junwei) @ 2018-05-25  7:20 UTC (permalink / raw)
  To: christian.koenig-5C7GfCeVMHo, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 05/25/2018 02:44 PM, Christian König wrote:
> NAK, that probably just fixed the symptom but not the underlying problem.
>
> Somebody is accessing the page array when it should never be accessed.

If prime import as GTT bo by default(now it's CPU bo), it would happens quickly 
when GTT sg bo creation rather than next cs validation.

Since ttm_sg_tt_init() only allocates gtt->ttm.dma_address if sg bo is created, 
it would fail to access ttm->pages when ttm populate.

current error happens in ttm populate from cs validation, the sg bo is imported 
from exporter.

>
> How did you manage to trigger this?

PRI_PRIME=1 with Unigine heaven.

Regards,
Jerry

>
> Regards,
> Christian.
>
> Am 25.05.2018 um 07:41 schrieb Junwei Zhang:
>> [  632.679861] BUG: unable to handle kernel NULL pointer dereference at (null)
>> [  632.679892] IP: drm_prime_sg_to_page_addr_arrays+0x52/0xb0 [drm]
>> <snip>
>> [  632.680011] Call Trace:
>> [  632.680082]  amdgpu_ttm_tt_populate+0x3e/0xa0 [amdgpu]
>> [  632.680092]  ttm_tt_populate.part.7+0x22/0x60 [amdttm]
>> [  632.680098]  amdttm_tt_bind+0x52/0x60 [amdttm]
>> [  632.680106]  ttm_bo_handle_move_mem+0x54b/0x5c0 [amdttm]
>> [  632.680112]  ? find_next_bit+0xb/0x10
>> [  632.680119]  amdttm_bo_validate+0x11d/0x130 [amdttm]
>> [  632.680176]  amdgpu_cs_bo_validate+0x9d/0x150 [amdgpu]
>> [  632.680232]  amdgpu_cs_validate+0x41/0x270 [amdgpu]
>> [  632.680288]  amdgpu_cs_list_validate+0xc7/0x1a0 [amdgpu]
>> [  632.680343]  amdgpu_cs_ioctl+0x1634/0x1c00 [amdgpu]
>> [  632.680401]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
>> [  632.680416]  drm_ioctl_kernel+0x6b/0xb0 [drm]
>> [  632.680431]  drm_ioctl+0x3e4/0x450 [drm]
>> [  632.680485]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
>> [  632.680537]  amdgpu_drm_ioctl+0x4c/0x80 [amdgpu]
>> [  632.680542]  do_vfs_ioctl+0xa4/0x600
>> [  632.680546]  ? SyS_futex+0x7f/0x180
>> [  632.680549]  SyS_ioctl+0x79/0x90
>> [  632.680554]  entry_SYSCALL_64_fastpath+0x24/0xab
>>
>> Signed-off-by: Junwei Zhang <Jerry.Zhang@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index 57d4da6..b293809 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -1212,7 +1212,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct
>> ttm_buffer_object *bo,
>>       gtt->ttm.ttm.func = &amdgpu_backend_func;
>>       /* allocate space for the uninitialized page entries */
>> -    if (ttm_sg_tt_init(&gtt->ttm, bo, page_flags)) {
>> +    if (ttm_dma_tt_init(&gtt->ttm, bo, page_flags)) {
>>           kfree(gtt);
>>           return NULL;
>>       }
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: fix NULL pointer dereference when run App with DRI_PRIME=1
       [not found]         ` <5B07B933.4060701-5C7GfCeVMHo@public.gmane.org>
@ 2018-05-25  7:54           ` Christian König
       [not found]             ` <2a876b66-abaa-5ca0-5975-2b458ab9dba5-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Christian König @ 2018-05-25  7:54 UTC (permalink / raw)
  To: Zhang, Jerry (Junwei), amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 25.05.2018 um 09:20 schrieb Zhang, Jerry (Junwei):
> On 05/25/2018 02:44 PM, Christian König wrote:
>> NAK, that probably just fixed the symptom but not the underlying 
>> problem.
>>
>> Somebody is accessing the page array when it should never be accessed.
>
> If prime import as GTT bo by default(now it's CPU bo), it would 
> happens quickly when GTT sg bo creation rather than next cs validation.
>
> Since ttm_sg_tt_init() only allocates gtt->ttm.dma_address if sg bo is 
> created, it would fail to access ttm->pages when ttm populate.

And exactly that's the problem, and imported BO should never populate.

>
> current error happens in ttm populate from cs validation, the sg bo is 
> imported from exporter.
>
>>
>> How did you manage to trigger this?
>
> PRI_PRIME=1 with Unigine heaven.

Going to give that a try, but the last time I check that worked as expected.

Thanks,
Christian.

>
> Regards,
> Jerry
>
>>
>> Regards,
>> Christian.
>>
>> Am 25.05.2018 um 07:41 schrieb Junwei Zhang:
>>> [  632.679861] BUG: unable to handle kernel NULL pointer dereference 
>>> at (null)
>>> [  632.679892] IP: drm_prime_sg_to_page_addr_arrays+0x52/0xb0 [drm]
>>> <snip>
>>> [  632.680011] Call Trace:
>>> [  632.680082]  amdgpu_ttm_tt_populate+0x3e/0xa0 [amdgpu]
>>> [  632.680092]  ttm_tt_populate.part.7+0x22/0x60 [amdttm]
>>> [  632.680098]  amdttm_tt_bind+0x52/0x60 [amdttm]
>>> [  632.680106]  ttm_bo_handle_move_mem+0x54b/0x5c0 [amdttm]
>>> [  632.680112]  ? find_next_bit+0xb/0x10
>>> [  632.680119]  amdttm_bo_validate+0x11d/0x130 [amdttm]
>>> [  632.680176]  amdgpu_cs_bo_validate+0x9d/0x150 [amdgpu]
>>> [  632.680232]  amdgpu_cs_validate+0x41/0x270 [amdgpu]
>>> [  632.680288]  amdgpu_cs_list_validate+0xc7/0x1a0 [amdgpu]
>>> [  632.680343]  amdgpu_cs_ioctl+0x1634/0x1c00 [amdgpu]
>>> [  632.680401]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
>>> [  632.680416]  drm_ioctl_kernel+0x6b/0xb0 [drm]
>>> [  632.680431]  drm_ioctl+0x3e4/0x450 [drm]
>>> [  632.680485]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
>>> [  632.680537]  amdgpu_drm_ioctl+0x4c/0x80 [amdgpu]
>>> [  632.680542]  do_vfs_ioctl+0xa4/0x600
>>> [  632.680546]  ? SyS_futex+0x7f/0x180
>>> [  632.680549]  SyS_ioctl+0x79/0x90
>>> [  632.680554]  entry_SYSCALL_64_fastpath+0x24/0xab
>>>
>>> Signed-off-by: Junwei Zhang <Jerry.Zhang@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> index 57d4da6..b293809 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> @@ -1212,7 +1212,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct
>>> ttm_buffer_object *bo,
>>>       gtt->ttm.ttm.func = &amdgpu_backend_func;
>>>       /* allocate space for the uninitialized page entries */
>>> -    if (ttm_sg_tt_init(&gtt->ttm, bo, page_flags)) {
>>> +    if (ttm_dma_tt_init(&gtt->ttm, bo, page_flags)) {
>>>           kfree(gtt);
>>>           return NULL;
>>>       }
>>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: fix NULL pointer dereference when run App with DRI_PRIME=1
       [not found]             ` <2a876b66-abaa-5ca0-5975-2b458ab9dba5-5C7GfCeVMHo@public.gmane.org>
@ 2018-05-25  8:23               ` Zhang, Jerry (Junwei)
       [not found]                 ` <5B07C807.4050806-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Zhang, Jerry (Junwei) @ 2018-05-25  8:23 UTC (permalink / raw)
  To: Christian König, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 05/25/2018 03:54 PM, Christian König wrote:
> Am 25.05.2018 um 09:20 schrieb Zhang, Jerry (Junwei):
>> On 05/25/2018 02:44 PM, Christian König wrote:
>>> NAK, that probably just fixed the symptom but not the underlying problem.
>>>
>>> Somebody is accessing the page array when it should never be accessed.
>>
>> If prime import as GTT bo by default(now it's CPU bo), it would happens
>> quickly when GTT sg bo creation rather than next cs validation.
>>
>> Since ttm_sg_tt_init() only allocates gtt->ttm.dma_address if sg bo is
>> created, it would fail to access ttm->pages when ttm populate.
>
> And exactly that's the problem, and imported BO should never populate.
>
>>
>> current error happens in ttm populate from cs validation, the sg bo is
>> imported from exporter.
>>
>>>
>>> How did you manage to trigger this?
>>
>> PRI_PRIME=1 with Unigine heaven.
>
> Going to give that a try, but the last time I check that worked as expected.

FYI.
PRI_PRIME=1 glxinfo will not trigger that, but the game does.

Jerry

>
> Thanks,
> Christian.
>
>>
>> Regards,
>> Jerry
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 25.05.2018 um 07:41 schrieb Junwei Zhang:
>>>> [  632.679861] BUG: unable to handle kernel NULL pointer dereference at (null)
>>>> [  632.679892] IP: drm_prime_sg_to_page_addr_arrays+0x52/0xb0 [drm]
>>>> <snip>
>>>> [  632.680011] Call Trace:
>>>> [  632.680082]  amdgpu_ttm_tt_populate+0x3e/0xa0 [amdgpu]
>>>> [  632.680092]  ttm_tt_populate.part.7+0x22/0x60 [amdttm]
>>>> [  632.680098]  amdttm_tt_bind+0x52/0x60 [amdttm]
>>>> [  632.680106]  ttm_bo_handle_move_mem+0x54b/0x5c0 [amdttm]
>>>> [  632.680112]  ? find_next_bit+0xb/0x10
>>>> [  632.680119]  amdttm_bo_validate+0x11d/0x130 [amdttm]
>>>> [  632.680176]  amdgpu_cs_bo_validate+0x9d/0x150 [amdgpu]
>>>> [  632.680232]  amdgpu_cs_validate+0x41/0x270 [amdgpu]
>>>> [  632.680288]  amdgpu_cs_list_validate+0xc7/0x1a0 [amdgpu]
>>>> [  632.680343]  amdgpu_cs_ioctl+0x1634/0x1c00 [amdgpu]
>>>> [  632.680401]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
>>>> [  632.680416]  drm_ioctl_kernel+0x6b/0xb0 [drm]
>>>> [  632.680431]  drm_ioctl+0x3e4/0x450 [drm]
>>>> [  632.680485]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
>>>> [  632.680537]  amdgpu_drm_ioctl+0x4c/0x80 [amdgpu]
>>>> [  632.680542]  do_vfs_ioctl+0xa4/0x600
>>>> [  632.680546]  ? SyS_futex+0x7f/0x180
>>>> [  632.680549]  SyS_ioctl+0x79/0x90
>>>> [  632.680554]  entry_SYSCALL_64_fastpath+0x24/0xab
>>>>
>>>> Signed-off-by: Junwei Zhang <Jerry.Zhang@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 2 +-
>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> index 57d4da6..b293809 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> @@ -1212,7 +1212,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct
>>>> ttm_buffer_object *bo,
>>>>       gtt->ttm.ttm.func = &amdgpu_backend_func;
>>>>       /* allocate space for the uninitialized page entries */
>>>> -    if (ttm_sg_tt_init(&gtt->ttm, bo, page_flags)) {
>>>> +    if (ttm_dma_tt_init(&gtt->ttm, bo, page_flags)) {
>>>>           kfree(gtt);
>>>>           return NULL;
>>>>       }
>>>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: fix NULL pointer dereference when run App with DRI_PRIME=1
       [not found]                 ` <5B07C807.4050806-5C7GfCeVMHo@public.gmane.org>
@ 2018-05-25  9:35                   ` Christian König
       [not found]                     ` <98ef8cf3-31bf-229b-fee2-88f426e3a91f-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Christian König @ 2018-05-25  9:35 UTC (permalink / raw)
  To: Zhang, Jerry (Junwei), amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 25.05.2018 um 10:23 schrieb Zhang, Jerry (Junwei):
> On 05/25/2018 03:54 PM, Christian König wrote:
>> Am 25.05.2018 um 09:20 schrieb Zhang, Jerry (Junwei):
>>> On 05/25/2018 02:44 PM, Christian König wrote:
>>>> NAK, that probably just fixed the symptom but not the underlying 
>>>> problem.
>>>>
>>>> Somebody is accessing the page array when it should never be accessed.
>>>
>>> If prime import as GTT bo by default(now it's CPU bo), it would happens
>>> quickly when GTT sg bo creation rather than next cs validation.
>>>
>>> Since ttm_sg_tt_init() only allocates gtt->ttm.dma_address if sg bo is
>>> created, it would fail to access ttm->pages when ttm populate.
>>
>> And exactly that's the problem, and imported BO should never populate.
>>
>>>
>>> current error happens in ttm populate from cs validation, the sg bo is
>>> imported from exporter.
>>>
>>>>
>>>> How did you manage to trigger this?
>>>
>>> PRI_PRIME=1 with Unigine heaven.
>>
>> Going to give that a try, but the last time I check that worked as 
>> expected.
>
> FYI.
> PRI_PRIME=1 glxinfo will not trigger that, but the game does.

Just tested and it works perfectly fine.

Is that on the closed stack or the open stack?

Christian.

>
> Jerry
>
>>
>> Thanks,
>> Christian.
>>
>>>
>>> Regards,
>>> Jerry
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 25.05.2018 um 07:41 schrieb Junwei Zhang:
>>>>> [  632.679861] BUG: unable to handle kernel NULL pointer 
>>>>> dereference at (null)
>>>>> [  632.679892] IP: drm_prime_sg_to_page_addr_arrays+0x52/0xb0 [drm]
>>>>> <snip>
>>>>> [  632.680011] Call Trace:
>>>>> [  632.680082]  amdgpu_ttm_tt_populate+0x3e/0xa0 [amdgpu]
>>>>> [  632.680092]  ttm_tt_populate.part.7+0x22/0x60 [amdttm]
>>>>> [  632.680098]  amdttm_tt_bind+0x52/0x60 [amdttm]
>>>>> [  632.680106]  ttm_bo_handle_move_mem+0x54b/0x5c0 [amdttm]
>>>>> [  632.680112]  ? find_next_bit+0xb/0x10
>>>>> [  632.680119]  amdttm_bo_validate+0x11d/0x130 [amdttm]
>>>>> [  632.680176]  amdgpu_cs_bo_validate+0x9d/0x150 [amdgpu]
>>>>> [  632.680232]  amdgpu_cs_validate+0x41/0x270 [amdgpu]
>>>>> [  632.680288]  amdgpu_cs_list_validate+0xc7/0x1a0 [amdgpu]
>>>>> [  632.680343]  amdgpu_cs_ioctl+0x1634/0x1c00 [amdgpu]
>>>>> [  632.680401]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
>>>>> [  632.680416]  drm_ioctl_kernel+0x6b/0xb0 [drm]
>>>>> [  632.680431]  drm_ioctl+0x3e4/0x450 [drm]
>>>>> [  632.680485]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
>>>>> [  632.680537]  amdgpu_drm_ioctl+0x4c/0x80 [amdgpu]
>>>>> [  632.680542]  do_vfs_ioctl+0xa4/0x600
>>>>> [  632.680546]  ? SyS_futex+0x7f/0x180
>>>>> [  632.680549]  SyS_ioctl+0x79/0x90
>>>>> [  632.680554]  entry_SYSCALL_64_fastpath+0x24/0xab
>>>>>
>>>>> Signed-off-by: Junwei Zhang <Jerry.Zhang@amd.com>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 2 +-
>>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> index 57d4da6..b293809 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> @@ -1212,7 +1212,7 @@ static struct ttm_tt 
>>>>> *amdgpu_ttm_tt_create(struct
>>>>> ttm_buffer_object *bo,
>>>>>       gtt->ttm.ttm.func = &amdgpu_backend_func;
>>>>>       /* allocate space for the uninitialized page entries */
>>>>> -    if (ttm_sg_tt_init(&gtt->ttm, bo, page_flags)) {
>>>>> +    if (ttm_dma_tt_init(&gtt->ttm, bo, page_flags)) {
>>>>>           kfree(gtt);
>>>>>           return NULL;
>>>>>       }
>>>>
>>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: fix NULL pointer dereference when run App with DRI_PRIME=1
       [not found]                     ` <98ef8cf3-31bf-229b-fee2-88f426e3a91f-5C7GfCeVMHo@public.gmane.org>
@ 2018-05-25  9:51                       ` Zhang, Jerry (Junwei)
       [not found]                         ` <5B07DCAA.509-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Zhang, Jerry (Junwei) @ 2018-05-25  9:51 UTC (permalink / raw)
  To: Christian König, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 05/25/2018 05:35 PM, Christian König wrote:
> Am 25.05.2018 um 10:23 schrieb Zhang, Jerry (Junwei):
>> On 05/25/2018 03:54 PM, Christian König wrote:
>>> Am 25.05.2018 um 09:20 schrieb Zhang, Jerry (Junwei):
>>>> On 05/25/2018 02:44 PM, Christian König wrote:
>>>>> NAK, that probably just fixed the symptom but not the underlying problem.
>>>>>
>>>>> Somebody is accessing the page array when it should never be accessed.
>>>>
>>>> If prime import as GTT bo by default(now it's CPU bo), it would happens
>>>> quickly when GTT sg bo creation rather than next cs validation.
>>>>
>>>> Since ttm_sg_tt_init() only allocates gtt->ttm.dma_address if sg bo is
>>>> created, it would fail to access ttm->pages when ttm populate.
>>>
>>> And exactly that's the problem, and imported BO should never populate.
>>>
>>>>
>>>> current error happens in ttm populate from cs validation, the sg bo is
>>>> imported from exporter.
>>>>
>>>>>
>>>>> How did you manage to trigger this?
>>>>
>>>> PRI_PRIME=1 with Unigine heaven.
>>>
>>> Going to give that a try, but the last time I check that worked as expected.
>>
>> FYI.
>> PRI_PRIME=1 glxinfo will not trigger that, but the game does.
>
> Just tested and it works perfectly fine.
>
> Is that on the closed stack or the open stack?

I used unified driver(latest 18.20 build) + drm-next kernel, installed as all 
open stack on A+A platform.
(issue was found by 18.20 build, all open stack(dkms driver))

BTW, How did you get the UMD? apt-get or build by yourself?


Jerry

>
> Christian.
>
>>
>> Jerry
>>
>>>
>>> Thanks,
>>> Christian.
>>>
>>>>
>>>> Regards,
>>>> Jerry
>>>>
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>> Am 25.05.2018 um 07:41 schrieb Junwei Zhang:
>>>>>> [  632.679861] BUG: unable to handle kernel NULL pointer dereference at
>>>>>> (null)
>>>>>> [  632.679892] IP: drm_prime_sg_to_page_addr_arrays+0x52/0xb0 [drm]
>>>>>> <snip>
>>>>>> [  632.680011] Call Trace:
>>>>>> [  632.680082]  amdgpu_ttm_tt_populate+0x3e/0xa0 [amdgpu]
>>>>>> [  632.680092]  ttm_tt_populate.part.7+0x22/0x60 [amdttm]
>>>>>> [  632.680098]  amdttm_tt_bind+0x52/0x60 [amdttm]
>>>>>> [  632.680106]  ttm_bo_handle_move_mem+0x54b/0x5c0 [amdttm]
>>>>>> [  632.680112]  ? find_next_bit+0xb/0x10
>>>>>> [  632.680119]  amdttm_bo_validate+0x11d/0x130 [amdttm]
>>>>>> [  632.680176]  amdgpu_cs_bo_validate+0x9d/0x150 [amdgpu]
>>>>>> [  632.680232]  amdgpu_cs_validate+0x41/0x270 [amdgpu]
>>>>>> [  632.680288]  amdgpu_cs_list_validate+0xc7/0x1a0 [amdgpu]
>>>>>> [  632.680343]  amdgpu_cs_ioctl+0x1634/0x1c00 [amdgpu]
>>>>>> [  632.680401]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
>>>>>> [  632.680416]  drm_ioctl_kernel+0x6b/0xb0 [drm]
>>>>>> [  632.680431]  drm_ioctl+0x3e4/0x450 [drm]
>>>>>> [  632.680485]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
>>>>>> [  632.680537]  amdgpu_drm_ioctl+0x4c/0x80 [amdgpu]
>>>>>> [  632.680542]  do_vfs_ioctl+0xa4/0x600
>>>>>> [  632.680546]  ? SyS_futex+0x7f/0x180
>>>>>> [  632.680549]  SyS_ioctl+0x79/0x90
>>>>>> [  632.680554]  entry_SYSCALL_64_fastpath+0x24/0xab
>>>>>>
>>>>>> Signed-off-by: Junwei Zhang <Jerry.Zhang@amd.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 2 +-
>>>>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>>> index 57d4da6..b293809 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>>> @@ -1212,7 +1212,7 @@ static struct ttm_tt *amdgpu_ttm_tt_create(struct
>>>>>> ttm_buffer_object *bo,
>>>>>>       gtt->ttm.ttm.func = &amdgpu_backend_func;
>>>>>>       /* allocate space for the uninitialized page entries */
>>>>>> -    if (ttm_sg_tt_init(&gtt->ttm, bo, page_flags)) {
>>>>>> +    if (ttm_dma_tt_init(&gtt->ttm, bo, page_flags)) {
>>>>>>           kfree(gtt);
>>>>>>           return NULL;
>>>>>>       }
>>>>>
>>>
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: fix NULL pointer dereference when run App with DRI_PRIME=1
       [not found]                         ` <5B07DCAA.509-5C7GfCeVMHo@public.gmane.org>
@ 2018-05-25 11:23                           ` Christian König
       [not found]                             ` <31572675-9d62-2acf-5cbf-7a389ac87554-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Christian König @ 2018-05-25 11:23 UTC (permalink / raw)
  To: Zhang, Jerry (Junwei),
	Christian König, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 25.05.2018 um 11:51 schrieb Zhang, Jerry (Junwei):
> On 05/25/2018 05:35 PM, Christian König wrote:
>> Am 25.05.2018 um 10:23 schrieb Zhang, Jerry (Junwei):
>>> On 05/25/2018 03:54 PM, Christian König wrote:
>>>> Am 25.05.2018 um 09:20 schrieb Zhang, Jerry (Junwei):
>>>>> On 05/25/2018 02:44 PM, Christian König wrote:
>>>>>> NAK, that probably just fixed the symptom but not the underlying 
>>>>>> problem.
>>>>>>
>>>>>> Somebody is accessing the page array when it should never be 
>>>>>> accessed.
>>>>>
>>>>> If prime import as GTT bo by default(now it's CPU bo), it would 
>>>>> happens
>>>>> quickly when GTT sg bo creation rather than next cs validation.
>>>>>
>>>>> Since ttm_sg_tt_init() only allocates gtt->ttm.dma_address if sg 
>>>>> bo is
>>>>> created, it would fail to access ttm->pages when ttm populate.
>>>>
>>>> And exactly that's the problem, and imported BO should never populate.
>>>>
>>>>>
>>>>> current error happens in ttm populate from cs validation, the sg 
>>>>> bo is
>>>>> imported from exporter.
>>>>>
>>>>>>
>>>>>> How did you manage to trigger this?
>>>>>
>>>>> PRI_PRIME=1 with Unigine heaven.
>>>>
>>>> Going to give that a try, but the last time I check that worked as 
>>>> expected.
>>>
>>> FYI.
>>> PRI_PRIME=1 glxinfo will not trigger that, but the game does.
>>
>> Just tested and it works perfectly fine.
>>
>> Is that on the closed stack or the open stack?
>
> I used unified driver(latest 18.20 build) + drm-next kernel, installed 
> as all open stack on A+A platform.
> (issue was found by 18.20 build, all open stack(dkms driver))
>
> BTW, How did you get the UMD? apt-get or build by yourself?

That's self build Mesa+libdrm.

Do you have the apt url and/or package versions at hand you used for the 
test?

Christian.

>
>
> Jerry

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amdgpu: fix NULL pointer dereference when run App with DRI_PRIME=1
       [not found]                             ` <31572675-9d62-2acf-5cbf-7a389ac87554-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-05-28  7:23                               ` Zhang, Jerry (Junwei)
  0 siblings, 0 replies; 9+ messages in thread
From: Zhang, Jerry (Junwei) @ 2018-05-28  7:23 UTC (permalink / raw)
  To: christian.koenig-5C7GfCeVMHo, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 05/25/2018 07:23 PM, Christian König wrote:
> Am 25.05.2018 um 11:51 schrieb Zhang, Jerry (Junwei):
>> On 05/25/2018 05:35 PM, Christian König wrote:
>>> Am 25.05.2018 um 10:23 schrieb Zhang, Jerry (Junwei):
>>>> On 05/25/2018 03:54 PM, Christian König wrote:
>>>>> Am 25.05.2018 um 09:20 schrieb Zhang, Jerry (Junwei):
>>>>>> On 05/25/2018 02:44 PM, Christian König wrote:
>>>>>>> NAK, that probably just fixed the symptom but not the underlying problem.
>>>>>>>
>>>>>>> Somebody is accessing the page array when it should never be accessed.
>>>>>>
>>>>>> If prime import as GTT bo by default(now it's CPU bo), it would happens
>>>>>> quickly when GTT sg bo creation rather than next cs validation.
>>>>>>
>>>>>> Since ttm_sg_tt_init() only allocates gtt->ttm.dma_address if sg bo is
>>>>>> created, it would fail to access ttm->pages when ttm populate.
>>>>>
>>>>> And exactly that's the problem, and imported BO should never populate.
>>>>>
>>>>>>
>>>>>> current error happens in ttm populate from cs validation, the sg bo is
>>>>>> imported from exporter.
>>>>>>
>>>>>>>
>>>>>>> How did you manage to trigger this?
>>>>>>
>>>>>> PRI_PRIME=1 with Unigine heaven.
>>>>>
>>>>> Going to give that a try, but the last time I check that worked as expected.
>>>>
>>>> FYI.
>>>> PRI_PRIME=1 glxinfo will not trigger that, but the game does.
>>>
>>> Just tested and it works perfectly fine.
>>>
>>> Is that on the closed stack or the open stack?
>>
>> I used unified driver(latest 18.20 build) + drm-next kernel, installed as all
>> open stack on A+A platform.
>> (issue was found by 18.20 build, all open stack(dkms driver))
>>
>> BTW, How did you get the UMD? apt-get or build by yourself?
>
> That's self build Mesa+libdrm.
>
> Do you have the apt url and/or package versions at hand you used for the test?

I found that the Ubuntu kernel 4.13/4.15 has no below patch:
   * 
https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&id=186ca446aea19e49d2e1433dd170c6e1c211a52a

So we could fix that in DKMS support rather than in upstream.

Double confirmed drm-next kernel that has no such issue.
(not sure what's going on last week, I did get the latest code and build the 
kernel and it failed. Sorry for this inconvenience)

Thanks for your time to check it.

Jerry

>
> Christian.
>
>>
>>
>> Jerry
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-05-28  7:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-25  5:41 [PATCH] drm/amdgpu: fix NULL pointer dereference when run App with DRI_PRIME=1 Junwei Zhang
     [not found] ` <1527226896-29270-1-git-send-email-Jerry.Zhang-5C7GfCeVMHo@public.gmane.org>
2018-05-25  6:44   ` Christian König
     [not found]     ` <f9495673-cb9a-cd75-a569-c4eb5b2e0c63-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-05-25  7:20       ` Zhang, Jerry (Junwei)
     [not found]         ` <5B07B933.4060701-5C7GfCeVMHo@public.gmane.org>
2018-05-25  7:54           ` Christian König
     [not found]             ` <2a876b66-abaa-5ca0-5975-2b458ab9dba5-5C7GfCeVMHo@public.gmane.org>
2018-05-25  8:23               ` Zhang, Jerry (Junwei)
     [not found]                 ` <5B07C807.4050806-5C7GfCeVMHo@public.gmane.org>
2018-05-25  9:35                   ` Christian König
     [not found]                     ` <98ef8cf3-31bf-229b-fee2-88f426e3a91f-5C7GfCeVMHo@public.gmane.org>
2018-05-25  9:51                       ` Zhang, Jerry (Junwei)
     [not found]                         ` <5B07DCAA.509-5C7GfCeVMHo@public.gmane.org>
2018-05-25 11:23                           ` Christian König
     [not found]                             ` <31572675-9d62-2acf-5cbf-7a389ac87554-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-05-28  7:23                               ` Zhang, Jerry (Junwei)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.