amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
@ 2020-02-19  4:04 Dennis Li
  2020-02-19  6:26 ` Zhang, Hawking
  0 siblings, 1 reply; 12+ messages in thread
From: Dennis Li @ 2020-02-19  4:04 UTC (permalink / raw)
  To: amd-gfx, Alexander.Deucher, Tao.Zhou1, Hawking.Zhang, Guchun.Chen
  Cc: Dennis Li

check whether the queue of entity is null to avoid null
pointer dereference.

Change-Id: I08d56774012cf229ba2fe7a011c1359e8d1e2781
Signed-off-by: Dennis Li <Dennis.Li@amd.com>

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index 4cc7881f438c..67cca463ddcc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -95,6 +95,9 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
 	int r;
 
 	entity = p->direct ? &p->vm->direct : &p->vm->delayed;
+	if (!entity->rq)
+		return 0;
+
 	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
 
 	WARN_ON(ib->length_dw == 0);
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* RE: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
  2020-02-19  4:04 [PATCH] drm/amdgpu: fix a bug NULL pointer dereference Dennis Li
@ 2020-02-19  6:26 ` Zhang, Hawking
  2020-02-19 10:49   ` Christian König
  0 siblings, 1 reply; 12+ messages in thread
From: Zhang, Hawking @ 2020-02-19  6:26 UTC (permalink / raw)
  To: Li, Dennis, amd-gfx, Deucher, Alexander, Zhou1, Tao, Chen, Guchun
  Cc: Li, Dennis

[AMD Official Use Only - Internal Distribution Only]

Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

Regards,
Hawking
-----Original Message-----
From: Dennis Li <Dennis.Li@amd.com> 
Sent: Wednesday, February 19, 2020 12:05
To: amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
Cc: Li, Dennis <Dennis.Li@amd.com>
Subject: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference

check whether the queue of entity is null to avoid null pointer dereference.

Change-Id: I08d56774012cf229ba2fe7a011c1359e8d1e2781
Signed-off-by: Dennis Li <Dennis.Li@amd.com>

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index 4cc7881f438c..67cca463ddcc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -95,6 +95,9 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
 	int r;
 
 	entity = p->direct ? &p->vm->direct : &p->vm->delayed;
+	if (!entity->rq)
+		return 0;
+
 	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
 
 	WARN_ON(ib->length_dw == 0);
--
2.17.1
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
  2020-02-19  6:26 ` Zhang, Hawking
@ 2020-02-19 10:49   ` Christian König
  2020-02-19 11:29     ` 回复: " Liu, Monk
  0 siblings, 1 reply; 12+ messages in thread
From: Christian König @ 2020-02-19 10:49 UTC (permalink / raw)
  To: Zhang, Hawking, Li, Dennis, amd-gfx, Deucher, Alexander, Zhou1,
	Tao, Chen, Guchun

Well of hand this patch looks like a clear NAK to me.

Returning without raising an error is certainly the wrong thing to do 
here because we just drop the necessary page table updates.

How does the entity->rq ends up as NULL in the first place?

Regards,
Christian.

Am 19.02.20 um 07:26 schrieb Zhang, Hawking:
> [AMD Official Use Only - Internal Distribution Only]
>
> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
>
> Regards,
> Hawking
> -----Original Message-----
> From: Dennis Li <Dennis.Li@amd.com>
> Sent: Wednesday, February 19, 2020 12:05
> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
> Cc: Li, Dennis <Dennis.Li@amd.com>
> Subject: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
>
> check whether the queue of entity is null to avoid null pointer dereference.
>
> Change-Id: I08d56774012cf229ba2fe7a011c1359e8d1e2781
> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index 4cc7881f438c..67cca463ddcc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,9 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>   	int r;
>   
>   	entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> +	if (!entity->rq)
> +		return 0;
> +
>   	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>   
>   	WARN_ON(ib->length_dw == 0);
> --
> 2.17.1
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* 回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
  2020-02-19 10:49   ` Christian König
@ 2020-02-19 11:29     ` Liu, Monk
  2020-02-20  8:41       ` Li, Dennis
  0 siblings, 1 reply; 12+ messages in thread
From: Liu, Monk @ 2020-02-19 11:29 UTC (permalink / raw)
  To: Koenig, Christian, Zhang, Hawking, Li, Dennis, amd-gfx, Deucher,
	Alexander, Zhou1, Tao, Chen, Guchun

> +	if (!entity->rq)
> +		return 0;
> +

Yes, supposedly we shouldn't get 'entity->rq == NULL' case , that looks the true bug 

-----邮件原件-----
发件人: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> 代表 Christian K?nig
发送时间: 2020年2月19日 18:50
收件人: Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
主题: Re: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference

Well of hand this patch looks like a clear NAK to me.

Returning without raising an error is certainly the wrong thing to do here because we just drop the necessary page table updates.

How does the entity->rq ends up as NULL in the first place?

Regards,
Christian.

Am 19.02.20 um 07:26 schrieb Zhang, Hawking:
> [AMD Official Use Only - Internal Distribution Only]
>
> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
>
> Regards,
> Hawking
> -----Original Message-----
> From: Dennis Li <Dennis.Li@amd.com>
> Sent: Wednesday, February 19, 2020 12:05
> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Zhang, 
> Hawking <Hawking.Zhang@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
> Cc: Li, Dennis <Dennis.Li@amd.com>
> Subject: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
>
> check whether the queue of entity is null to avoid null pointer dereference.
>
> Change-Id: I08d56774012cf229ba2fe7a011c1359e8d1e2781
> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index 4cc7881f438c..67cca463ddcc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,9 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>   	int r;
>   
>   	entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> +	if (!entity->rq)
> +		return 0;
> +
>   	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>   
>   	WARN_ON(ib->length_dw == 0);
> --
> 2.17.1
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmo
> nk.liu%40amd.com%7C28e7260af3a24eec758f08d7b52975e3%7C3dd8961fe4884e60
> 8e11a82d994e183d%7C0%7C0%7C637177062003213431&amp;sdata=vMXmhwTlN8lAav
> uqhYhpmKLM6V%2F%2B2%2FubFBbsk%2BGY%2Bjw%3D&amp;reserved=0

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmonk.liu%40amd.com%7C28e7260af3a24eec758f08d7b52975e3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637177062003213431&amp;sdata=vMXmhwTlN8lAavuqhYhpmKLM6V%2F%2B2%2FubFBbsk%2BGY%2Bjw%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
  2020-02-19 11:29     ` 回复: " Liu, Monk
@ 2020-02-20  8:41       ` Li, Dennis
  2020-02-20 10:39         ` Nirmoy
  2020-02-20 13:28         ` 回复: " Liu, Monk
  0 siblings, 2 replies; 12+ messages in thread
From: Li, Dennis @ 2020-02-20  8:41 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, Zhang, Hawking, amd-gfx, Deucher,
	Alexander, Zhou1, Tao, Chen, Guchun

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian and Monk,
      When doing SDMA copy, a RAS uncorrectable error happens, which will cause this issue.  The RAS uncorrectable error event will trigger driver to do BACO reset which will set the status of SDMA scheduler to no ready. And then drm_sched_entity_get_free_sched will return NULL in drm_sched_entity_select_rq, which cause entity->rq to NULL. 

Best Regards
Dennis Li
-----Original Message-----
From: Liu, Monk <Monk.Liu@amd.com> 
Sent: Wednesday, February 19, 2020 7:30 PM
To: Koenig, Christian <Christian.Koenig@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
Subject: 回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference

> +	if (!entity->rq)
> +		return 0;
> +

Yes, supposedly we shouldn't get 'entity->rq == NULL' case , that looks the true bug 

-----邮件原件-----
发件人: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> 代表 Christian K?nig
发送时间: 2020年2月19日 18:50
收件人: Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
主题: Re: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference

Well of hand this patch looks like a clear NAK to me.

Returning without raising an error is certainly the wrong thing to do here because we just drop the necessary page table updates.

How does the entity->rq ends up as NULL in the first place?

Regards,
Christian.

Am 19.02.20 um 07:26 schrieb Zhang, Hawking:
> [AMD Official Use Only - Internal Distribution Only]
>
> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
>
> Regards,
> Hawking
> -----Original Message-----
> From: Dennis Li <Dennis.Li@amd.com>
> Sent: Wednesday, February 19, 2020 12:05
> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Zhang, 
> Hawking <Hawking.Zhang@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
> Cc: Li, Dennis <Dennis.Li@amd.com>
> Subject: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
>
> check whether the queue of entity is null to avoid null pointer dereference.
>
> Change-Id: I08d56774012cf229ba2fe7a011c1359e8d1e2781
> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index 4cc7881f438c..67cca463ddcc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,9 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>   	int r;
>   
>   	entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> +	if (!entity->rq)
> +		return 0;
> +
>   	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>   
>   	WARN_ON(ib->length_dw == 0);
> --
> 2.17.1
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmo
> nk.liu%40amd.com%7C28e7260af3a24eec758f08d7b52975e3%7C3dd8961fe4884e60
> 8e11a82d994e183d%7C0%7C0%7C637177062003213431&amp;sdata=vMXmhwTlN8lAav
> uqhYhpmKLM6V%2F%2B2%2FubFBbsk%2BGY%2Bjw%3D&amp;reserved=0

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmonk.liu%40amd.com%7C28e7260af3a24eec758f08d7b52975e3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637177062003213431&amp;sdata=vMXmhwTlN8lAavuqhYhpmKLM6V%2F%2B2%2FubFBbsk%2BGY%2Bjw%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
  2020-02-20  8:41       ` Li, Dennis
@ 2020-02-20 10:39         ` Nirmoy
  2020-02-20 13:28         ` 回复: " Liu, Monk
  1 sibling, 0 replies; 12+ messages in thread
From: Nirmoy @ 2020-02-20 10:39 UTC (permalink / raw)
  To: amd-gfx


On 2/20/20 9:41 AM, Li, Dennis wrote:
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi, Christian and Monk,
>        When doing SDMA copy, a RAS uncorrectable error happens, which will cause this issue.  The RAS uncorrectable error event will trigger driver to do BACO reset which will set the status of SDMA scheduler to no ready. And then drm_sched_entity_get_free_sched will return NULL in drm_sched_entity_select_rq, which cause entity->rq to NULL.

This can happen only in drm_sched_job_init() which gets called in 
amdgpu_job_submit() which is after when we get this NULL ptr exception.

         entity = p->direct ? &p->vm->direct : &p->vm->delayed;
         ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);

         WARN_ON(ib->length_dw == 0);
         amdgpu_ring_pad_ib(ring, ib);
         WARN_ON(ib->length_dw > p->num_dw_left);
         r = amdgpu_job_submit(p->job, entity, AMDGPU_FENCE_OWNER_VM, 
&f);  <-- Here
         if (r)
                 goto error;

Could it be possible that we are doing  drm_sched_entity_init() for 
direct and indirect entity with zero scheds in amdgpu_vm_init() ?


Regards,

Nirmoy

> Best Regards
> Dennis Li
> -----Original Message-----
> From: Liu, Monk <Monk.Liu@amd.com>
> Sent: Wednesday, February 19, 2020 7:30 PM
> To: Koenig, Christian <Christian.Koenig@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
> Subject: 回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
>
>> +	if (!entity->rq)
>> +		return 0;
>> +
> Yes, supposedly we shouldn't get 'entity->rq == NULL' case , that looks the true bug
>
> -----邮件原件-----
> 发件人: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> 代表 Christian K?nig
> 发送时间: 2020年2月19日 18:50
> 收件人: Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
> 主题: Re: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
>
> Well of hand this patch looks like a clear NAK to me.
>
> Returning without raising an error is certainly the wrong thing to do here because we just drop the necessary page table updates.
>
> How does the entity->rq ends up as NULL in the first place?
>
> Regards,
> Christian.
>
> Am 19.02.20 um 07:26 schrieb Zhang, Hawking:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
>>
>> Regards,
>> Hawking
>> -----Original Message-----
>> From: Dennis Li <Dennis.Li@amd.com>
>> Sent: Wednesday, February 19, 2020 12:05
>> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander
>> <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Zhang,
>> Hawking <Hawking.Zhang@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
>> Cc: Li, Dennis <Dennis.Li@amd.com>
>> Subject: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
>>
>> check whether the queue of entity is null to avoid null pointer dereference.
>>
>> Change-Id: I08d56774012cf229ba2fe7a011c1359e8d1e2781
>> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>> index 4cc7881f438c..67cca463ddcc 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>> @@ -95,6 +95,9 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>>    	int r;
>>    
>>    	entity = p->direct ? &p->vm->direct : &p->vm->delayed;
>> +	if (!entity->rq)
>> +		return 0;
>> +
>>    	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>>    
>>    	WARN_ON(ib->length_dw == 0);
>> --
>> 2.17.1
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
>> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmo
>> nk.liu%40amd.com%7C28e7260af3a24eec758f08d7b52975e3%7C3dd8961fe4884e60
>> 8e11a82d994e183d%7C0%7C0%7C637177062003213431&amp;sdata=vMXmhwTlN8lAav
>> uqhYhpmKLM6V%2F%2B2%2FubFBbsk%2BGY%2Bjw%3D&amp;reserved=0
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cnirmoy.das%40amd.com%7C3af940db61dd474eb3f608d7b5e0a5cb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637177848788408220&amp;sdata=QiKFb5KAI21G4%2B5OiNxf22OIlmSu0078xhFquDVrIXA%3D&amp;reserved=0
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cnirmoy.das%40amd.com%7C3af940db61dd474eb3f608d7b5e0a5cb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637177848788408220&amp;sdata=QiKFb5KAI21G4%2B5OiNxf22OIlmSu0078xhFquDVrIXA%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* 回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
  2020-02-20  8:41       ` Li, Dennis
  2020-02-20 10:39         ` Nirmoy
@ 2020-02-20 13:28         ` Liu, Monk
  2020-02-20 13:35           ` Liu, Monk
  1 sibling, 1 reply; 12+ messages in thread
From: Liu, Monk @ 2020-02-20 13:28 UTC (permalink / raw)
  To: Li, Dennis, Koenig, Christian, Zhang, Hawking, amd-gfx, Deucher,
	Alexander, Zhou1, Tao, Chen, Guchun

so the scenario is:
KMD doing page update -> ECC -> baco reset (mute SDMA scheduler) -> drm_sched_entity_get_free_sched() return NULL -> you get your NULL pointer

And your approach is to introduce a bail out in the amdgpu_vm_sdma_commit() so the original amdgpu vm update jobs will be dropped,

That really can fix your NULL pointer issue, but actually bring another issue: you didn't finish the vm update work !

my initial idea is you do something like:

static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,

	...

	struct amdgpu_device *adev = p->adev;
	while (!mutex_trylock(&adev->lock_reset)) //means GPU is under recovering 	
		msleep(xxx);

	....

	mutex_unlock(&adev->lock_reset); //this put at the end of this routine
}


that do make our implement more complicated, but at least it doesn't drop the necessary vm update, otherwise you will run into other problems ...

@Koenig, Christian do you have another solution ?

Thanks 
	

-----邮件原件-----
发件人: Li, Dennis <Dennis.Li@amd.com> 
发送时间: 2020年2月20日 16:41
收件人: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
主题: RE: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian and Monk,
      When doing SDMA copy, a RAS uncorrectable error happens, which will cause this issue.  The RAS uncorrectable error event will trigger driver to do BACO reset which will set the status of SDMA scheduler to no ready. And then drm_sched_entity_get_free_sched will return NULL in drm_sched_entity_select_rq, which cause entity->rq to NULL. 

Best Regards
Dennis Li
-----Original Message-----
From: Liu, Monk <Monk.Liu@amd.com>
Sent: Wednesday, February 19, 2020 7:30 PM
To: Koenig, Christian <Christian.Koenig@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
Subject: 回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference

> +	if (!entity->rq)
> +		return 0;
> +

Yes, supposedly we shouldn't get 'entity->rq == NULL' case , that looks the true bug 

-----邮件原件-----
发件人: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> 代表 Christian K?nig
发送时间: 2020年2月19日 18:50
收件人: Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
主题: Re: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference

Well of hand this patch looks like a clear NAK to me.

Returning without raising an error is certainly the wrong thing to do here because we just drop the necessary page table updates.

How does the entity->rq ends up as NULL in the first place?

Regards,
Christian.

Am 19.02.20 um 07:26 schrieb Zhang, Hawking:
> [AMD Official Use Only - Internal Distribution Only]
>
> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
>
> Regards,
> Hawking
> -----Original Message-----
> From: Dennis Li <Dennis.Li@amd.com>
> Sent: Wednesday, February 19, 2020 12:05
> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Zhang, 
> Hawking <Hawking.Zhang@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
> Cc: Li, Dennis <Dennis.Li@amd.com>
> Subject: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
>
> check whether the queue of entity is null to avoid null pointer dereference.
>
> Change-Id: I08d56774012cf229ba2fe7a011c1359e8d1e2781
> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index 4cc7881f438c..67cca463ddcc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,9 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>   	int r;
>   
>   	entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> +	if (!entity->rq)
> +		return 0;
> +
>   	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>   
>   	WARN_ON(ib->length_dw == 0);
> --
> 2.17.1
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmo
> nk.liu%40amd.com%7C28e7260af3a24eec758f08d7b52975e3%7C3dd8961fe4884e60
> 8e11a82d994e183d%7C0%7C0%7C637177062003213431&amp;sdata=vMXmhwTlN8lAav
> uqhYhpmKLM6V%2F%2B2%2FubFBbsk%2BGY%2Bjw%3D&amp;reserved=0

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmonk.liu%40amd.com%7C28e7260af3a24eec758f08d7b52975e3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637177062003213431&amp;sdata=vMXmhwTlN8lAavuqhYhpmKLM6V%2F%2B2%2FubFBbsk%2BGY%2Bjw%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* 回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
  2020-02-20 13:28         ` 回复: " Liu, Monk
@ 2020-02-20 13:35           ` Liu, Monk
  2020-02-20 15:15             ` Nirmoy
  0 siblings, 1 reply; 12+ messages in thread
From: Liu, Monk @ 2020-02-20 13:35 UTC (permalink / raw)
  To: Liu, Monk, Li, Dennis, Koenig,  Christian, Zhang, Hawking,
	amd-gfx, Deucher, Alexander, Zhou1, Tao, Chen, Guchun

Sorry, my previous idea still leave RQ null, please check if below method works:

29 static struct drm_sched_rq *
130 drm_sched_entity_get_free_sched(struct drm_sched_entity *entity)
131 {
132     struct drm_sched_rq *rq = NULL;
133     unsigned int min_jobs = UINT_MAX, num_jobs;
134     int i;

135
		While (!mutex_trylock(....))
			Sleep()

136     for (i = 0; i < entity->num_rq_list; ++i) {
137         struct drm_gpu_scheduler *sched = entity->rq_list[i]->sched;
138
139         if (!entity->rq_list[i]->sched->ready) {    //we take the gpu reset mutex lock, so now sched->ready won't be set to "not ready"
140             DRM_WARN("sched%s is not ready, skipping", sched->name);				
141             continue;
142         }
143
144         num_jobs = atomic_read(&sched->num_jobs);
145         if (num_jobs < min_jobs) {
146             min_jobs = num_jobs;
147             rq = entity->rq_list[i];
148         }
149     }

		Mutex_unlock(...)

150
151     return rq;
152 }



-----邮件原件-----
发件人: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> 代表 Liu, Monk
发送时间: 2020年2月20日 21:28
收件人: Li, Dennis <Dennis.Li@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
主题: 回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference

so the scenario is:
KMD doing page update -> ECC -> baco reset (mute SDMA scheduler) -> drm_sched_entity_get_free_sched() return NULL -> you get your NULL pointer

And your approach is to introduce a bail out in the amdgpu_vm_sdma_commit() so the original amdgpu vm update jobs will be dropped,

That really can fix your NULL pointer issue, but actually bring another issue: you didn't finish the vm update work !

my initial idea is you do something like:

static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,

	...

	struct amdgpu_device *adev = p->adev;
	while (!mutex_trylock(&adev->lock_reset)) //means GPU is under recovering 	
		msleep(xxx);

	....

	mutex_unlock(&adev->lock_reset); //this put at the end of this routine }


that do make our implement more complicated, but at least it doesn't drop the necessary vm update, otherwise you will run into other problems ...

@Koenig, Christian do you have another solution ?

Thanks 
	

-----邮件原件-----
发件人: Li, Dennis <Dennis.Li@amd.com>
发送时间: 2020年2月20日 16:41
收件人: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
主题: RE: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian and Monk,
      When doing SDMA copy, a RAS uncorrectable error happens, which will cause this issue.  The RAS uncorrectable error event will trigger driver to do BACO reset which will set the status of SDMA scheduler to no ready. And then drm_sched_entity_get_free_sched will return NULL in drm_sched_entity_select_rq, which cause entity->rq to NULL. 

Best Regards
Dennis Li
-----Original Message-----
From: Liu, Monk <Monk.Liu@amd.com>
Sent: Wednesday, February 19, 2020 7:30 PM
To: Koenig, Christian <Christian.Koenig@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
Subject: 回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference

> +	if (!entity->rq)
> +		return 0;
> +

Yes, supposedly we shouldn't get 'entity->rq == NULL' case , that looks the true bug 

-----邮件原件-----
发件人: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> 代表 Christian K?nig
发送时间: 2020年2月19日 18:50
收件人: Zhang, Hawking <Hawking.Zhang@amd.com>; Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
主题: Re: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference

Well of hand this patch looks like a clear NAK to me.

Returning without raising an error is certainly the wrong thing to do here because we just drop the necessary page table updates.

How does the entity->rq ends up as NULL in the first place?

Regards,
Christian.

Am 19.02.20 um 07:26 schrieb Zhang, Hawking:
> [AMD Official Use Only - Internal Distribution Only]
>
> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
>
> Regards,
> Hawking
> -----Original Message-----
> From: Dennis Li <Dennis.Li@amd.com>
> Sent: Wednesday, February 19, 2020 12:05
> To: amd-gfx@lists.freedesktop.org; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; Zhou1, Tao <Tao.Zhou1@amd.com>; Zhang, 
> Hawking <Hawking.Zhang@amd.com>; Chen, Guchun <Guchun.Chen@amd.com>
> Cc: Li, Dennis <Dennis.Li@amd.com>
> Subject: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
>
> check whether the queue of entity is null to avoid null pointer dereference.
>
> Change-Id: I08d56774012cf229ba2fe7a011c1359e8d1e2781
> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index 4cc7881f438c..67cca463ddcc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,9 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>   	int r;
>   
>   	entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> +	if (!entity->rq)
> +		return 0;
> +
>   	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>   
>   	WARN_ON(ib->length_dw == 0);
> --
> 2.17.1
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmo
> nk.liu%40amd.com%7C28e7260af3a24eec758f08d7b52975e3%7C3dd8961fe4884e60
> 8e11a82d994e183d%7C0%7C0%7C637177062003213431&amp;sdata=vMXmhwTlN8lAav
> uqhYhpmKLM6V%2F%2B2%2FubFBbsk%2BGY%2Bjw%3D&amp;reserved=0

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmonk.liu%40amd.com%7C7acd21e839804ca7b70e08d7b608b9af%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637178021336998643&amp;sdata=cxadPi08Q1dqMZYtodebNWm55GruAr5BOmcsZR%2BFR3M%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmonk.liu%40amd.com%7C7acd21e839804ca7b70e08d7b608b9af%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637178021336998643&amp;sdata=cxadPi08Q1dqMZYtodebNWm55GruAr5BOmcsZR%2BFR3M%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
  2020-02-20 13:35           ` Liu, Monk
@ 2020-02-20 15:15             ` Nirmoy
  2020-02-20 15:20               ` Nirmoy
  0 siblings, 1 reply; 12+ messages in thread
From: Nirmoy @ 2020-02-20 15:15 UTC (permalink / raw)
  To: amd-gfx


On 2/20/20 2:35 PM, Liu, Monk wrote:
> Sorry, my previous idea still leave RQ null, please check if below method works:
>
> 29 static struct drm_sched_rq *
> 130 drm_sched_entity_get_free_sched(struct drm_sched_entity *entity)
> 131 {
> 132     struct drm_sched_rq *rq = NULL;
> 133     unsigned int min_jobs = UINT_MAX, num_jobs;
> 134     int i;
>
> 135
> 		While (!mutex_trylock(....))
> 			Sleep()
We can't do that drm_sched_entity_get_free_sched is in another 
module(drm scheduler) independent of amdgpu
> 136     for (i = 0; i < entity->num_rq_list; ++i) {
> 137         struct drm_gpu_scheduler *sched = entity->rq_list[i]->sched;
> 138
> 139         if (!entity->rq_list[i]->sched->ready) {    //we take the gpu reset mutex lock, so now sched->ready won't be set to "not ready"
> 140             DRM_WARN("sched%s is not ready, skipping", sched->name);				
> 141             continue;
> 142         }
> 143
> 144         num_jobs = atomic_read(&sched->num_jobs);
> 145         if (num_jobs < min_jobs) {
> 146             min_jobs = num_jobs;
> 147             rq = entity->rq_list[i];
> 148         }
> 149     }
>
> 		Mutex_unlock(...)
>
> 150
> 151     return rq;
> 152 }
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
  2020-02-20 15:15             ` Nirmoy
@ 2020-02-20 15:20               ` Nirmoy
  2020-02-20 15:24                 ` 回复: " Liu, Monk
  0 siblings, 1 reply; 12+ messages in thread
From: Nirmoy @ 2020-02-20 15:20 UTC (permalink / raw)
  To: amd-gfx

Hi Monk,

How  can I reproduce this bug ?

Regards,

Nirmoy

On 2/20/20 4:15 PM, Nirmoy wrote:
>
> On 2/20/20 2:35 PM, Liu, Monk wrote:
>> Sorry, my previous idea still leave RQ null, please check if below 
>> method works:
>>
>> 29 static struct drm_sched_rq *
>> 130 drm_sched_entity_get_free_sched(struct drm_sched_entity *entity)
>> 131 {
>> 132     struct drm_sched_rq *rq = NULL;
>> 133     unsigned int min_jobs = UINT_MAX, num_jobs;
>> 134     int i;
>>
>> 135
>>         While (!mutex_trylock(....))
>>             Sleep()
> We can't do that drm_sched_entity_get_free_sched is in another 
> module(drm scheduler) independent of amdgpu
>> 136     for (i = 0; i < entity->num_rq_list; ++i) {
>> 137         struct drm_gpu_scheduler *sched = entity->rq_list[i]->sched;
>> 138
>> 139         if (!entity->rq_list[i]->sched->ready) { //we take the 
>> gpu reset mutex lock, so now sched->ready won't be set to "not ready"
>> 140             DRM_WARN("sched%s is not ready, skipping", sched->name);
>> 141             continue;
>> 142         }
>> 143
>> 144         num_jobs = atomic_read(&sched->num_jobs);
>> 145         if (num_jobs < min_jobs) {
>> 146             min_jobs = num_jobs;
>> 147             rq = entity->rq_list[i];
>> 148         }
>> 149     }
>>
>>         Mutex_unlock(...)
>>
>> 150
>> 151     return rq;
>> 152 }
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cnirmoy.das%40amd.com%7C42d1848a368c4371069a08d7b6175779%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637178083693348777&amp;sdata=FWeewzJ8jYBlBxAVMkMeq%2BIEOzGapqMiRfW22VYr1SI%3D&amp;reserved=0 
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* 回复: 回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
  2020-02-20 15:20               ` Nirmoy
@ 2020-02-20 15:24                 ` Liu, Monk
  2020-02-20 15:25                   ` Liu, Monk
  0 siblings, 1 reply; 12+ messages in thread
From: Liu, Monk @ 2020-02-20 15:24 UTC (permalink / raw)
  To: Das, Nirmoy, amd-gfx

I don't know

The patch is from me, I just think the patch is dropping a vm update which looks not perfect to me

-----邮件原件-----
发件人: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> 代表 Nirmoy
发送时间: 2020年2月20日 23:21
收件人: amd-gfx@lists.freedesktop.org
主题: Re: 回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference

Hi Monk,

How  can I reproduce this bug ?

Regards,

Nirmoy

On 2/20/20 4:15 PM, Nirmoy wrote:
>
> On 2/20/20 2:35 PM, Liu, Monk wrote:
>> Sorry, my previous idea still leave RQ null, please check if below 
>> method works:
>>
>> 29 static struct drm_sched_rq *
>> 130 drm_sched_entity_get_free_sched(struct drm_sched_entity *entity)
>> 131 {
>> 132     struct drm_sched_rq *rq = NULL;
>> 133     unsigned int min_jobs = UINT_MAX, num_jobs;
>> 134     int i;
>>
>> 135
>>         While (!mutex_trylock(....))
>>             Sleep()
> We can't do that drm_sched_entity_get_free_sched is in another 
> module(drm scheduler) independent of amdgpu
>> 136     for (i = 0; i < entity->num_rq_list; ++i) {
>> 137         struct drm_gpu_scheduler *sched = 
>> entity->rq_list[i]->sched;
>> 138
>> 139         if (!entity->rq_list[i]->sched->ready) { //we take the 
>> gpu reset mutex lock, so now sched->ready won't be set to "not ready"
>> 140             DRM_WARN("sched%s is not ready, skipping", 
>> sched->name);
>> 141             continue;
>> 142         }
>> 143
>> 144         num_jobs = atomic_read(&sched->num_jobs);
>> 145         if (num_jobs < min_jobs) {
>> 146             min_jobs = num_jobs;
>> 147             rq = entity->rq_list[i];
>> 148         }
>> 149     }
>>
>>         Mutex_unlock(...)
>>
>> 150
>> 151     return rq;
>> 152 }
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmo
> nk.liu%40amd.com%7C3635fc8baa4d4b90387108d7b6188069%7C3dd8961fe4884e60
> 8e11a82d994e183d%7C0%7C0%7C637178088661253651&amp;sdata=Ki0SmulPESLIj7
> dGyT10Be0zXungK4U9fXONp45NjaA%3D&amp;reserved=0
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmonk.liu%40amd.com%7C3635fc8baa4d4b90387108d7b6188069%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637178088661253651&amp;sdata=Ki0SmulPESLIj7dGyT10Be0zXungK4U9fXONp45NjaA%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

* 回复: 回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference
  2020-02-20 15:24                 ` 回复: " Liu, Monk
@ 2020-02-20 15:25                   ` Liu, Monk
  0 siblings, 0 replies; 12+ messages in thread
From: Liu, Monk @ 2020-02-20 15:25 UTC (permalink / raw)
  To: Liu, Monk, Das, Nirmoy, amd-gfx

Isn't from me... damn keyboard 

-----邮件原件-----
发件人: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> 代表 Liu, Monk
发送时间: 2020年2月20日 23:24
收件人: Das, Nirmoy <Nirmoy.Das@amd.com>; amd-gfx@lists.freedesktop.org
主题: 回复: 回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference

I don't know

The patch is from me, I just think the patch is dropping a vm update which looks not perfect to me

-----邮件原件-----
发件人: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> 代表 Nirmoy
发送时间: 2020年2月20日 23:21
收件人: amd-gfx@lists.freedesktop.org
主题: Re: 回复: [PATCH] drm/amdgpu: fix a bug NULL pointer dereference

Hi Monk,

How  can I reproduce this bug ?

Regards,

Nirmoy

On 2/20/20 4:15 PM, Nirmoy wrote:
>
> On 2/20/20 2:35 PM, Liu, Monk wrote:
>> Sorry, my previous idea still leave RQ null, please check if below 
>> method works:
>>
>> 29 static struct drm_sched_rq *
>> 130 drm_sched_entity_get_free_sched(struct drm_sched_entity *entity)
>> 131 {
>> 132     struct drm_sched_rq *rq = NULL;
>> 133     unsigned int min_jobs = UINT_MAX, num_jobs;
>> 134     int i;
>>
>> 135
>>         While (!mutex_trylock(....))
>>             Sleep()
> We can't do that drm_sched_entity_get_free_sched is in another 
> module(drm scheduler) independent of amdgpu
>> 136     for (i = 0; i < entity->num_rq_list; ++i) {
>> 137         struct drm_gpu_scheduler *sched =
>> entity->rq_list[i]->sched;
>> 138
>> 139         if (!entity->rq_list[i]->sched->ready) { //we take the 
>> gpu reset mutex lock, so now sched->ready won't be set to "not ready"
>> 140             DRM_WARN("sched%s is not ready, skipping",
>> sched->name);
>> 141             continue;
>> 142         }
>> 143
>> 144         num_jobs = atomic_read(&sched->num_jobs);
>> 145         if (num_jobs < min_jobs) {
>> 146             min_jobs = num_jobs;
>> 147             rq = entity->rq_list[i];
>> 148         }
>> 149     }
>>
>>         Mutex_unlock(...)
>>
>> 150
>> 151     return rq;
>> 152 }
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmo
> nk.liu%40amd.com%7C3635fc8baa4d4b90387108d7b6188069%7C3dd8961fe4884e60
> 8e11a82d994e183d%7C0%7C0%7C637178088661253651&amp;sdata=Ki0SmulPESLIj7
> dGyT10Be0zXungK4U9fXONp45NjaA%3D&amp;reserved=0
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmonk.liu%40amd.com%7C81ad7d959e7a4ed972bb08d7b618fc41%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637178090738811091&amp;sdata=7BAfVNQKI92Mbw7B2kBRYoEnM2OxDwtedu%2FM7r6cJZ4%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cmonk.liu%40amd.com%7C81ad7d959e7a4ed972bb08d7b618fc41%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637178090738811091&amp;sdata=7BAfVNQKI92Mbw7B2kBRYoEnM2OxDwtedu%2FM7r6cJZ4%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-02-20 15:25 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-19  4:04 [PATCH] drm/amdgpu: fix a bug NULL pointer dereference Dennis Li
2020-02-19  6:26 ` Zhang, Hawking
2020-02-19 10:49   ` Christian König
2020-02-19 11:29     ` 回复: " Liu, Monk
2020-02-20  8:41       ` Li, Dennis
2020-02-20 10:39         ` Nirmoy
2020-02-20 13:28         ` 回复: " Liu, Monk
2020-02-20 13:35           ` Liu, Monk
2020-02-20 15:15             ` Nirmoy
2020-02-20 15:20               ` Nirmoy
2020-02-20 15:24                 ` 回复: " Liu, Monk
2020-02-20 15:25                   ` Liu, Monk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).