AMD-GFX Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH] drm/amdgpu: Check entity rq
@ 2020-03-25 11:07 xinhui pan
  2020-03-25 11:14 ` Nirmoy
  0 siblings, 1 reply; 11+ messages in thread
From: xinhui pan @ 2020-03-25 11:07 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher, Felix Kuehling, xinhui pan, Christian König

gpu recover will call sdma suspend/resume. In this period, ring will be
disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
be false.

If we submit any jobs in this ring-disabled period. We fail to pick up
a rq for vm entity and entity->rq will set to NULL.
amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
hit panic.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index cf96c335b258..d30d103e48a2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
 	int r;
 
 	entity = p->direct ? &p->vm->direct : &p->vm->delayed;
+	if (!entity->rq)
+		return -ENOENT;
 	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
 
 	WARN_ON(ib->length_dw == 0);
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdgpu: Check entity rq
  2020-03-25 11:14 ` Nirmoy
@ 2020-03-25 11:13   ` Koenig, Christian
  2020-03-25 11:34     ` Pan, Xinhui
  2020-03-25 11:37     ` Pan, Xinhui
  0 siblings, 2 replies; 11+ messages in thread
From: Koenig, Christian @ 2020-03-25 11:13 UTC (permalink / raw)
  To: Das, Nirmoy; +Cc: Deucher, Alexander, Kuehling, Felix, Pan, Xinhui, amd-gfx

[-- Attachment #1.1: Type: text/plain, Size: 5097 bytes --]

Hi guys,

thanks for pointing this out Nirmoy.

Yeah, could be that I forgot to commit the patch. Currently I don't know at which end of the chaos I should start to clean up.

Christian.

Am 25.03.2020 12:09 schrieb "Das, Nirmoy" <Nirmoy.Das@amd.com>:
Hi Xinhui,


Can you please check if you can reproduce the crash with
https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html

Christian fix it earlier, I think he forgot to push it.


Regards,

Nirmoy

On 3/25/20 12:07 PM, xinhui pan wrote:
> gpu recover will call sdma suspend/resume. In this period, ring will be
> disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> be false.
>
> If we submit any jobs in this ring-disabled period. We fail to pick up
> a rq for vm entity and entity->rq will set to NULL.
> amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> hit panic.
>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index cf96c335b258..d30d103e48a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>        int r;
>
>        entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> +     if (!entity->rq)
> +             return -ENOENT;
>        ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>
>        WARN_ON(ib->length_dw == 0);


Am 25.03.2020 12:09 schrieb "Das, Nirmoy" <Nirmoy.Das@amd.com>:
Hi Xinhui,


Can you please check if you can reproduce the crash with
https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html

Christian fix it earlier, I think he forgot to push it.


Regards,

Nirmoy

On 3/25/20 12:07 PM, xinhui pan wrote:
> gpu recover will call sdma suspend/resume. In this period, ring will be
> disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> be false.
>
> If we submit any jobs in this ring-disabled period. We fail to pick up
> a rq for vm entity and entity->rq will set to NULL.
> amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> hit panic.
>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index cf96c335b258..d30d103e48a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>        int r;
>
>        entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> +     if (!entity->rq)
> +             return -ENOENT;
>        ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>
>        WARN_ON(ib->length_dw == 0);


Am 25.03.2020 12:09 schrieb "Das, Nirmoy" <Nirmoy.Das@amd.com>:
Hi Xinhui,


Can you please check if you can reproduce the crash with
https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html

Christian fix it earlier, I think he forgot to push it.


Regards,

Nirmoy

On 3/25/20 12:07 PM, xinhui pan wrote:
> gpu recover will call sdma suspend/resume. In this period, ring will be
> disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> be false.
>
> If we submit any jobs in this ring-disabled period. We fail to pick up
> a rq for vm entity and entity->rq will set to NULL.
> amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> hit panic.
>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index cf96c335b258..d30d103e48a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>        int r;
>
>        entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> +     if (!entity->rq)
> +             return -ENOENT;
>        ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>
>        WARN_ON(ib->length_dw == 0);

[-- Attachment #1.2: Type: text/html, Size: 8607 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdgpu: Check entity rq
  2020-03-25 11:07 [PATCH] drm/amdgpu: Check entity rq xinhui pan
@ 2020-03-25 11:14 ` Nirmoy
  2020-03-25 11:13   ` Koenig, Christian
  0 siblings, 1 reply; 11+ messages in thread
From: Nirmoy @ 2020-03-25 11:14 UTC (permalink / raw)
  To: xinhui pan, amd-gfx; +Cc: Alex Deucher, Felix Kuehling, Christian König

Hi Xinhui,


Can you please check if you can reproduce the crash with 
https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html

Christian fix it earlier, I think he forgot to push it.


Regards,

Nirmoy

On 3/25/20 12:07 PM, xinhui pan wrote:
> gpu recover will call sdma suspend/resume. In this period, ring will be
> disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> be false.
>
> If we submit any jobs in this ring-disabled period. We fail to pick up
> a rq for vm entity and entity->rq will set to NULL.
> amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> hit panic.
>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index cf96c335b258..d30d103e48a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>   	int r;
>   
>   	entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> +	if (!entity->rq)
> +		return -ENOENT;
>   	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>   
>   	WARN_ON(ib->length_dw == 0);
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdgpu: Check entity rq
  2020-03-25 11:13   ` Koenig, Christian
@ 2020-03-25 11:34     ` Pan, Xinhui
  2020-03-25 11:37     ` Pan, Xinhui
  1 sibling, 0 replies; 11+ messages in thread
From: Pan, Xinhui @ 2020-03-25 11:34 UTC (permalink / raw)
  To: Das, Nirmoy, Koenig, Christian
  Cc: Deucher, Alexander, Kuehling, Felix, amd-gfx

[-- Attachment #1.1: Type: text/plain, Size: 6084 bytes --]

[AMD Official Use Only - Internal Distribution Only]

well, submit job with HW disabled shluld be no harm.

The only concern is that we might use up IBs if we park scheduler during recovery. I have saw recovery stuck in sa new functuon.

ring test alloc IBs to test if recovery succeed or not. But if there is no enough IBs it will wait fences to signal. However we have parked the scheduler thread,  the job will never run and no fences will be signaled.

see, deadlock indeed. Now we are allowing job submission here. it is more likely that IBs might be used up.

________________________________
From: Koenig, Christian <Christian.Koenig@amd.com>
Sent: Wednesday, March 25, 2020 7:13:13 PM
To: Das, Nirmoy <Nirmoy.Das@amd.com>
Cc: Pan, Xinhui <Xinhui.Pan@amd.com>; amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>; Deucher, Alexander <Alexander.Deucher@amd.com>; Kuehling, Felix <Felix.Kuehling@amd.com>
Subject: Re: [PATCH] drm/amdgpu: Check entity rq

Hi guys,

thanks for pointing this out Nirmoy.

Yeah, could be that I forgot to commit the patch. Currently I don't know at which end of the chaos I should start to clean up.

Christian.

Am 25.03.2020 12:09 schrieb "Das, Nirmoy" <Nirmoy.Das@amd.com>:
Hi Xinhui,


Can you please check if you can reproduce the crash with
https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html

Christian fix it earlier, I think he forgot to push it.


Regards,

Nirmoy

On 3/25/20 12:07 PM, xinhui pan wrote:
> gpu recover will call sdma suspend/resume. In this period, ring will be
> disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> be false.
>
> If we submit any jobs in this ring-disabled period. We fail to pick up
> a rq for vm entity and entity->rq will set to NULL.
> amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> hit panic.
>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index cf96c335b258..d30d103e48a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>        int r;
>
>        entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> +     if (!entity->rq)
> +             return -ENOENT;
>        ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>
>        WARN_ON(ib->length_dw == 0);


Am 25.03.2020 12:09 schrieb "Das, Nirmoy" <Nirmoy.Das@amd.com>:
Hi Xinhui,


Can you please check if you can reproduce the crash with
https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html

Christian fix it earlier, I think he forgot to push it.


Regards,

Nirmoy

On 3/25/20 12:07 PM, xinhui pan wrote:
> gpu recover will call sdma suspend/resume. In this period, ring will be
> disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> be false.
>
> If we submit any jobs in this ring-disabled period. We fail to pick up
> a rq for vm entity and entity->rq will set to NULL.
> amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> hit panic.
>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index cf96c335b258..d30d103e48a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>        int r;
>
>        entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> +     if (!entity->rq)
> +             return -ENOENT;
>        ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>
>        WARN_ON(ib->length_dw == 0);


Am 25.03.2020 12:09 schrieb "Das, Nirmoy" <Nirmoy.Das@amd.com>:
Hi Xinhui,


Can you please check if you can reproduce the crash with
https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html

Christian fix it earlier, I think he forgot to push it.


Regards,

Nirmoy

On 3/25/20 12:07 PM, xinhui pan wrote:
> gpu recover will call sdma suspend/resume. In this period, ring will be
> disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> be false.
>
> If we submit any jobs in this ring-disabled period. We fail to pick up
> a rq for vm entity and entity->rq will set to NULL.
> amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> hit panic.
>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index cf96c335b258..d30d103e48a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>        int r;
>
>        entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> +     if (!entity->rq)
> +             return -ENOENT;
>        ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>
>        WARN_ON(ib->length_dw == 0);

[-- Attachment #1.2: Type: text/html, Size: 10860 bytes --]

[-- Attachment #2: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdgpu: Check entity rq
  2020-03-25 11:13   ` Koenig, Christian
  2020-03-25 11:34     ` Pan, Xinhui
@ 2020-03-25 11:37     ` Pan, Xinhui
  1 sibling, 0 replies; 11+ messages in thread
From: Pan, Xinhui @ 2020-03-25 11:37 UTC (permalink / raw)
  To: Koenig, Christian
  Cc: Deucher, Alexander, Kuehling, Felix, Pan, Xinhui, Das, Nirmoy, amd-gfx

well, submit job with HW disabled shluld be no harm.

The only concern is that we might use up IBs if we park scheduler thread during recovery. 
I have saw recovery stuck in sa new functuon. 
ring test alloc IBs to test if recovery succeed or not. But if there is no enough IBs it will wait fences to signal. 
However we have parked the scheduler thread,  the job will never run and no fences will be signaled.

see, deadlock indeed. Now we are allowing job submission here. it is more likely that IBs might be used up.

deadlock calltrace. 
271384 [27069.375047] INFO: task gnome-shell:2507 blocked for more than 120 seconds.
271385 [27069.382510]       Tainted: G        W         5.4.0-rc7+ #1
271386 [27069.388207] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
271387 [27069.396221] gnome-shell     D    0  2507   2487 0x00000000
271388 [27069.401869] Call Trace:
271389 [27069.404404]  __schedule+0x2ab/0x860
271390 [27069.408009]  ? dma_fence_wait_any_timeout+0x1a4/0x2b0
271391 [27069.413198]  schedule+0x3a/0xc0
271392 [27069.416432]  schedule_timeout+0x21d/0x3c0
271393 [27069.420583]  ? trace_hardirqs_on+0x3b/0xf0
271394 [27069.424815]  ? dma_fence_add_callback+0x6e/0xe0
271395 [27069.429449]  ? dma_fence_wait_any_timeout+0x1a4/0x2b0
271396 [27069.434640]  dma_fence_wait_any_timeout+0x205/0x2b0
271397 [27069.439633]  ? dma_fence_wait_any_timeout+0x238/0x2b0
271398 [27069.444944]  amdgpu_sa_bo_new+0x4d7/0x5c0 [amdgpu]
271399 [27069.449949]  amdgpu_ib_get+0x36/0xa0 [amdgpu]
271400 [27069.454534]  amdgpu_job_alloc_with_ib+0x4d/0x70 [amdgpu]
271401 [27069.460057]  amdgpu_vm_sdma_prepare+0x28/0x60 [amdgpu]
271402 [27069.465370]  amdgpu_vm_bo_update_mapping+0xd7/0x1f0 [amdgpu]
271403 [27069.471171]  ? mark_held_locks+0x4d/0x80
271404 [27069.475281]  amdgpu_vm_bo_update+0x3b7/0x960 [amdgpu]
271405 [27069.480538]  amdgpu_gem_va_ioctl+0x4f3/0x510 [amdgpu]
271406 [27069.485838]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
271407 [27069.491380]  drm_ioctl_kernel+0xb0/0x100 [drm]
271408 [27069.496045]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
271409 [27069.501569]  ? drm_ioctl_kernel+0xb0/0x100 [drm]
271410 [27069.506353]  drm_ioctl+0x389/0x450 [drm]
271411 [27069.510458]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
271412 [27069.516000]  ? trace_hardirqs_on+0x3b/0xf0
271413 [27069.520305]  amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
271414 [27069.525048]  do_vfs_ioctl+0xa9/0x6f0
271415 [27069.528753]  ? tomoyo_file_ioctl+0x19/0x20
271416 [27069.532972]  ksys_ioctl+0x75/0x80
271417 [27069.536396]  ? do_syscall_64+0x17/0x230
271418 [27069.540357]  __x64_sys_ioctl+0x1a/0x20
271419 [27069.544239]  do_syscall_64+0x5f/0x230


> 2020年3月25日 19:13,Koenig, Christian <Christian.Koenig@amd.com> 写道:
> 
> Hi guys,
> 
> thanks for pointing this out Nirmoy.
> 
> Yeah, could be that I forgot to commit the patch. Currently I don't know at which end of the chaos I should start to clean up.
> 
> Christian.
> 
> Am 25.03.2020 12:09 schrieb "Das, Nirmoy" <Nirmoy.Das@amd.com>:
> Hi Xinhui,
> 
> 
> Can you please check if you can reproduce the crash with 
> https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html
> 
> Christian fix it earlier, I think he forgot to push it.
> 
> 
> Regards,
> 
> Nirmoy
> 
> On 3/25/20 12:07 PM, xinhui pan wrote:
> > gpu recover will call sdma suspend/resume. In this period, ring will be
> > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> > be false.
> >
> > If we submit any jobs in this ring-disabled period. We fail to pick up
> > a rq for vm entity and entity->rq will set to NULL.
> > amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> > hit panic.
> >
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Alex Deucher <alexander.deucher@amd.com>
> > Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> > Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > index cf96c335b258..d30d103e48a2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
> >        int r;
> >   
> >        entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> > +     if (!entity->rq)
> > +             return -ENOENT;
> >        ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
> >   
> >        WARN_ON(ib->length_dw == 0);
> 
> 
> Am 25.03.2020 12:09 schrieb "Das, Nirmoy" <Nirmoy.Das@amd.com>:
> Hi Xinhui,
> 
> 
> Can you please check if you can reproduce the crash with 
> https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html
> 
> Christian fix it earlier, I think he forgot to push it.
> 
> 
> Regards,
> 
> Nirmoy
> 
> On 3/25/20 12:07 PM, xinhui pan wrote:
> > gpu recover will call sdma suspend/resume. In this period, ring will be
> > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> > be false.
> >
> > If we submit any jobs in this ring-disabled period. We fail to pick up
> > a rq for vm entity and entity->rq will set to NULL.
> > amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> > hit panic.
> >
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Alex Deucher <alexander.deucher@amd.com>
> > Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> > Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > index cf96c335b258..d30d103e48a2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
> >        int r;
> >   
> >        entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> > +     if (!entity->rq)
> > +             return -ENOENT;
> >        ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
> >   
> >        WARN_ON(ib->length_dw == 0);
> 
> 
> Am 25.03.2020 12:09 schrieb "Das, Nirmoy" <Nirmoy.Das@amd.com>:
> Hi Xinhui,
> 
> 
> Can you please check if you can reproduce the crash with 
> https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html
> 
> Christian fix it earlier, I think he forgot to push it.
> 
> 
> Regards,
> 
> Nirmoy
> 
> On 3/25/20 12:07 PM, xinhui pan wrote:
> > gpu recover will call sdma suspend/resume. In this period, ring will be
> > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> > be false.
> >
> > If we submit any jobs in this ring-disabled period. We fail to pick up
> > a rq for vm entity and entity->rq will set to NULL.
> > amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> > hit panic.
> >
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Alex Deucher <alexander.deucher@amd.com>
> > Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> > Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > index cf96c335b258..d30d103e48a2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
> >        int r;
> >   
> >        entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> > +     if (!entity->rq)
> > +             return -ENOENT;
> >        ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
> >   
> >        WARN_ON(ib->length_dw == 0);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdgpu: Check entity rq
  2020-03-25 11:03     ` Nirmoy
@ 2020-03-30 11:11       ` Christian König
  0 siblings, 0 replies; 11+ messages in thread
From: Christian König @ 2020-03-30 11:11 UTC (permalink / raw)
  To: Nirmoy, amd-gfx

Am 25.03.20 um 12:03 schrieb Nirmoy:
>
> On 3/25/20 10:23 AM, Pan, Xinhui wrote:
>>
>>> 2020年3月25日 15:48,Koenig, Christian <Christian.Koenig@amd.com> 写道: 
>>>
>>>
>>> Am 25.03.20 um 06:47 schrieb xinhui pan:
>>>> Hit panic during GPU recovery test. drm_sched_entity_select_rq might
>>>> set NULL to rq. So add a check like drm_sched_job_init does.
>>> NAK, the rq should never be set to NULL in the first place.
>>>
>>> How did that happened?
>> well, I have not check the details.
>> but just got the call trace below.
>> looks like sched is not ready, and drm_sched_entity_select_rq set 
>> entity->rq to NULL.
>> in the next amdgpu_vm_sdma_commit, hit panic when we deference 
>> entity->rq.
>
> "drm/amdgpu: stop disable the scheduler during HW fini" from Christian 
> should've fix it already. But
>
> I can't find that commit in brahma/amd-staging-drm-next.

Yeah, my fault. I actually forgot to push it.

Should be fixed by now,
Christian.

>
> Regards,
>
> Nirmoy
>
>>
>> 297567 [   44.667677] amdgpu 0000:03:00.0: GPU reset begin!
>> 297568 [   44.929047] [drm] scheduler sdma0 is not ready, skipping
>> 297569 [   44.929048] [drm] scheduler sdma1 is not ready, skipping
>> 297570 [   44.934608] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* 
>> Couldn't update BO_VA (-2)
>> 297571 [   44.947941] BUG: kernel NULL pointer dereference, address: 
>> 0000000000000038
>> 297572 [   44.955132] #PF: supervisor read access in kernel mode
>> 297573 [   44.960451] #PF: error_code(0x0000) - not-present page
>> 297574 [   44.965714] PGD 0 P4D 0
>> 297575 [   44.968331] Oops: 0000 [#1] SMP PTI
>> 297576 [   44.971911] CPU: 7 PID: 2496 Comm: gnome-shell Tainted: 
>> G        W         5.4.0-rc7+ #1
>> 297577 [   44.980221] Hardware name: System manufacturer System 
>> Product Name/Z170-A, BIOS 1702 01/28/2016
>> 297578 [   44.989177] RIP: 0010:amdgpu_vm_sdma_commit+0x55/0x190 
>> [amdgpu]
>> 297579 [   44.995242] Code: 47 20 80 7f 10 00 4c 8b a0 88 01 00 00 48 
>> 8b 47 08 4c 8d a8 70 01 00 00 75 07 4c 8d a8 88 02 00 00 49 8b 45 10 
>> 41 8b 54 24 08 <48> 8b 40 38 85 d2 48 8d b8 30 ff ff f       f 0f 84 
>> 06 01 00 00 48 8b 80
>> 297580 [   45.014931] RSP: 0018:ffffb66e008839d0 EFLAGS: 00010246
>> 297581 [   45.020504] RAX: 0000000000000000 RBX: ffffb66e00883a30 
>> RCX: 0000000000100400
>> 297582 [   45.028062] RDX: 000000000000003c RSI: ffff8df123662138 
>> RDI: ffffb66e00883a30
>> 297583 [   45.035662] RBP: ffffb66e00883a00 R08: ffffb66e0088395c 
>> R09: ffffb66e00883960
>> 297584 [   45.043298] R10: 0000000000100240 R11: 0000000000000035 
>> R12: ffff8df1425385e8
>> 297585 [   45.050916] R13: ffff8df13cfd1288 R14: ffff8df123662138 
>> R15: ffff8df13cfd1000
>> 297586 [   45.058524] FS:  00007fcc8f6b2100(0000) 
>> GS:ffff8df15e380000(0000) knlGS:0000000000000000
>> 297587 [   45.067114] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> 297588 [   45.073206] CR2: 0000000000000038 CR3: 0000000641fb6006 
>> CR4: 00000000003606e0
>> 297589 [   45.080791] DR0: 0000000000000000 DR1: 0000000000000000 
>> DR2: 0000000000000000
>> 297590 [   45.088277] DR3: 0000000000000000 DR6: 00000000fffe0ff0 
>> DR7: 0000000000000400
>> 297591 [   45.095773] Call Trace:
>> 297592 [   45.098354]  amdgpu_vm_bo_update_mapping+0x1c1/0x1f0 [amdgpu]
>> 297593 [   45.104427]  ? mark_held_locks+0x4d/0x80
>> 297594 [   45.108682]  amdgpu_vm_bo_update+0x3b7/0x960 [amdgpu]
>> 297595 [   45.114049]  ? rcu_read_lock_sched_held+0x4f/0x80
>> 297596 [   45.119111]  amdgpu_gem_va_ioctl+0x4f3/0x510 [amdgpu]
>> 297597 [   45.124495]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
>> 297598 [   45.130250]  drm_ioctl_kernel+0xb0/0x100 [drm]
>> 297599 [   45.134988]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
>> 297600 [   45.140742]  ? drm_ioctl_kernel+0xb0/0x100 [drm]
>> 297601 [   45.145622]  drm_ioctl+0x389/0x450 [drm]
>> 297602 [   45.149804]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
>> 297603 [   45.155551]  ? trace_hardirqs_on+0x3b/0xf0
>> 297604 [   45.159892]  amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
>> 297605 [   45.172104]  do_vfs_ioctl+0xa9/0x6f0
>> 297606 [   45.175909]  ? tomoyo_file_ioctl+0x19/0x20
>> 297607 [   45.180241]  ksys_ioctl+0x75/0x80
>> 297608 [   45.183760]  ? do_syscall_64+0x17/0x230
>> 297609 [   45.187833]  __x64_sys_ioctl+0x1a/0x20
>> 297610 [   45.191846]  do_syscall_64+0x5f/0x230
>> 297611 [   45.195764]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> 297612 [   45.201126] RIP: 0033:0x7fcc8c7725d7
>>
>>> Regards,
>>> Christian.
>>>
>>>> Cc: Christian König <christian.koenig@amd.com>
>>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>>> ---
>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>>>>   1 file changed, 2 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>>> index cf96c335b258..d30d103e48a2 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>>> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct 
>>>> amdgpu_vm_update_params *p,
>>>>       int r;
>>>>         entity = p->direct ? &p->vm->direct : &p->vm->delayed;
>>>> +    if (!entity->rq)
>>>> +        return -ENOENT;
>>>>       ring = container_of(entity->rq->sched, struct amdgpu_ring, 
>>>> sched);
>>>>         WARN_ON(ib->length_dw == 0);
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cnirmoy.das%40amd.com%7Cd293af82969b445042e008d7d09e3f53%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637207250441956201&amp;sdata=cvW%2B%2FlmbeeovS4EHk4VjtC1MTaCAVjHTV%2FitSoAoOD4%3D&amp;reserved=0 
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdgpu: Check entity rq
  2020-03-25  9:23   ` Pan, Xinhui
  2020-03-25 10:54     ` Pan, Xinhui
@ 2020-03-25 11:03     ` Nirmoy
  2020-03-30 11:11       ` Christian König
  1 sibling, 1 reply; 11+ messages in thread
From: Nirmoy @ 2020-03-25 11:03 UTC (permalink / raw)
  To: amd-gfx


On 3/25/20 10:23 AM, Pan, Xinhui wrote:
>
>> 2020年3月25日 15:48,Koenig, Christian <Christian.Koenig@amd.com> 写道:
>>
>> Am 25.03.20 um 06:47 schrieb xinhui pan:
>>> Hit panic during GPU recovery test. drm_sched_entity_select_rq might
>>> set NULL to rq. So add a check like drm_sched_job_init does.
>> NAK, the rq should never be set to NULL in the first place.
>>
>> How did that happened?
> well, I have not check the details.
> but just got the call trace below.
> looks like sched is not ready, and drm_sched_entity_select_rq set entity->rq to NULL.
> in the next amdgpu_vm_sdma_commit, hit panic when we deference entity->rq.

"drm/amdgpu: stop disable the scheduler during HW fini" from Christian 
should've fix it already. But

I can't find that commit in brahma/amd-staging-drm-next.

Regards,

Nirmoy

>
> 297567 [   44.667677] amdgpu 0000:03:00.0: GPU reset begin!
> 297568 [   44.929047] [drm] scheduler sdma0 is not ready, skipping
> 297569 [   44.929048] [drm] scheduler sdma1 is not ready, skipping
> 297570 [   44.934608] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-2)
> 297571 [   44.947941] BUG: kernel NULL pointer dereference, address: 0000000000000038
> 297572 [   44.955132] #PF: supervisor read access in kernel mode
> 297573 [   44.960451] #PF: error_code(0x0000) - not-present page
> 297574 [   44.965714] PGD 0 P4D 0
> 297575 [   44.968331] Oops: 0000 [#1] SMP PTI
> 297576 [   44.971911] CPU: 7 PID: 2496 Comm: gnome-shell Tainted: G        W         5.4.0-rc7+ #1
> 297577 [   44.980221] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1702 01/28/2016
> 297578 [   44.989177] RIP: 0010:amdgpu_vm_sdma_commit+0x55/0x190 [amdgpu]
> 297579 [   44.995242] Code: 47 20 80 7f 10 00 4c 8b a0 88 01 00 00 48 8b 47 08 4c 8d a8 70 01 00 00 75 07 4c 8d a8 88 02 00 00 49 8b 45 10 41 8b 54 24 08 <48> 8b 40 38 85 d2 48 8d b8 30 ff ff f       f 0f 84 06 01 00 00 48 8b 80
> 297580 [   45.014931] RSP: 0018:ffffb66e008839d0 EFLAGS: 00010246
> 297581 [   45.020504] RAX: 0000000000000000 RBX: ffffb66e00883a30 RCX: 0000000000100400
> 297582 [   45.028062] RDX: 000000000000003c RSI: ffff8df123662138 RDI: ffffb66e00883a30
> 297583 [   45.035662] RBP: ffffb66e00883a00 R08: ffffb66e0088395c R09: ffffb66e00883960
> 297584 [   45.043298] R10: 0000000000100240 R11: 0000000000000035 R12: ffff8df1425385e8
> 297585 [   45.050916] R13: ffff8df13cfd1288 R14: ffff8df123662138 R15: ffff8df13cfd1000
> 297586 [   45.058524] FS:  00007fcc8f6b2100(0000) GS:ffff8df15e380000(0000) knlGS:0000000000000000
> 297587 [   45.067114] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 297588 [   45.073206] CR2: 0000000000000038 CR3: 0000000641fb6006 CR4: 00000000003606e0
> 297589 [   45.080791] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> 297590 [   45.088277] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> 297591 [   45.095773] Call Trace:
> 297592 [   45.098354]  amdgpu_vm_bo_update_mapping+0x1c1/0x1f0 [amdgpu]
> 297593 [   45.104427]  ? mark_held_locks+0x4d/0x80
> 297594 [   45.108682]  amdgpu_vm_bo_update+0x3b7/0x960 [amdgpu]
> 297595 [   45.114049]  ? rcu_read_lock_sched_held+0x4f/0x80
> 297596 [   45.119111]  amdgpu_gem_va_ioctl+0x4f3/0x510 [amdgpu]
> 297597 [   45.124495]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
> 297598 [   45.130250]  drm_ioctl_kernel+0xb0/0x100 [drm]
> 297599 [   45.134988]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
> 297600 [   45.140742]  ? drm_ioctl_kernel+0xb0/0x100 [drm]
> 297601 [   45.145622]  drm_ioctl+0x389/0x450 [drm]
> 297602 [   45.149804]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
> 297603 [   45.155551]  ? trace_hardirqs_on+0x3b/0xf0
> 297604 [   45.159892]  amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
> 297605 [   45.172104]  do_vfs_ioctl+0xa9/0x6f0
> 297606 [   45.175909]  ? tomoyo_file_ioctl+0x19/0x20
> 297607 [   45.180241]  ksys_ioctl+0x75/0x80
> 297608 [   45.183760]  ? do_syscall_64+0x17/0x230
> 297609 [   45.187833]  __x64_sys_ioctl+0x1a/0x20
> 297610 [   45.191846]  do_syscall_64+0x5f/0x230
> 297611 [   45.195764]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 297612 [   45.201126] RIP: 0033:0x7fcc8c7725d7
>
>> Regards,
>> Christian.
>>
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>>>   1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> index cf96c335b258..d30d103e48a2 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>>>   	int r;
>>>     	entity = p->direct ? &p->vm->direct : &p->vm->delayed;
>>> +	if (!entity->rq)
>>> +		return -ENOENT;
>>>   	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>>>     	WARN_ON(ib->length_dw == 0);
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7Cnirmoy.das%40amd.com%7Cd293af82969b445042e008d7d09e3f53%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637207250441956201&amp;sdata=cvW%2B%2FlmbeeovS4EHk4VjtC1MTaCAVjHTV%2FitSoAoOD4%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdgpu: Check entity rq
  2020-03-25  9:23   ` Pan, Xinhui
@ 2020-03-25 10:54     ` Pan, Xinhui
  2020-03-25 11:03     ` Nirmoy
  1 sibling, 0 replies; 11+ messages in thread
From: Pan, Xinhui @ 2020-03-25 10:54 UTC (permalink / raw)
  To: Pan, Xinhui
  Cc: Deucher, Alexander, Kuehling, Felix, Pan, Xinhui, Koenig,
	Christian, amd-gfx



> 2020年3月25日 17:23,Pan, Xinhui <Xinhui.Pan@amd.com> 写道:
> 
> 
> 
>> 2020年3月25日 15:48,Koenig, Christian <Christian.Koenig@amd.com> 写道:
>> 
>> Am 25.03.20 um 06:47 schrieb xinhui pan:
>>> Hit panic during GPU recovery test. drm_sched_entity_select_rq might
>>> set NULL to rq. So add a check like drm_sched_job_init does.
>> 
>> NAK, the rq should never be set to NULL in the first place.
>> 
>> How did that happened?
> 
> well, I have not check the details.

so recovery will disable sdma ring. the sched->ready will be false then. 
any job submitted during suspend and resume will meet this issue.

[   99.011614] amdgpu 0000:03:00.0: GPU reset begin!
[   99.265504] CPU: 5 PID: 163 Comm: kworker/5:1 Tainted: G        W         5.4.0-rc7+ #1
[   99.273659] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1702 01/28/2016
[   99.282522] Workqueue: events drm_sched_job_timedout [gpu_sched]
[   99.288682] Call Trace:
[   99.291193]  dump_stack+0x98/0xd5
[   99.294629]  sdma_v5_0_enable+0x1ab/0x1d0 [amdgpu]
[   99.299563]  sdma_v5_0_suspend+0x2a/0x30 [amdgpu]
[   99.304360]  amdgpu_device_ip_suspend_phase2+0xa3/0x110 [amdgpu]
[   99.310504]  ? amdgpu_device_ip_suspend_phase1+0x5b/0xe0 [amdgpu]
[   99.316727]  amdgpu_device_ip_suspend+0x37/0x60 [amdgpu]
[   99.322159]  amdgpu_device_pre_asic_reset+0x81/0x1f0 [amdgpu]
[   99.328054]  amdgpu_device_gpu_recover+0x27f/0xc60 [amdgpu]
[   99.333767]  amdgpu_job_timedout+0x123/0x140 [amdgpu]
[   99.338898]  drm_sched_job_timedout+0x85/0xe0 [gpu_sched]
[   99.344445]  ? amdgpu_cgs_destroy_device+0x10/0x10 [amdgpu]
[   99.350145]  ? drm_sched_job_timedout+0x85/0xe0 [gpu_sched]
[   99.355834]  process_one_work+0x231/0x5c0
[   99.359927]  worker_thread+0x3f/0x3b0
[   99.363641]  ? __kthread_parkme+0x61/0x90
[   99.367701]  kthread+0x12c/0x150
[   99.371010]  ? process_one_work+0x5c0/0x5c0
[   99.375318]  ? kthread_park+0x90/0x90
[   99.379042]  ret_from_fork+0x3a/0x50


> but just got the call trace below.
> looks like sched is not ready, and drm_sched_entity_select_rq set entity->rq to NULL.
> in the next amdgpu_vm_sdma_commit, hit panic when we deference entity->rq.
> 
> 297567 [   44.667677] amdgpu 0000:03:00.0: GPU reset begin!
> 297568 [   44.929047] [drm] scheduler sdma0 is not ready, skipping
> 297569 [   44.929048] [drm] scheduler sdma1 is not ready, skipping
> 297570 [   44.934608] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-2)
> 297571 [   44.947941] BUG: kernel NULL pointer dereference, address: 0000000000000038
> 297572 [   44.955132] #PF: supervisor read access in kernel mode
> 297573 [   44.960451] #PF: error_code(0x0000) - not-present page
> 297574 [   44.965714] PGD 0 P4D 0
> 297575 [   44.968331] Oops: 0000 [#1] SMP PTI
> 297576 [   44.971911] CPU: 7 PID: 2496 Comm: gnome-shell Tainted: G        W         5.4.0-rc7+ #1
> 297577 [   44.980221] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1702 01/28/2016
> 297578 [   44.989177] RIP: 0010:amdgpu_vm_sdma_commit+0x55/0x190 [amdgpu]
> 297579 [   44.995242] Code: 47 20 80 7f 10 00 4c 8b a0 88 01 00 00 48 8b 47 08 4c 8d a8 70 01 00 00 75 07 4c 8d a8 88 02 00 00 49 8b 45 10 41 8b 54 24 08 <48> 8b 40 38 85 d2 48 8d b8 30 ff ff f       f 0f 84 06 01 00 00 48 8b 80
> 297580 [   45.014931] RSP: 0018:ffffb66e008839d0 EFLAGS: 00010246
> 297581 [   45.020504] RAX: 0000000000000000 RBX: ffffb66e00883a30 RCX: 0000000000100400
> 297582 [   45.028062] RDX: 000000000000003c RSI: ffff8df123662138 RDI: ffffb66e00883a30
> 297583 [   45.035662] RBP: ffffb66e00883a00 R08: ffffb66e0088395c R09: ffffb66e00883960
> 297584 [   45.043298] R10: 0000000000100240 R11: 0000000000000035 R12: ffff8df1425385e8
> 297585 [   45.050916] R13: ffff8df13cfd1288 R14: ffff8df123662138 R15: ffff8df13cfd1000
> 297586 [   45.058524] FS:  00007fcc8f6b2100(0000) GS:ffff8df15e380000(0000) knlGS:0000000000000000
> 297587 [   45.067114] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 297588 [   45.073206] CR2: 0000000000000038 CR3: 0000000641fb6006 CR4: 00000000003606e0
> 297589 [   45.080791] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> 297590 [   45.088277] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> 297591 [   45.095773] Call Trace:
> 297592 [   45.098354]  amdgpu_vm_bo_update_mapping+0x1c1/0x1f0 [amdgpu]
> 297593 [   45.104427]  ? mark_held_locks+0x4d/0x80
> 297594 [   45.108682]  amdgpu_vm_bo_update+0x3b7/0x960 [amdgpu]
> 297595 [   45.114049]  ? rcu_read_lock_sched_held+0x4f/0x80
> 297596 [   45.119111]  amdgpu_gem_va_ioctl+0x4f3/0x510 [amdgpu]
> 297597 [   45.124495]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
> 297598 [   45.130250]  drm_ioctl_kernel+0xb0/0x100 [drm]
> 297599 [   45.134988]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
> 297600 [   45.140742]  ? drm_ioctl_kernel+0xb0/0x100 [drm]
> 297601 [   45.145622]  drm_ioctl+0x389/0x450 [drm]
> 297602 [   45.149804]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
> 297603 [   45.155551]  ? trace_hardirqs_on+0x3b/0xf0
> 297604 [   45.159892]  amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
> 297605 [   45.172104]  do_vfs_ioctl+0xa9/0x6f0
> 297606 [   45.175909]  ? tomoyo_file_ioctl+0x19/0x20
> 297607 [   45.180241]  ksys_ioctl+0x75/0x80
> 297608 [   45.183760]  ? do_syscall_64+0x17/0x230
> 297609 [   45.187833]  __x64_sys_ioctl+0x1a/0x20
> 297610 [   45.191846]  do_syscall_64+0x5f/0x230
> 297611 [   45.195764]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 297612 [   45.201126] RIP: 0033:0x7fcc8c7725d7
> 
>> 
>> Regards,
>> Christian.
>> 
>>> 
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Cc: Alex Deucher <alexander.deucher@amd.com>
>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>>> 1 file changed, 2 insertions(+)
>>> 
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> index cf96c335b258..d30d103e48a2 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>>> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>>> 	int r;
>>>   	entity = p->direct ? &p->vm->direct : &p->vm->delayed;
>>> +	if (!entity->rq)
>>> +		return -ENOENT;
>>> 	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>>>   	WARN_ON(ib->length_dw == 0);
>> 
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdgpu: Check entity rq
  2020-03-25  7:48 ` Christian König
@ 2020-03-25  9:23   ` Pan, Xinhui
  2020-03-25 10:54     ` Pan, Xinhui
  2020-03-25 11:03     ` Nirmoy
  0 siblings, 2 replies; 11+ messages in thread
From: Pan, Xinhui @ 2020-03-25  9:23 UTC (permalink / raw)
  To: Koenig, Christian
  Cc: Deucher, Alexander, Kuehling, Felix, Pan, Xinhui, amd-gfx



> 2020年3月25日 15:48,Koenig, Christian <Christian.Koenig@amd.com> 写道:
> 
> Am 25.03.20 um 06:47 schrieb xinhui pan:
>> Hit panic during GPU recovery test. drm_sched_entity_select_rq might
>> set NULL to rq. So add a check like drm_sched_job_init does.
> 
> NAK, the rq should never be set to NULL in the first place.
> 
> How did that happened?

well, I have not check the details.
but just got the call trace below.
looks like sched is not ready, and drm_sched_entity_select_rq set entity->rq to NULL.
in the next amdgpu_vm_sdma_commit, hit panic when we deference entity->rq.

297567 [   44.667677] amdgpu 0000:03:00.0: GPU reset begin!
297568 [   44.929047] [drm] scheduler sdma0 is not ready, skipping
297569 [   44.929048] [drm] scheduler sdma1 is not ready, skipping
297570 [   44.934608] [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-2)
297571 [   44.947941] BUG: kernel NULL pointer dereference, address: 0000000000000038
297572 [   44.955132] #PF: supervisor read access in kernel mode
297573 [   44.960451] #PF: error_code(0x0000) - not-present page
297574 [   44.965714] PGD 0 P4D 0
297575 [   44.968331] Oops: 0000 [#1] SMP PTI
297576 [   44.971911] CPU: 7 PID: 2496 Comm: gnome-shell Tainted: G        W         5.4.0-rc7+ #1
297577 [   44.980221] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1702 01/28/2016
297578 [   44.989177] RIP: 0010:amdgpu_vm_sdma_commit+0x55/0x190 [amdgpu]
297579 [   44.995242] Code: 47 20 80 7f 10 00 4c 8b a0 88 01 00 00 48 8b 47 08 4c 8d a8 70 01 00 00 75 07 4c 8d a8 88 02 00 00 49 8b 45 10 41 8b 54 24 08 <48> 8b 40 38 85 d2 48 8d b8 30 ff ff f       f 0f 84 06 01 00 00 48 8b 80
297580 [   45.014931] RSP: 0018:ffffb66e008839d0 EFLAGS: 00010246
297581 [   45.020504] RAX: 0000000000000000 RBX: ffffb66e00883a30 RCX: 0000000000100400
297582 [   45.028062] RDX: 000000000000003c RSI: ffff8df123662138 RDI: ffffb66e00883a30
297583 [   45.035662] RBP: ffffb66e00883a00 R08: ffffb66e0088395c R09: ffffb66e00883960
297584 [   45.043298] R10: 0000000000100240 R11: 0000000000000035 R12: ffff8df1425385e8
297585 [   45.050916] R13: ffff8df13cfd1288 R14: ffff8df123662138 R15: ffff8df13cfd1000
297586 [   45.058524] FS:  00007fcc8f6b2100(0000) GS:ffff8df15e380000(0000) knlGS:0000000000000000
297587 [   45.067114] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
297588 [   45.073206] CR2: 0000000000000038 CR3: 0000000641fb6006 CR4: 00000000003606e0
297589 [   45.080791] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
297590 [   45.088277] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
297591 [   45.095773] Call Trace:
297592 [   45.098354]  amdgpu_vm_bo_update_mapping+0x1c1/0x1f0 [amdgpu]
297593 [   45.104427]  ? mark_held_locks+0x4d/0x80
297594 [   45.108682]  amdgpu_vm_bo_update+0x3b7/0x960 [amdgpu]
297595 [   45.114049]  ? rcu_read_lock_sched_held+0x4f/0x80
297596 [   45.119111]  amdgpu_gem_va_ioctl+0x4f3/0x510 [amdgpu]
297597 [   45.124495]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
297598 [   45.130250]  drm_ioctl_kernel+0xb0/0x100 [drm]
297599 [   45.134988]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
297600 [   45.140742]  ? drm_ioctl_kernel+0xb0/0x100 [drm]
297601 [   45.145622]  drm_ioctl+0x389/0x450 [drm]
297602 [   45.149804]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
297603 [   45.155551]  ? trace_hardirqs_on+0x3b/0xf0
297604 [   45.159892]  amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
297605 [   45.172104]  do_vfs_ioctl+0xa9/0x6f0
297606 [   45.175909]  ? tomoyo_file_ioctl+0x19/0x20
297607 [   45.180241]  ksys_ioctl+0x75/0x80
297608 [   45.183760]  ? do_syscall_64+0x17/0x230
297609 [   45.187833]  __x64_sys_ioctl+0x1a/0x20
297610 [   45.191846]  do_syscall_64+0x5f/0x230
297611 [   45.195764]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
297612 [   45.201126] RIP: 0033:0x7fcc8c7725d7

> 
> Regards,
> Christian.
> 
>> 
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>>  1 file changed, 2 insertions(+)
>> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>> index cf96c335b258..d30d103e48a2 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
>> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>>  	int r;
>>    	entity = p->direct ? &p->vm->direct : &p->vm->delayed;
>> +	if (!entity->rq)
>> +		return -ENOENT;
>>  	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>>    	WARN_ON(ib->length_dw == 0);
> 

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/amdgpu: Check entity rq
  2020-03-25  5:47 xinhui pan
@ 2020-03-25  7:48 ` Christian König
  2020-03-25  9:23   ` Pan, Xinhui
  0 siblings, 1 reply; 11+ messages in thread
From: Christian König @ 2020-03-25  7:48 UTC (permalink / raw)
  To: xinhui pan, amd-gfx; +Cc: Alex Deucher, Felix Kuehling

Am 25.03.20 um 06:47 schrieb xinhui pan:
> Hit panic during GPU recovery test. drm_sched_entity_select_rq might
> set NULL to rq. So add a check like drm_sched_job_init does.

NAK, the rq should never be set to NULL in the first place.

How did that happened?

Regards,
Christian.

>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> index cf96c335b258..d30d103e48a2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
>   	int r;
>   
>   	entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> +	if (!entity->rq)
> +		return -ENOENT;
>   	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
>   
>   	WARN_ON(ib->length_dw == 0);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] drm/amdgpu: Check entity rq
@ 2020-03-25  5:47 xinhui pan
  2020-03-25  7:48 ` Christian König
  0 siblings, 1 reply; 11+ messages in thread
From: xinhui pan @ 2020-03-25  5:47 UTC (permalink / raw)
  To: amd-gfx; +Cc: Alex Deucher, Felix Kuehling, xinhui pan, Christian König

Hit panic during GPU recovery test. drm_sched_entity_select_rq might
set NULL to rq. So add a check like drm_sched_job_init does.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
index cf96c335b258..d30d103e48a2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
@@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
 	int r;
 
 	entity = p->direct ? &p->vm->direct : &p->vm->delayed;
+	if (!entity->rq)
+		return -ENOENT;
 	ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
 
 	WARN_ON(ib->length_dw == 0);
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, back to index

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-25 11:07 [PATCH] drm/amdgpu: Check entity rq xinhui pan
2020-03-25 11:14 ` Nirmoy
2020-03-25 11:13   ` Koenig, Christian
2020-03-25 11:34     ` Pan, Xinhui
2020-03-25 11:37     ` Pan, Xinhui
  -- strict thread matches above, loose matches on Subject: below --
2020-03-25  5:47 xinhui pan
2020-03-25  7:48 ` Christian König
2020-03-25  9:23   ` Pan, Xinhui
2020-03-25 10:54     ` Pan, Xinhui
2020-03-25 11:03     ` Nirmoy
2020-03-30 11:11       ` Christian König

AMD-GFX Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/amd-gfx/0 amd-gfx/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 amd-gfx amd-gfx/ https://lore.kernel.org/amd-gfx \
		amd-gfx@lists.freedesktop.org
	public-inbox-index amd-gfx

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.freedesktop.lists.amd-gfx


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git