AMD-GFX Archive on lore.kernel.org
 help / color / Atom feed
From: "Pan, Xinhui" <Xinhui.Pan@amd.com>
To: "Koenig, Christian" <Christian.Koenig@amd.com>
Cc: "Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Kuehling, Felix" <Felix.Kuehling@amd.com>,
	"Pan, Xinhui" <Xinhui.Pan@amd.com>,
	"Das, Nirmoy" <Nirmoy.Das@amd.com>,
	"amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] drm/amdgpu: Check entity rq
Date: Wed, 25 Mar 2020 11:37:27 +0000
Message-ID: <7E315CB2-0C2D-4F61-B8BD-BA0C9772390E@amd.com> (raw)
In-Reply-To: <32e5b144-228c-44d9-8576-3941dc99d8d5@email.android.com>

well, submit job with HW disabled shluld be no harm.

The only concern is that we might use up IBs if we park scheduler thread during recovery. 
I have saw recovery stuck in sa new functuon. 
ring test alloc IBs to test if recovery succeed or not. But if there is no enough IBs it will wait fences to signal. 
However we have parked the scheduler thread,  the job will never run and no fences will be signaled.

see, deadlock indeed. Now we are allowing job submission here. it is more likely that IBs might be used up.

deadlock calltrace. 
271384 [27069.375047] INFO: task gnome-shell:2507 blocked for more than 120 seconds.
271385 [27069.382510]       Tainted: G        W         5.4.0-rc7+ #1
271386 [27069.388207] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
271387 [27069.396221] gnome-shell     D    0  2507   2487 0x00000000
271388 [27069.401869] Call Trace:
271389 [27069.404404]  __schedule+0x2ab/0x860
271390 [27069.408009]  ? dma_fence_wait_any_timeout+0x1a4/0x2b0
271391 [27069.413198]  schedule+0x3a/0xc0
271392 [27069.416432]  schedule_timeout+0x21d/0x3c0
271393 [27069.420583]  ? trace_hardirqs_on+0x3b/0xf0
271394 [27069.424815]  ? dma_fence_add_callback+0x6e/0xe0
271395 [27069.429449]  ? dma_fence_wait_any_timeout+0x1a4/0x2b0
271396 [27069.434640]  dma_fence_wait_any_timeout+0x205/0x2b0
271397 [27069.439633]  ? dma_fence_wait_any_timeout+0x238/0x2b0
271398 [27069.444944]  amdgpu_sa_bo_new+0x4d7/0x5c0 [amdgpu]
271399 [27069.449949]  amdgpu_ib_get+0x36/0xa0 [amdgpu]
271400 [27069.454534]  amdgpu_job_alloc_with_ib+0x4d/0x70 [amdgpu]
271401 [27069.460057]  amdgpu_vm_sdma_prepare+0x28/0x60 [amdgpu]
271402 [27069.465370]  amdgpu_vm_bo_update_mapping+0xd7/0x1f0 [amdgpu]
271403 [27069.471171]  ? mark_held_locks+0x4d/0x80
271404 [27069.475281]  amdgpu_vm_bo_update+0x3b7/0x960 [amdgpu]
271405 [27069.480538]  amdgpu_gem_va_ioctl+0x4f3/0x510 [amdgpu]
271406 [27069.485838]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
271407 [27069.491380]  drm_ioctl_kernel+0xb0/0x100 [drm]
271408 [27069.496045]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
271409 [27069.501569]  ? drm_ioctl_kernel+0xb0/0x100 [drm]
271410 [27069.506353]  drm_ioctl+0x389/0x450 [drm]
271411 [27069.510458]  ? amdgpu_gem_va_map_flags+0x70/0x70 [amdgpu]
271412 [27069.516000]  ? trace_hardirqs_on+0x3b/0xf0
271413 [27069.520305]  amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
271414 [27069.525048]  do_vfs_ioctl+0xa9/0x6f0
271415 [27069.528753]  ? tomoyo_file_ioctl+0x19/0x20
271416 [27069.532972]  ksys_ioctl+0x75/0x80
271417 [27069.536396]  ? do_syscall_64+0x17/0x230
271418 [27069.540357]  __x64_sys_ioctl+0x1a/0x20
271419 [27069.544239]  do_syscall_64+0x5f/0x230


> 2020年3月25日 19:13,Koenig, Christian <Christian.Koenig@amd.com> 写道:
> 
> Hi guys,
> 
> thanks for pointing this out Nirmoy.
> 
> Yeah, could be that I forgot to commit the patch. Currently I don't know at which end of the chaos I should start to clean up.
> 
> Christian.
> 
> Am 25.03.2020 12:09 schrieb "Das, Nirmoy" <Nirmoy.Das@amd.com>:
> Hi Xinhui,
> 
> 
> Can you please check if you can reproduce the crash with 
> https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html
> 
> Christian fix it earlier, I think he forgot to push it.
> 
> 
> Regards,
> 
> Nirmoy
> 
> On 3/25/20 12:07 PM, xinhui pan wrote:
> > gpu recover will call sdma suspend/resume. In this period, ring will be
> > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> > be false.
> >
> > If we submit any jobs in this ring-disabled period. We fail to pick up
> > a rq for vm entity and entity->rq will set to NULL.
> > amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> > hit panic.
> >
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Alex Deucher <alexander.deucher@amd.com>
> > Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> > Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > index cf96c335b258..d30d103e48a2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
> >        int r;
> >   
> >        entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> > +     if (!entity->rq)
> > +             return -ENOENT;
> >        ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
> >   
> >        WARN_ON(ib->length_dw == 0);
> 
> 
> Am 25.03.2020 12:09 schrieb "Das, Nirmoy" <Nirmoy.Das@amd.com>:
> Hi Xinhui,
> 
> 
> Can you please check if you can reproduce the crash with 
> https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html
> 
> Christian fix it earlier, I think he forgot to push it.
> 
> 
> Regards,
> 
> Nirmoy
> 
> On 3/25/20 12:07 PM, xinhui pan wrote:
> > gpu recover will call sdma suspend/resume. In this period, ring will be
> > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> > be false.
> >
> > If we submit any jobs in this ring-disabled period. We fail to pick up
> > a rq for vm entity and entity->rq will set to NULL.
> > amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> > hit panic.
> >
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Alex Deucher <alexander.deucher@amd.com>
> > Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> > Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > index cf96c335b258..d30d103e48a2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
> >        int r;
> >   
> >        entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> > +     if (!entity->rq)
> > +             return -ENOENT;
> >        ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
> >   
> >        WARN_ON(ib->length_dw == 0);
> 
> 
> Am 25.03.2020 12:09 schrieb "Das, Nirmoy" <Nirmoy.Das@amd.com>:
> Hi Xinhui,
> 
> 
> Can you please check if you can reproduce the crash with 
> https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html
> 
> Christian fix it earlier, I think he forgot to push it.
> 
> 
> Regards,
> 
> Nirmoy
> 
> On 3/25/20 12:07 PM, xinhui pan wrote:
> > gpu recover will call sdma suspend/resume. In this period, ring will be
> > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will
> > be false.
> >
> > If we submit any jobs in this ring-disabled period. We fail to pick up
> > a rq for vm entity and entity->rq will set to NULL.
> > amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise
> > hit panic.
> >
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: Alex Deucher <alexander.deucher@amd.com>
> > Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> > Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > index cf96c335b258..d30d103e48a2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c
> > @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p,
> >        int r;
> >   
> >        entity = p->direct ? &p->vm->direct : &p->vm->delayed;
> > +     if (!entity->rq)
> > +             return -ENOENT;
> >        ring = container_of(entity->rq->sched, struct amdgpu_ring, sched);
> >   
> >        WARN_ON(ib->length_dw == 0);

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  parent reply index

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-25 11:07 xinhui pan
2020-03-25 11:14 ` Nirmoy
2020-03-25 11:13   ` Koenig, Christian
2020-03-25 11:34     ` Pan, Xinhui
2020-03-25 11:37     ` Pan, Xinhui [this message]
  -- strict thread matches above, loose matches on Subject: below --
2020-03-25  5:47 xinhui pan
2020-03-25  7:48 ` Christian König
2020-03-25  9:23   ` Pan, Xinhui
2020-03-25 10:54     ` Pan, Xinhui
2020-03-25 11:03     ` Nirmoy
2020-03-30 11:11       ` Christian König

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7E315CB2-0C2D-4F61-B8BD-BA0C9772390E@amd.com \
    --to=xinhui.pan@amd.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Christian.Koenig@amd.com \
    --cc=Felix.Kuehling@amd.com \
    --cc=Nirmoy.Das@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

AMD-GFX Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/amd-gfx/0 amd-gfx/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 amd-gfx amd-gfx/ https://lore.kernel.org/amd-gfx \
		amd-gfx@lists.freedesktop.org
	public-inbox-index amd-gfx

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.freedesktop.lists.amd-gfx


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git