Hi guys, thanks for pointing this out Nirmoy. Yeah, could be that I forgot to commit the patch. Currently I don't know at which end of the chaos I should start to clean up. Christian. Am 25.03.2020 12:09 schrieb "Das, Nirmoy" : Hi Xinhui, Can you please check if you can reproduce the crash with https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html Christian fix it earlier, I think he forgot to push it. Regards, Nirmoy On 3/25/20 12:07 PM, xinhui pan wrote: > gpu recover will call sdma suspend/resume. In this period, ring will be > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will > be false. > > If we submit any jobs in this ring-disabled period. We fail to pick up > a rq for vm entity and entity->rq will set to NULL. > amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise > hit panic. > > Cc: Christian König > Cc: Alex Deucher > Cc: Felix Kuehling > Signed-off-by: xinhui pan > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > index cf96c335b258..d30d103e48a2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p, > int r; > > entity = p->direct ? &p->vm->direct : &p->vm->delayed; > + if (!entity->rq) > + return -ENOENT; > ring = container_of(entity->rq->sched, struct amdgpu_ring, sched); > > WARN_ON(ib->length_dw == 0); Am 25.03.2020 12:09 schrieb "Das, Nirmoy" : Hi Xinhui, Can you please check if you can reproduce the crash with https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html Christian fix it earlier, I think he forgot to push it. Regards, Nirmoy On 3/25/20 12:07 PM, xinhui pan wrote: > gpu recover will call sdma suspend/resume. In this period, ring will be > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will > be false. > > If we submit any jobs in this ring-disabled period. We fail to pick up > a rq for vm entity and entity->rq will set to NULL. > amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise > hit panic. > > Cc: Christian König > Cc: Alex Deucher > Cc: Felix Kuehling > Signed-off-by: xinhui pan > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > index cf96c335b258..d30d103e48a2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p, > int r; > > entity = p->direct ? &p->vm->direct : &p->vm->delayed; > + if (!entity->rq) > + return -ENOENT; > ring = container_of(entity->rq->sched, struct amdgpu_ring, sched); > > WARN_ON(ib->length_dw == 0); Am 25.03.2020 12:09 schrieb "Das, Nirmoy" : Hi Xinhui, Can you please check if you can reproduce the crash with https://lists.freedesktop.org/archives/amd-gfx/2020-February/046414.html Christian fix it earlier, I think he forgot to push it. Regards, Nirmoy On 3/25/20 12:07 PM, xinhui pan wrote: > gpu recover will call sdma suspend/resume. In this period, ring will be > disabled. So the vm_pte_scheds(sdma.instance[X].ring.sched)->ready will > be false. > > If we submit any jobs in this ring-disabled period. We fail to pick up > a rq for vm entity and entity->rq will set to NULL. > amdgpu_vm_sdma_commit did not check the entity->rq, so fix it. Otherwise > hit panic. > > Cc: Christian König > Cc: Alex Deucher > Cc: Felix Kuehling > Signed-off-by: xinhui pan > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > index cf96c335b258..d30d103e48a2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c > @@ -95,6 +95,8 @@ static int amdgpu_vm_sdma_commit(struct amdgpu_vm_update_params *p, > int r; > > entity = p->direct ? &p->vm->direct : &p->vm->delayed; > + if (!entity->rq) > + return -ENOENT; > ring = container_of(entity->rq->sched, struct amdgpu_ring, sched); > > WARN_ON(ib->length_dw == 0);