Re: [PATCH 2/5] drm/amdgpu: add ring soft recovery v2

From: "Marek Olšák" <maraeo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: "Christian König" <christian.koenig-5C7GfCeVMHo@public.gmane.org>
Cc: Alex Deucher
	<alexdeucher-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	amd-gfx mailing list
	<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
Subject: Re: [PATCH 2/5] drm/amdgpu: add ring soft recovery v2
Date: Thu, 23 Aug 2018 16:21:44 -0400	[thread overview]
Message-ID: <CAAxE2A4TKbg-Z07u9nsr=k3ujzA7p8m-2NEiAFmHP8Lab_Zy4g@mail.gmail.com> (raw)
In-Reply-To: <b4b062be-2bc8-035b-e5db-2d17de698826-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On Thu, Aug 23, 2018 at 2:51 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 22.08.2018 um 21:32 schrieb Marek Olšák:
> > On Wed, Aug 22, 2018 at 12:56 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> >> On Wed, Aug 22, 2018 at 6:05 AM Christian König
> >> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>> Instead of hammering hard on the GPU try a soft recovery first.
> >>>
> >>> v2: reorder code a bit
> >>>
> >>> Signed-off-by: Christian König <christian.koenig@amd.com>
> >>> ---
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  6 ++++++
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 24 ++++++++++++++++++++++++
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  4 ++++
> >>>   3 files changed, 34 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> index 265ff90f4e01..d93e31a5c4e7 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> @@ -33,6 +33,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
> >>>          struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
> >>>          struct amdgpu_job *job = to_amdgpu_job(s_job);
> >>>
> >>> +       if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
> >>> +               DRM_ERROR("ring %s timeout, but soft recovered\n",
> >>> +                         s_job->sched->name);
> >>> +               return;
> >>> +       }
> >> I think we should still bubble up the error to userspace even if we
> >> can recover.  Data is lost when the wave is killed.  We should treat
> >> it like a GPU reset.
> > Yes, please increment gpu_reset_counter, so that we are compliant with
> > OpenGL. Being able to recover from infinite loops is great, but test
> > suites also expect this to be properly reported to userspace via the
> > per-context query.
>
> Sure that shouldn't be a problem.
>
> > Also please bump the deadline to 1 second. Even you if you kill all
> > shaders, the IB can also contain CP DMA, which may take longer than 1
> > ms.
>
> Is there any way we can get a feedback from the SQ if the kill was
> successfully?

I don't think so. The kill should be finished pretty quickly, but more
waves with infinite loops may be waiting to be launched, so you still
need to repeat the kill command. And we should ideally repeat it for 1
second.

The reason is that vertex shader waves take a lot of time to launch. A
very very very large draw call can keep launching new waves for 1
second with the same infinite loop. You would have to soft-reset all
VGTs to stop that.

>
> 1 second is way to long, since in the case of a blocked MC we need to
> start up hard reset relative fast.

10 seconds have already passed.

I think that some hangs from corrupted descriptors may still be
recoverable just by killing waves.

Marek

>
> Regards,
> Christian.
>
> >
> > Marek
> >
> > Marek
> >
> >> Alex
> >>
> >>> +
> >>>          DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
> >>>                    job->base.sched->name, atomic_read(&ring->fence_drv.last_seq),
> >>>                    ring->fence_drv.sync_seq);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> index 5dfd26be1eec..c045a4e38ad1 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> @@ -383,6 +383,30 @@ void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
> >>>          amdgpu_ring_emit_reg_wait(ring, reg1, mask, mask);
> >>>   }
> >>>
> >>> +/**
> >>> + * amdgpu_ring_soft_recovery - try to soft recover a ring lockup
> >>> + *
> >>> + * @ring: ring to try the recovery on
> >>> + * @vmid: VMID we try to get going again
> >>> + * @fence: timedout fence
> >>> + *
> >>> + * Tries to get a ring proceeding again when it is stuck.
> >>> + */
> >>> +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
> >>> +                              struct dma_fence *fence)
> >>> +{
> >>> +       ktime_t deadline = ktime_add_us(ktime_get(), 1000);
> >>> +
> >>> +       if (!ring->funcs->soft_recovery)
> >>> +               return false;
> >>> +
> >>> +       while (!dma_fence_is_signaled(fence) &&
> >>> +              ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
> >>> +               ring->funcs->soft_recovery(ring, vmid);
> >>> +
> >>> +       return dma_fence_is_signaled(fence);
> >>> +}
> >>> +
> >>>   /*
> >>>    * Debugfs info
> >>>    */
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> index 409fdd9b9710..9cc239968e40 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> @@ -168,6 +168,8 @@ struct amdgpu_ring_funcs {
> >>>          /* priority functions */
> >>>          void (*set_priority) (struct amdgpu_ring *ring,
> >>>                                enum drm_sched_priority priority);
> >>> +       /* Try to soft recover the ring to make the fence signal */
> >>> +       void (*soft_recovery)(struct amdgpu_ring *ring, unsigned vmid);
> >>>   };
> >>>
> >>>   struct amdgpu_ring {
> >>> @@ -260,6 +262,8 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring);
> >>>   void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
> >>>                                                  uint32_t reg0, uint32_t val0,
> >>>                                                  uint32_t reg1, uint32_t val1);
> >>> +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
> >>> +                              struct dma_fence *fence);
> >>>
> >>>   static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring)
> >>>   {
> >>> --
> >>> 2.14.1
> >>>
> >>> _______________________________________________
> >>> amd-gfx mailing list
> >>> amd-gfx@lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> >> _______________________________________________
> >> amd-gfx mailing list
> >> amd-gfx@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx