All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Marek Olšák" <maraeo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: "Christian König" <christian.koenig-5C7GfCeVMHo@public.gmane.org>
Cc: Alex Deucher
	<alexdeucher-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	amd-gfx mailing list
	<amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org>
Subject: Re: [PATCH 2/5] drm/amdgpu: add ring soft recovery v2
Date: Thu, 23 Aug 2018 16:21:44 -0400	[thread overview]
Message-ID: <CAAxE2A4TKbg-Z07u9nsr=k3ujzA7p8m-2NEiAFmHP8Lab_Zy4g@mail.gmail.com> (raw)
In-Reply-To: <b4b062be-2bc8-035b-e5db-2d17de698826-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On Thu, Aug 23, 2018 at 2:51 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Am 22.08.2018 um 21:32 schrieb Marek Olšák:
> > On Wed, Aug 22, 2018 at 12:56 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> >> On Wed, Aug 22, 2018 at 6:05 AM Christian König
> >> <ckoenig.leichtzumerken@gmail.com> wrote:
> >>> Instead of hammering hard on the GPU try a soft recovery first.
> >>>
> >>> v2: reorder code a bit
> >>>
> >>> Signed-off-by: Christian König <christian.koenig@amd.com>
> >>> ---
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  6 ++++++
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 24 ++++++++++++++++++++++++
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  4 ++++
> >>>   3 files changed, 34 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> index 265ff90f4e01..d93e31a5c4e7 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> @@ -33,6 +33,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
> >>>          struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
> >>>          struct amdgpu_job *job = to_amdgpu_job(s_job);
> >>>
> >>> +       if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
> >>> +               DRM_ERROR("ring %s timeout, but soft recovered\n",
> >>> +                         s_job->sched->name);
> >>> +               return;
> >>> +       }
> >> I think we should still bubble up the error to userspace even if we
> >> can recover.  Data is lost when the wave is killed.  We should treat
> >> it like a GPU reset.
> > Yes, please increment gpu_reset_counter, so that we are compliant with
> > OpenGL. Being able to recover from infinite loops is great, but test
> > suites also expect this to be properly reported to userspace via the
> > per-context query.
>
> Sure that shouldn't be a problem.
>
> > Also please bump the deadline to 1 second. Even you if you kill all
> > shaders, the IB can also contain CP DMA, which may take longer than 1
> > ms.
>
> Is there any way we can get a feedback from the SQ if the kill was
> successfully?

I don't think so. The kill should be finished pretty quickly, but more
waves with infinite loops may be waiting to be launched, so you still
need to repeat the kill command. And we should ideally repeat it for 1
second.

The reason is that vertex shader waves take a lot of time to launch. A
very very very large draw call can keep launching new waves for 1
second with the same infinite loop. You would have to soft-reset all
VGTs to stop that.

>
> 1 second is way to long, since in the case of a blocked MC we need to
> start up hard reset relative fast.

10 seconds have already passed.

I think that some hangs from corrupted descriptors may still be
recoverable just by killing waves.

Marek

>
> Regards,
> Christian.
>
> >
> > Marek
> >
> > Marek
> >
> >> Alex
> >>
> >>> +
> >>>          DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
> >>>                    job->base.sched->name, atomic_read(&ring->fence_drv.last_seq),
> >>>                    ring->fence_drv.sync_seq);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> index 5dfd26be1eec..c045a4e38ad1 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> @@ -383,6 +383,30 @@ void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
> >>>          amdgpu_ring_emit_reg_wait(ring, reg1, mask, mask);
> >>>   }
> >>>
> >>> +/**
> >>> + * amdgpu_ring_soft_recovery - try to soft recover a ring lockup
> >>> + *
> >>> + * @ring: ring to try the recovery on
> >>> + * @vmid: VMID we try to get going again
> >>> + * @fence: timedout fence
> >>> + *
> >>> + * Tries to get a ring proceeding again when it is stuck.
> >>> + */
> >>> +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
> >>> +                              struct dma_fence *fence)
> >>> +{
> >>> +       ktime_t deadline = ktime_add_us(ktime_get(), 1000);
> >>> +
> >>> +       if (!ring->funcs->soft_recovery)
> >>> +               return false;
> >>> +
> >>> +       while (!dma_fence_is_signaled(fence) &&
> >>> +              ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
> >>> +               ring->funcs->soft_recovery(ring, vmid);
> >>> +
> >>> +       return dma_fence_is_signaled(fence);
> >>> +}
> >>> +
> >>>   /*
> >>>    * Debugfs info
> >>>    */
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> index 409fdd9b9710..9cc239968e40 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> @@ -168,6 +168,8 @@ struct amdgpu_ring_funcs {
> >>>          /* priority functions */
> >>>          void (*set_priority) (struct amdgpu_ring *ring,
> >>>                                enum drm_sched_priority priority);
> >>> +       /* Try to soft recover the ring to make the fence signal */
> >>> +       void (*soft_recovery)(struct amdgpu_ring *ring, unsigned vmid);
> >>>   };
> >>>
> >>>   struct amdgpu_ring {
> >>> @@ -260,6 +262,8 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring);
> >>>   void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
> >>>                                                  uint32_t reg0, uint32_t val0,
> >>>                                                  uint32_t reg1, uint32_t val1);
> >>> +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
> >>> +                              struct dma_fence *fence);
> >>>
> >>>   static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring)
> >>>   {
> >>> --
> >>> 2.14.1
> >>>
> >>> _______________________________________________
> >>> amd-gfx mailing list
> >>> amd-gfx@lists.freedesktop.org
> >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> >> _______________________________________________
> >> amd-gfx mailing list
> >> amd-gfx@lists.freedesktop.org
> >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

  parent reply	other threads:[~2018-08-23 20:21 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-22 10:05 [PATCH 1/5] drm/amdgpu: cleanup GPU recovery check a bit Christian König
     [not found] ` <20180822100531.52863-1-christian.koenig-5C7GfCeVMHo@public.gmane.org>
2018-08-22 10:05   ` [PATCH 2/5] drm/amdgpu: add ring soft recovery v2 Christian König
     [not found]     ` <20180822100531.52863-2-christian.koenig-5C7GfCeVMHo@public.gmane.org>
2018-08-22 16:55       ` Alex Deucher
     [not found]         ` <CADnq5_PU5TT4Q9oE+egAd2S4UgDENLu-Las_fnUZrNJUS=rmCw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-08-22 19:32           ` Marek Olšák
     [not found]             ` <CAAxE2A5L5UjsWVMSBGVe82GyVS3LAPtVh-5QBb5Ope5ZVEHKig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-08-23  6:51               ` Christian König
     [not found]                 ` <b4b062be-2bc8-035b-e5db-2d17de698826-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-08-23 20:21                   ` Marek Olšák [this message]
     [not found]                     ` <CAAxE2A4TKbg-Z07u9nsr=k3ujzA7p8m-2NEiAFmHP8Lab_Zy4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-08-24  7:28                       ` Christian König
2018-08-23  7:17           ` Huang Rui
2018-08-23  9:21             ` Christian König
     [not found]               ` <11e0af50-f589-b2e1-3431-8125fe2014ad-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-08-23 13:03                 ` Huang Rui
2018-08-22 10:05   ` [PATCH 3/5] drm/amdgpu: implement soft_recovery for GFX7 Christian König
2018-08-22 10:05   ` [PATCH 4/5] drm/amdgpu: implement soft_recovery for GFX8 v2 Christian König
2018-08-22 10:05   ` [PATCH 5/5] drm/amdgpu: implement soft_recovery for GFX9 Christian König
2018-08-22 14:56   ` [PATCH 1/5] drm/amdgpu: cleanup GPU recovery check a bit Andrey Grodzovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAAxE2A4TKbg-Z07u9nsr=k3ujzA7p8m-2NEiAFmHP8Lab_Zy4g@mail.gmail.com' \
    --to=maraeo-re5jqeeqqe8avxtiumwx3w@public.gmane.org \
    --cc=alexdeucher-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    --cc=amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org \
    --cc=christian.koenig-5C7GfCeVMHo@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.