dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Daniel Vetter <daniel@ffwll.ch>
To: "Marek Olšák" <maraeo@gmail.com>
Cc: "Rob Clark" <robdclark@chromium.org>,
	"amd-gfx list" <amd-gfx@lists.freedesktop.org>,
	"Sharma, Shashank" <Shashank.Sharma@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Olsak, Marek" <Marek.Olsak@amd.com>,
	"Somalapuram, Amaranath" <Amaranath.Somalapuram@amd.com>,
	"Abhinav Kumar" <quic_abhinavk@quicinc.com>,
	dri-devel <dri-devel@lists.freedesktop.org>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Shashank Sharma" <contactshashanksharma@gmail.com>,
	"Christian König" <christian.koenig@amd.com>
Subject: Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
Date: Wed, 30 Mar 2022 11:49:39 +0200	[thread overview]
Message-ID: <YkQnsw9Js1T41qW/@phenom.ffwll.local> (raw)
In-Reply-To: <CAAxE2A642QK0NFRLYsq5PuossG_mLExiJD8SzipVc4xVp_V=tw@mail.gmail.com>

On Tue, Mar 29, 2022 at 12:25:55PM -0400, Marek Olšák wrote:
> I don't know what iris does, but I would guess that the same problems as
> with AMD GPUs apply, making GPUs resets very fragile.

iris_batch_check_for_reset -> replace_kernel_ctx -> iris_lost_context_state

is I think the main call chain of how this is handled/detected. There's
also a side-chain which handles -EIO from execbuf.

Also this is using non-recoverable contexts, i.e. any time they suffer
from a gpu reset (either because guilty themselves, or collateral damage
of a reset that shot more than just the guilty context) the context stops
entirely and refuses any further execbuf with -EIO.

Cheers, Daniel

> 
> Marek
> 
> On Tue., Mar. 29, 2022, 08:14 Christian König, <christian.koenig@amd.com>
> wrote:
> 
> > My main question is what does the iris driver better than radeonsi when
> > the client doesn't support the robustness extension?
> >
> > From Daniels description it sounds like they have at least a partial
> > recovery mechanism in place.
> >
> > Apart from that I completely agree to what you said below.
> >
> > Christian.
> >
> > Am 26.03.22 um 01:53 schrieb Olsak, Marek:
> >
> > [AMD Official Use Only]
> >
> > amdgpu has 2 resets: soft reset and hard reset.
> >
> > The soft reset is able to recover from an infinite loop and even some GPU
> > hangs due to bad shaders or bad states. The soft reset uses a signal that
> > kills all currently-running shaders of a certain process (VM context),
> > which unblocks the graphics pipeline, so draws and command buffers finish
> > but are not correctly. This can then cause a hard hang if the shader was
> > supposed to signal work completion through a shader store instruction and a
> > non-shader consumer is waiting for it (skipping the store instruction by
> > killing the shader won't signal the work, and thus the consumer will be
> > stuck, requiring a hard reset).
> >
> > The hard reset can recover from other hangs, which is great, but it may
> > use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory
> > contents, but we should assume that any process that had running jobs on
> > the GPU during a GPU reset has its memory resources in an inconsistent
> > state, and thus following command buffers can cause another GPU hang. The
> > shader store example above is enough to cause another hard hang due to
> > incorrect content in memory resources, which can contain synchronization
> > primitives that are used internally by the hardware.
> >
> > Asking the driver to replay a command buffer that caused a hang is a sure
> > way to hang it again. Unrelated processes can be affected due to lost VRAM
> > or the misfortune of using the GPU while the GPU hang occurred. The window
> > system should recreate GPU resources and redraw everything without
> > affecting applications. If apps use GL, they should do the same. Processes
> > that can't recover by redrawing content can be terminated or left alone,
> > but they shouldn't be allowed to submit work to the GPU anymore.
> >
> > dEQP only exercises the soft reset. I think WebGL is only able to trigger
> > a soft reset at this point, but Vulkan can also trigger a hard reset.
> >
> > Marek
> > ------------------------------
> > *From:* Koenig, Christian <Christian.Koenig@amd.com>
> > <Christian.Koenig@amd.com>
> > *Sent:* March 23, 2022 11:25
> > *To:* Daniel Vetter <daniel@ffwll.ch> <daniel@ffwll.ch>; Daniel Stone
> > <daniel@fooishbar.org> <daniel@fooishbar.org>; Olsak, Marek
> > <Marek.Olsak@amd.com> <Marek.Olsak@amd.com>; Grodzovsky, Andrey
> > <Andrey.Grodzovsky@amd.com> <Andrey.Grodzovsky@amd.com>
> > *Cc:* Rob Clark <robdclark@gmail.com> <robdclark@gmail.com>; Rob Clark
> > <robdclark@chromium.org> <robdclark@chromium.org>; Sharma, Shashank
> > <Shashank.Sharma@amd.com> <Shashank.Sharma@amd.com>; Christian König
> > <ckoenig.leichtzumerken@gmail.com> <ckoenig.leichtzumerken@gmail.com>;
> > Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>
> > <Amaranath.Somalapuram@amd.com>; Abhinav Kumar <quic_abhinavk@quicinc.com>
> > <quic_abhinavk@quicinc.com>; dri-devel <dri-devel@lists.freedesktop.org>
> > <dri-devel@lists.freedesktop.org>; amd-gfx list
> > <amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org>; Deucher,
> > Alexander <Alexander.Deucher@amd.com> <Alexander.Deucher@amd.com>;
> > Shashank Sharma <contactshashanksharma@gmail.com>
> > <contactshashanksharma@gmail.com>
> > *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
> >
> > [Adding Marek and Andrey as well]
> >
> > Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> > > On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org>
> > <daniel@fooishbar.org> wrote:
> > >> Hi,
> > >>
> > >> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com>
> > <robdclark@gmail.com> wrote:
> > >>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
> > >>> <christian.koenig@amd.com> <christian.koenig@amd.com> wrote:
> > >>>> Well you can, it just means that their contexts are lost as well.
> > >>> Which is rather inconvenient when deqp-egl reset tests, for example,
> > >>> take down your compositor ;-)
> > >> Yeah. Or anything WebGL.
> > >>
> > >> System-wide collateral damage is definitely a non-starter. If that
> > >> means that the userspace driver has to do what iris does and ensure
> > >> everything's recreated and resubmitted, that works too, just as long
> > >> as the response to 'my adblocker didn't detect a crypto miner ad'  is
> > >> something better than 'shoot the entire user session'.
> > > Not sure where that idea came from, I thought at least I made it clear
> > > that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> > > which should die without recovery attempt.
> > >
> > > The entire discussion here is who should be responsible for replay and
> > > at least if you can decide the uapi, then punting that entirely to
> > > userspace is a good approach.
> >
> > Yes, completely agree. We have the approach of re-submitting things in
> > the kernel and that failed quite miserable.
> >
> > In other words currently a GPU reset has something like a 99% chance to
> > get down your whole desktop.
> >
> > Daniel can you briefly explain what exactly iris does when a lost
> > context is detected without gl robustness?
> >
> > It sounds like you guys got that working quite well.
> >
> > Thanks,
> > Christian.
> >
> > >
> > > Ofc it'd be nice if the collateral damage is limited, i.e. requests
> > > not currently on the gpu, or on different engines and all that
> > > shouldn't be nuked, if possible.
> > >
> > > Also ofc since msm uapi is that the kernel tries to recover there's
> > > not much we can do there, contexts cannot be shot. But still trying to
> > > replay them as much as possible feels a bit like overkill.
> > > -Daniel
> > >
> > >> Cheers,
> > >> Daniel
> > >
> > >
> >
> >
> >

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

  reply	other threads:[~2022-03-30  9:49 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-08 18:04 [PATCH v2 1/2] drm: Add GPU reset sysfs event Shashank Sharma
2022-03-08 18:04 ` [PATCH v2 2/2] drm/amdgpu: add work function for GPU reset event Shashank Sharma
2022-03-09  7:47 ` [PATCH v2 1/2] drm: Add GPU reset sysfs event Simon Ser
2022-03-09 11:18   ` Sharma, Shashank
2022-03-09  8:09 ` Christian König
2022-03-09  9:56 ` Pierre-Eric Pelloux-Prayer
2022-03-09 10:10   ` Simon Ser
2022-03-09 10:24     ` Christian König
2022-03-09 10:28       ` Simon Ser
2022-03-09 10:28       ` Pierre-Eric Pelloux-Prayer
2022-03-09 18:12 ` Rob Clark
2022-03-10  9:55   ` Christian König
2022-03-10 15:24     ` Rob Clark
2022-03-10 16:21       ` Sharma, Shashank
2022-03-10 16:27         ` Andrey Grodzovsky
2022-03-10 17:16           ` Rob Clark
2022-03-10 17:10         ` Rob Clark
2022-03-10 17:19           ` Sharma, Shashank
2022-03-10 17:40             ` Rob Clark
2022-03-10 18:33               ` Abhinav Kumar
2022-03-10 19:14                 ` Sharma, Shashank
2022-03-10 19:35                   ` Rob Clark
2022-03-10 19:44                     ` Sharma, Shashank
2022-03-10 19:56                       ` Rob Clark
2022-03-10 20:17                         ` Sharma, Shashank
2022-03-11  8:30                         ` Pekka Paalanen
2022-03-14 14:23                           ` Alex Deucher
2022-03-14 15:26                             ` Pekka Paalanen
2022-03-15 14:54                               ` Alex Deucher
2022-03-16  8:48                                 ` Pekka Paalanen
2022-03-16 14:12                                   ` Alex Deucher
2022-03-16 15:36                                     ` Rob Clark
2022-03-16 15:48                                       ` Alex Deucher
2022-03-16 16:30                                         ` Rob Clark
2022-03-17  7:03                                       ` Christian König
2022-03-17  9:29                                         ` Daniel Vetter
2022-03-17  9:46                                           ` Christian König
2022-03-17 15:34                                           ` Rob Clark
2022-03-17 17:23                                             ` Daniel Vetter
2022-03-17 15:40                                           ` Rob Clark
2022-03-17 17:26                                             ` Daniel Vetter
2022-03-17 17:31                                               ` Rob Clark
2022-03-18  7:42                                                 ` Christian König
2022-03-18 15:12                                                   ` Rob Clark
2022-03-21  9:30                                                     ` Christian König
2022-03-21 16:03                                                       ` Rob Clark
2022-03-23 14:07                                                         ` Daniel Stone
2022-03-23 15:14                                                           ` Daniel Vetter
2022-03-23 15:25                                                             ` Christian König
2022-03-26  0:53                                                               ` Olsak, Marek
2022-03-29 12:14                                                                 ` Christian König
2022-03-29 16:25                                                                   ` Marek Olšák
2022-03-30  9:49                                                                     ` Daniel Vetter [this message]
2022-03-23 17:30                                                             ` Rob Clark
2022-03-21 14:15                                                     ` Daniel Vetter
2022-03-15  7:13                             ` Dave Airlie
2022-03-15  7:25                               ` Simon Ser
2022-03-15  7:25                               ` Christian König
2022-03-17  9:25                             ` Daniel Vetter
2022-03-16 21:50 ` Rob Clark
2022-03-17  8:42   ` Sharma, Shashank
2022-03-17  9:21     ` Christian König
2022-03-17 10:31       ` Daniel Stone

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YkQnsw9Js1T41qW/@phenom.ffwll.local \
    --to=daniel@ffwll.ch \
    --cc=Alexander.Deucher@amd.com \
    --cc=Amaranath.Somalapuram@amd.com \
    --cc=Marek.Olsak@amd.com \
    --cc=Shashank.Sharma@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=contactshashanksharma@gmail.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=maraeo@gmail.com \
    --cc=quic_abhinavk@quicinc.com \
    --cc=robdclark@chromium.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).