From: "Christian König" <christian.koenig@amd.com>
To: "Olsak, Marek" <Marek.Olsak@amd.com>,
Daniel Vetter <daniel@ffwll.ch>,
Daniel Stone <daniel@fooishbar.org>,
"Grodzovsky, Andrey" <Andrey.Grodzovsky@amd.com>
Cc: "Rob Clark" <robdclark@chromium.org>,
"Sharma, Shashank" <Shashank.Sharma@amd.com>,
"Christian König" <ckoenig.leichtzumerken@gmail.com>,
"Somalapuram, Amaranath" <Amaranath.Somalapuram@amd.com>,
"Abhinav Kumar" <quic_abhinavk@quicinc.com>,
dri-devel <dri-devel@lists.freedesktop.org>,
"amd-gfx list" <amd-gfx@lists.freedesktop.org>,
"Deucher, Alexander" <Alexander.Deucher@amd.com>,
"Shashank Sharma" <contactshashanksharma@gmail.com>
Subject: Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
Date: Tue, 29 Mar 2022 14:14:18 +0200 [thread overview]
Message-ID: <5818c2a4-80c4-8af2-9937-d2787054c149@amd.com> (raw)
In-Reply-To: <DM6PR12MB473154C6C678EA97C03979A4F91B9@DM6PR12MB4731.namprd12.prod.outlook.com>
[-- Attachment #1: Type: text/plain, Size: 5393 bytes --]
My main question is what does the iris driver better than radeonsi when
the client doesn't support the robustness extension?
From Daniels description it sounds like they have at least a partial
recovery mechanism in place.
Apart from that I completely agree to what you said below.
Christian.
Am 26.03.22 um 01:53 schrieb Olsak, Marek:
>
> [AMD Official Use Only]
>
>
> amdgpu has 2 resets: soft reset and hard reset.
>
> The soft reset is able to recover from an infinite loop and even some
> GPU hangs due to bad shaders or bad states. The soft reset uses a
> signal that kills all currently-running shaders of a certain process
> (VM context), which unblocks the graphics pipeline, so draws and
> command buffers finish but are not correctly. This can then cause a
> hard hang if the shader was supposed to signal work completion through
> a shader store instruction and a non-shader consumer is waiting for it
> (skipping the store instruction by killing the shader won't signal the
> work, and thus the consumer will be stuck, requiring a hard reset).
>
> The hard reset can recover from other hangs, which is great, but it
> may use a PCI reset, which erases VRAM on dGPUs. APUs don't lose
> memory contents, but we should assume that any process that had
> running jobs on the GPU during a GPU reset has its memory resources in
> an inconsistent state, and thus following command buffers can cause
> another GPU hang. The shader store example above is enough to cause
> another hard hang due to incorrect content in memory resources, which
> can contain synchronization primitives that are used internally by the
> hardware.
>
> Asking the driver to replay a command buffer that caused a hang is a
> sure way to hang it again. Unrelated processes can be affected due to
> lost VRAM or the misfortune of using the GPU while the GPU hang
> occurred. The window system should recreate GPU resources and redraw
> everything without affecting applications. If apps use GL, they should
> do the same. Processes that can't recover by redrawing content can be
> terminated or left alone, but they shouldn't be allowed to submit work
> to the GPU anymore.
>
> dEQP only exercises the soft reset. I think WebGL is only able to
> trigger a soft reset at this point, but Vulkan can also trigger a hard
> reset.
>
> Marek
> ------------------------------------------------------------------------
> *From:* Koenig, Christian <Christian.Koenig@amd.com>
> *Sent:* March 23, 2022 11:25
> *To:* Daniel Vetter <daniel@ffwll.ch>; Daniel Stone
> <daniel@fooishbar.org>; Olsak, Marek <Marek.Olsak@amd.com>;
> Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> *Cc:* Rob Clark <robdclark@gmail.com>; Rob Clark
> <robdclark@chromium.org>; Sharma, Shashank <Shashank.Sharma@amd.com>;
> Christian König <ckoenig.leichtzumerken@gmail.com>; Somalapuram,
> Amaranath <Amaranath.Somalapuram@amd.com>; Abhinav Kumar
> <quic_abhinavk@quicinc.com>; dri-devel
> <dri-devel@lists.freedesktop.org>; amd-gfx list
> <amd-gfx@lists.freedesktop.org>; Deucher, Alexander
> <Alexander.Deucher@amd.com>; Shashank Sharma
> <contactshashanksharma@gmail.com>
> *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
> [Adding Marek and Andrey as well]
>
> Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> > On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org> wrote:
> >> Hi,
> >>
> >> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com> wrote:
> >>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>> Well you can, it just means that their contexts are lost as well.
> >>> Which is rather inconvenient when deqp-egl reset tests, for example,
> >>> take down your compositor ;-)
> >> Yeah. Or anything WebGL.
> >>
> >> System-wide collateral damage is definitely a non-starter. If that
> >> means that the userspace driver has to do what iris does and ensure
> >> everything's recreated and resubmitted, that works too, just as long
> >> as the response to 'my adblocker didn't detect a crypto miner ad' is
> >> something better than 'shoot the entire user session'.
> > Not sure where that idea came from, I thought at least I made it clear
> > that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> > which should die without recovery attempt.
> >
> > The entire discussion here is who should be responsible for replay and
> > at least if you can decide the uapi, then punting that entirely to
> > userspace is a good approach.
>
> Yes, completely agree. We have the approach of re-submitting things in
> the kernel and that failed quite miserable.
>
> In other words currently a GPU reset has something like a 99% chance to
> get down your whole desktop.
>
> Daniel can you briefly explain what exactly iris does when a lost
> context is detected without gl robustness?
>
> It sounds like you guys got that working quite well.
>
> Thanks,
> Christian.
>
> >
> > Ofc it'd be nice if the collateral damage is limited, i.e. requests
> > not currently on the gpu, or on different engines and all that
> > shouldn't be nuked, if possible.
> >
> > Also ofc since msm uapi is that the kernel tries to recover there's
> > not much we can do there, contexts cannot be shot. But still trying to
> > replay them as much as possible feels a bit like overkill.
> > -Daniel
> >
> >> Cheers,
> >> Daniel
> >
> >
>
[-- Attachment #2: Type: text/html, Size: 11493 bytes --]
next prev parent reply other threads:[~2022-03-29 12:14 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-08 18:04 [PATCH v2 1/2] drm: Add GPU reset sysfs event Shashank Sharma
2022-03-08 18:04 ` [PATCH v2 2/2] drm/amdgpu: add work function for GPU reset event Shashank Sharma
2022-03-09 7:47 ` [PATCH v2 1/2] drm: Add GPU reset sysfs event Simon Ser
2022-03-09 11:18 ` Sharma, Shashank
2022-03-09 8:09 ` Christian König
2022-03-09 9:56 ` Pierre-Eric Pelloux-Prayer
2022-03-09 10:10 ` Simon Ser
2022-03-09 10:24 ` Christian König
2022-03-09 10:28 ` Simon Ser
2022-03-09 10:28 ` Pierre-Eric Pelloux-Prayer
2022-03-09 18:12 ` Rob Clark
2022-03-10 9:55 ` Christian König
2022-03-10 15:24 ` Rob Clark
2022-03-10 16:21 ` Sharma, Shashank
2022-03-10 16:27 ` Andrey Grodzovsky
2022-03-10 17:16 ` Rob Clark
2022-03-10 17:10 ` Rob Clark
2022-03-10 17:19 ` Sharma, Shashank
2022-03-10 17:40 ` Rob Clark
2022-03-10 18:33 ` Abhinav Kumar
2022-03-10 19:14 ` Sharma, Shashank
2022-03-10 19:35 ` Rob Clark
2022-03-10 19:44 ` Sharma, Shashank
2022-03-10 19:56 ` Rob Clark
2022-03-10 20:17 ` Sharma, Shashank
2022-03-11 8:30 ` Pekka Paalanen
2022-03-14 14:23 ` Alex Deucher
2022-03-14 15:26 ` Pekka Paalanen
2022-03-15 14:54 ` Alex Deucher
2022-03-16 8:48 ` Pekka Paalanen
2022-03-16 14:12 ` Alex Deucher
2022-03-16 15:36 ` Rob Clark
2022-03-16 15:48 ` Alex Deucher
2022-03-16 16:30 ` Rob Clark
2022-03-17 7:03 ` Christian König
2022-03-17 9:29 ` Daniel Vetter
2022-03-17 9:46 ` Christian König
2022-03-17 15:34 ` Rob Clark
2022-03-17 17:23 ` Daniel Vetter
2022-03-17 15:40 ` Rob Clark
2022-03-17 17:26 ` Daniel Vetter
2022-03-17 17:31 ` Rob Clark
2022-03-18 7:42 ` Christian König
2022-03-18 15:12 ` Rob Clark
2022-03-21 9:30 ` Christian König
2022-03-21 16:03 ` Rob Clark
2022-03-23 14:07 ` Daniel Stone
2022-03-23 15:14 ` Daniel Vetter
2022-03-23 15:25 ` Christian König
2022-03-26 0:53 ` Olsak, Marek
2022-03-29 12:14 ` Christian König [this message]
2022-03-29 16:25 ` Marek Olšák
2022-03-30 9:49 ` Daniel Vetter
2022-03-23 17:30 ` Rob Clark
2022-03-21 14:15 ` Daniel Vetter
2022-03-15 7:13 ` Dave Airlie
2022-03-15 7:25 ` Simon Ser
2022-03-15 7:25 ` Christian König
2022-03-17 9:25 ` Daniel Vetter
2022-03-16 21:50 ` Rob Clark
2022-03-17 8:42 ` Sharma, Shashank
2022-03-17 9:21 ` Christian König
2022-03-17 10:31 ` Daniel Stone
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5818c2a4-80c4-8af2-9937-d2787054c149@amd.com \
--to=christian.koenig@amd.com \
--cc=Alexander.Deucher@amd.com \
--cc=Amaranath.Somalapuram@amd.com \
--cc=Andrey.Grodzovsky@amd.com \
--cc=Marek.Olsak@amd.com \
--cc=Shashank.Sharma@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=ckoenig.leichtzumerken@gmail.com \
--cc=contactshashanksharma@gmail.com \
--cc=daniel@ffwll.ch \
--cc=daniel@fooishbar.org \
--cc=dri-devel@lists.freedesktop.org \
--cc=quic_abhinavk@quicinc.com \
--cc=robdclark@chromium.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).