All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Marek Olšák" <maraeo@gmail.com>
To: "Christian König" <christian.koenig@amd.com>
Cc: "Rob Clark" <robdclark@chromium.org>,
	"amd-gfx list" <amd-gfx@lists.freedesktop.org>,
	"Sharma, Shashank" <Shashank.Sharma@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Olsak, Marek" <Marek.Olsak@amd.com>,
	"Somalapuram, Amaranath" <Amaranath.Somalapuram@amd.com>,
	"Abhinav Kumar" <quic_abhinavk@quicinc.com>,
	dri-devel <dri-devel@lists.freedesktop.org>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Shashank Sharma" <contactshashanksharma@gmail.com>
Subject: Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
Date: Tue, 29 Mar 2022 12:25:55 -0400	[thread overview]
Message-ID: <CAAxE2A642QK0NFRLYsq5PuossG_mLExiJD8SzipVc4xVp_V=tw@mail.gmail.com> (raw)
In-Reply-To: <5818c2a4-80c4-8af2-9937-d2787054c149@amd.com>

[-- Attachment #1: Type: text/plain, Size: 6183 bytes --]

I don't know what iris does, but I would guess that the same problems as
with AMD GPUs apply, making GPUs resets very fragile.

Marek

On Tue., Mar. 29, 2022, 08:14 Christian König, <christian.koenig@amd.com>
wrote:

> My main question is what does the iris driver better than radeonsi when
> the client doesn't support the robustness extension?
>
> From Daniels description it sounds like they have at least a partial
> recovery mechanism in place.
>
> Apart from that I completely agree to what you said below.
>
> Christian.
>
> Am 26.03.22 um 01:53 schrieb Olsak, Marek:
>
> [AMD Official Use Only]
>
> amdgpu has 2 resets: soft reset and hard reset.
>
> The soft reset is able to recover from an infinite loop and even some GPU
> hangs due to bad shaders or bad states. The soft reset uses a signal that
> kills all currently-running shaders of a certain process (VM context),
> which unblocks the graphics pipeline, so draws and command buffers finish
> but are not correctly. This can then cause a hard hang if the shader was
> supposed to signal work completion through a shader store instruction and a
> non-shader consumer is waiting for it (skipping the store instruction by
> killing the shader won't signal the work, and thus the consumer will be
> stuck, requiring a hard reset).
>
> The hard reset can recover from other hangs, which is great, but it may
> use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory
> contents, but we should assume that any process that had running jobs on
> the GPU during a GPU reset has its memory resources in an inconsistent
> state, and thus following command buffers can cause another GPU hang. The
> shader store example above is enough to cause another hard hang due to
> incorrect content in memory resources, which can contain synchronization
> primitives that are used internally by the hardware.
>
> Asking the driver to replay a command buffer that caused a hang is a sure
> way to hang it again. Unrelated processes can be affected due to lost VRAM
> or the misfortune of using the GPU while the GPU hang occurred. The window
> system should recreate GPU resources and redraw everything without
> affecting applications. If apps use GL, they should do the same. Processes
> that can't recover by redrawing content can be terminated or left alone,
> but they shouldn't be allowed to submit work to the GPU anymore.
>
> dEQP only exercises the soft reset. I think WebGL is only able to trigger
> a soft reset at this point, but Vulkan can also trigger a hard reset.
>
> Marek
> ------------------------------
> *From:* Koenig, Christian <Christian.Koenig@amd.com>
> <Christian.Koenig@amd.com>
> *Sent:* March 23, 2022 11:25
> *To:* Daniel Vetter <daniel@ffwll.ch> <daniel@ffwll.ch>; Daniel Stone
> <daniel@fooishbar.org> <daniel@fooishbar.org>; Olsak, Marek
> <Marek.Olsak@amd.com> <Marek.Olsak@amd.com>; Grodzovsky, Andrey
> <Andrey.Grodzovsky@amd.com> <Andrey.Grodzovsky@amd.com>
> *Cc:* Rob Clark <robdclark@gmail.com> <robdclark@gmail.com>; Rob Clark
> <robdclark@chromium.org> <robdclark@chromium.org>; Sharma, Shashank
> <Shashank.Sharma@amd.com> <Shashank.Sharma@amd.com>; Christian König
> <ckoenig.leichtzumerken@gmail.com> <ckoenig.leichtzumerken@gmail.com>;
> Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>
> <Amaranath.Somalapuram@amd.com>; Abhinav Kumar <quic_abhinavk@quicinc.com>
> <quic_abhinavk@quicinc.com>; dri-devel <dri-devel@lists.freedesktop.org>
> <dri-devel@lists.freedesktop.org>; amd-gfx list
> <amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org>; Deucher,
> Alexander <Alexander.Deucher@amd.com> <Alexander.Deucher@amd.com>;
> Shashank Sharma <contactshashanksharma@gmail.com>
> <contactshashanksharma@gmail.com>
> *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
>
> [Adding Marek and Andrey as well]
>
> Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> > On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org>
> <daniel@fooishbar.org> wrote:
> >> Hi,
> >>
> >> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com>
> <robdclark@gmail.com> wrote:
> >>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
> >>> <christian.koenig@amd.com> <christian.koenig@amd.com> wrote:
> >>>> Well you can, it just means that their contexts are lost as well.
> >>> Which is rather inconvenient when deqp-egl reset tests, for example,
> >>> take down your compositor ;-)
> >> Yeah. Or anything WebGL.
> >>
> >> System-wide collateral damage is definitely a non-starter. If that
> >> means that the userspace driver has to do what iris does and ensure
> >> everything's recreated and resubmitted, that works too, just as long
> >> as the response to 'my adblocker didn't detect a crypto miner ad'  is
> >> something better than 'shoot the entire user session'.
> > Not sure where that idea came from, I thought at least I made it clear
> > that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> > which should die without recovery attempt.
> >
> > The entire discussion here is who should be responsible for replay and
> > at least if you can decide the uapi, then punting that entirely to
> > userspace is a good approach.
>
> Yes, completely agree. We have the approach of re-submitting things in
> the kernel and that failed quite miserable.
>
> In other words currently a GPU reset has something like a 99% chance to
> get down your whole desktop.
>
> Daniel can you briefly explain what exactly iris does when a lost
> context is detected without gl robustness?
>
> It sounds like you guys got that working quite well.
>
> Thanks,
> Christian.
>
> >
> > Ofc it'd be nice if the collateral damage is limited, i.e. requests
> > not currently on the gpu, or on different engines and all that
> > shouldn't be nuked, if possible.
> >
> > Also ofc since msm uapi is that the kernel tries to recover there's
> > not much we can do there, contexts cannot be shot. But still trying to
> > replay them as much as possible feels a bit like overkill.
> > -Daniel
> >
> >> Cheers,
> >> Daniel
> >
> >
>
>
>

[-- Attachment #2: Type: text/html, Size: 11788 bytes --]

WARNING: multiple messages have this Message-ID (diff)
From: "Marek Olšák" <maraeo@gmail.com>
To: "Christian König" <christian.koenig@amd.com>
Cc: "Rob Clark" <robdclark@chromium.org>,
	"Grodzovsky, Andrey" <Andrey.Grodzovsky@amd.com>,
	"amd-gfx list" <amd-gfx@lists.freedesktop.org>,
	"Sharma, Shashank" <Shashank.Sharma@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Olsak, Marek" <Marek.Olsak@amd.com>,
	"Somalapuram, Amaranath" <Amaranath.Somalapuram@amd.com>,
	"Abhinav Kumar" <quic_abhinavk@quicinc.com>,
	"Daniel Stone" <daniel@fooishbar.org>,
	"Rob Clark" <robdclark@gmail.com>,
	dri-devel <dri-devel@lists.freedesktop.org>,
	"Daniel Vetter" <daniel@ffwll.ch>,
	"Deucher, Alexander" <Alexander.Deucher@amd.com>,
	"Shashank Sharma" <contactshashanksharma@gmail.com>
Subject: Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
Date: Tue, 29 Mar 2022 12:25:55 -0400	[thread overview]
Message-ID: <CAAxE2A642QK0NFRLYsq5PuossG_mLExiJD8SzipVc4xVp_V=tw@mail.gmail.com> (raw)
In-Reply-To: <5818c2a4-80c4-8af2-9937-d2787054c149@amd.com>

[-- Attachment #1: Type: text/plain, Size: 6183 bytes --]

I don't know what iris does, but I would guess that the same problems as
with AMD GPUs apply, making GPUs resets very fragile.

Marek

On Tue., Mar. 29, 2022, 08:14 Christian König, <christian.koenig@amd.com>
wrote:

> My main question is what does the iris driver better than radeonsi when
> the client doesn't support the robustness extension?
>
> From Daniels description it sounds like they have at least a partial
> recovery mechanism in place.
>
> Apart from that I completely agree to what you said below.
>
> Christian.
>
> Am 26.03.22 um 01:53 schrieb Olsak, Marek:
>
> [AMD Official Use Only]
>
> amdgpu has 2 resets: soft reset and hard reset.
>
> The soft reset is able to recover from an infinite loop and even some GPU
> hangs due to bad shaders or bad states. The soft reset uses a signal that
> kills all currently-running shaders of a certain process (VM context),
> which unblocks the graphics pipeline, so draws and command buffers finish
> but are not correctly. This can then cause a hard hang if the shader was
> supposed to signal work completion through a shader store instruction and a
> non-shader consumer is waiting for it (skipping the store instruction by
> killing the shader won't signal the work, and thus the consumer will be
> stuck, requiring a hard reset).
>
> The hard reset can recover from other hangs, which is great, but it may
> use a PCI reset, which erases VRAM on dGPUs. APUs don't lose memory
> contents, but we should assume that any process that had running jobs on
> the GPU during a GPU reset has its memory resources in an inconsistent
> state, and thus following command buffers can cause another GPU hang. The
> shader store example above is enough to cause another hard hang due to
> incorrect content in memory resources, which can contain synchronization
> primitives that are used internally by the hardware.
>
> Asking the driver to replay a command buffer that caused a hang is a sure
> way to hang it again. Unrelated processes can be affected due to lost VRAM
> or the misfortune of using the GPU while the GPU hang occurred. The window
> system should recreate GPU resources and redraw everything without
> affecting applications. If apps use GL, they should do the same. Processes
> that can't recover by redrawing content can be terminated or left alone,
> but they shouldn't be allowed to submit work to the GPU anymore.
>
> dEQP only exercises the soft reset. I think WebGL is only able to trigger
> a soft reset at this point, but Vulkan can also trigger a hard reset.
>
> Marek
> ------------------------------
> *From:* Koenig, Christian <Christian.Koenig@amd.com>
> <Christian.Koenig@amd.com>
> *Sent:* March 23, 2022 11:25
> *To:* Daniel Vetter <daniel@ffwll.ch> <daniel@ffwll.ch>; Daniel Stone
> <daniel@fooishbar.org> <daniel@fooishbar.org>; Olsak, Marek
> <Marek.Olsak@amd.com> <Marek.Olsak@amd.com>; Grodzovsky, Andrey
> <Andrey.Grodzovsky@amd.com> <Andrey.Grodzovsky@amd.com>
> *Cc:* Rob Clark <robdclark@gmail.com> <robdclark@gmail.com>; Rob Clark
> <robdclark@chromium.org> <robdclark@chromium.org>; Sharma, Shashank
> <Shashank.Sharma@amd.com> <Shashank.Sharma@amd.com>; Christian König
> <ckoenig.leichtzumerken@gmail.com> <ckoenig.leichtzumerken@gmail.com>;
> Somalapuram, Amaranath <Amaranath.Somalapuram@amd.com>
> <Amaranath.Somalapuram@amd.com>; Abhinav Kumar <quic_abhinavk@quicinc.com>
> <quic_abhinavk@quicinc.com>; dri-devel <dri-devel@lists.freedesktop.org>
> <dri-devel@lists.freedesktop.org>; amd-gfx list
> <amd-gfx@lists.freedesktop.org> <amd-gfx@lists.freedesktop.org>; Deucher,
> Alexander <Alexander.Deucher@amd.com> <Alexander.Deucher@amd.com>;
> Shashank Sharma <contactshashanksharma@gmail.com>
> <contactshashanksharma@gmail.com>
> *Subject:* Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
>
> [Adding Marek and Andrey as well]
>
> Am 23.03.22 um 16:14 schrieb Daniel Vetter:
> > On Wed, 23 Mar 2022 at 15:07, Daniel Stone <daniel@fooishbar.org>
> <daniel@fooishbar.org> wrote:
> >> Hi,
> >>
> >> On Mon, 21 Mar 2022 at 16:02, Rob Clark <robdclark@gmail.com>
> <robdclark@gmail.com> wrote:
> >>> On Mon, Mar 21, 2022 at 2:30 AM Christian König
> >>> <christian.koenig@amd.com> <christian.koenig@amd.com> wrote:
> >>>> Well you can, it just means that their contexts are lost as well.
> >>> Which is rather inconvenient when deqp-egl reset tests, for example,
> >>> take down your compositor ;-)
> >> Yeah. Or anything WebGL.
> >>
> >> System-wide collateral damage is definitely a non-starter. If that
> >> means that the userspace driver has to do what iris does and ensure
> >> everything's recreated and resubmitted, that works too, just as long
> >> as the response to 'my adblocker didn't detect a crypto miner ad'  is
> >> something better than 'shoot the entire user session'.
> > Not sure where that idea came from, I thought at least I made it clear
> > that legacy gl _has_ to recover. It's only vk and arb_robustness gl
> > which should die without recovery attempt.
> >
> > The entire discussion here is who should be responsible for replay and
> > at least if you can decide the uapi, then punting that entirely to
> > userspace is a good approach.
>
> Yes, completely agree. We have the approach of re-submitting things in
> the kernel and that failed quite miserable.
>
> In other words currently a GPU reset has something like a 99% chance to
> get down your whole desktop.
>
> Daniel can you briefly explain what exactly iris does when a lost
> context is detected without gl robustness?
>
> It sounds like you guys got that working quite well.
>
> Thanks,
> Christian.
>
> >
> > Ofc it'd be nice if the collateral damage is limited, i.e. requests
> > not currently on the gpu, or on different engines and all that
> > shouldn't be nuked, if possible.
> >
> > Also ofc since msm uapi is that the kernel tries to recover there's
> > not much we can do there, contexts cannot be shot. But still trying to
> > replay them as much as possible feels a bit like overkill.
> > -Daniel
> >
> >> Cheers,
> >> Daniel
> >
> >
>
>
>

[-- Attachment #2: Type: text/html, Size: 11788 bytes --]

  reply	other threads:[~2022-03-29 16:26 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-08 18:04 [PATCH v2 1/2] drm: Add GPU reset sysfs event Shashank Sharma
2022-03-08 18:04 ` [PATCH v2 2/2] drm/amdgpu: add work function for GPU reset event Shashank Sharma
2022-03-09  7:47 ` [PATCH v2 1/2] drm: Add GPU reset sysfs event Simon Ser
2022-03-09 11:18   ` Sharma, Shashank
2022-03-09  8:09 ` Christian König
2022-03-09  9:56 ` Pierre-Eric Pelloux-Prayer
2022-03-09 10:10   ` Simon Ser
2022-03-09 10:24     ` Christian König
2022-03-09 10:28       ` Simon Ser
2022-03-09 10:28       ` Pierre-Eric Pelloux-Prayer
2022-03-09 18:12 ` Rob Clark
2022-03-10  9:55   ` Christian König
2022-03-10 15:24     ` Rob Clark
2022-03-10 16:21       ` Sharma, Shashank
2022-03-10 16:27         ` Andrey Grodzovsky
2022-03-10 17:16           ` Rob Clark
2022-03-10 17:10         ` Rob Clark
2022-03-10 17:19           ` Sharma, Shashank
2022-03-10 17:40             ` Rob Clark
2022-03-10 18:33               ` Abhinav Kumar
2022-03-10 19:14                 ` Sharma, Shashank
2022-03-10 19:35                   ` Rob Clark
2022-03-10 19:44                     ` Sharma, Shashank
2022-03-10 19:56                       ` Rob Clark
2022-03-10 20:17                         ` Sharma, Shashank
2022-03-11  8:30                         ` Pekka Paalanen
2022-03-14 14:23                           ` Alex Deucher
2022-03-14 14:23                             ` Alex Deucher
2022-03-14 15:26                             ` Pekka Paalanen
2022-03-14 15:26                               ` Pekka Paalanen
2022-03-15 14:54                               ` Alex Deucher
2022-03-15 14:54                                 ` Alex Deucher
2022-03-16  8:48                                 ` Pekka Paalanen
2022-03-16  8:48                                   ` Pekka Paalanen
2022-03-16 14:12                                   ` Alex Deucher
2022-03-16 14:12                                     ` Alex Deucher
2022-03-16 15:36                                     ` Rob Clark
2022-03-16 15:36                                       ` Rob Clark
2022-03-16 15:48                                       ` Alex Deucher
2022-03-16 15:48                                         ` Alex Deucher
2022-03-16 16:30                                         ` Rob Clark
2022-03-16 16:30                                           ` Rob Clark
2022-03-17  7:03                                       ` Christian König
2022-03-17  7:03                                         ` Christian König
2022-03-17  9:29                                         ` Daniel Vetter
2022-03-17  9:29                                           ` Daniel Vetter
2022-03-17  9:46                                           ` Christian König
2022-03-17  9:46                                             ` Christian König
2022-03-17 15:34                                           ` Rob Clark
2022-03-17 15:34                                             ` Rob Clark
2022-03-17 17:23                                             ` Daniel Vetter
2022-03-17 17:23                                               ` Daniel Vetter
2022-03-17 15:40                                           ` Rob Clark
2022-03-17 15:40                                             ` Rob Clark
2022-03-17 17:26                                             ` Daniel Vetter
2022-03-17 17:26                                               ` Daniel Vetter
2022-03-17 17:31                                               ` Rob Clark
2022-03-17 17:31                                                 ` Rob Clark
2022-03-18  7:42                                                 ` Christian König
2022-03-18  7:42                                                   ` Christian König
2022-03-18 15:12                                                   ` Rob Clark
2022-03-18 15:12                                                     ` Rob Clark
2022-03-21  9:30                                                     ` Christian König
2022-03-21  9:30                                                       ` Christian König
2022-03-21 16:03                                                       ` Rob Clark
2022-03-21 16:03                                                         ` Rob Clark
2022-03-23 14:07                                                         ` Daniel Stone
2022-03-23 15:14                                                           ` Daniel Vetter
2022-03-23 15:14                                                             ` Daniel Vetter
2022-03-23 15:25                                                             ` Christian König
2022-03-23 15:25                                                               ` Christian König
2022-03-26  0:53                                                               ` Olsak, Marek
2022-03-26  0:53                                                                 ` Olsak, Marek
2022-03-29 12:14                                                                 ` Christian König
2022-03-29 12:14                                                                   ` Christian König
2022-03-29 16:25                                                                   ` Marek Olšák [this message]
2022-03-29 16:25                                                                     ` Marek Olšák
2022-03-30  9:49                                                                     ` Daniel Vetter
2022-03-30  9:49                                                                       ` Daniel Vetter
2022-03-23 17:30                                                             ` Rob Clark
2022-03-23 17:30                                                               ` Rob Clark
2022-03-21 14:15                                                     ` Daniel Vetter
2022-03-21 14:15                                                       ` Daniel Vetter
2022-03-15  7:13                             ` Dave Airlie
2022-03-15  7:13                               ` Dave Airlie
2022-03-15  7:25                               ` Simon Ser
2022-03-15  7:25                                 ` Simon Ser
2022-03-15  7:25                               ` Christian König
2022-03-15  7:25                                 ` Christian König
2022-03-17  9:25                             ` Daniel Vetter
2022-03-16 21:50 ` Rob Clark
2022-03-17  8:42   ` Sharma, Shashank
2022-03-17  9:21     ` Christian König
2022-03-17 10:31       ` Daniel Stone
2022-03-17 10:31         ` Daniel Stone

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAAxE2A642QK0NFRLYsq5PuossG_mLExiJD8SzipVc4xVp_V=tw@mail.gmail.com' \
    --to=maraeo@gmail.com \
    --cc=Alexander.Deucher@amd.com \
    --cc=Amaranath.Somalapuram@amd.com \
    --cc=Marek.Olsak@amd.com \
    --cc=Shashank.Sharma@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=contactshashanksharma@gmail.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=quic_abhinavk@quicinc.com \
    --cc=robdclark@chromium.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.