dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Alex Deucher <alexdeucher@gmail.com>
To: Pekka Paalanen <ppaalanen@gmail.com>
Cc: "Rob Clark" <robdclark@chromium.org>,
	"Sharma, Shashank" <shashank.sharma@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Amaranath Somalapuram" <amaranath.somalapuram@amd.com>,
	"Abhinav Kumar" <quic_abhinavk@quicinc.com>,
	dri-devel <dri-devel@lists.freedesktop.org>,
	"amd-gfx list" <amd-gfx@lists.freedesktop.org>,
	"Alexandar Deucher" <alexander.deucher@amd.com>,
	"Shashank Sharma" <contactshashanksharma@gmail.com>,
	"Christian Koenig" <christian.koenig@amd.com>
Subject: Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
Date: Tue, 15 Mar 2022 10:54:38 -0400	[thread overview]
Message-ID: <CADnq5_NsxipfFFXfRSXvVQin3e1gj0Q_p9p-shi3VZ2pSCwwfw@mail.gmail.com> (raw)
In-Reply-To: <20220314172647.223658d2@eldfell>

On Mon, Mar 14, 2022 at 11:26 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
>
> On Mon, 14 Mar 2022 10:23:27 -0400
> Alex Deucher <alexdeucher@gmail.com> wrote:
>
> > On Fri, Mar 11, 2022 at 3:30 AM Pekka Paalanen <ppaalanen@gmail.com> wrote:
> > >
> > > On Thu, 10 Mar 2022 11:56:41 -0800
> > > Rob Clark <robdclark@gmail.com> wrote:
> > >
> > > > For something like just notifying a compositor that a gpu crash
> > > > happened, perhaps drm_event is more suitable.  See
> > > > virtio_gpu_fence_event_create() for an example of adding new event
> > > > types.  Although maybe you want it to be an event which is not device
> > > > specific.  This isn't so much of a debugging use-case as simply
> > > > notification.
> > >
> > > Hi,
> > >
> > > for this particular use case, are we now talking about the display
> > > device (KMS) crashing or the rendering device (OpenGL/Vulkan) crashing?
> > >
> > > If the former, I wasn't aware that display device crashes are a thing.
> > > How should a userspace display server react to those?
> > >
> > > If the latter, don't we have EGL extensions or Vulkan API already to
> > > deliver that?
> > >
> > > The above would be about device crashes that directly affect the
> > > display server. Is that the use case in mind here, or is it instead
> > > about notifying the display server that some application has caused a
> > > driver/hardware crash? If the latter, how should a display server react
> > > to that? Disconnect the application?
> > >
> > > Shashank, what is the actual use case you are developing this for?
> > >
> > > I've read all the emails here so far, and I don't recall seeing it
> > > explained.
> > >
> >
> > The idea is that a support daemon or compositor would listen for GPU
> > reset notifications and do something useful with them (kill the guilty
> > app, restart the desktop environment, etc.).  Today when the GPU
> > resets, most applications just continue assuming nothing is wrong,
> > meanwhile the GPU has stopped accepting work until the apps re-init
> > their context so all of their command submissions just get rejected.
> >
> > > Btw. somewhat relatedly, there has been work aiming to allow
> > > graceful hot-unplug of DRM devices. There is a kernel doc outlining how
> > > the various APIs should react towards userspace when a DRM device
> > > suddenly disappears. That seems to have some overlap here IMO.
> > >
> > > See https://www.kernel.org/doc/html/latest/gpu/drm-uapi.html#device-hot-unplug
> > > which also has a couple pointers to EGL and Vulkan APIs.
> >
> > The problem is most applications don't use the GL or VK robustness
> > APIs.
>
> Hi,
>
> how would this new event help with that?

This event would provide notification that a GPU reset occurred.

>
> I mean, yeah, there could be a daemon that kills those GPU users, but
> then what? You still lose any unsaved work, and may need to manually
> restart them.
>
> Is the idea that it is better to have the app crash and disappear than
> to look like it froze while it otherwise still runs?

Yes.  The daemon could also send the user some sort of notification
that a GPU reset occurred.

>
> If some daemon or compositor goes killing apps that trigger GPU resets,
> then how do we stop that for an app that actually does use the
> appropriate EGL or Vulkan APIs to detect and remedy that situation
> itself?

I guess the daemon could keep some sort of whitelist.  OTOH, very few
if any applications, especially games actually support these
extensions.

>
> >  You could use something like that in the compositor, but those
> > APIs tend to be focused more on the application itself rather than the
> > GPU in general.  E.g., Is my context lost.  Which is fine for
> > restarting your context, but doesn't really help if you want to try
> > and do something with another application (i.e., the likely guilty
> > app).  Also, on dGPU at least, when you reset the GPU, vram is usually
> > lost (either due to the memory controller being reset, or vram being
> > zero'd on init due to ECC support), so even if you are not the guilty
> > process, in that case you'd need to re-init your context anyway.
>
> Why should something like a compositor listen for this and kill apps
> that triggered GPU resets, instead of e.g. Mesa noticing that in the app
> and killing itself? Mesa in the app would know if robustness API is
> being used.

That's another possibility, but it doesn't handle the case where the
compositor doesn't support any sort of robustness extension so if the
GPU was reset, you'd lose your desktop anyway even if the app kept
running.

>
> Would be really nice to have the answers to all these questions to be
> collected and reiterated in the next version of this proposal.

The idea is to provide the notification of a GPU reset.  What the
various desktop environments or daemons do with it is up to them.  I
still think there is value in a notification even if you don't kill
apps or anything like that.  E.g., you can have a daemon running that
gets notified and logs the error, collects debug info, sends an email,
etc.

Alex

>
>
> Thanks,
> pq

  reply	other threads:[~2022-03-15 14:54 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-08 18:04 [PATCH v2 1/2] drm: Add GPU reset sysfs event Shashank Sharma
2022-03-08 18:04 ` [PATCH v2 2/2] drm/amdgpu: add work function for GPU reset event Shashank Sharma
2022-03-09  7:47 ` [PATCH v2 1/2] drm: Add GPU reset sysfs event Simon Ser
2022-03-09 11:18   ` Sharma, Shashank
2022-03-09  8:09 ` Christian König
2022-03-09  9:56 ` Pierre-Eric Pelloux-Prayer
2022-03-09 10:10   ` Simon Ser
2022-03-09 10:24     ` Christian König
2022-03-09 10:28       ` Simon Ser
2022-03-09 10:28       ` Pierre-Eric Pelloux-Prayer
2022-03-09 18:12 ` Rob Clark
2022-03-10  9:55   ` Christian König
2022-03-10 15:24     ` Rob Clark
2022-03-10 16:21       ` Sharma, Shashank
2022-03-10 16:27         ` Andrey Grodzovsky
2022-03-10 17:16           ` Rob Clark
2022-03-10 17:10         ` Rob Clark
2022-03-10 17:19           ` Sharma, Shashank
2022-03-10 17:40             ` Rob Clark
2022-03-10 18:33               ` Abhinav Kumar
2022-03-10 19:14                 ` Sharma, Shashank
2022-03-10 19:35                   ` Rob Clark
2022-03-10 19:44                     ` Sharma, Shashank
2022-03-10 19:56                       ` Rob Clark
2022-03-10 20:17                         ` Sharma, Shashank
2022-03-11  8:30                         ` Pekka Paalanen
2022-03-14 14:23                           ` Alex Deucher
2022-03-14 15:26                             ` Pekka Paalanen
2022-03-15 14:54                               ` Alex Deucher [this message]
2022-03-16  8:48                                 ` Pekka Paalanen
2022-03-16 14:12                                   ` Alex Deucher
2022-03-16 15:36                                     ` Rob Clark
2022-03-16 15:48                                       ` Alex Deucher
2022-03-16 16:30                                         ` Rob Clark
2022-03-17  7:03                                       ` Christian König
2022-03-17  9:29                                         ` Daniel Vetter
2022-03-17  9:46                                           ` Christian König
2022-03-17 15:34                                           ` Rob Clark
2022-03-17 17:23                                             ` Daniel Vetter
2022-03-17 15:40                                           ` Rob Clark
2022-03-17 17:26                                             ` Daniel Vetter
2022-03-17 17:31                                               ` Rob Clark
2022-03-18  7:42                                                 ` Christian König
2022-03-18 15:12                                                   ` Rob Clark
2022-03-21  9:30                                                     ` Christian König
2022-03-21 16:03                                                       ` Rob Clark
2022-03-23 14:07                                                         ` Daniel Stone
2022-03-23 15:14                                                           ` Daniel Vetter
2022-03-23 15:25                                                             ` Christian König
2022-03-26  0:53                                                               ` Olsak, Marek
2022-03-29 12:14                                                                 ` Christian König
2022-03-29 16:25                                                                   ` Marek Olšák
2022-03-30  9:49                                                                     ` Daniel Vetter
2022-03-23 17:30                                                             ` Rob Clark
2022-03-21 14:15                                                     ` Daniel Vetter
2022-03-15  7:13                             ` Dave Airlie
2022-03-15  7:25                               ` Simon Ser
2022-03-15  7:25                               ` Christian König
2022-03-17  9:25                             ` Daniel Vetter
2022-03-16 21:50 ` Rob Clark
2022-03-17  8:42   ` Sharma, Shashank
2022-03-17  9:21     ` Christian König
2022-03-17 10:31       ` Daniel Stone

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CADnq5_NsxipfFFXfRSXvVQin3e1gj0Q_p9p-shi3VZ2pSCwwfw@mail.gmail.com \
    --to=alexdeucher@gmail.com \
    --cc=alexander.deucher@amd.com \
    --cc=amaranath.somalapuram@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=contactshashanksharma@gmail.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=ppaalanen@gmail.com \
    --cc=quic_abhinavk@quicinc.com \
    --cc=robdclark@chromium.org \
    --cc=shashank.sharma@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).