dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
From: Daniel Vetter <daniel@ffwll.ch>
To: Rob Clark <robdclark@gmail.com>
Cc: "Rob Clark" <robdclark@chromium.org>,
	"Sharma, Shashank" <shashank.sharma@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Amaranath Somalapuram" <amaranath.somalapuram@amd.com>,
	"Abhinav Kumar" <quic_abhinavk@quicinc.com>,
	dri-devel <dri-devel@lists.freedesktop.org>,
	"Alexandar Deucher" <alexander.deucher@amd.com>,
	"amd-gfx list" <amd-gfx@lists.freedesktop.org>,
	"Shashank Sharma" <contactshashanksharma@gmail.com>,
	"Christian Koenig" <christian.koenig@amd.com>
Subject: Re: [PATCH v2 1/2] drm: Add GPU reset sysfs event
Date: Thu, 17 Mar 2022 18:26:58 +0100	[thread overview]
Message-ID: <YjNvYmcuDKiIneDx@phenom.ffwll.local> (raw)
In-Reply-To: <CAF6AEGu79jDW1xqJOaNCB=jAxhKiCRFJwoCNiC9ye97KvpWEtQ@mail.gmail.com>

On Thu, Mar 17, 2022 at 08:40:51AM -0700, Rob Clark wrote:
> On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
> > > Am 16.03.22 um 16:36 schrieb Rob Clark:
> > > > [SNIP]
> > > > just one point of clarification.. in the msm and i915 case it is
> > > > purely for debugging and telemetry (ie. sending crash logs back to
> > > > distro for analysis if user has crash reporting enabled).. it isn't
> > > > used for triggering any action like killing app or compositor.
> > >
> > > By the way, how does msm it's memory management for the devcoredumps?
> >
> > GFP_NORECLAIM all the way. It's purely best effort.
> >
> > Note that the fancy new plan for i915 discrete gpu is to only support gpu
> > crash dumps on non-recoverable gpu contexts, i.e. those that do not
> > continue to the next batch when something bad happens. This is what vk
> > wants and also what iris now uses (we do context recovery in userspace in
> > all cases), and non-recoverable contexts greatly simplify the crash dump
> > gather: Only thing you need to gather is the register state from hw
> > (before you reset it), all the batchbuffer bo and indirect state bo (in
> > i915 you can mark which bo to capture in the CS ioctl) can be captured in
> > a worker later on. Which for non-recoverable context is no issue, since
> > subsequent batchbuffers won't trample over any of these things.
> 
> fwiw, we snapshot everything (cmdstream and bo's marked with dump
> flag, in addition to hw state) before resuming the GPU, so there is no
> danger of things being trampled.  After state is captured and GPU
> reset, we "replay" the submits that were written into the ringbuffer
> after the faulting submit.  GPU crashes should be a thing you don't
> need to try to optimize.

Not sure why you think we optimize anything here?

> (At some point, I'd like to use scheduler for the replay, and actually
> use drm_sched_stop()/etc.. but last time I looked there were still
> some sched bugs in that area which prevented me from deleting a bunch
> of code ;-))

Not sure about your hw, but at least on intel replaying tends to just
result in follow-on fun. And that holds even more so the more complex a
workload is. This is why vk just dies immediately and does not try to
replay anything, offloading it to the app. Same with arb robusteness.
Afaik it's really only media and classic gl which insist that the driver
stack somehow recover.

And recovering from a mess in userspace is a lot simpler than trying to
pull of the same magic in the kernel. Plus it also helps with a few of the
dma_fence rules, which is a nice bonus.
-Daniel

> 
> BR,
> -R
> 
> >
> > And that way you can record the crashdump (or at least the big pieces like
> > all the indirect state stuff) with GFP_KERNEL.
> >
> > msm probably gets it wrong since embedded drivers have much less shrinker
> > and generally no mmu notifiers going on :-)
> >
> > > I mean it is strictly forbidden to allocate any memory in the GPU reset
> > > path.
> > >
> > > > I would however *strongly* recommend devcoredump support in other GPU
> > > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
> > > > to debug and fix a couple obscure issues that I was not able to
> > > > reproduce by myself.
> > >
> > > Yes, completely agree as well.
> >
> > +1
> >
> > Cheers, Daniel
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

  reply	other threads:[~2022-03-17 17:27 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-08 18:04 [PATCH v2 1/2] drm: Add GPU reset sysfs event Shashank Sharma
2022-03-08 18:04 ` [PATCH v2 2/2] drm/amdgpu: add work function for GPU reset event Shashank Sharma
2022-03-09  7:47 ` [PATCH v2 1/2] drm: Add GPU reset sysfs event Simon Ser
2022-03-09 11:18   ` Sharma, Shashank
2022-03-09  8:09 ` Christian König
2022-03-09  9:56 ` Pierre-Eric Pelloux-Prayer
2022-03-09 10:10   ` Simon Ser
2022-03-09 10:24     ` Christian König
2022-03-09 10:28       ` Simon Ser
2022-03-09 10:28       ` Pierre-Eric Pelloux-Prayer
2022-03-09 18:12 ` Rob Clark
2022-03-10  9:55   ` Christian König
2022-03-10 15:24     ` Rob Clark
2022-03-10 16:21       ` Sharma, Shashank
2022-03-10 16:27         ` Andrey Grodzovsky
2022-03-10 17:16           ` Rob Clark
2022-03-10 17:10         ` Rob Clark
2022-03-10 17:19           ` Sharma, Shashank
2022-03-10 17:40             ` Rob Clark
2022-03-10 18:33               ` Abhinav Kumar
2022-03-10 19:14                 ` Sharma, Shashank
2022-03-10 19:35                   ` Rob Clark
2022-03-10 19:44                     ` Sharma, Shashank
2022-03-10 19:56                       ` Rob Clark
2022-03-10 20:17                         ` Sharma, Shashank
2022-03-11  8:30                         ` Pekka Paalanen
2022-03-14 14:23                           ` Alex Deucher
2022-03-14 15:26                             ` Pekka Paalanen
2022-03-15 14:54                               ` Alex Deucher
2022-03-16  8:48                                 ` Pekka Paalanen
2022-03-16 14:12                                   ` Alex Deucher
2022-03-16 15:36                                     ` Rob Clark
2022-03-16 15:48                                       ` Alex Deucher
2022-03-16 16:30                                         ` Rob Clark
2022-03-17  7:03                                       ` Christian König
2022-03-17  9:29                                         ` Daniel Vetter
2022-03-17  9:46                                           ` Christian König
2022-03-17 15:34                                           ` Rob Clark
2022-03-17 17:23                                             ` Daniel Vetter
2022-03-17 15:40                                           ` Rob Clark
2022-03-17 17:26                                             ` Daniel Vetter [this message]
2022-03-17 17:31                                               ` Rob Clark
2022-03-18  7:42                                                 ` Christian König
2022-03-18 15:12                                                   ` Rob Clark
2022-03-21  9:30                                                     ` Christian König
2022-03-21 16:03                                                       ` Rob Clark
2022-03-23 14:07                                                         ` Daniel Stone
2022-03-23 15:14                                                           ` Daniel Vetter
2022-03-23 15:25                                                             ` Christian König
2022-03-26  0:53                                                               ` Olsak, Marek
2022-03-29 12:14                                                                 ` Christian König
2022-03-29 16:25                                                                   ` Marek Olšák
2022-03-30  9:49                                                                     ` Daniel Vetter
2022-03-23 17:30                                                             ` Rob Clark
2022-03-21 14:15                                                     ` Daniel Vetter
2022-03-15  7:13                             ` Dave Airlie
2022-03-15  7:25                               ` Simon Ser
2022-03-15  7:25                               ` Christian König
2022-03-17  9:25                             ` Daniel Vetter
2022-03-16 21:50 ` Rob Clark
2022-03-17  8:42   ` Sharma, Shashank
2022-03-17  9:21     ` Christian König
2022-03-17 10:31       ` Daniel Stone

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YjNvYmcuDKiIneDx@phenom.ffwll.local \
    --to=daniel@ffwll.ch \
    --cc=alexander.deucher@amd.com \
    --cc=amaranath.somalapuram@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=contactshashanksharma@gmail.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=quic_abhinavk@quicinc.com \
    --cc=robdclark@chromium.org \
    --cc=robdclark@gmail.com \
    --cc=shashank.sharma@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).