linux-sgx.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Asahi Lina <lina@asahilina.net>
To: "Faith Ekstrand" <faith.ekstrand@collabora.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Maarten Lankhorst" <maarten.lankhorst@linux.intel.com>,
	"Maxime Ripard" <mripard@kernel.org>,
	"Thomas Zimmermann" <tzimmermann@suse.de>,
	"David Airlie" <airlied@gmail.com>,
	"Daniel Vetter" <daniel@ffwll.ch>,
	"Miguel Ojeda" <ojeda@kernel.org>,
	"Alex Gaynor" <alex.gaynor@gmail.com>,
	"Wedson Almeida Filho" <wedsonaf@gmail.com>,
	"Boqun Feng" <boqun.feng@gmail.com>,
	"Gary Guo" <gary@garyguo.net>,
	"Björn Roy Baron" <bjorn3_gh@protonmail.com>,
	"Sumit Semwal" <sumit.semwal@linaro.org>,
	"Luben Tuikov" <luben.tuikov@amd.com>,
	"Jarkko Sakkinen" <jarkko@kernel.org>,
	"Dave Hansen" <dave.hansen@linux.intel.com>
Cc: Alyssa Rosenzweig <alyssa@rosenzweig.io>,
	Karol Herbst <kherbst@redhat.com>,
	Ella Stanforth <ella@iglunix.org>, Mary <mary@mary.zone>,
	linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org,
	rust-for-linux@vger.kernel.org, linux-media@vger.kernel.org,
	linaro-mm-sig@lists.linaro.org, linux-sgx@vger.kernel.org,
	asahi@lists.linux.dev
Subject: Re: [PATCH RFC 11/18] drm/scheduler: Clean up jobs when the scheduler is torn down
Date: Fri, 10 Mar 2023 18:58:18 +0900	[thread overview]
Message-ID: <25349030-5cc6-90a2-f1d2-4f9a67c4f194@asahilina.net> (raw)
In-Reply-To: <777dea65ef81c402d0765b1244b40633c483f4b2.camel@collabora.com>

On 10/03/2023 04.59, Faith Ekstrand wrote:
> On Thu, 2023-03-09 at 18:43 +0900, Asahi Lina wrote:
>> On 09/03/2023 17.42, Christian König wrote:
>>> Long story short: Don't do this! This is what the Windows drivers
>>> have 
>>> been doing and it creates tons of problems.
> 
> Yeah, we tried to do a bit of that in the GL days.  It was a bad idea.

I think I should clarify: I was proposing re-queueing innocent jobs from
innocent queues/VMs that were impacted by a fault. The reason is that we
may be able to tweak firmware state to force it to do that safely,
during the firmware recovery cycle, such that an aborted job restarts
and then subsequent jobs/commands continue as normal. We can't leave it
to userspace because if we do nothing, the affected job ends up
incomplete but then everything after it that is already queued still
runs, and that is definitely a recipe for a bigger mess if userspace
wants to seamlessly recover. The firmware recovery cycle is a
"stop-the-world" situation for the GPU (the firmware literally
busy-loops waiting for the driver to set a continue flag in memory...),
so that's the only real chance that the driver gets to make decisions
about what is going to happen next.

Of course, that only works if individual possibly concurrently running
commands are idempotent, but I think a lot of typical GPU work is? (E.g.
any render pass without side effects other than the render targets and
where the background shader does no loads, or even render passes that do
loads but where all draws are opaque, which are all things the current
Gallium driver is intimately familiar with since Crazy Tiler
Optimizations™ need that info to be provided anyway). So I was wondering
whether it'd make sense to have such an idempotency/restartable flag on
job submission, and then the driver would do its best to recover and
rerun it if it gets killed by an unrelated concurrent bad job.

Then again this all depends on an investigation into what we *can* do
during firmware recovery that hasn't happened at all yet. It might be
that it isn't safe to do anything really, or that doing things depends
on touching even deeper firmware state structs that we treat as opaque
right now and we really don't want to have to touch...

But maybe none of this is worth it in practice, it just sounded like it
could be useful maybe?

Now that I look at it, we have a lovely "what is this flag doing anyway"
bit already passed from Mesa through to the firmware we called
ASAHI_RENDER_SET_WHEN_RELOADING_Z_OR_S which, now that I look at it, is
actually getting set when any attachment (any color, Z, S) is not being
cleared for that pass (so it's loaded). That could very well be an "is
not idempotent" flag... and maybe that means the firmware does this for
us already? Sounds like something to test... I might have some 16Kx16K
GLmark runs to do concurrent with an evil faulting job now ^^ (and then
that also means we need to set it when shaders have side effects and
stuff, which right now we don't).

>>> Just signal the problem back to userspace and let the user space
>>> driver 
>>> decide what to do.
>>>
>>> The background is that most graphics applications (games etc..)
>>> then 
>>> rather start on the next frame instead of submitting the current
>>> one 
>>> again while compute applications make sure that the abort and tell
>>> the 
>>> user that the calculations might be corrupted and need to be
>>> redone.
> 
> The guarantee that Vulkan makes is that, if you idle the GPU and you
> haven't gotten a DEVICE_LOST yet, your data is good.  If you get a
> DEVICE_LOST, all bets are off.  The problem is that, no matter how fast
> the error propagation may be in the kernel or userspace driver, errors
> can still show up in strange ways.  An OOB buffer access could end up
> modifying a shader binary which gets run 3 frames later and causes a
> corruption.  Once you've faulted, you really have no idea how far back
> is good or what memory is corrupted.  You have to assume that
> everything mapped to the GPU VA space is potentially toast.

Yes of course, for the actually faulting VM all bets are off after a
fault (though we can try a bit harder at least... I have a READ_ONLY BO
flag now, I should set it on the shader pools!).

>> Actually I wanted to ask about error notifications. Right now we have
>> an
>> out-of-band mechanism to provide detailed fault info to userspace
>> which
>> works fine, but in principle it's optional.
> 
> This is fine, in principal.  Because of the nature of errors, async is
> fine as long as the error shows up eventually.  Faster is better, for
> sure, but error latency doesn't really matter in practice.
> 
>> However, I also mark the hw
>>  fences as errored when a fault happens (with an errno that describes
>> the overall situation), but that never makes it into the drm_sched
>> job
>> complete fence. I looked at the drm_sched code and I didn't see any
>> error propagation. Is that supposed to work, or am I supposed to
>> directly mark the drm_sched side fence as complete, or did I
>> misunderstand all this? I get the feeling maybe existing drivers just
>> rely on the recovery/timeout/etc paths to mark jobs as errored (since
>> those do it explicitly) and never need error forwarding from the hw
>> fence?
> 
> The end behavior needs to be that all fences for all jobs submitted to
> the queue get signaled.  That's needed to satisfy the finite time
> guarantees of dma_fence.  Exactly how that happens (let the job run,
> abort all the jobs, etc.) is an implementation detail for the driver to
> decide.  If you want, you can also set a bit on the context (or queue)
> to mark it as dead and start returning EIO or similar from any ioctls
> trying to submit more work if you wanted.  Not required but you can.

Fences have an error flag though, does that get reported to userspace
somehow? I thought it did, but maybe not, or maybe only drm_sched not
propagating it is the issue?

In other words, absent my fancy stats reporting BO system, what is the
normal way that an explicit sync driver signals to userspace that the
job associated with a syncobj has failed?

(If there is no way, then I'll probably want to change the stats BO
system to be configurable, so if you ask for no stats/time info, you
only get overall job status and faults, which has less overhead.)

~~ Lina

  reply	other threads:[~2023-03-10  9:58 UTC|newest]

Thread overview: 122+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-07 14:25 [PATCH RFC 00/18] Rust DRM subsystem abstractions (& preview AGX driver) Asahi Lina
2023-03-07 14:25 ` [PATCH RFC 01/18] rust: drm: ioctl: Add DRM ioctl abstraction Asahi Lina
2023-03-07 14:48   ` Karol Herbst
2023-03-07 14:51     ` Karol Herbst
2023-03-07 15:32   ` Maíra Canal
2023-03-09  5:32     ` Asahi Lina
2023-03-09  6:15       ` Dave Airlie
2023-03-09 12:09         ` Maíra Canal
2023-03-07 17:34   ` Björn Roy Baron
2023-03-09  6:04     ` Asahi Lina
2023-03-09 20:24       ` Faith Ekstrand
2023-03-09 20:39         ` Karol Herbst
2023-03-10  6:21           ` Asahi Lina
2023-04-13  9:23   ` Daniel Vetter
2023-03-07 14:25 ` [PATCH RFC 02/18] rust: drm: Add Device and Driver abstractions Asahi Lina
2023-03-07 18:19   ` Björn Roy Baron
2023-03-09  6:10     ` Asahi Lina
2023-03-10 18:56   ` Boqun Feng
2023-03-11  5:41   ` Boqun Feng
2023-04-05 17:10   ` Daniel Vetter
2023-03-07 14:25 ` [PATCH RFC 03/18] rust: drm: file: Add File abstraction Asahi Lina
2023-03-09 21:16   ` Faith Ekstrand
2023-03-09 22:16     ` Asahi Lina
2023-03-13 17:49       ` Faith Ekstrand
2023-03-14  2:07         ` Boqun Feng
2023-04-05 11:25           ` Daniel Vetter
2023-03-07 14:25 ` [PATCH RFC 04/18] rust: drm: gem: Add GEM object abstraction Asahi Lina
2023-04-05 11:08   ` Daniel Vetter
2023-04-05 11:19     ` Miguel Ojeda
2023-04-05 11:22       ` Daniel Vetter
2023-04-05 12:32         ` Miguel Ojeda
2023-04-05 12:36           ` Daniel Vetter
2023-03-07 14:25 ` [PATCH RFC 05/18] drm/gem-shmem: Export VM ops functions Asahi Lina
2023-03-07 14:25 ` [PATCH RFC 06/18] rust: drm: gem: shmem: Add DRM shmem helper abstraction Asahi Lina
2023-03-08 13:38   ` Maíra Canal
2023-03-09  5:25     ` Asahi Lina
2023-03-09 11:47       ` Maíra Canal
2023-03-09 14:16         ` Asahi Lina
2023-03-07 14:25 ` [PATCH RFC 07/18] rust: drm: mm: Add DRM MM Range Allocator abstraction Asahi Lina
2023-04-06 14:15   ` Daniel Vetter
2023-04-06 15:28     ` Miguel Ojeda
2023-04-06 15:45       ` Daniel Vetter
2023-04-06 17:19         ` Miguel Ojeda
2023-04-06 15:53     ` Asahi Lina
2023-04-06 16:13       ` [Linaro-mm-sig] " Daniel Vetter
2023-04-06 16:39         ` Asahi Lina
2023-03-07 14:25 ` [PATCH RFC 08/18] rust: dma_fence: Add DMA Fence abstraction Asahi Lina
2023-04-05 11:10   ` Daniel Vetter
2023-03-07 14:25 ` [PATCH RFC 09/18] rust: drm: syncobj: Add DRM Sync Object abstraction Asahi Lina
2023-04-05 12:33   ` Daniel Vetter
2023-04-06 16:04     ` Asahi Lina
2023-03-07 14:25 ` [PATCH RFC 10/18] drm/scheduler: Add can_run_job callback Asahi Lina
2023-03-08  8:46   ` Christian König
2023-03-08  9:41     ` Asahi Lina
2023-03-08 10:00       ` Christian König
2023-03-08 14:53         ` Asahi Lina
2023-03-08 15:30           ` Christian König
2023-03-08 16:44             ` Asahi Lina
2023-03-08 17:57               ` Christian König
2023-03-08 19:05                 ` Asahi Lina
2023-03-08 19:12                   ` Christian König
2023-03-08 19:45                     ` Asahi Lina
2023-03-08 20:14                       ` Christian König
2023-03-09  6:30                         ` Asahi Lina
2023-03-09  8:05                           ` Christian König
2023-03-09  9:14                             ` Asahi Lina
2023-03-09 18:50                               ` Faith Ekstrand
2023-03-10  9:16                                 ` Asahi Lina
2023-03-08 12:39     ` Karol Herbst
2023-03-08 13:47       ` Christian König
2023-03-08 14:43         ` Karol Herbst
2023-03-08 15:02           ` Christian König
2023-03-08 15:19             ` Karol Herbst
2023-03-16 13:40               ` Daniel Vetter
2023-04-05 13:40   ` Daniel Vetter
2023-04-05 14:14     ` Christian König
2023-04-05 14:21       ` Daniel Vetter
2023-03-07 14:25 ` [PATCH RFC 11/18] drm/scheduler: Clean up jobs when the scheduler is torn down Asahi Lina
2023-03-08  9:57   ` Maarten Lankhorst
2023-03-08 10:03     ` Christian König
2023-03-08 15:18       ` Asahi Lina
2023-03-08 15:42         ` Christian König
2023-03-08 17:32           ` Asahi Lina
2023-03-08 18:12             ` Christian König
2023-03-08 19:37               ` Asahi Lina
2023-03-09  8:42                 ` Christian König
2023-03-09  9:43                   ` Asahi Lina
2023-03-09 11:47                     ` Christian König
2023-03-09 13:48                       ` Asahi Lina
2023-03-09 19:59                     ` Faith Ekstrand
2023-03-10  9:58                       ` Asahi Lina [this message]
2023-03-13 20:11                         ` Faith Ekstrand
2023-03-08 17:39           ` alyssa
2023-03-08 17:44             ` Asahi Lina
2023-03-08 18:13             ` Christian König
2023-04-05 13:52   ` Daniel Vetter
2023-03-07 14:25 ` [PATCH RFC 12/18] rust: drm: sched: Add GPU scheduler abstraction Asahi Lina
2023-04-05 15:43   ` Daniel Vetter
2023-04-05 19:29     ` Daniel Vetter
2023-04-18  8:45       ` Daniel Vetter
2023-03-07 14:25 ` [PATCH RFC 13/18] drm/gem: Add a flag to control whether objects can be exported Asahi Lina
2023-04-05 14:55   ` Daniel Vetter
2023-03-07 14:25 ` [PATCH RFC 14/18] rust: drm: gem: Add set_exportable() method Asahi Lina
2023-03-07 14:25 ` [PATCH RFC 15/18] drm/asahi: Add the Asahi driver UAPI [DO NOT MERGE] Asahi Lina
2023-03-07 15:28   ` Karol Herbst
2023-03-07 14:25 ` [PATCH RFC 16/18] rust: bindings: Bind the Asahi DRM UAPI Asahi Lina
2023-03-07 14:25 ` [PATCH RFC 17/18] rust: macros: Add versions macro Asahi Lina
2023-03-07 16:17 ` [PATCH RFC 00/18] Rust DRM subsystem abstractions (& preview AGX driver) Asahi Lina
     [not found] ` <20230307-rust-drm-v1-18-917ff5bc80a8@asahilina.net>
2023-04-05 14:44   ` [PATCH RFC 18/18] drm/asahi: Add the Asahi driver for Apple AGX GPUs Daniel Vetter
2023-04-06  5:02     ` Asahi Lina
2023-04-06  5:09       ` Asahi Lina
2023-04-06 11:25       ` [Linaro-mm-sig] " Daniel Vetter
2023-04-06 13:32         ` Asahi Lina
2023-04-06 13:54           ` Daniel Vetter
     [not found]   ` <ZC2HtBOaoUAzVCVH@phenom.ffwll.local>
2023-04-06  4:44     ` Asahi Lina
2023-04-06  5:09       ` Asahi Lina
2023-04-06 11:26         ` Daniel Vetter
2023-04-06 10:42       ` [Linaro-mm-sig] " Daniel Vetter
2023-04-06 11:55       ` Daniel Vetter
2023-04-06 13:15         ` Asahi Lina
2023-04-06 13:48           ` Daniel Vetter
2023-04-06 15:19             ` Asahi Lina

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=25349030-5cc6-90a2-f1d2-4f9a67c4f194@asahilina.net \
    --to=lina@asahilina.net \
    --cc=airlied@gmail.com \
    --cc=alex.gaynor@gmail.com \
    --cc=alyssa@rosenzweig.io \
    --cc=asahi@lists.linux.dev \
    --cc=bjorn3_gh@protonmail.com \
    --cc=boqun.feng@gmail.com \
    --cc=christian.koenig@amd.com \
    --cc=daniel@ffwll.ch \
    --cc=dave.hansen@linux.intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=ella@iglunix.org \
    --cc=faith.ekstrand@collabora.com \
    --cc=gary@garyguo.net \
    --cc=jarkko@kernel.org \
    --cc=kherbst@redhat.com \
    --cc=linaro-mm-sig@lists.linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-media@vger.kernel.org \
    --cc=linux-sgx@vger.kernel.org \
    --cc=luben.tuikov@amd.com \
    --cc=maarten.lankhorst@linux.intel.com \
    --cc=mary@mary.zone \
    --cc=mripard@kernel.org \
    --cc=ojeda@kernel.org \
    --cc=rust-for-linux@vger.kernel.org \
    --cc=sumit.semwal@linaro.org \
    --cc=tzimmermann@suse.de \
    --cc=wedsonaf@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).