Re: [PATCH v5 02/16] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr

From: Daniel Vetter <daniel@ffwll.ch>
To: "Christian König" <christian.koenig@amd.com>
Cc: Emma Anholt <emma@anholt.net>,
	Tomeu Vizoso <tomeu.vizoso@collabora.com>,
	dri-devel@lists.freedesktop.org,
	Steven Price <steven.price@arm.com>,
	Rob Herring <robh+dt@kernel.org>,
	Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>,
	Boris Brezillon <boris.brezillon@collabora.com>,
	Alex Deucher <alexander.deucher@amd.com>,
	Qiang Yu <yuq825@gmail.com>, Robin Murphy <robin.murphy@arm.com>
Subject: Re: [PATCH v5 02/16] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr
Date: Tue, 29 Jun 2021 16:05:30 +0200	[thread overview]
Message-ID: <YNsoqiGKMCRzhOfx@phenom.ffwll.local> (raw)
In-Reply-To: <532d1f9d-1092-18c3-c87d-463cfda60ed7@amd.com>

On Tue, Jun 29, 2021 at 01:24:53PM +0200, Christian König wrote:
> Am 29.06.21 um 13:18 schrieb Boris Brezillon:
> > Hi Christian,
> > 
> > On Tue, 29 Jun 2021 13:03:58 +0200
> > Christian König <christian.koenig@amd.com> wrote:
> > 
> > > Am 29.06.21 um 09:34 schrieb Boris Brezillon:
> > > > Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
> > > > reset. This leads to extra complexity when we need to synchronize timeout
> > > > works with the reset work. One solution to address that is to have an
> > > > ordered workqueue at the driver level that will be used by the different
> > > > schedulers to queue their timeout work. Thanks to the serialization
> > > > provided by the ordered workqueue we are guaranteed that timeout
> > > > handlers are executed sequentially, and can thus easily reset the GPU
> > > > from the timeout handler without extra synchronization.
> > > Well, we had already tried this and it didn't worked the way it is expected.
> > > 
> > > The major problem is that you not only want to serialize the queue, but
> > > rather have a single reset for all queues.
> > > 
> > > Otherwise you schedule multiple resets for each hardware queue. E.g. for
> > > your 3 hardware queues you would reset the GPU 3 times if all of them
> > > time out at the same time (which is rather likely).
> > > 
> > > Using a single delayed work item doesn't work either because you then
> > > only have one timeout.
> > > 
> > > What could be done is to cancel all delayed work items from all stopped
> > > schedulers.
> > drm_sched_stop() does that already, and since we call drm_sched_stop()
> > on all queues in the timeout handler, we end up with only one global
> > reset happening even if several queues report a timeout at the same
> > time.
> 
> Ah, nice. Yeah, in this case it should indeed work as expected.
> 
> Feel free to add an Acked-by: Christian König <christian.koenig@amd.com> to
> it.

Yeah this is also what we want for the case where you have cascading
reset. Like if you have per-engine reset, but if that fails, have to reset
the entire chip. That's similar but not quite to the soft recover, then
chip reset amdgpu does since for the engine reset you do have to stop the
scheduler. For full context the i915 reset cascade:

- try to evict the offending work with preemption (similar in it's impact
  to amdgpu's soft recovery I think), other requests on the same engine
  are unaffected here.

- reset the affected engine

- reset the entire chip

So my plan at least is that we'd only call drm_sched_stop as we escalate.
Also on earlier chips first and second might not be available.

Lucky for us chip reset only resets the render side of things, display
survives on gen4+, so we don't have the nasty locking issue amdgpu has.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch