[RFC 00/11] TDR/watchdog timeout support for gen8

* [RFC 00/11] TDR/watchdog timeout support for gen8
@ 2015-06-08 17:03 Tomas Elf
  2015-06-08 17:03 ` [RFC 01/11] drm/i915: Early exit from semaphore_waits_for for execlist mode Tomas Elf
                   ` (13 more replies)
  0 siblings, 14 replies; 59+ messages in thread
From: Tomas Elf @ 2015-06-08 17:03 UTC (permalink / raw)
  To: Intel-GFX

This patch series introduces the following features:

* Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode.

TDR is an umbrella term for anything that goes into detecting and recovering
from GPU hangs and is a term more widely used outside of the upstream driver.
This feature introduces an extensible framework that currently supports gen8
but that can be easily extended to support gen7 as well (which is already the
case in GMIN but unfortunately in a not quite upstreamable form). The code
contained in this submission represents the essentials of what is currently in
GMIN merged with what is currently in upstream (as of the time when this work
commenced a few months back).

This feature adds a new hang recovery path alongside the legacy GPU reset path,
which takes care of engine recovery only. Aside from adding support for
per-engine recovery this feature also introduces rules for how to promote a
potential per-engine reset to a legacy, full GPU reset.

The hang checker now integrates with the error handler in a slightly different
way in that it allows hang recovery on multiple engines at the same time by
passing an engine flag mask to the error handler where flags representing all
of the hung engines are set. This allows us to schedule hang recovery once for
all currently hung engines instead of one hang recovery per detected engine
hang. Previously, when only full GPU reset was supported this was all the same
since it wouldn't matter if one or four engines were hung at any given point
since it would all amount to the same thing - the GPU getting reset. As it
stands now the behaviour is different depending on which engine is hung since
each engine is reset separately from all the other engines, therefore we have
to think about this in terms of scheduling cost and recovery latency. (see
open question below)

OPEN QUESTIONS:

	1. Do we want to investigate the possibility of per-engine hang
	detection? In the current upstream driver there is only one work queue
	that handles the hang checker and everything from initial hang
	detection to final hang recovery runs in this thread. This makes sense
	if you're only supporting one form of hang recovery - using full GPU
	reset and nothing tied to any particular engine. However, as part
	of this patch series we're changing that by introducing per-engine
	hang recovery. It could make sense to introduce multiple work
	queues - one per engine - to run multiple hang checking threads in
	parallel.

	This would potentially allow savings in terms of recovery latency since
	we don't have to scan all engines every time the hang checker is
	scheduled and the error handler does not have to scan all engines every
	time it is scheduled. Instead, we could implement one work queue per
	engine that would invoke the hang checker that only checks _that_
	particular engine and then the error handler is invoked for _that_
	particular engine. If one engine has hung the latency for getting to
	the hang recovery path for that particular engine would be (Time For
	Hang Checking One Engine) + (Time For Error Handling One Engine) rather
	than the time it takes to do hang checking for all engines + the time
	it takes to do error handling for all engines that have been detected
	as hung (which in the worst case would be all engines). There would
	potentially be as many hang checker and error handling threads going on
	concurrently as there are engines in the hardware but they would all be
	running in parallel without any significant locking. The first time
	where any thread needs exclusive access to the driver is at the point
	of the actual hang recovery but the time it takes to get there would
	theoretically be lower and the time it actually takes to do per-engine
	hang recovery is quite a lot lower than the time it takes to actually
	detect a hang reliably.

	How much we would save by such a change still needs to be analysed and
	compared against the current single-thread model but it makes sense
	from a theoretical design point of view.

	2. How does per-engine reset integrate with the public reset stats
	IOCTL? These stats are used for the GL robustness interface and
	currently these tests are failing when running per-engine hang recovery
	since we treat per-engine recovery differently from full GPU recovery,
	which is nothing that userland knows anything about. When userland
	expects to hang the hardware it expects the reset stat interface to
	reflect this, which is something that has changed as part of this code
	submission. There's more than one way to solve this. Here are two options:

		1. Expose per-engine reset statistics and set contexts as
		guilty the same way for per-engine reset as for full GPU
		resets.

		That would make this change to the hang recovery mechanism
		transparent to userland but it would change the semantics since
		an active context in the reset stats no longer implies that the
		GPU was fully reset.

		2. Add a new set of statistics for per-engine reset (one group
		of statistics for each engine) to reflect the extended
		capabilities that per-engine hang recovery offers.

		Would that be breaking the ABI?

		... Or are there any other way of doing this?

* Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.

This feature allows userland applications to control whether or not individual
batch buffers should have a first-level, fine-grained, hardware-based hang
detection mechanism on top of the ordinary, software-based periodic hang
checker that is already in the driver. The advantage over relying solely on the
current software-based hang checker is that the watchdog timeout mechanism is
about 1000x quicker and more precise. Since it's not a full driver-level hang
detection mechanism but only targetting one individual batch buffer at a time
it can afford to be that quick without risking an increase in false positive
hang detection.

This feature includes the following changes:

a) Watchdog timeout interrupt service routine for handling watchdog interrupts
and connecting these to per-engine hang recovery.

b) Injection of watchdog timer enablement/cancellation instructions
before/after the batch buffer start instruction in the ring buffer so that
watchdog timeout is connected to the submission of an individual batch buffer.

c) Extension of the DRM batch buffer interface, exposing the watchdog timeout
feature to userland. We've got two open source groups in VPG currently in the
process of integrating support for this feature, which should make it
principally possible to upstream this extension.

There is currently full watchdog timeout support for gen7 in GMIN and it is
quite similar to the gen8 implementation so there is nothing obvious that
prevents us from upstreaming that code along with the gen8 code. However,
watchdog timeout is fully dependent on the per-engine hang recovery path and
that is not part of this code submission for gen7. Therefore watchdog timeout
support for gen7 has been excluded until per-engine hang recovery support for
gen7 has landed upstream.

As part of this submission we've had to reinstate the work queue that was
previously in place between the error handler and the hang recovery path. The
reason for this is that the per-engine recovery path is called directly from
the interrupt handler in the case of watchdog timeout. In that situation
there's no way of grabbing the struct_mutex, which is a requirement for the
hang recovery path. Therefore, by reinstating the work queue we provide a
unified execution context for the hang recovery code that allows the hang
recovery code to grab whatever locks it needs without sacrificing interrupt
latency too much or sleeping indefinitely in hard interrupt context.

* Feature 3. Context Submission Status Consistency checking

Something that becomes apparent when you run long-duration operations tests
with concurrent rendering processes with intermittently injected hangs is that
it seems like the GPU forgets to send context completion interrupts to the
driver under some circumstances. What this means is that the driver sometimes
gets stuck on a context that never seems to finish, all the while the hardware
has completed and is waiting for more work.

The problem with this is that the per-engine hang recovery path relies on
context resubmission to kick off the hardware again following an engine reset.
This can only be done safely if the hardware and driver share the same opinion
about the current state. Therefore we've extended the periodic hang checker to
check for context submission state inconsistencies aside from the hang checking
it already does.

If such a state is detected it is assumed (based on experience) that a context
completion interrupt has been lost somehow. If this state persists for some
time an attempt to correct it is made by faking the presumably lost context
completion interrupt by manually calling the execlist interrupt handler, which
is normally called from the main interrupt handler cued by a received context
event interrupt. Just because an interrupt goes missing does not mean that the
context status buffer (CSB) does not get appropriately updated by the hardware,
which means that we can expect to find all the recent changes to the context
states for each engine captured there. If there are outstanding context status
changes in store there then the faked context event interrupt will allow the
interrupt handler to act on them. In the case of lost context completion
interrupts this will prompt the driver to remove the already completed context
from the execlist queue and move on to the next pending piece of work and
thereby eliminating the inconsistency.

* Feature 4. Debugfs extensions for per-engine hang recovery and TDR/watchdog trace
points.

Tomas Elf (11):
  drm/i915: Early exit from semaphore_waits_for for execlist mode.
  drm/i915: Introduce uevent for full GPU reset.
  drm/i915: Add reset stats entry point for per-engine reset.
  drm/i915: Adding TDR / per-engine reset support for gen8.
  drm/i915: Extending i915_gem_check_wedge to check engine reset in
    progress
  drm/i915: Disable warnings for TDR interruptions in the display
    driver.
  drm/i915: Reinstate hang recovery work queue.
  drm/i915: Watchdog timeout support for gen8.
  drm/i915: Fake lost context interrupts through forced CSB check.
  drm/i915: Debugfs interface for per-engine hang recovery.
  drm/i915: TDR/watchdog trace points.

 drivers/gpu/drm/i915/i915_debugfs.c     |  146 +++++-
 drivers/gpu/drm/i915/i915_dma.c         |   79 +++
 drivers/gpu/drm/i915/i915_drv.c         |  201 ++++++++
 drivers/gpu/drm/i915/i915_drv.h         |   91 +++-
 drivers/gpu/drm/i915/i915_gem.c         |   93 +++-
 drivers/gpu/drm/i915/i915_gpu_error.c   |    2 +-
 drivers/gpu/drm/i915/i915_irq.c         |  378 ++++++++++++--
 drivers/gpu/drm/i915/i915_params.c      |   10 +
 drivers/gpu/drm/i915/i915_reg.h         |   13 +
 drivers/gpu/drm/i915/i915_trace.h       |  298 +++++++++++
 drivers/gpu/drm/i915/intel_display.c    |   16 +-
 drivers/gpu/drm/i915/intel_lrc.c        |  858 ++++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_lrc.h        |   16 +-
 drivers/gpu/drm/i915/intel_lrc_tdr.h    |   40 ++
 drivers/gpu/drm/i915/intel_ringbuffer.c |   87 +++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  109 ++++
 drivers/gpu/drm/i915/intel_uncore.c     |  241 ++++++++-
 include/uapi/drm/i915_drm.h             |    5 +-
 18 files changed, 2589 insertions(+), 94 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/intel_lrc_tdr.h

-- 
1.7.9.5

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 59+ messages in thread