All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/20] TDR/watchdog support for gen8
@ 2016-01-13 17:28 Arun Siluvery
  2016-01-13 17:28 ` [PATCH 01/20] drm/i915: Make i915_gem_reset_ring_status() public Arun Siluvery
                   ` (20 more replies)
  0 siblings, 21 replies; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx

These patches were sent previously a while ago[1] so rebased on latest nightly
and resending again for feedback.

This patch series adds support for Per engine resets, watchdog timeout
reset. Please see [1] for detailed description.

[1] http://lists.freedesktop.org/archives/intel-gfx/2015-October/078696.html

Tim Gore (1):
  drm/i915: drm/i915 changes to simulated hangs

Tomas Elf (19):
  drm/i915: Make i915_gem_reset_ring_status() public
  drm/i915: Generalise common GPU engine reset request/unrequest code
  drm/i915: TDR / per-engine hang recovery support for gen8.
  drm/i915: TDR / per-engine hang detection
  drm/i915: Extending i915_gem_check_wedge to check engine reset in
    progress
  drm/i915: Reinstate hang recovery work queue.
  drm/i915: Watchdog timeout: Hang detection integration into error
    handler
  drm/i915: Watchdog timeout: IRQ handler for gen8
  drm/i915: Watchdog timeout: Ringbuffer command emission for gen8
  drm/i915: Watchdog timeout: DRM kernel interface enablement
  drm/i915: Fake lost context event interrupts through forced CSB
    checking.
  drm/i915: Debugfs interface for per-engine hang recovery.
  drm/i915: Test infrastructure for context state inconsistency
    simulation
  drm/i915: TDR/watchdog trace points.
  drm/i915: Port of Added scheduler support to __wait_request() calls
  drm/i915: Fix __i915_wait_request() behaviour during hang detection.
  drm/i915: Extended error state with TDR count, watchdog count and
    engine reset count
  drm/i915: TDR / per-engine hang recovery kernel docs
  drm/i915: Enable TDR / per-engine hang recovery

 Documentation/DocBook/gpu.tmpl          | 476 ++++++++++++++++++
 drivers/gpu/drm/i915/i915_debugfs.c     | 163 +++++-
 drivers/gpu/drm/i915/i915_dma.c         |  80 +++
 drivers/gpu/drm/i915/i915_drv.c         | 328 ++++++++++++
 drivers/gpu/drm/i915/i915_drv.h         |  90 +++-
 drivers/gpu/drm/i915/i915_gem.c         | 152 +++++-
 drivers/gpu/drm/i915/i915_gpu_error.c   |   8 +-
 drivers/gpu/drm/i915/i915_irq.c         | 263 ++++++++--
 drivers/gpu/drm/i915/i915_params.c      |  19 +
 drivers/gpu/drm/i915/i915_params.h      |   2 +
 drivers/gpu/drm/i915/i915_reg.h         |   9 +
 drivers/gpu/drm/i915/i915_trace.h       | 354 ++++++++++++-
 drivers/gpu/drm/i915/intel_display.c    |   5 +-
 drivers/gpu/drm/i915/intel_lrc.c        | 865 +++++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_lrc.h        |  16 +-
 drivers/gpu/drm/i915/intel_lrc_tdr.h    |  39 ++
 drivers/gpu/drm/i915/intel_ringbuffer.c |  90 +++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  95 ++++
 drivers/gpu/drm/i915/intel_uncore.c     | 197 +++++++-
 include/uapi/drm/i915_drm.h             |   5 +-
 20 files changed, 3134 insertions(+), 122 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/intel_lrc_tdr.h

-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 01/20] drm/i915: Make i915_gem_reset_ring_status() public
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 17:28 ` [PATCH 02/20] drm/i915: Generalise common GPU engine reset request/unrequest code Arun Siluvery
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

Makes i915_gem_reset_ring_status() public for use from engine reset path in
order to replicate the same behavior as in full GPU reset but for a single
engine.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h | 2 ++
 drivers/gpu/drm/i915/i915_gem.c | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 104bd18..703a320 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3007,6 +3007,8 @@ static inline bool i915_stop_ring_allow_warn(struct drm_i915_private *dev_priv)
 }
 
 void i915_gem_reset(struct drm_device *dev);
+void i915_gem_reset_ring_status(struct drm_i915_private *dev_priv,
+			        struct intel_engine_cs *ring);
 bool i915_gem_clflush_object(struct drm_i915_gem_object *obj, bool force);
 int __must_check i915_gem_init(struct drm_device *dev);
 int i915_gem_init_rings(struct drm_device *dev);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 6c60e04..e3cfed2 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2775,8 +2775,8 @@ i915_gem_find_active_request(struct intel_engine_cs *ring)
 	return NULL;
 }
 
-static void i915_gem_reset_ring_status(struct drm_i915_private *dev_priv,
-				       struct intel_engine_cs *ring)
+void i915_gem_reset_ring_status(struct drm_i915_private *dev_priv,
+			        struct intel_engine_cs *ring)
 {
 	struct drm_i915_gem_request *request;
 	bool ring_hung;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 02/20] drm/i915: Generalise common GPU engine reset request/unrequest code
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
  2016-01-13 17:28 ` [PATCH 01/20] drm/i915: Make i915_gem_reset_ring_status() public Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-22 11:24   ` Mika Kuoppala
  2016-01-13 17:28 ` [PATCH 03/20] drm/i915: TDR / per-engine hang recovery support for gen8 Arun Siluvery
                   ` (18 subsequent siblings)
  20 siblings, 1 reply; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

GPU engine reset handshaking is something that is applicable to both full GPU
reset and engine reset, which is something that is part of the upcoming TDR
per-engine hang recovery patches. Break out the common engine reset
request/unrequest code (originally written by Mika Kuoppala) for reuse later in
the TDR enablement patch series.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 46 ++++++++++++++++++++++++++-----------
 1 file changed, 32 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index c3c13dc..2df4246 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1529,32 +1529,50 @@ static int wait_for_register(struct drm_i915_private *dev_priv,
 	return wait_for((I915_READ(reg) & mask) == value, timeout_ms);
 }
 
+static inline int gen8_request_engine_reset(struct intel_engine_cs *engine)
+{
+	struct drm_i915_private *dev_priv = engine->dev->dev_private;
+	int ret = 0;
+
+	I915_WRITE(RING_RESET_CTL(engine->mmio_base),
+		   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
+
+	ret = wait_for_register(dev_priv,
+			      RING_RESET_CTL(engine->mmio_base),
+			      RESET_CTL_READY_TO_RESET,
+			      RESET_CTL_READY_TO_RESET,
+			      700);
+	if (ret)
+		DRM_ERROR("%s: reset request timeout\n", engine->name);
+
+	return ret;
+}
+
+static inline int gen8_unrequest_engine_reset(struct intel_engine_cs *engine)
+{
+	struct drm_i915_private *dev_priv = engine->dev->dev_private;
+
+	I915_WRITE(RING_RESET_CTL(engine->mmio_base),
+		_MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
+
+	return 0;
+}
+
 static int gen8_do_reset(struct drm_device *dev)
 {
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	struct intel_engine_cs *engine;
 	int i;
 
-	for_each_ring(engine, dev_priv, i) {
-		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
-			   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
-
-		if (wait_for_register(dev_priv,
-				      RING_RESET_CTL(engine->mmio_base),
-				      RESET_CTL_READY_TO_RESET,
-				      RESET_CTL_READY_TO_RESET,
-				      700)) {
-			DRM_ERROR("%s: reset request timeout\n", engine->name);
+	for_each_ring(engine, dev_priv, i)
+		if (gen8_request_engine_reset(engine))
 			goto not_ready;
-		}
-	}
 
 	return gen6_do_reset(dev);
 
 not_ready:
 	for_each_ring(engine, dev_priv, i)
-		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
-			   _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
+		gen8_unrequest_engine_reset(engine);
 
 	return -EIO;
 }
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 03/20] drm/i915: TDR / per-engine hang recovery support for gen8.
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
  2016-01-13 17:28 ` [PATCH 01/20] drm/i915: Make i915_gem_reset_ring_status() public Arun Siluvery
  2016-01-13 17:28 ` [PATCH 02/20] drm/i915: Generalise common GPU engine reset request/unrequest code Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 21:16   ` Chris Wilson
                     ` (2 more replies)
  2016-01-13 17:28 ` [PATCH 04/20] drm/i915: TDR / per-engine hang detection Arun Siluvery
                   ` (17 subsequent siblings)
  20 siblings, 3 replies; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ian Lister, Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

TDR = Timeout Detection and Recovery.

This change introduces support for TDR-style per-engine reset as an initial,
less intrusive hang recovery option to be attempted before falling back to the
legacy full GPU reset recovery mode if necessary. Initially we're only
supporting gen8 but adding support for gen7 is straight-forward since we've
already established an extensible framework where gen7 support can be plugged
in (add corresponding versions of intel_ring_enable, intel_ring_disable,
intel_ring_save, intel_ring_restore, etc.).

1. Per-engine recovery vs. Full GPU recovery

To capture the state of a single engine being detected as hung there is now a
new flag for every engine that can be set once the decision has been made to
schedule hang recovery for that particular engine. This patch only provides the
hang recovery path but not the hang detection integration so for now there is
no way of detecting individual engines as hung and targetting that individual
engine for per-engine hang recovery.

The following algorithm is used to determine when to use which recovery mode
given that hang detection has somehow detected a hang on an individual engine
and given that per-engine hang recovery has been enabled (which it by default
is not):

	1. The error handler checks all engines that have been marked as hung
	by the hang checker and checks how long ago it was since it last
	attempted to do per-engine hang recovery for each respective, currently
	hung engine. If the measured time period is within a certain time
	window, i.e. the last per-engine hang recovery was done too recently,
	it is determined that the previously attempted per-engine hang recovery
	was ineffective and the step is taken to promote the current hang to a
	full GPU reset. The default value for this time window is 10 seconds,
	meaning any hang happening within 10 seconds of a previous hang on the
	same engine will be promoted to full GPU reset. (of course, as long as
	the per-engine hang recovery option is disabled this won't matter and
	the error handler will always go for legacy full GPU reset)

	2. If the error handler determines that no currently hung engine has
	recently had hang recovery a per-engine hang recovery is scheduled.

	3. If the decision to go with per-engine hang recovery is not taken, or
	if per-engine hang recovery is attempted but failed for whatever
	reason, TDR falls back to legacy full GPU recovery.

NOTE: Gen7 and earlier will always promote to full GPU reset since there is
currently no per-engine reset support for these gens.

2. Context Submission Status Consistency.

Per-engine hang recovery on gen8 (or execlist submission mode in general)
relies on the basic concept of context submission status consistency. What this
means is that we make sure that the status of the hardware and the driver when
it comes to the submission of the currently running context on any engine is
consistent. For example, when submitting a context to the corresponding ELSP
port of an engine we expect the owning request of that context to be at the
head of the corresponding execution list queue. Likewise, as long as the
context is executing on the GPU we expect the EXECLIST_STATUS register and the
context status buffer (CSB) to reflect this. Thus, if the context submission
status is consistent the ID of the currently executing context should be in
EXECLIST_STATUS and it should be consistent with the context of the head
request element in the execution list queue corresponding to that engine.

The reason why this is important for per-engine hang recovery in execlist mode
is because this recovery mode relies on context resubmission in order to resume
execution following the recovery. If a context has been determined to be hung
and the per-engine hang recovery mode is engaged leading to the resubmission of
that context it's important that the hardware is in fact not busy doing
something else or is being idle since a resubmission during this state could
cause unforseen side-effects such as unexpected preemptions.

There are rare, although consistently reproducable, situations that have shown
up in practice where the driver and hardware are no longer consistent with each
other, e.g. due to lost context completion interrupts after which the hardware
would be idle but the driver would still think that a context would still be
active.

3. There is a new reset path for engine reset alongside the legacy full GPU
reset path. This path does the following:

	1) Check for context submission consistency to make sure that the
	context that the hardware is currently stuck on is actually what the
	driver is working on. If not then clearly we're not in a consistently
	hung state and we bail out early.

	2) Disable/idle the engine. This is done through reset handshaking on
	gen8+ unlike earlier gens where this was done by clearing the ring
	valid bits in MI_MODE and ring control registers, which are no longer
	supported on gen8+. Reset handshaking translates to setting the reset
	request bit in the reset control register.

	3) Save the current engine state. What this translates to on gen8 is
	simply to read the current value of the head register and nudge it so
	that it points to the next valid instruction in the ring buffer. Since
	we assume that the execution is currently stuck in a batch buffer the
	effect of this is that the batchbuffer start instruction of the hung
	batch buffer is skipped so that when execution resumes, following the
	hang recovery completion, it resumes immediately following the batch
	buffer.

	This effectively means that we're forcefully terminating the currently
	active, hung batch buffer. Obviously, the outcome of this intervention
	is potentially undefined but there are not many good options in this
	scenario. It's better than resetting the entire GPU in the vast
	majority of cases.

	Save the nudged head value to be applied later.

	4) Reset the engine.

	5) Apply the nudged head value to the head register.

	6) Reenable the engine. For gen8 this means resubmitting the fixed-up
	context, allowing execution to resume. In order to resubmit a context
	without relying on the currently hung execlist queue we use a new,
	privileged API that is dedicated to TDR use only. This submission API
	bypasses any currently queued work and gets exclusive access to the
	ELSP ports.

	7) If the engine hang recovery procedure fails at any point in between
	disablement and reenablement of the engine there is a back-off
	procedure: For gen8 it's possible to back out of the reset handshake by
	clearing the reset request bit in the reset control register.

NOTE:
It's possible that some of Ben Widawsky's original per-engine reset patches
from 3 years ago are in this commit but since this work has gone through the
hands of at least 3 people already any kind of ownership tracking has been lost
a long time ago. If you think that you should be on the sob list just let me
know.

* RFCv2: (Chris Wilson / Daniel Vetter)
- Simply use the previously private function i915_gem_reset_ring_status() from
  the engine hang recovery path to set active/pending context status. This
  replicates the same behaviour as in full GPU reset but for a single,
  targetted engine.

- Remove all additional uevents for both full GPU reset and per-engine reset.
  Adapted uevent behaviour to the new per-engine hang recovery mode in that it
  will only send one uevent regardless of which form of recovery is employed.
  If a per-engine reset is attempted first then one uevent will be dispatched.
  If that recovery mode fails and the hang is promoted to a full GPU reset no
  further uevents will be dispatched at that point.

- Tidied up the TDR context resubmission path in intel_lrc.c . Reduced the
  amount of duplication by relying entirely on the normal unqueue function.
  Added a new parameter to the unqueue function that takes into consideration
  if the unqueue call is for a first-time context submission or a resubmission
  and adapts the handling of elsp_submitted accordingly. The reason for
  this is that for context resubmission we don't expect any further
  interrupts for the submission or the following context completion. A more
  elegant way of handling this would be to phase out elsp_submitted
  altogether, however that's part of a LRC/execlist cleanup effort that is
  happening independently of this patch series. For now we make this change
  as simple as possible with as few non-TDR-related side-effects as
  possible.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_dma.c         |  18 +
 drivers/gpu/drm/i915/i915_drv.c         | 206 ++++++++++++
 drivers/gpu/drm/i915/i915_drv.h         |  58 ++++
 drivers/gpu/drm/i915/i915_irq.c         | 169 +++++++++-
 drivers/gpu/drm/i915/i915_params.c      |  19 ++
 drivers/gpu/drm/i915/i915_params.h      |   2 +
 drivers/gpu/drm/i915/i915_reg.h         |   2 +
 drivers/gpu/drm/i915/intel_lrc.c        | 565 +++++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_lrc.h        |  14 +
 drivers/gpu/drm/i915/intel_lrc_tdr.h    |  36 ++
 drivers/gpu/drm/i915/intel_ringbuffer.c |  84 ++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  64 ++++
 drivers/gpu/drm/i915/intel_uncore.c     | 147 +++++++++
 13 files changed, 1358 insertions(+), 26 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/intel_lrc_tdr.h

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 44a896c..c45ec353 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -837,6 +837,22 @@ static void intel_device_info_runtime_init(struct drm_device *dev)
 			 info->has_eu_pg ? "y" : "n");
 }
 
+static void
+i915_hangcheck_init(struct drm_device *dev)
+{
+	int i;
+	struct drm_i915_private *dev_priv = dev->dev_private;
+
+	for (i = 0; i < I915_NUM_RINGS; i++) {
+		struct intel_engine_cs *engine = &dev_priv->ring[i];
+		struct intel_ring_hangcheck *hc = &engine->hangcheck;
+
+		i915_hangcheck_reinit(engine);
+		hc->reset_count = 0;
+		hc->tdr_count = 0;
+	}
+}
+
 static void intel_init_dpio(struct drm_i915_private *dev_priv)
 {
 	/*
@@ -1034,6 +1050,8 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
 
 	i915_gem_load(dev);
 
+	i915_hangcheck_init(dev);
+
 	/* On the 945G/GM, the chipset reports the MSI capability on the
 	 * integrated graphics even though the support isn't actually there
 	 * according to the published specs.  It doesn't appear to function
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index f17a2b0..c0ad003 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -34,6 +34,7 @@
 #include "i915_drv.h"
 #include "i915_trace.h"
 #include "intel_drv.h"
+#include "intel_lrc_tdr.h"
 
 #include <linux/console.h>
 #include <linux/module.h>
@@ -571,6 +572,7 @@ static int i915_drm_suspend(struct drm_device *dev)
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	pci_power_t opregion_target_state;
 	int error;
+	int i;
 
 	/* ignore lid events during suspend */
 	mutex_lock(&dev_priv->modeset_restore_lock);
@@ -596,6 +598,16 @@ static int i915_drm_suspend(struct drm_device *dev)
 
 	intel_guc_suspend(dev);
 
+	/*
+	 * Clear any pending reset requests. They should be picked up
+	 * after resume when new work is submitted
+	 */
+	for (i = 0; i < I915_NUM_RINGS; i++)
+		atomic_set(&dev_priv->ring[i].hangcheck.flags, 0);
+
+	atomic_clear_mask(I915_RESET_IN_PROGRESS_FLAG,
+		&dev_priv->gpu_error.reset_counter);
+
 	intel_suspend_gt_powersave(dev);
 
 	/*
@@ -948,6 +960,200 @@ int i915_reset(struct drm_device *dev)
 	return 0;
 }
 
+/**
+ * i915_reset_engine - reset GPU engine after a hang
+ * @engine: engine to reset
+ *
+ * Reset a specific GPU engine. Useful if a hang is detected. Returns zero on successful
+ * reset or otherwise an error code.
+ *
+ * Procedure is fairly simple:
+ *
+ *	- Force engine to idle.
+ *
+ *	- Save current head register value and nudge it past the point of the hang in the
+ *	  ring buffer, which is typically the BB_START instruction of the hung batch buffer,
+ *	  on to the following instruction.
+ *
+ *	- Reset engine.
+ *
+ *	- Restore the previously saved, nudged head register value.
+ *
+ *	- Re-enable engine to resume running. On gen8 this requires the previously hung
+ *	  context to be resubmitted to ELSP via the dedicated TDR-execlists interface.
+ *
+ */
+int i915_reset_engine(struct intel_engine_cs *engine)
+{
+	struct drm_device *dev = engine->dev;
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	struct drm_i915_gem_request *current_request = NULL;
+	uint32_t head;
+	bool force_advance = false;
+	int ret = 0;
+	int err_ret = 0;
+
+	WARN_ON(!mutex_is_locked(&dev->struct_mutex));
+
+        /* Take wake lock to prevent power saving mode */
+	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+
+	i915_gem_reset_ring_status(dev_priv, engine);
+
+	if (i915.enable_execlists) {
+		enum context_submission_status status =
+			intel_execlists_TDR_get_current_request(engine, NULL);
+
+		/*
+		 * If the context submission state in hardware is not
+		 * consistent with the the corresponding state in the driver or
+		 * if there for some reason is no current context in the
+		 * process of being submitted then bail out and try again. Do
+		 * not proceed unless we have reliable current context state
+		 * information. The reason why this is important is because
+		 * per-engine hang recovery relies on context resubmission in
+		 * order to force the execution to resume following the hung
+		 * batch buffer. If the hardware is not currently running the
+		 * same context as the driver thinks is hung then anything can
+		 * happen at the point of context resubmission, e.g. unexpected
+		 * preemptions or the previously hung context could be
+		 * submitted when the hardware is idle which makes no sense.
+		 */
+		if (status != CONTEXT_SUBMISSION_STATUS_OK) {
+			ret = -EAGAIN;
+			goto reset_engine_error;
+		}
+	}
+
+	ret = intel_ring_disable(engine);
+	if (ret != 0) {
+		DRM_ERROR("Failed to disable %s\n", engine->name);
+		goto reset_engine_error;
+	}
+
+	if (i915.enable_execlists) {
+		enum context_submission_status status;
+		bool inconsistent;
+
+		status = intel_execlists_TDR_get_current_request(engine,
+				&current_request);
+
+		inconsistent = (status != CONTEXT_SUBMISSION_STATUS_OK);
+		if (inconsistent) {
+			/*
+			 * If we somehow have reached this point with
+			 * an inconsistent context submission status then
+			 * back out of the previously requested reset and
+			 * retry later.
+			 */
+			WARN(inconsistent,
+			     "Inconsistent context status on %s: %u\n",
+			     engine->name, status);
+
+			ret = -EAGAIN;
+			goto reenable_reset_engine_error;
+		}
+	}
+
+	/* Sample the current ring head position */
+	head = I915_READ_HEAD(engine) & HEAD_ADDR;
+
+	if (head == engine->hangcheck.last_head) {
+		/*
+		 * The engine has not advanced since the last
+		 * time it hung so force it to advance to the
+		 * next QWORD. In most cases the engine head
+		 * pointer will automatically advance to the
+		 * next instruction as soon as it has read the
+		 * current instruction, without waiting for it
+		 * to complete. This seems to be the default
+		 * behaviour, however an MBOX wait inserted
+		 * directly to the VCS/BCS engines does not behave
+		 * in the same way, instead the head pointer
+		 * will still be pointing at the MBOX instruction
+		 * until it completes.
+		 */
+		force_advance = true;
+	}
+
+	engine->hangcheck.last_head = head;
+
+	ret = intel_ring_save(engine, current_request, force_advance);
+	if (ret) {
+		DRM_ERROR("Failed to save %s engine state\n", engine->name);
+		goto reenable_reset_engine_error;
+	}
+
+	ret = intel_gpu_engine_reset(engine);
+	if (ret) {
+		DRM_ERROR("Failed to reset %s\n", engine->name);
+		goto reenable_reset_engine_error;
+	}
+
+	ret = intel_ring_restore(engine, current_request);
+	if (ret) {
+		DRM_ERROR("Failed to restore %s engine state\n", engine->name);
+		goto reenable_reset_engine_error;
+	}
+
+	/* Correct driver state */
+	intel_gpu_engine_reset_resample(engine, current_request);
+
+	/*
+	 * Reenable engine
+	 *
+	 * In execlist mode on gen8+ this is implicit by simply resubmitting
+	 * the previously hung context. In ring buffer submission mode on gen7
+	 * and earlier we need to actively turn on the engine first.
+	 */
+	if (i915.enable_execlists)
+		intel_execlists_TDR_context_resubmission(engine);
+	else
+		ret = intel_ring_enable(engine);
+
+	if (ret) {
+		DRM_ERROR("Failed to enable %s again after reset\n",
+			engine->name);
+
+		goto reset_engine_error;
+	}
+
+	/* Clear reset flags to allow future hangchecks */
+	atomic_set(&engine->hangcheck.flags, 0);
+
+	/* Wake up anything waiting on this engine's queue */
+	wake_up_all(&engine->irq_queue);
+
+	if (i915.enable_execlists && current_request)
+		i915_gem_request_unreference(current_request);
+
+	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+
+	return ret;
+
+reenable_reset_engine_error:
+
+	err_ret = intel_ring_enable(engine);
+	if (err_ret)
+		DRM_ERROR("Failed to reenable %s following error during reset (%d)\n",
+			engine->name, err_ret);
+
+reset_engine_error:
+
+	/* Clear reset flags to allow future hangchecks */
+	atomic_set(&engine->hangcheck.flags, 0);
+
+	/* Wake up anything waiting on this engine's queue */
+	wake_up_all(&engine->irq_queue);
+
+	if (i915.enable_execlists && current_request)
+		i915_gem_request_unreference(current_request);
+
+	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+
+	return ret;
+}
+
 static int i915_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
 	struct intel_device_info *intel_info =
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 703a320..e866f14 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2432,6 +2432,48 @@ struct drm_i915_cmd_table {
 	int count;
 };
 
+/*
+ * Context submission status
+ *
+ * CONTEXT_SUBMISSION_STATUS_OK:
+ *	Context submitted to ELSP and state of execlist queue is the same as
+ *	the state of EXECLIST_STATUS register. Software and hardware states
+ *	are consistent and can be trusted.
+ *
+ * CONTEXT_SUBMISSION_STATUS_INCONSISTENT:
+ *	Context has been submitted to the execlist queue but the state of the
+ *	EXECLIST_STATUS register is different from the execlist queue state.
+ *	This could mean any of the following:
+ *
+ *		1. The context is in the head position of the execlist queue
+ *		   but has not yet been submitted to ELSP.
+ *
+ *		2. The hardware just recently completed the context but the
+ *		   context is pending removal from the execlist queue.
+ *
+ *		3. The driver has lost a context state transition interrupt.
+ *		   Typically what this means is that hardware has completed and
+ *		   is now idle but the driver thinks the hardware is still
+ *		   busy.
+ *
+ *	Overall what this means is that the context submission status is
+ *	currently in transition and cannot be trusted until it settles down.
+ *
+ * CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED:
+ *	No context submitted to the execlist queue and the EXECLIST_STATUS
+ *	register shows no context being processed.
+ *
+ * CONTEXT_SUBMISSION_STATUS_NONE_UNDEFINED:
+ *	Initial state before submission status has been determined.
+ *
+ */
+enum context_submission_status {
+	CONTEXT_SUBMISSION_STATUS_OK = 0,
+	CONTEXT_SUBMISSION_STATUS_INCONSISTENT,
+	CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED,
+	CONTEXT_SUBMISSION_STATUS_UNDEFINED
+};
+
 /* Note that the (struct drm_i915_private *) cast is just to shut up gcc. */
 #define __I915__(p) ({ \
 	struct drm_i915_private *__p; \
@@ -2690,8 +2732,12 @@ extern long i915_compat_ioctl(struct file *filp, unsigned int cmd,
 			      unsigned long arg);
 #endif
 extern int intel_gpu_reset(struct drm_device *dev);
+extern int intel_gpu_engine_reset(struct intel_engine_cs *engine);
+extern int intel_request_gpu_engine_reset(struct intel_engine_cs *engine);
+extern int intel_unrequest_gpu_engine_reset(struct intel_engine_cs *engine);
 extern bool intel_has_gpu_reset(struct drm_device *dev);
 extern int i915_reset(struct drm_device *dev);
+extern int i915_reset_engine(struct intel_engine_cs *engine);
 extern unsigned long i915_chipset_val(struct drm_i915_private *dev_priv);
 extern unsigned long i915_mch_val(struct drm_i915_private *dev_priv);
 extern unsigned long i915_gfx_val(struct drm_i915_private *dev_priv);
@@ -2704,6 +2750,18 @@ void intel_hpd_init(struct drm_i915_private *dev_priv);
 void intel_hpd_init_work(struct drm_i915_private *dev_priv);
 void intel_hpd_cancel_work(struct drm_i915_private *dev_priv);
 bool intel_hpd_pin_to_port(enum hpd_pin pin, enum port *port);
+static inline void i915_hangcheck_reinit(struct intel_engine_cs *engine)
+{
+	struct intel_ring_hangcheck *hc = &engine->hangcheck;
+
+	hc->acthd = 0;
+	hc->max_acthd = 0;
+	hc->seqno = 0;
+	hc->score = 0;
+	hc->action = HANGCHECK_IDLE;
+	hc->deadlock = 0;
+}
+
 
 /* i915_irq.c */
 void i915_queue_hangcheck(struct drm_device *dev);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index f04d799..6a0ec37 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2470,10 +2470,70 @@ static void i915_reset_and_wakeup(struct drm_device *dev)
 	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
 	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
 	char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
-	int ret;
+	bool reset_complete = false;
+	struct intel_engine_cs *ring;
+	int ret = 0;
+	int i;
+
+	mutex_lock(&dev->struct_mutex);
 
 	kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, error_event);
 
+	for_each_ring(ring, dev_priv, i) {
+
+		/*
+		 * Skip further individual engine reset requests if full GPU
+		 * reset requested.
+		 */
+		if (i915_reset_in_progress(error))
+			break;
+
+		if (atomic_read(&ring->hangcheck.flags) &
+			I915_ENGINE_RESET_IN_PROGRESS) {
+
+			if (!reset_complete)
+				kobject_uevent_env(&dev->primary->kdev->kobj,
+						   KOBJ_CHANGE,
+						   reset_event);
+
+			reset_complete = true;
+
+			ret = i915_reset_engine(ring);
+
+			/*
+			 * Execlist mode only:
+			 *
+			 * -EAGAIN means that between detecting a hang (and
+			 * also determining that the currently submitted
+			 * context is stable and valid) and trying to recover
+			 * from the hang the current context changed state.
+			 * This means that we are probably not completely hung
+			 * after all. Just fail and retry by exiting all the
+			 * way back and wait for the next hang detection. If we
+			 * have a true hang on our hands then we will detect it
+			 * again, otherwise we will continue like nothing
+			 * happened.
+			 */
+			if (ret == -EAGAIN) {
+				DRM_ERROR("Reset of %s aborted due to " \
+					  "change in context submission " \
+					  "state - retrying!", ring->name);
+				ret = 0;
+			}
+
+			if (ret) {
+				DRM_ERROR("Reset of %s failed! (%d)", ring->name, ret);
+
+				atomic_or(I915_RESET_IN_PROGRESS_FLAG,
+					&dev_priv->gpu_error.reset_counter);
+				break;
+			}
+		}
+	}
+
+	/* The full GPU reset will grab the struct_mutex when it needs it */
+	mutex_unlock(&dev->struct_mutex);
+
 	/*
 	 * Note that there's only one work item which does gpu resets, so we
 	 * need not worry about concurrent gpu resets potentially incrementing
@@ -2486,8 +2546,13 @@ static void i915_reset_and_wakeup(struct drm_device *dev)
 	 */
 	if (i915_reset_in_progress(error) && !i915_terminally_wedged(error)) {
 		DRM_DEBUG_DRIVER("resetting chip\n");
-		kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE,
-				   reset_event);
+
+		if (!reset_complete)
+			kobject_uevent_env(&dev->primary->kdev->kobj,
+					   KOBJ_CHANGE,
+					   reset_event);
+
+		reset_complete = true;
 
 		/*
 		 * In most cases it's guaranteed that we get here with an RPM
@@ -2520,23 +2585,36 @@ static void i915_reset_and_wakeup(struct drm_device *dev)
 			 *
 			 * Since unlock operations are a one-sided barrier only,
 			 * we need to insert a barrier here to order any seqno
-			 * updates before
-			 * the counter increment.
+			 * updates before the counter increment.
+			 *
+			 * The increment clears I915_RESET_IN_PROGRESS_FLAG.
 			 */
 			smp_mb__before_atomic();
 			atomic_inc(&dev_priv->gpu_error.reset_counter);
 
-			kobject_uevent_env(&dev->primary->kdev->kobj,
-					   KOBJ_CHANGE, reset_done_event);
+			/*
+			 * If any per-engine resets were promoted to full GPU
+			 * reset don't forget to clear those reset flags.
+			 */
+			for_each_ring(ring, dev_priv, i)
+				atomic_set(&ring->hangcheck.flags, 0);
 		} else {
+			/* Terminal wedge condition */
+			WARN(1, "i915_reset failed, declaring GPU as wedged!\n");
 			atomic_or(I915_WEDGED, &error->reset_counter);
 		}
+	}
 
-		/*
-		 * Note: The wake_up also serves as a memory barrier so that
-		 * waiters see the update value of the reset counter atomic_t.
-		 */
+	/*
+	 * Note: The wake_up also serves as a memory barrier so that
+	 * waiters see the update value of the reset counter atomic_t.
+	 */
+	if (reset_complete) {
 		i915_error_wake_up(dev_priv, true);
+
+		if (ret == 0)
+			kobject_uevent_env(&dev->primary->kdev->kobj,
+					   KOBJ_CHANGE, reset_done_event);
 	}
 }
 
@@ -2649,6 +2727,14 @@ void i915_handle_error(struct drm_device *dev, bool wedged,
 	va_list args;
 	char error_msg[80];
 
+	struct intel_engine_cs *engine;
+
+	/*
+	 * NB: Placeholder until the hang checker supports
+	 * per-engine hang detection.
+	 */
+	u32 engine_mask = 0;
+
 	va_start(args, fmt);
 	vscnprintf(error_msg, sizeof(error_msg), fmt, args);
 	va_end(args);
@@ -2657,8 +2743,65 @@ void i915_handle_error(struct drm_device *dev, bool wedged,
 	i915_report_and_clear_eir(dev);
 
 	if (wedged) {
-		atomic_or(I915_RESET_IN_PROGRESS_FLAG,
-				&dev_priv->gpu_error.reset_counter);
+		/*
+		 * Defer to full GPU reset if any of the following is true:
+		 *	0. Engine reset disabled.
+		 * 	1. The caller did not ask for per-engine reset.
+		 *	2. The hardware does not support it (pre-gen7).
+		 *	3. We already tried per-engine reset recently.
+		 */
+		bool full_reset = true;
+
+		if (!i915.enable_engine_reset) {
+			DRM_INFO("Engine reset disabled: Using full GPU reset.\n");
+			engine_mask = 0x0;
+		}
+
+		/*
+		 * TBD: We currently only support per-engine reset for gen8+.
+		 * Implement support for gen7.
+		 */
+		if (engine_mask && (INTEL_INFO(dev)->gen >= 8)) {
+			u32 i;
+
+			for_each_ring(engine, dev_priv, i) {
+				u32 now, last_engine_reset_timediff;
+
+				if (!(intel_ring_flag(engine) & engine_mask))
+					continue;
+
+				/* Measure the time since this engine was last reset */
+				now = get_seconds();
+				last_engine_reset_timediff =
+					now - engine->hangcheck.last_engine_reset_time;
+
+				full_reset = last_engine_reset_timediff <
+					i915.gpu_reset_promotion_time;
+
+				engine->hangcheck.last_engine_reset_time = now;
+
+				/*
+				 * This engine was not reset too recently - go ahead
+				 * with engine reset instead of falling back to full
+				 * GPU reset.
+				 *
+				 * Flag that we want to try and reset this engine.
+				 * This can still be overridden by a global
+				 * reset e.g. if per-engine reset fails.
+				 */
+				if (!full_reset)
+					atomic_or(I915_ENGINE_RESET_IN_PROGRESS,
+						&engine->hangcheck.flags);
+				else
+					break;
+
+			} /* for_each_ring */
+		}
+
+		if (full_reset) {
+			atomic_or(I915_RESET_IN_PROGRESS_FLAG,
+					&dev_priv->gpu_error.reset_counter);
+		}
 
 		/*
 		 * Wakeup waiting processes so that the reset function
diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index 8d90c25..5cf9c11 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -37,6 +37,8 @@ struct i915_params i915 __read_mostly = {
 	.enable_fbc = -1,
 	.enable_execlists = -1,
 	.enable_hangcheck = true,
+	.enable_engine_reset = false,
+	.gpu_reset_promotion_time = 10,
 	.enable_ppgtt = -1,
 	.enable_psr = 0,
 	.preliminary_hw_support = IS_ENABLED(CONFIG_DRM_I915_PRELIMINARY_HW_SUPPORT),
@@ -116,6 +118,23 @@ MODULE_PARM_DESC(enable_hangcheck,
 	"WARNING: Disabling this can cause system wide hangs. "
 	"(default: true)");
 
+module_param_named_unsafe(enable_engine_reset, i915.enable_engine_reset, bool, 0644);
+MODULE_PARM_DESC(enable_engine_reset,
+	"Enable GPU engine hang recovery mode. Used as a soft, low-impact form "
+	"of hang recovery that targets individual GPU engines rather than the "
+	"entire GPU"
+	"(default: false)");
+
+module_param_named(gpu_reset_promotion_time,
+               i915.gpu_reset_promotion_time, int, 0644);
+MODULE_PARM_DESC(gpu_reset_promotion_time,
+               "Catch excessive engine resets. Each engine maintains a "
+	       "timestamp of the last time it was reset. If it hangs again "
+	       "within this period then fall back to full GPU reset to try and"
+	       " recover from the hang. Only applicable if enable_engine_reset "
+	       "is enabled."
+               "default=10 seconds");
+
 module_param_named_unsafe(enable_ppgtt, i915.enable_ppgtt, int, 0400);
 MODULE_PARM_DESC(enable_ppgtt,
 	"Override PPGTT usage. "
diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
index 5299290..60f3d23 100644
--- a/drivers/gpu/drm/i915/i915_params.h
+++ b/drivers/gpu/drm/i915/i915_params.h
@@ -49,8 +49,10 @@ struct i915_params {
 	int use_mmio_flip;
 	int mmio_debug;
 	int edp_vswing;
+	unsigned int gpu_reset_promotion_time;
 	/* leave bools at the end to not create holes */
 	bool enable_hangcheck;
+	bool enable_engine_reset;
 	bool fastboot;
 	bool prefault_disable;
 	bool load_detect_test;
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 0a98889..3fc5d75 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -164,6 +164,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
 #define  GEN6_GRDOM_RENDER		(1 << 1)
 #define  GEN6_GRDOM_MEDIA		(1 << 2)
 #define  GEN6_GRDOM_BLT			(1 << 3)
+#define  GEN6_GRDOM_VECS		(1 << 4)
+#define  GEN8_GRDOM_MEDIA2		(1 << 7)
 
 #define RING_PP_DIR_BASE(ring)		_MMIO((ring)->mmio_base+0x228)
 #define RING_PP_DIR_BASE_READ(ring)	_MMIO((ring)->mmio_base+0x518)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index ab344e0..fcec476 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -136,6 +136,7 @@
 #include <drm/i915_drm.h>
 #include "i915_drv.h"
 #include "intel_mocs.h"
+#include "intel_lrc_tdr.h"
 
 #define GEN9_LR_CONTEXT_RENDER_SIZE (22 * PAGE_SIZE)
 #define GEN8_LR_CONTEXT_RENDER_SIZE (20 * PAGE_SIZE)
@@ -325,7 +326,8 @@ uint64_t intel_lr_context_descriptor(struct intel_context *ctx,
 }
 
 static void execlists_elsp_write(struct drm_i915_gem_request *rq0,
-				 struct drm_i915_gem_request *rq1)
+				 struct drm_i915_gem_request *rq1,
+				 bool tdr_resubmission)
 {
 
 	struct intel_engine_cs *ring = rq0->ring;
@@ -335,13 +337,17 @@ static void execlists_elsp_write(struct drm_i915_gem_request *rq0,
 
 	if (rq1) {
 		desc[1] = intel_lr_context_descriptor(rq1->ctx, rq1->ring);
-		rq1->elsp_submitted++;
+
+		if (!tdr_resubmission)
+			rq1->elsp_submitted++;
 	} else {
 		desc[1] = 0;
 	}
 
 	desc[0] = intel_lr_context_descriptor(rq0->ctx, rq0->ring);
-	rq0->elsp_submitted++;
+
+	if (!tdr_resubmission)
+		rq0->elsp_submitted++;
 
 	/* You must always write both descriptors in the order below. */
 	spin_lock(&dev_priv->uncore.lock);
@@ -359,6 +365,182 @@ static void execlists_elsp_write(struct drm_i915_gem_request *rq0,
 	spin_unlock(&dev_priv->uncore.lock);
 }
 
+/**
+ * execlist_get_context_reg_page() - Get memory page for context object
+ * @engine: engine
+ * @ctx: context running on engine
+ * @page: returned page
+ *
+ * Return: 0 if successful, otherwise propagates error codes.
+ */
+static inline int execlist_get_context_reg_page(struct intel_engine_cs *engine,
+		struct intel_context *ctx,
+		struct page **page)
+{
+	struct drm_i915_gem_object *ctx_obj;
+
+	if (!page)
+		return -EINVAL;
+
+	if (!ctx)
+		ctx = engine->default_context;
+
+	ctx_obj = ctx->engine[engine->id].state;
+
+	if (WARN(!ctx_obj, "Context object not set up!\n"))
+		return -EINVAL;
+
+	WARN(!i915_gem_obj_is_pinned(ctx_obj),
+	     "Context object is not pinned!\n");
+
+	*page = i915_gem_object_get_page(ctx_obj, LRC_STATE_PN);
+
+	if (WARN(!*page, "Context object page could not be resolved!\n"))
+		return -EINVAL;
+
+	return 0;
+}
+
+/**
+ * execlist_write_context_reg() - Write value to Context register
+ * @engine: Engine
+ * @ctx: Context running on engine
+ * @ctx_reg: Index into context image pointing to register location
+ * @mmio_reg_addr: MMIO register address
+ * @val: Value to be written
+ * @mmio_reg_name_str: Designated register name
+ *
+ * Return: 0 if successful, otherwise propagates error codes.
+ */
+static inline int execlists_write_context_reg(struct intel_engine_cs *engine,
+					      struct intel_context *ctx,
+					      u32 ctx_reg,
+					      i915_reg_t mmio_reg,
+					      u32 val,
+					      const char *mmio_reg_name_str)
+{
+	struct page *page = NULL;
+	uint32_t *reg_state;
+
+	int ret = execlist_get_context_reg_page(engine, ctx, &page);
+	if (WARN(ret, "[write %s:%u] Failed to get context memory page for %s!\n",
+		 mmio_reg_name_str, (unsigned int) mmio_reg.reg, engine->name)) {
+		return ret;
+	}
+
+	reg_state = kmap_atomic(page);
+
+	WARN(reg_state[ctx_reg] != mmio_reg.reg,
+	     "[write %s:%u]: Context reg addr (%x) != MMIO reg addr (%x)!\n",
+	     mmio_reg_name_str,
+	     (unsigned int) mmio_reg.reg,
+	     (unsigned int) reg_state[ctx_reg],
+	     (unsigned int) mmio_reg.reg);
+
+	reg_state[ctx_reg+1] = val;
+	kunmap_atomic(reg_state);
+
+	return ret;
+}
+
+/**
+ * execlist_read_context_reg() - Read value from Context register
+ * @engine: Engine
+ * @ctx: Context running on engine
+ * @ctx_reg: Index into context image pointing to register location
+ * @mmio_reg: MMIO register struct
+ * @val: Output parameter returning register value
+ * @mmio_reg_name_str: Designated register name
+ *
+ * Return: 0 if successful, otherwise propagates error codes.
+ */
+static inline int execlists_read_context_reg(struct intel_engine_cs *engine,
+					     struct intel_context *ctx,
+					     u32 ctx_reg,
+					     i915_reg_t mmio_reg,
+					     u32 *val,
+					     const char *mmio_reg_name_str)
+{
+	struct page *page = NULL;
+	uint32_t *reg_state;
+	int ret = 0;
+
+	if (!val)
+		return -EINVAL;
+
+	ret = execlist_get_context_reg_page(engine, ctx, &page);
+	if (WARN(ret, "[read %s:%u] Failed to get context memory page for %s!\n",
+		 mmio_reg_name_str, (unsigned int) mmio_reg.reg, engine->name)) {
+		return ret;
+	}
+
+	reg_state = kmap_atomic(page);
+
+	WARN(reg_state[ctx_reg] != mmio_reg.reg,
+	     "[read %s:%u]: Context reg addr (%x) != MMIO reg addr (%x)!\n",
+	     mmio_reg_name_str,
+	     (unsigned int) ctx_reg,
+	     (unsigned int) reg_state[ctx_reg],
+	     (unsigned int) mmio_reg.reg);
+
+	*val = reg_state[ctx_reg+1];
+	kunmap_atomic(reg_state);
+
+	return ret;
+ }
+
+/*
+ * Generic macros for generating function implementation for context register
+ * read/write functions.
+ *
+ * Macro parameters
+ * ----------------
+ * reg_name: Designated name of context register (e.g. tail, head, buffer_ctl)
+ *
+ * reg_def: Context register macro definition (e.g. CTX_RING_TAIL)
+ *
+ * mmio_reg_def: Name of macro function used to determine the address
+ *		 of the corresponding MMIO register (e.g. RING_TAIL, RING_HEAD).
+ *		 This macro function is assumed to be defined on the form of:
+ *
+ *			#define mmio_reg_def(base) (base+register_offset)
+ *
+ *		 Where "base" is the MMIO base address of the respective ring
+ *		 and "register_offset" is the offset relative to "base".
+ *
+ * Function parameters
+ * -------------------
+ * engine: The engine that the context is running on
+ * ctx: The context of the register that is to be accessed
+ * reg_name: Value to be written/read to/from the register.
+ */
+#define INTEL_EXECLISTS_WRITE_REG(reg_name, reg_def, mmio_reg_def) \
+	int intel_execlists_write_##reg_name(struct intel_engine_cs *engine, \
+					     struct intel_context *ctx, \
+					     u32 reg_name) \
+{ \
+	return execlists_write_context_reg(engine, ctx, (reg_def), \
+			mmio_reg_def(engine->mmio_base), (reg_name), \
+			(#reg_name)); \
+}
+
+#define INTEL_EXECLISTS_READ_REG(reg_name, reg_def, mmio_reg_def) \
+	int intel_execlists_read_##reg_name(struct intel_engine_cs *engine, \
+					    struct intel_context *ctx, \
+					    u32 *reg_name) \
+{ \
+	return execlists_read_context_reg(engine, ctx, (reg_def), \
+			mmio_reg_def(engine->mmio_base), (reg_name), \
+			(#reg_name)); \
+}
+
+INTEL_EXECLISTS_READ_REG(tail, CTX_RING_TAIL, RING_TAIL)
+INTEL_EXECLISTS_WRITE_REG(head, CTX_RING_HEAD, RING_HEAD)
+INTEL_EXECLISTS_READ_REG(head, CTX_RING_HEAD, RING_HEAD)
+
+#undef INTEL_EXECLISTS_READ_REG
+#undef INTEL_EXECLISTS_WRITE_REG
+
 static int execlists_update_context(struct drm_i915_gem_request *rq)
 {
 	struct intel_engine_cs *ring = rq->ring;
@@ -396,17 +578,18 @@ static int execlists_update_context(struct drm_i915_gem_request *rq)
 }
 
 static void execlists_submit_requests(struct drm_i915_gem_request *rq0,
-				      struct drm_i915_gem_request *rq1)
+				      struct drm_i915_gem_request *rq1,
+				      bool tdr_resubmission)
 {
 	execlists_update_context(rq0);
 
 	if (rq1)
 		execlists_update_context(rq1);
 
-	execlists_elsp_write(rq0, rq1);
+	execlists_elsp_write(rq0, rq1, tdr_resubmission);
 }
 
-static void execlists_context_unqueue(struct intel_engine_cs *ring)
+static void execlists_context_unqueue(struct intel_engine_cs *ring, bool tdr_resubmission)
 {
 	struct drm_i915_gem_request *req0 = NULL, *req1 = NULL;
 	struct drm_i915_gem_request *cursor = NULL, *tmp = NULL;
@@ -440,6 +623,16 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring)
 		}
 	}
 
+	/*
+	 * Only do TDR resubmission of the second head request if it's already
+	 * been submitted. The intention is to restore the original submission
+	 * state from the situation when the hang originally happened. If it
+	 * was never submitted we don't want to submit it for the first time at
+	 * this point
+	 */
+	if (tdr_resubmission && req1 && !req1->elsp_submitted)
+		req1 = NULL;
+
 	if (IS_GEN8(ring->dev) || IS_GEN9(ring->dev)) {
 		/*
 		 * WaIdleLiteRestore: make sure we never cause a lite
@@ -460,9 +653,32 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring)
 		}
 	}
 
-	WARN_ON(req1 && req1->elsp_submitted);
+	WARN_ON(req1 && req1->elsp_submitted && !tdr_resubmission);
 
-	execlists_submit_requests(req0, req1);
+	execlists_submit_requests(req0, req1, tdr_resubmission);
+}
+
+/**
+ * intel_execlists_TDR_context_resubmission() - ELSP context resubmission
+ * @ring: engine to do resubmission for.
+ *
+ * Context submission mechanism exclusively used by TDR that bypasses the
+ * execlist queue. This is necessary since at the point of TDR hang recovery
+ * the hardware will be hung and resubmitting a fixed context (the context that
+ * the TDR has identified as hung and fixed up in order to move past the
+ * blocking batch buffer) to a hung execlist queue will lock up the TDR.
+ * Instead, opt for direct ELSP submission without depending on the rest of the
+ * driver.
+ */
+void intel_execlists_TDR_context_resubmission(struct intel_engine_cs *ring)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&ring->execlist_lock, flags);
+	WARN_ON(list_empty(&ring->execlist_queue));
+
+	execlists_context_unqueue(ring, true);
+	spin_unlock_irqrestore(&ring->execlist_lock, flags);
 }
 
 static bool execlists_check_remove_request(struct intel_engine_cs *ring,
@@ -560,9 +776,9 @@ void intel_lrc_irq_handler(struct intel_engine_cs *ring)
 		/* Prevent a ctx to preempt itself */
 		if ((status & GEN8_CTX_STATUS_ACTIVE_IDLE) &&
 		    (submit_contexts != 0))
-			execlists_context_unqueue(ring);
+			execlists_context_unqueue(ring, false);
 	} else if (submit_contexts != 0) {
-		execlists_context_unqueue(ring);
+		execlists_context_unqueue(ring, false);
 	}
 
 	spin_unlock(&ring->execlist_lock);
@@ -613,7 +829,7 @@ static int execlists_context_queue(struct drm_i915_gem_request *request)
 
 	list_add_tail(&request->execlist_link, &ring->execlist_queue);
 	if (num_elements == 0)
-		execlists_context_unqueue(ring);
+		execlists_context_unqueue(ring, false);
 
 	spin_unlock_irq(&ring->execlist_lock);
 
@@ -1536,7 +1752,7 @@ static int gen8_init_common_ring(struct intel_engine_cs *ring)
 	ring->next_context_status_buffer = next_context_status_buffer_hw;
 	DRM_DEBUG_DRIVER("Execlists enabled for %s\n", ring->name);
 
-	memset(&ring->hangcheck, 0, sizeof(ring->hangcheck));
+	i915_hangcheck_reinit(ring);
 
 	return 0;
 }
@@ -1888,6 +2104,187 @@ out:
 	return ret;
 }
 
+static int
+gen8_ring_disable(struct intel_engine_cs *ring)
+{
+	intel_request_gpu_engine_reset(ring);
+	return 0;
+}
+
+static int
+gen8_ring_enable(struct intel_engine_cs *ring)
+{
+	intel_unrequest_gpu_engine_reset(ring);
+	return 0;
+}
+
+/**
+ * gen8_ring_save() - save minimum engine state
+ * @ring: engine whose state is to be saved
+ * @req: request containing the context currently running on engine
+ * @force_advance: indicates whether or not we should nudge the head
+ *		  forward or not
+ *
+ * Saves the head MMIO register to scratch memory while engine is reset and
+ * reinitialized. Before saving the head register we nudge the head position to
+ * be correctly aligned with a QWORD boundary, which brings it up to the next
+ * presumably valid instruction. Typically, at the point of hang recovery the
+ * head register will be pointing to the last DWORD of the BB_START
+ * instruction, which is followed by a padding MI_NOOP inserted by the
+ * driver.
+ *
+ * Returns:
+ * 	0 if ok, otherwise propagates error codes.
+ */
+static int
+gen8_ring_save(struct intel_engine_cs *ring, struct drm_i915_gem_request *req,
+		bool force_advance)
+{
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	struct intel_ringbuffer *ringbuf = NULL;
+	struct intel_context *ctx;
+	int ret = 0;
+	int clamp_to_tail = 0;
+	uint32_t head;
+	uint32_t tail;
+	uint32_t head_addr;
+	uint32_t tail_addr;
+
+	if (WARN_ON(!req))
+	    return -EINVAL;
+
+	ctx = req->ctx;
+	ringbuf = ctx->engine[ring->id].ringbuf;
+
+	/*
+	 * Read head from MMIO register since it contains the
+	 * most up to date value of head at this point.
+	 */
+	head = I915_READ_HEAD(ring);
+
+	/*
+	 * Read tail from the context because the execlist queue
+	 * updates the tail value there first during submission.
+	 * The MMIO tail register is not updated until the actual
+	 * ring submission completes.
+	 */
+	ret = I915_READ_TAIL_CTX(ring, ctx, tail);
+	if (ret)
+		return ret;
+
+	/*
+	 * head_addr and tail_addr are the head and tail values
+	 * excluding ring wrapping information and aligned to DWORD
+	 * boundary
+	 */
+	head_addr = head & HEAD_ADDR;
+	tail_addr = tail & TAIL_ADDR;
+
+	/*
+	 * The head must always chase the tail.
+	 * If the tail is beyond the head then do not allow
+	 * the head to overtake it. If the tail is less than
+	 * the head then the tail has already wrapped and
+	 * there is no problem in advancing the head or even
+	 * wrapping the head back to 0 as worst case it will
+	 * become equal to tail
+	 */
+	if (head_addr <= tail_addr)
+		clamp_to_tail = 1;
+
+	if (force_advance) {
+
+		/* Force head pointer to next QWORD boundary */
+		head_addr &= ~0x7;
+		head_addr += 8;
+
+	} else if (head & 0x7) {
+
+		/* Ensure head pointer is pointing to a QWORD boundary */
+		head += 0x7;
+		head &= ~0x7;
+		head_addr = head;
+	}
+
+	if (clamp_to_tail && (head_addr > tail_addr)) {
+		head_addr = tail_addr;
+	} else if (head_addr >= ringbuf->size) {
+		/* Wrap head back to start if it exceeds ring size */
+		head_addr = 0;
+	}
+
+	head &= ~HEAD_ADDR;
+	head |= (head_addr & HEAD_ADDR);
+	ring->saved_head = head;
+
+	return 0;
+}
+
+
+/**
+ * gen8_ring_restore() - restore previously saved engine state
+ * @ring: engine whose state is to be restored
+ * @req: request containing the context currently running on engine
+ *
+ * Reinitializes engine and restores the previously saved engine state.
+ * See: gen8_ring_save()
+ *
+ * Returns:
+ * 	0 if ok, otherwise propagates error codes.
+ */
+static int
+gen8_ring_restore(struct intel_engine_cs *ring, struct drm_i915_gem_request *req)
+{
+	struct drm_i915_private *dev_priv = ring->dev->dev_private;
+	struct intel_context *ctx;
+
+	if (WARN_ON(!req))
+	    return -EINVAL;
+
+	ctx = req->ctx;
+
+	/* Re-initialize ring */
+	if (ring->init_hw) {
+		int ret = ring->init_hw(ring);
+		if (ret != 0) {
+			DRM_ERROR("Failed to re-initialize %s\n",
+					ring->name);
+			return ret;
+		}
+	} else {
+		DRM_ERROR("ring init function pointer not set up\n");
+		return -EINVAL;
+	}
+
+	if (ring->id == RCS) {
+		/*
+		 * These register reinitializations are only located here
+		 * temporarily until they are moved out of the
+		 * init_clock_gating function to some function we can
+		 * call from here.
+		 */
+
+		/* WaVSRefCountFullforceMissDisable:chv */
+		/* WaDSRefCountFullforceMissDisable:chv */
+		I915_WRITE(GEN7_FF_THREAD_MODE,
+			   I915_READ(GEN7_FF_THREAD_MODE) &
+			   ~(GEN8_FF_DS_REF_CNT_FFME | GEN7_FF_VS_REF_CNT_FFME));
+
+		I915_WRITE(_3D_CHICKEN3,
+			   _3D_CHICKEN_SDE_LIMIT_FIFO_POLY_DEPTH(2));
+
+		/* WaSwitchSolVfFArbitrationPriority:bdw */
+		I915_WRITE(GAM_ECOCHK, I915_READ(GAM_ECOCHK) | HSW_ECOCHK_ARB_PRIO_SOL);
+	}
+
+	/* Restore head */
+
+	I915_WRITE_HEAD(ring, ring->saved_head);
+	I915_WRITE_HEAD_CTX(ring, ctx, ring->saved_head);
+
+	return 0;
+}
+
 static int gen8_init_rcs_context(struct drm_i915_gem_request *req)
 {
 	int ret;
@@ -2021,6 +2418,10 @@ static int logical_render_ring_init(struct drm_device *dev)
 	ring->irq_get = gen8_logical_ring_get_irq;
 	ring->irq_put = gen8_logical_ring_put_irq;
 	ring->emit_bb_start = gen8_emit_bb_start;
+	ring->enable = gen8_ring_enable;
+	ring->disable = gen8_ring_disable;
+	ring->save = gen8_ring_save;
+	ring->restore = gen8_ring_restore;
 
 	ring->dev = dev;
 
@@ -2073,6 +2474,10 @@ static int logical_bsd_ring_init(struct drm_device *dev)
 	ring->irq_get = gen8_logical_ring_get_irq;
 	ring->irq_put = gen8_logical_ring_put_irq;
 	ring->emit_bb_start = gen8_emit_bb_start;
+	ring->enable = gen8_ring_enable;
+	ring->disable = gen8_ring_disable;
+	ring->save = gen8_ring_save;
+	ring->restore = gen8_ring_restore;
 
 	return logical_ring_init(dev, ring);
 }
@@ -2098,6 +2503,10 @@ static int logical_bsd2_ring_init(struct drm_device *dev)
 	ring->irq_get = gen8_logical_ring_get_irq;
 	ring->irq_put = gen8_logical_ring_put_irq;
 	ring->emit_bb_start = gen8_emit_bb_start;
+	ring->enable = gen8_ring_enable;
+	ring->disable = gen8_ring_disable;
+	ring->save = gen8_ring_save;
+	ring->restore = gen8_ring_restore;
 
 	return logical_ring_init(dev, ring);
 }
@@ -2128,6 +2537,10 @@ static int logical_blt_ring_init(struct drm_device *dev)
 	ring->irq_get = gen8_logical_ring_get_irq;
 	ring->irq_put = gen8_logical_ring_put_irq;
 	ring->emit_bb_start = gen8_emit_bb_start;
+	ring->enable = gen8_ring_enable;
+	ring->disable = gen8_ring_disable;
+	ring->save = gen8_ring_save;
+	ring->restore = gen8_ring_restore;
 
 	return logical_ring_init(dev, ring);
 }
@@ -2158,6 +2571,10 @@ static int logical_vebox_ring_init(struct drm_device *dev)
 	ring->irq_get = gen8_logical_ring_get_irq;
 	ring->irq_put = gen8_logical_ring_put_irq;
 	ring->emit_bb_start = gen8_emit_bb_start;
+	ring->enable = gen8_ring_enable;
+	ring->disable = gen8_ring_disable;
+	ring->save = gen8_ring_save;
+	ring->restore = gen8_ring_restore;
 
 	return logical_ring_init(dev, ring);
 }
@@ -2587,3 +3004,127 @@ void intel_lr_context_reset(struct drm_device *dev,
 		ringbuf->tail = 0;
 	}
 }
+
+/**
+ * intel_execlists_TDR_get_current_request() - return request currently
+ * processed by engine
+ *
+ * @ring: Engine currently running context to be returned.
+ *
+ * @req:  Output parameter containing the current request (the request at the
+ *	  head of execlist queue corresponding to the given ring). May be NULL
+ *	  if no request has been submitted to the execlist queue of this
+ *	  engine. If the req parameter passed in to the function is not NULL
+ *	  and a request is found and returned the request is referenced before
+ *	  it is returned. It is the responsibility of the caller to dereference
+ *	  it at the end of its life cycle.
+ *
+ * Return:
+ *	CONTEXT_SUBMISSION_STATUS_OK if request is found to be submitted and its
+ *	context is currently running on engine.
+ *
+ *	CONTEXT_SUBMISSION_STATUS_INCONSISTENT if request is found to be submitted
+ *	but its context is not in a state that is consistent with current
+ *	hardware state for the given engine. This has been observed in three cases:
+ *
+ *		1. Before the engine has switched to this context after it has
+ *		been submitted to the execlist queue.
+ *
+ *		2. After the engine has switched away from this context but
+ *		before the context has been removed from the execlist queue.
+ *
+ *		3. The driver has lost an interrupt. Typically the hardware has
+ *		gone to idle but the driver still thinks the context belonging to
+ *		the request at the head of the queue is still executing.
+ *
+ *	CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED if no context has been found
+ *	to be submitted to the execlist queue and if the hardware is idle.
+ */
+enum context_submission_status
+intel_execlists_TDR_get_current_request(struct intel_engine_cs *ring,
+		struct drm_i915_gem_request **req)
+{
+	struct drm_i915_private *dev_priv;
+	unsigned long flags;
+	struct drm_i915_gem_request *tmpreq = NULL;
+	struct intel_context *tmpctx = NULL;
+	unsigned hw_context = 0;
+	unsigned sw_context = 0;
+	bool hw_active = false;
+	enum context_submission_status status =
+			CONTEXT_SUBMISSION_STATUS_UNDEFINED;
+
+	if (WARN_ON(!ring))
+		return status;
+
+	dev_priv = ring->dev->dev_private;
+
+	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+	spin_lock_irqsave(&ring->execlist_lock, flags);
+	hw_context = I915_READ(RING_EXECLIST_STATUS_CTX_ID(ring));
+
+	hw_active = (I915_READ(RING_EXECLIST_STATUS_LO(ring)) &
+		EXECLIST_STATUS_CURRENT_ACTIVE_ELEMENT_STATUS) ? true : false;
+
+	tmpreq = list_first_entry_or_null(&ring->execlist_queue,
+		struct drm_i915_gem_request, execlist_link);
+
+	if (tmpreq) {
+		sw_context = intel_execlists_ctx_id((tmpreq->ctx)->engine[ring->id].state);
+
+		/*
+		 * Only acknowledge the request in the execlist queue if it's
+		 * actually been submitted to hardware, otherwise there's the
+		 * risk of a false inconsistency detection between the
+		 * (unsubmitted) request and the idle hardware state.
+		 */
+		if (tmpreq->elsp_submitted > 0) {
+			/*
+			 * If the caller has not passed a non-NULL req
+			 * parameter then it is not interested in getting a
+			 * request reference back.  Don't temporarily grab a
+			 * reference since holding the execlist lock is enough
+			 * to ensure that the execlist code will hold its
+			 * reference all throughout this function. As long as
+			 * that reference is kept there is no need for us to
+			 * take yet another reference.  The reason why this is
+			 * of interest is because certain callers, such as the
+			 * TDR hang checker, cannot grab struct_mutex before
+			 * calling and because of that we cannot dereference
+			 * any requests (DRM might assert if we do). Just rely
+			 * on the execlist code to provide indirect protection.
+			 */
+			if (req)
+				i915_gem_request_reference(tmpreq);
+
+			if (tmpreq->ctx)
+				tmpctx = tmpreq->ctx;
+		}
+	}
+
+	if (tmpctx) {
+		status = ((hw_context == sw_context) && hw_active) ?
+				CONTEXT_SUBMISSION_STATUS_OK :
+				CONTEXT_SUBMISSION_STATUS_INCONSISTENT;
+	} else {
+		/*
+		 * If we don't have any queue entries and the
+		 * EXECLIST_STATUS register points to zero we are
+		 * clearly not processing any context right now
+		 */
+		WARN((hw_context || hw_active), "hw_context=%x, hardware %s!\n",
+			hw_context, hw_active ? "not idle":"idle");
+
+		status = (hw_context || hw_active) ?
+			CONTEXT_SUBMISSION_STATUS_INCONSISTENT :
+			CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED;
+	}
+
+	if (req)
+		*req = tmpreq;
+
+	spin_unlock_irqrestore(&ring->execlist_lock, flags);
+	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+
+	return status;
+}
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index de41ad6..d9acb31 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -29,7 +29,9 @@
 /* Execlists regs */
 #define RING_ELSP(ring)				_MMIO((ring)->mmio_base + 0x230)
 #define RING_EXECLIST_STATUS_LO(ring)		_MMIO((ring)->mmio_base + 0x234)
+#define	  EXECLIST_STATUS_CURRENT_ACTIVE_ELEMENT_STATUS	(0x3 << 14)
 #define RING_EXECLIST_STATUS_HI(ring)		_MMIO((ring)->mmio_base + 0x234 + 4)
+#define RING_EXECLIST_STATUS_CTX_ID(ring)	RING_EXECLIST_STATUS_HI(ring)
 #define RING_CONTEXT_CONTROL(ring)		_MMIO((ring)->mmio_base + 0x244)
 #define	  CTX_CTRL_INHIBIT_SYN_CTX_SWITCH	(1 << 3)
 #define	  CTX_CTRL_ENGINE_CTX_RESTORE_INHIBIT	(1 << 0)
@@ -118,4 +120,16 @@ u32 intel_execlists_ctx_id(struct drm_i915_gem_object *ctx_obj);
 void intel_lrc_irq_handler(struct intel_engine_cs *ring);
 void intel_execlists_retire_requests(struct intel_engine_cs *ring);
 
+int intel_execlists_read_tail(struct intel_engine_cs *ring,
+			 struct intel_context *ctx,
+			 u32 *tail);
+
+int intel_execlists_write_head(struct intel_engine_cs *ring,
+			  struct intel_context *ctx,
+			  u32 head);
+
+int intel_execlists_read_head(struct intel_engine_cs *ring,
+			 struct intel_context *ctx,
+			 u32 *head);
+
 #endif /* _INTEL_LRC_H_ */
diff --git a/drivers/gpu/drm/i915/intel_lrc_tdr.h b/drivers/gpu/drm/i915/intel_lrc_tdr.h
new file mode 100644
index 0000000..4520753
--- /dev/null
+++ b/drivers/gpu/drm/i915/intel_lrc_tdr.h
@@ -0,0 +1,36 @@
+/*
+ * Copyright © 2015 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef _INTEL_LRC_TDR_H_
+#define _INTEL_LRC_TDR_H_
+
+/* Privileged execlist API used exclusively by TDR */
+
+void intel_execlists_TDR_context_resubmission(struct intel_engine_cs *ring);
+
+enum context_submission_status
+intel_execlists_TDR_get_current_request(struct intel_engine_cs *ring,
+		struct drm_i915_gem_request **req);
+
+#endif /* _INTEL_LRC_TDR_H_ */
+
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 4060acf..def0dcf 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -434,6 +434,88 @@ static void ring_write_tail(struct intel_engine_cs *ring,
 	I915_WRITE_TAIL(ring, value);
 }
 
+int intel_ring_disable(struct intel_engine_cs *ring)
+{
+	WARN_ON(!ring);
+
+	if (ring && ring->disable)
+		return ring->disable(ring);
+	else {
+		DRM_ERROR("Ring disable not supported on %s\n", ring->name);
+		return -EINVAL;
+	}
+}
+
+int intel_ring_enable(struct intel_engine_cs *ring)
+{
+	WARN_ON(!ring);
+
+	if (ring && ring->enable)
+		return ring->enable(ring);
+	else {
+		DRM_ERROR("Ring enable not supported on %s\n", ring->name);
+		return -EINVAL;
+	}
+}
+
+int intel_ring_save(struct intel_engine_cs *ring,
+		struct drm_i915_gem_request *req,
+		bool force_advance)
+{
+	WARN_ON(!ring);
+
+	if (ring && ring->save)
+		return ring->save(ring, req, force_advance);
+	else {
+		DRM_ERROR("Ring save not supported on %s\n", ring->name);
+		return -EINVAL;
+	}
+}
+
+int intel_ring_restore(struct intel_engine_cs *ring,
+		struct drm_i915_gem_request *req)
+{
+	WARN_ON(!ring);
+
+	if (ring && ring->restore)
+		return ring->restore(ring, req);
+	else {
+		DRM_ERROR("Ring restore not supported on %s\n", ring->name);
+		return -EINVAL;
+	}
+}
+
+void intel_gpu_engine_reset_resample(struct intel_engine_cs *ring,
+		struct drm_i915_gem_request *req)
+{
+	struct intel_ringbuffer *ringbuf;
+	struct drm_i915_private *dev_priv;
+
+	if (WARN_ON(!ring))
+		return;
+
+	dev_priv = ring->dev->dev_private;
+
+	if (i915.enable_execlists) {
+		struct intel_context *ctx;
+
+		if (WARN_ON(!req))
+			return;
+
+		ctx = req->ctx;
+		ringbuf = ctx->engine[ring->id].ringbuf;
+
+		/*
+		 * In gen8+ context head is restored during reset and
+		 * we can use it as a reference to set up the new
+		 * driver state.
+		 */
+		I915_READ_HEAD_CTX(ring, ctx, ringbuf->head);
+		ringbuf->last_retired_head = -1;
+		intel_ring_update_space(ringbuf);
+	}
+}
+
 u64 intel_ring_get_active_head(struct intel_engine_cs *ring)
 {
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
@@ -629,7 +711,7 @@ static int init_ring_common(struct intel_engine_cs *ring)
 	ringbuf->tail = I915_READ_TAIL(ring) & TAIL_ADDR;
 	intel_ring_update_space(ringbuf);
 
-	memset(&ring->hangcheck, 0, sizeof(ring->hangcheck));
+	i915_hangcheck_reinit(ring);
 
 out:
 	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 7349d92..7014778 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -49,6 +49,22 @@ struct  intel_hw_status_page {
 #define I915_READ_MODE(ring) I915_READ(RING_MI_MODE((ring)->mmio_base))
 #define I915_WRITE_MODE(ring, val) I915_WRITE(RING_MI_MODE((ring)->mmio_base), val)
 
+
+#define I915_READ_TAIL_CTX(engine, ctx, outval) \
+	intel_execlists_read_tail((engine), \
+				(ctx), \
+				&(outval));
+
+#define I915_READ_HEAD_CTX(engine, ctx, outval) \
+	intel_execlists_read_head((engine), \
+				(ctx), \
+				&(outval));
+
+#define I915_WRITE_HEAD_CTX(engine, ctx, val) \
+	intel_execlists_write_head((engine), \
+				(ctx), \
+				(val));
+
 /* seqno size is actually only a uint32, but since we plan to use MI_FLUSH_DW to
  * do the writes, and that must have qw aligned offsets, simply pretend it's 8b.
  */
@@ -94,6 +110,34 @@ struct intel_ring_hangcheck {
 	enum intel_ring_hangcheck_action action;
 	int deadlock;
 	u32 instdone[I915_NUM_INSTDONE_REG];
+
+	/*
+	 * Last recorded ring head index.
+	 * This is only ever a ring index where as active
+	 * head may be a graphics address in a ring buffer
+	 */
+	u32 last_head;
+
+	/* Flag to indicate if engine reset required */
+	atomic_t flags;
+
+	/* Indicates request to reset this engine */
+#define I915_ENGINE_RESET_IN_PROGRESS (1<<0)
+
+	/*
+	 * Timestamp (seconds) from when the last time
+	 * this engine was reset.
+	 */
+	u32 last_engine_reset_time;
+
+	/*
+	 * Number of times this engine has been
+	 * reset since boot
+	 */
+	u32 reset_count;
+
+	/* Number of TDR hang detections */
+	u32 tdr_count;
 };
 
 struct intel_ringbuffer {
@@ -205,6 +249,14 @@ struct  intel_engine_cs {
 #define I915_DISPATCH_RS     0x4
 	void		(*cleanup)(struct intel_engine_cs *ring);
 
+	int (*enable)(struct intel_engine_cs *ring);
+	int (*disable)(struct intel_engine_cs *ring);
+	int (*save)(struct intel_engine_cs *ring,
+		    struct drm_i915_gem_request *req,
+		    bool force_advance);
+	int (*restore)(struct intel_engine_cs *ring,
+		       struct drm_i915_gem_request *req);
+
 	/* GEN8 signal/wait table - never trust comments!
 	 *	  signal to	signal to    signal to   signal to      signal to
 	 *	    RCS		   VCS          BCS        VECS		 VCS2
@@ -311,6 +363,9 @@ struct  intel_engine_cs {
 
 	struct intel_ring_hangcheck hangcheck;
 
+	/* Saved head value to be restored after reset */
+	u32 saved_head;
+
 	struct {
 		struct drm_i915_gem_object *obj;
 		u32 gtt_offset;
@@ -463,6 +518,15 @@ void intel_ring_update_space(struct intel_ringbuffer *ringbuf);
 int intel_ring_space(struct intel_ringbuffer *ringbuf);
 bool intel_ring_stopped(struct intel_engine_cs *ring);
 
+void intel_gpu_engine_reset_resample(struct intel_engine_cs *ring,
+		struct drm_i915_gem_request *req);
+int intel_ring_disable(struct intel_engine_cs *ring);
+int intel_ring_enable(struct intel_engine_cs *ring);
+int intel_ring_save(struct intel_engine_cs *ring,
+		struct drm_i915_gem_request *req, bool force_advance);
+int intel_ring_restore(struct intel_engine_cs *ring,
+		struct drm_i915_gem_request *req);
+
 int __must_check intel_ring_idle(struct intel_engine_cs *ring);
 void intel_ring_init_seqno(struct intel_engine_cs *ring, u32 seqno);
 int intel_ring_flush_all_caches(struct drm_i915_gem_request *req);
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 2df4246..f20548c 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1623,6 +1623,153 @@ bool intel_has_gpu_reset(struct drm_device *dev)
 	return intel_get_gpu_reset(dev) != NULL;
 }
 
+static inline int wait_for_engine_reset(struct drm_i915_private *dev_priv,
+		unsigned int grdom)
+{
+#define _CND ((__raw_i915_read32(dev_priv, GEN6_GDRST) & grdom) == 0)
+
+	/*
+	 * Spin waiting for the device to ack the reset request.
+	 * Times out after 500 us
+	 * */
+	return wait_for_atomic_us(_CND, 500);
+
+#undef _CND
+}
+
+static int do_engine_reset_nolock(struct intel_engine_cs *engine)
+{
+	int ret = -ENODEV;
+	struct drm_i915_private *dev_priv = engine->dev->dev_private;
+
+	assert_spin_locked(&dev_priv->uncore.lock);
+
+	switch (engine->id) {
+	case RCS:
+		__raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_RENDER);
+		engine->hangcheck.reset_count++;
+		ret = wait_for_engine_reset(dev_priv, GEN6_GRDOM_RENDER);
+		break;
+
+	case BCS:
+		__raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_BLT);
+		engine->hangcheck.reset_count++;
+		ret = wait_for_engine_reset(dev_priv, GEN6_GRDOM_BLT);
+		break;
+
+	case VCS:
+		__raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_MEDIA);
+		engine->hangcheck.reset_count++;
+		ret = wait_for_engine_reset(dev_priv, GEN6_GRDOM_MEDIA);
+		break;
+
+	case VECS:
+		__raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_VECS);
+		engine->hangcheck.reset_count++;
+		ret = wait_for_engine_reset(dev_priv, GEN6_GRDOM_VECS);
+		break;
+
+	case VCS2:
+		__raw_i915_write32(dev_priv, GEN6_GDRST, GEN8_GRDOM_MEDIA2);
+		engine->hangcheck.reset_count++;
+		ret = wait_for_engine_reset(dev_priv, GEN8_GRDOM_MEDIA2);
+		break;
+
+	default:
+		DRM_ERROR("Unexpected engine: %d\n", engine->id);
+		break;
+	}
+
+	return ret;
+}
+
+static int gen8_do_engine_reset(struct intel_engine_cs *engine)
+{
+	struct drm_device *dev = engine->dev;
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	int ret = -ENODEV;
+	unsigned long irqflags;
+
+	spin_lock_irqsave(&dev_priv->uncore.lock, irqflags);
+	ret = do_engine_reset_nolock(engine);
+	spin_unlock_irqrestore(&dev_priv->uncore.lock, irqflags);
+
+	if (!ret) {
+		u32 reset_ctl = 0;
+
+		/*
+		 * Confirm that reset control register back to normal
+		 * following the reset.
+		 */
+		reset_ctl = I915_READ(RING_RESET_CTL(engine->mmio_base));
+		WARN(reset_ctl & 0x3, "Reset control still active after reset! (0x%08x)\n",
+			reset_ctl);
+	} else {
+		DRM_ERROR("Engine reset failed! (%d)\n", ret);
+	}
+
+	return ret;
+}
+
+int intel_gpu_engine_reset(struct intel_engine_cs *engine)
+{
+	/* Reset an individual engine */
+	int ret = -ENODEV;
+	struct drm_device *dev = engine->dev;
+
+	switch (INTEL_INFO(dev)->gen) {
+	case 8:
+		ret = gen8_do_engine_reset(engine);
+		break;
+	default:
+		DRM_ERROR("Per Engine Reset not supported on Gen%d\n",
+			  INTEL_INFO(dev)->gen);
+		break;
+	}
+
+	return ret;
+}
+
+/*
+ * On gen8+ a reset request has to be issued via the reset control register
+ * before a GPU engine can be reset in order to stop the command streamer
+ * and idle the engine. This replaces the legacy way of stopping an engine
+ * by writing to the stop ring bit in the MI_MODE register.
+ */
+int intel_request_gpu_engine_reset(struct intel_engine_cs *engine)
+{
+	/* Request reset for an individual engine */
+	int ret = -ENODEV;
+	struct drm_device *dev = engine->dev;
+
+	if (INTEL_INFO(dev)->gen >= 8)
+		ret = gen8_request_engine_reset(engine);
+	else
+		DRM_ERROR("Reset request not supported on Gen%d\n",
+			  INTEL_INFO(dev)->gen);
+
+	return ret;
+}
+
+/*
+ * It is possible to back off from a previously issued reset request by simply
+ * clearing the reset request bit in the reset control register.
+ */
+int intel_unrequest_gpu_engine_reset(struct intel_engine_cs *engine)
+{
+	/* Roll back reset request for an individual engine */
+	int ret = -ENODEV;
+	struct drm_device *dev = engine->dev;
+
+	if (INTEL_INFO(dev)->gen >= 8)
+		ret = gen8_unrequest_engine_reset(engine);
+	else
+		DRM_ERROR("Reset unrequest not supported on Gen%d\n",
+			  INTEL_INFO(dev)->gen);
+
+	return ret;
+}
+
 bool intel_uncore_unclaimed_mmio(struct drm_i915_private *dev_priv)
 {
 	return check_for_unclaimed_mmio(dev_priv);
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 04/20] drm/i915: TDR / per-engine hang detection
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (2 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 03/20] drm/i915: TDR / per-engine hang recovery support for gen8 Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 20:37   ` Chris Wilson
  2016-01-13 17:28 ` [PATCH 05/20] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress Arun Siluvery
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

With the per-engine hang recovery path already in place this patch adds
per-engine hang detection by letting the periodic hang checker detect hangs on
individual engines and communicate this to the error handler. During hang
checking every engine is checked and the hang detection status for each engine
is aggregated into a single 32-bit engine flag mask that contains all the
engine flags (1 << ring->id) of all the hung engines or'ed together. The
per-engine path in the error handler then sets up the hangcheck state for each
invidual, hung engine based on the engine flag mask before potentially calling
the per-engine hang recovery path.

This allows the hang detection to happen in lock-step for all engines in
parallel and lets the driver process all hung engines in turn in the error
handler.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c |  2 +-
 drivers/gpu/drm/i915/i915_drv.h     |  4 ++--
 drivers/gpu/drm/i915/i915_irq.c     | 41 +++++++++++++++++++++++--------------
 3 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index e3377ab..6d1b6c3 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -4720,7 +4720,7 @@ i915_wedged_set(void *data, u64 val)
 
 	intel_runtime_pm_get(dev_priv);
 
-	i915_handle_error(dev, val,
+	i915_handle_error(dev, 0x0, val,
 			  "Manually setting wedged to %llu", val);
 
 	intel_runtime_pm_put(dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index e866f14..85cf692 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2765,8 +2765,8 @@ static inline void i915_hangcheck_reinit(struct intel_engine_cs *engine)
 
 /* i915_irq.c */
 void i915_queue_hangcheck(struct drm_device *dev);
-__printf(3, 4)
-void i915_handle_error(struct drm_device *dev, bool wedged,
+__printf(4, 5)
+void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
 		       const char *fmt, ...);
 
 extern void intel_irq_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 6a0ec37..fef74cf 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2712,15 +2712,29 @@ static void i915_report_and_clear_eir(struct drm_device *dev)
 
 /**
  * i915_handle_error - handle a gpu error
- * @dev: drm device
+ * @dev: 		drm device
+ * @engine_mask: 	Bit mask containing the engine flags of all engines
+ *			associated with one or more detected errors.
+ *			May be 0x0.
+ *			If wedged is set to true this implies that one or more
+ *			engine hangs were detected. In this case we will
+ *			attempt to reset all engines that have been detected
+ *			as hung.
+ *			If a previous engine reset was attempted too recently
+ *			or if one of the current engine resets fails we fall
+ *			back to legacy full GPU reset.
+ * @wedged: 		true = Hang detected, invoke hang recovery.
+ * @fmt, ...: 		Error message describing reason for error.
  *
  * Do some basic checking of register state at error time and
  * dump it to the syslog.  Also call i915_capture_error_state() to make
  * sure we get a record and make it available in debugfs.  Fire a uevent
  * so userspace knows something bad happened (should trigger collection
- * of a ring dump etc.).
+ * of a ring dump etc.). If a hang was detected (wedged = true) try to
+ * reset the associated engine. Failing that, try to fall back to legacy
+ * full GPU reset recovery mode.
  */
-void i915_handle_error(struct drm_device *dev, bool wedged,
+void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
 		       const char *fmt, ...)
 {
 	struct drm_i915_private *dev_priv = dev->dev_private;
@@ -2729,12 +2743,6 @@ void i915_handle_error(struct drm_device *dev, bool wedged,
 
 	struct intel_engine_cs *engine;
 
-	/*
-	 * NB: Placeholder until the hang checker supports
-	 * per-engine hang detection.
-	 */
-	u32 engine_mask = 0;
-
 	va_start(args, fmt);
 	vscnprintf(error_msg, sizeof(error_msg), fmt, args);
 	va_end(args);
@@ -3162,7 +3170,7 @@ ring_stuck(struct intel_engine_cs *ring, u64 acthd)
 	 */
 	tmp = I915_READ_CTL(ring);
 	if (tmp & RING_WAIT) {
-		i915_handle_error(dev, false,
+		i915_handle_error(dev, intel_ring_flag(ring), false,
 				  "Kicking stuck wait on %s",
 				  ring->name);
 		I915_WRITE_CTL(ring, tmp);
@@ -3174,7 +3182,7 @@ ring_stuck(struct intel_engine_cs *ring, u64 acthd)
 		default:
 			return HANGCHECK_HUNG;
 		case 1:
-			i915_handle_error(dev, false,
+			i915_handle_error(dev, intel_ring_flag(ring), false,
 					  "Kicking stuck semaphore on %s",
 					  ring->name);
 			I915_WRITE_CTL(ring, tmp);
@@ -3203,7 +3211,8 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	struct drm_device *dev = dev_priv->dev;
 	struct intel_engine_cs *ring;
 	int i;
-	int busy_count = 0, rings_hung = 0;
+	u32 engine_mask = 0;
+	int busy_count = 0;
 	bool stuck[I915_NUM_RINGS] = { 0 };
 #define BUSY 1
 #define KICK 5
@@ -3316,12 +3325,14 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 			DRM_INFO("%s on %s\n",
 				 stuck[i] ? "stuck" : "no progress",
 				 ring->name);
-			rings_hung++;
+
+			engine_mask |= intel_ring_flag(ring);
+			ring->hangcheck.tdr_count++;
 		}
 	}
 
-	if (rings_hung) {
-		i915_handle_error(dev, true, "Ring hung");
+	if (engine_mask) {
+		i915_handle_error(dev, engine_mask, true, "Ring hung (0x%02x)", engine_mask);
 		goto out;
 	}
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 05/20] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (3 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 04/20] drm/i915: TDR / per-engine hang detection Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 20:49   ` Chris Wilson
  2016-01-13 17:28 ` [PATCH 06/20] drm/i915: Reinstate hang recovery work queue Arun Siluvery
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ian Lister, Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

i915_gem_wedge now returns a non-zero result in three different cases:

1. Legacy: A hang has been detected and full GPU reset is in progress.

2. Per-engine recovery:

	a. A single engine reference can be passed to the function, in which
	case only that engine will be checked. If that particular engine is
	detected to be hung and is to be reset this will yield a non-zero
	result but not if reset is in progress for any other engine.

	b. No engine reference is passed to the function, in which case all
	engines are checked for ongoing per-engine hang recovery.

Also, i915_wait_request was updated to take advantage of this new
functionality. This is important since the TDR hang recovery mechanism needs a
way to force waiting threads that hold the struct_mutex to give up the
struct_mutex and try again after the hang recovery has completed. If
i915_wait_request does not take per-engine hang recovery into account there is
no way for a waiting thread to know that a per-engine recovery is about to
happen and that it needs to back off.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |  3 +-
 drivers/gpu/drm/i915/i915_gem.c         | 60 +++++++++++++++++++++++++++------
 drivers/gpu/drm/i915/intel_lrc.c        |  4 +--
 drivers/gpu/drm/i915/intel_ringbuffer.c |  4 ++-
 4 files changed, 56 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 85cf692..5be7d3e 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3033,7 +3033,8 @@ i915_gem_find_active_request(struct intel_engine_cs *ring);
 
 bool i915_gem_retire_requests(struct drm_device *dev);
 void i915_gem_retire_requests_ring(struct intel_engine_cs *ring);
-int __must_check i915_gem_check_wedge(struct i915_gpu_error *error,
+int __must_check i915_gem_check_wedge(struct drm_i915_private *dev_priv,
+				      struct intel_engine_cs *engine,
 				      bool interruptible);
 
 static inline bool i915_reset_in_progress(struct i915_gpu_error *error)
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index e3cfed2..e6eb45d 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -80,12 +80,38 @@ static void i915_gem_info_remove_obj(struct drm_i915_private *dev_priv,
 	spin_unlock(&dev_priv->mm.object_stat_lock);
 }
 
+static inline int
+i915_engine_reset_in_progress(struct drm_i915_private *dev_priv,
+	struct intel_engine_cs *engine)
+{
+	int ret = 0;
+
+	if (engine) {
+		ret = !!(atomic_read(&dev_priv->ring[engine->id].hangcheck.flags)
+			& I915_ENGINE_RESET_IN_PROGRESS);
+	} else {
+		int i;
+
+		for (i = 0; i < I915_NUM_RINGS; i++)
+			if (atomic_read(&dev_priv->ring[i].hangcheck.flags)
+				& I915_ENGINE_RESET_IN_PROGRESS) {
+
+				ret = 1;
+				break;
+			}
+	}
+
+	return ret;
+}
+
 static int
-i915_gem_wait_for_error(struct i915_gpu_error *error)
+i915_gem_wait_for_error(struct drm_i915_private *dev_priv)
 {
 	int ret;
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
 
 #define EXIT_COND (!i915_reset_in_progress(error) || \
+		   !i915_engine_reset_in_progress(dev_priv, NULL) || \
 		   i915_terminally_wedged(error))
 	if (EXIT_COND)
 		return 0;
@@ -114,7 +140,7 @@ int i915_mutex_lock_interruptible(struct drm_device *dev)
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	int ret;
 
-	ret = i915_gem_wait_for_error(&dev_priv->gpu_error);
+	ret = i915_gem_wait_for_error(dev_priv);
 	if (ret)
 		return ret;
 
@@ -1110,10 +1136,15 @@ put_rpm:
 }
 
 int
-i915_gem_check_wedge(struct i915_gpu_error *error,
+i915_gem_check_wedge(struct drm_i915_private *dev_priv,
+		     struct intel_engine_cs *engine,
 		     bool interruptible)
 {
-	if (i915_reset_in_progress(error)) {
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
+
+	if (i915_reset_in_progress(error) ||
+	    i915_engine_reset_in_progress(dev_priv, engine)) {
+
 		/* Non-interruptible callers can't handle -EAGAIN, hence return
 		 * -EIO unconditionally for these. */
 		if (!interruptible)
@@ -1253,6 +1284,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
 	unsigned long timeout_expire;
 	s64 before, now;
 	int ret;
+	int reset_in_progress = 0;
 
 	WARN(!intel_irqs_enabled(dev_priv), "IRQs disabled");
 
@@ -1297,11 +1329,17 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
 
 		/* We need to check whether any gpu reset happened in between
 		 * the caller grabbing the seqno and now ... */
-		if (reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) {
+		reset_in_progress =
+			i915_gem_check_wedge(ring->dev->dev_private, NULL, interruptible);
+
+		if ((reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) ||
+		     reset_in_progress) {
+
 			/* ... but upgrade the -EAGAIN to an -EIO if the gpu
 			 * is truely gone. */
-			ret = i915_gem_check_wedge(&dev_priv->gpu_error, interruptible);
-			if (ret == 0)
+			if (reset_in_progress)
+				ret = reset_in_progress;
+			else
 				ret = -EAGAIN;
 			break;
 		}
@@ -1470,7 +1508,7 @@ i915_wait_request(struct drm_i915_gem_request *req)
 
 	BUG_ON(!mutex_is_locked(&dev->struct_mutex));
 
-	ret = i915_gem_check_wedge(&dev_priv->gpu_error, interruptible);
+	ret = i915_gem_check_wedge(dev_priv, NULL, interruptible);
 	if (ret)
 		return ret;
 
@@ -1560,7 +1598,7 @@ i915_gem_object_wait_rendering__nonblocking(struct drm_i915_gem_object *obj,
 	if (!obj->active)
 		return 0;
 
-	ret = i915_gem_check_wedge(&dev_priv->gpu_error, true);
+	ret = i915_gem_check_wedge(dev_priv, NULL, true);
 	if (ret)
 		return ret;
 
@@ -4104,11 +4142,11 @@ i915_gem_ring_throttle(struct drm_device *dev, struct drm_file *file)
 	unsigned reset_counter;
 	int ret;
 
-	ret = i915_gem_wait_for_error(&dev_priv->gpu_error);
+	ret = i915_gem_wait_for_error(dev_priv);
 	if (ret)
 		return ret;
 
-	ret = i915_gem_check_wedge(&dev_priv->gpu_error, false);
+	ret = i915_gem_check_wedge(dev_priv, NULL, false);
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index fcec476..a2e56d4 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1066,8 +1066,8 @@ int intel_logical_ring_begin(struct drm_i915_gem_request *req, int num_dwords)
 
 	WARN_ON(req == NULL);
 	dev_priv = req->ring->dev->dev_private;
-
-	ret = i915_gem_check_wedge(&dev_priv->gpu_error,
+	ret = i915_gem_check_wedge(dev_priv,
+				   req->ring,
 				   dev_priv->mm.interruptible);
 	if (ret)
 		return ret;
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index def0dcf..f959326 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -2501,8 +2501,10 @@ int intel_ring_begin(struct drm_i915_gem_request *req,
 	ring = req->ring;
 	dev_priv = ring->dev->dev_private;
 
-	ret = i915_gem_check_wedge(&dev_priv->gpu_error,
+	ret = i915_gem_check_wedge(dev_priv,
+				   ring,
 				   dev_priv->mm.interruptible);
+
 	if (ret)
 		return ret;
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 06/20] drm/i915: Reinstate hang recovery work queue.
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (4 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 05/20] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 21:01   ` Chris Wilson
  2016-01-13 17:28 ` [PATCH 07/20] drm/i915: Watchdog timeout: Hang detection integration into error handler Arun Siluvery
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala, Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

There used to be a work queue separating the error handler from the hang
recovery path, which was removed a while back in this commit:

	commit b8d24a06568368076ebd5a858a011699a97bfa42
	Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
	Date:   Wed Jan 28 17:03:14 2015 +0200

	    drm/i915: Remove nested work in gpu error handling

Now we need to revert most of that commit since the work queue separating hang
detection from hang recovery is needed in preparation for the upcoming watchdog
timeout feature. The watchdog interrupt service routine will be a second
callsite of the error handler alongside the periodic hang checker, which runs
in a work queue context. Seeing as the error handler will be serving a caller
in a hard interrupt execution context that means that the error handler must
never end up in a situation where it needs to grab the struct_mutex.
Unfortunately, that is exactly what we need to do first at the start of the
hang recovery path, which might potentially sleep if the struct_mutex is
already held by another thread. Not good when you're in a hard interrupt
context.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_dma.c |  1 +
 drivers/gpu/drm/i915/i915_drv.h |  1 +
 drivers/gpu/drm/i915/i915_irq.c | 31 ++++++++++++++++++++++++-------
 3 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index c45ec353..67003c2 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1203,6 +1203,7 @@ int i915_driver_unload(struct drm_device *dev)
 	/* Free error state after interrupts are fully disabled. */
 	cancel_delayed_work_sync(&dev_priv->gpu_error.hangcheck_work);
 	i915_destroy_error_state(dev);
+	cancel_work_sync(&dev_priv->gpu_error.work);
 
 	if (dev->pdev->msi_enabled)
 		pci_disable_msi(dev->pdev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 5be7d3e..072ca37 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1337,6 +1337,7 @@ struct i915_gpu_error {
 	spinlock_t lock;
 	/* Protected by the above dev->gpu_error.lock. */
 	struct drm_i915_error_state *first_error;
+	struct work_struct work;
 
 	unsigned long missed_irq_rings;
 
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index fef74cf..8937c82 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2457,16 +2457,19 @@ static void i915_error_wake_up(struct drm_i915_private *dev_priv,
 }
 
 /**
- * i915_reset_and_wakeup - do process context error handling work
- * @dev: drm device
+ * i915_error_work_func - do process context error handling work
+ * @work: work item containing error struct, passed by the error handler
  *
  * Fire an error uevent so userspace can see that a hang or error
  * was detected.
  */
-static void i915_reset_and_wakeup(struct drm_device *dev)
+static void i915_error_work_func(struct work_struct *work)
 {
-	struct drm_i915_private *dev_priv = to_i915(dev);
-	struct i915_gpu_error *error = &dev_priv->gpu_error;
+	struct i915_gpu_error *error = container_of(work, struct i915_gpu_error,
+	                                            work);
+	struct drm_i915_private *dev_priv =
+	        container_of(error, struct drm_i915_private, gpu_error);
+	struct drm_device *dev = dev_priv->dev;
 	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
 	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
 	char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
@@ -2827,7 +2830,21 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
 		i915_error_wake_up(dev_priv, false);
 	}
 
-	i915_reset_and_wakeup(dev);
+	/*
+	 * Gen 7:
+	 *
+	 * Our reset work can grab modeset locks (since it needs to reset the
+	 * state of outstanding pageflips). Hence it must not be run on our own
+	 * dev-priv->wq work queue for otherwise the flush_work in the pageflip
+	 * code will deadlock.
+	 * If error_work is already in the work queue then it will not be added
+	 * again. It hasn't yet executed so it will see the reset flags when
+	 * it is scheduled. If it isn't in the queue or it is currently
+	 * executing then this call will add it to the queue again so that
+	 * even if it misses the reset flags during the current call it is
+	 * guaranteed to see them on the next call.
+	 */
+	schedule_work(&dev_priv->gpu_error.work);
 }
 
 /* Called from drm generic code, passed 'crtc' which
@@ -4682,7 +4699,7 @@ void intel_irq_init(struct drm_i915_private *dev_priv)
 	struct drm_device *dev = dev_priv->dev;
 
 	intel_hpd_init_work(dev_priv);
-
+	INIT_WORK(&dev_priv->gpu_error.work, i915_error_work_func);
 	INIT_WORK(&dev_priv->rps.work, gen6_pm_rps_work);
 	INIT_WORK(&dev_priv->l3_parity.error_work, ivybridge_parity_work);
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 07/20] drm/i915: Watchdog timeout: Hang detection integration into error handler
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (5 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 06/20] drm/i915: Reinstate hang recovery work queue Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 21:13   ` Chris Wilson
  2016-01-13 17:28 ` [PATCH 08/20] drm/i915: Watchdog timeout: IRQ handler for gen8 Arun Siluvery
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ian Lister, Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

This patch enables watchdog timeout hang detection as an entrypoint into the
driver error handler. This form of hang detection overrides the promotion logic
normally used by the periodic hang checker and instead allows for direct access
to the per-engine hang recovery path.

NOTE: I don't know if Ben Widawsky had any part in this code from 3 years
ago. There have been so many people involved in this already that I am in no
position to know. If I've missed anyone's sob line please let me know.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c |  2 +-
 drivers/gpu/drm/i915/i915_drv.h     |  6 +++---
 drivers/gpu/drm/i915/i915_irq.c     | 43 ++++++++++++++++++++++---------------
 3 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 6d1b6c3..dabddda 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -4720,7 +4720,7 @@ i915_wedged_set(void *data, u64 val)
 
 	intel_runtime_pm_get(dev_priv);
 
-	i915_handle_error(dev, 0x0, val,
+	i915_handle_error(dev, 0x0, false, val,
 			  "Manually setting wedged to %llu", val);
 
 	intel_runtime_pm_put(dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 072ca37..80e6d01 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2766,9 +2766,9 @@ static inline void i915_hangcheck_reinit(struct intel_engine_cs *engine)
 
 /* i915_irq.c */
 void i915_queue_hangcheck(struct drm_device *dev);
-__printf(4, 5)
-void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
-		       const char *fmt, ...);
+__printf(5, 6)
+void i915_handle_error(struct drm_device *dev, u32 engine_mask,
+		       bool watchdog, bool wedged, const char *fmt, ...);
 
 extern void intel_irq_init(struct drm_i915_private *dev_priv);
 int intel_irq_install(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 8937c82..0710724 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2726,6 +2726,7 @@ static void i915_report_and_clear_eir(struct drm_device *dev)
  *			If a previous engine reset was attempted too recently
  *			or if one of the current engine resets fails we fall
  *			back to legacy full GPU reset.
+ * @watchdog: 		true = Engine hang detected by hardware watchdog.
  * @wedged: 		true = Hang detected, invoke hang recovery.
  * @fmt, ...: 		Error message describing reason for error.
  *
@@ -2737,8 +2738,8 @@ static void i915_report_and_clear_eir(struct drm_device *dev)
  * reset the associated engine. Failing that, try to fall back to legacy
  * full GPU reset recovery mode.
  */
-void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
-		       const char *fmt, ...)
+void i915_handle_error(struct drm_device *dev, u32 engine_mask,
+                       bool watchdog, bool wedged, const char *fmt, ...)
 {
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	va_list args;
@@ -2776,20 +2777,27 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
 			u32 i;
 
 			for_each_ring(engine, dev_priv, i) {
-				u32 now, last_engine_reset_timediff;
 
 				if (!(intel_ring_flag(engine) & engine_mask))
 					continue;
 
-				/* Measure the time since this engine was last reset */
-				now = get_seconds();
-				last_engine_reset_timediff =
-					now - engine->hangcheck.last_engine_reset_time;
-
-				full_reset = last_engine_reset_timediff <
-					i915.gpu_reset_promotion_time;
-
-				engine->hangcheck.last_engine_reset_time = now;
+				if (!watchdog) {
+					/* Measure the time since this engine was last reset */
+					u32 now = get_seconds();
+					u32 last_engine_reset_timediff =
+						now - engine->hangcheck.last_engine_reset_time;
+
+					full_reset = last_engine_reset_timediff <
+						i915.gpu_reset_promotion_time;
+
+					engine->hangcheck.last_engine_reset_time = now;
+				} else {
+					/*
+					 * Watchdog timeout always results
+					 * in engine reset.
+					 */
+					full_reset = false;
+				}
 
 				/*
 				 * This engine was not reset too recently - go ahead
@@ -2800,10 +2808,11 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
 				 * This can still be overridden by a global
 				 * reset e.g. if per-engine reset fails.
 				 */
-				if (!full_reset)
+				if (watchdog || !full_reset)
 					atomic_or(I915_ENGINE_RESET_IN_PROGRESS,
 						&engine->hangcheck.flags);
-				else
+
+				if (full_reset)
 					break;
 
 			} /* for_each_ring */
@@ -3187,7 +3196,7 @@ ring_stuck(struct intel_engine_cs *ring, u64 acthd)
 	 */
 	tmp = I915_READ_CTL(ring);
 	if (tmp & RING_WAIT) {
-		i915_handle_error(dev, intel_ring_flag(ring), false,
+		i915_handle_error(dev, intel_ring_flag(ring), false, false,
 				  "Kicking stuck wait on %s",
 				  ring->name);
 		I915_WRITE_CTL(ring, tmp);
@@ -3199,7 +3208,7 @@ ring_stuck(struct intel_engine_cs *ring, u64 acthd)
 		default:
 			return HANGCHECK_HUNG;
 		case 1:
-			i915_handle_error(dev, intel_ring_flag(ring), false,
+			i915_handle_error(dev, intel_ring_flag(ring), false, false,
 					  "Kicking stuck semaphore on %s",
 					  ring->name);
 			I915_WRITE_CTL(ring, tmp);
@@ -3349,7 +3358,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	}
 
 	if (engine_mask) {
-		i915_handle_error(dev, engine_mask, true, "Ring hung (0x%02x)", engine_mask);
+		i915_handle_error(dev, engine_mask, false, true, "Ring hung (0x%02x)", engine_mask);
 		goto out;
 	}
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 08/20] drm/i915: Watchdog timeout: IRQ handler for gen8
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (6 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 07/20] drm/i915: Watchdog timeout: Hang detection integration into error handler Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 17:28 ` [PATCH 09/20] drm/i915: Watchdog timeout: Ringbuffer command emission " Arun Siluvery
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ian Lister, Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

*** General ***

Watchdog timeout (or "media engine reset") is a feature that allows userland
applications to enable hang detection on individual batch buffers. The
detection mechanism itself is mostly bound to the hardware and the only thing
that the driver needs to do to support this form of hang detection is to
implement the interrupt handling support as well as watchdog command emission
before and after the emitted batch buffer start instruction in the ring buffer.

The principle of the hang detection mechanism is as follows:

1. Once the decision has been made to enable watchdog timeout for a particular
batch buffer and the driver is in the process of emitting the batch buffer
start instruction into the ring buffer it also emits a watchdog timer start
instruction before and a watchdog timer cancellation instruction after the
batch buffer start instruction in the ring buffer.

2. Once the GPU execution reaches the watchdog timer start instruction the
hardware watchdog counter is started by the hardware. The counter keeps
counting until either reaching a previously configured threshold value or the
timer cancellation instruction is executed.

2a. If the counter reaches the threshold value the hardware fires a watchdog
interrupt that is picked up by the watchdog interrupt handler. This means that
a hang has been detected and the driver needs to deal with it the same way it
would deal with a engine hang detected by the periodic hang checker. The only
difference between the two is that we never promote full GPU reset following a
watchdog timeout in case a per-engine reset was attempted too recently. Thusly,
the watchdog interrupt handler calls the error handler directly passing the
engine mask of the hung engine in question, which immediately results in a
per-engine hang recovery being scheduled.

2b. If the batch buffer completes and the execution reaches the watchdog
cancellation instruction before the watchdog counter reaches its threshold
value the watchdog is cancelled and nothing more comes of it. No hang is
detected.

*** This patch introduces: ***

1. IRQ handler code for watchdog timeout allowing direct hang recovery based on
hardware-driven hang detection, which then integrates directly with the
per-engine hang recovery path.

2. Watchdog timeout init code patch for setup of watchdog timeout threshold
values and gen-specific register information.

The current default watchdog threshold value is 60 ms, since this has been
empirically determined to be a good compromise for low-latency requirements and
low rate of false positives.

Currently the render engine and all available media engines support watchdog
timeout. The specifications elude to the VECS engine being supported but that
is currently not supported by this commit.

NOTE: I don't know if Ben Widawsky had any part in this code from 3 years
ago. There have been so many people involved in this already that I am in no
position to know. If I've missed anyone's sob line please let me know.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_dma.c         | 59 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_drv.h         |  1 +
 drivers/gpu/drm/i915/i915_irq.c         | 24 ++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h         |  7 ++++
 drivers/gpu/drm/i915/intel_lrc.c        |  7 ++++
 drivers/gpu/drm/i915/intel_ringbuffer.h |  9 +++++
 6 files changed, 107 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 67003c2..eb12810 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -868,6 +868,64 @@ static void intel_init_dpio(struct drm_i915_private *dev_priv)
 	}
 }
 
+void i915_watchdog_init(struct drm_device *dev)
+{
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	int freq;
+	int i;
+
+	/*
+	 * Based on pre-defined time out value (60ms or 30ms) calculate
+	 * timer count thresholds needed based on core frequency.
+	 *
+	 * For RCS.
+	 * The timestamp resolution changed in Gen7 and beyond to 80ns
+	 * for all pipes. Before that it was 640ns.
+	 */
+
+#define KM_RCS_ENGINE_TIMEOUT_VALUE_IN_MS 60
+#define KM_BSD_ENGINE_TIMEOUT_VALUE_IN_MS 60
+#define KM_TIMER_MILLISECOND 1000
+
+	/*
+	 * Timestamp timer resolution = 0.080 uSec,
+	 * or 12500000 counts per second
+	 */
+#define KM_TIMESTAMP_CNTS_PER_SEC_80NS 12500000
+
+	/*
+	 * Timestamp timer resolution = 0.640 uSec,
+	 * or 1562500 counts per second
+	 */
+#define KM_TIMESTAMP_CNTS_PER_SEC_640NS 1562500
+
+	if (INTEL_INFO(dev)->gen >= 7)
+		freq = KM_TIMESTAMP_CNTS_PER_SEC_80NS;
+	else
+		freq = KM_TIMESTAMP_CNTS_PER_SEC_640NS;
+
+	dev_priv->ring[RCS].watchdog_threshold =
+		((KM_RCS_ENGINE_TIMEOUT_VALUE_IN_MS) *
+		(freq / KM_TIMER_MILLISECOND));
+
+	dev_priv->ring[VCS].watchdog_threshold =
+		((KM_BSD_ENGINE_TIMEOUT_VALUE_IN_MS) *
+		(freq / KM_TIMER_MILLISECOND));
+
+	dev_priv->ring[VCS2].watchdog_threshold =
+		((KM_BSD_ENGINE_TIMEOUT_VALUE_IN_MS) *
+		(freq / KM_TIMER_MILLISECOND));
+
+	for (i = 0; i < I915_NUM_RINGS; i++)
+		dev_priv->ring[i].hangcheck.watchdog_count = 0;
+
+	DRM_INFO("Watchdog Timeout [ms], " \
+			"RCS: 0x%08X, VCS: 0x%08X, VCS2: 0x%08X\n", \
+			KM_RCS_ENGINE_TIMEOUT_VALUE_IN_MS,
+			KM_BSD_ENGINE_TIMEOUT_VALUE_IN_MS,
+			KM_BSD_ENGINE_TIMEOUT_VALUE_IN_MS);
+}
+
 /**
  * i915_driver_load - setup chip and create an initial config
  * @dev: DRM device
@@ -1051,6 +1109,7 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
 	i915_gem_load(dev);
 
 	i915_hangcheck_init(dev);
+	i915_watchdog_init(dev);
 
 	/* On the 945G/GM, the chipset reports the MSI capability on the
 	 * integrated graphics even though the support isn't actually there
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 80e6d01..24787ed 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2751,6 +2751,7 @@ void intel_hpd_init(struct drm_i915_private *dev_priv);
 void intel_hpd_init_work(struct drm_i915_private *dev_priv);
 void intel_hpd_cancel_work(struct drm_i915_private *dev_priv);
 bool intel_hpd_pin_to_port(enum hpd_pin pin, enum port *port);
+void i915_watchdog_init(struct drm_device *dev);
 static inline void i915_hangcheck_reinit(struct intel_engine_cs *engine)
 {
 	struct intel_ring_hangcheck *hc = &engine->hangcheck;
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 0710724..c4f888b 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1325,6 +1325,27 @@ gen8_cs_irq_handler(struct intel_engine_cs *ring, u32 iir, int test_shift)
 		notify_ring(ring);
 	if (iir & (GT_CONTEXT_SWITCH_INTERRUPT << test_shift))
 		intel_lrc_irq_handler(ring);
+	if (iir & (GT_GEN8_RCS_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT)) {
+		struct drm_i915_private *dev_priv = ring->dev->dev_private;
+
+		/* Stop the counter to prevent further interrupts */
+		I915_WRITE(RING_CNTR(ring->mmio_base), GEN6_RCS_WATCHDOG_DISABLE);
+
+		ring->hangcheck.watchdog_count++;
+		i915_handle_error(ring->dev, intel_ring_flag(ring), true, true,
+				  "%s watchdog timed out", ring->name);
+	}
+	if (iir & ((GT_GEN8_VCS_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT) |
+		   (GT_GEN8_VCS_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT))) {
+		struct drm_i915_private *dev_priv = ring->dev->dev_private;
+
+		/* Stop the counter to prevent further interrupts */
+		I915_WRITE(RING_CNTR(ring->mmio_base), GEN8_VCS_WATCHDOG_DISABLE);
+
+		ring->hangcheck.watchdog_count++;
+		i915_handle_error(ring->dev, intel_ring_flag(ring), true, true,
+				  "%s watchdog timed out", ring->name);
+	}
 }
 
 static irqreturn_t gen8_gt_irq_handler(struct drm_i915_private *dev_priv,
@@ -3906,11 +3927,14 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
 {
 	/* These are interrupts we'll toggle with the ring mask register */
 	uint32_t gt_interrupts[] = {
+		GT_GEN8_RCS_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 		GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 			GT_RENDER_L3_PARITY_ERROR_INTERRUPT |
 			GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT,
+		GT_GEN8_VCS_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
+		GT_GEN8_VCS_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
 		GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
 			GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 3fc5d75..cd1a695 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -1583,6 +1583,8 @@ enum skl_disp_power_wells {
 #define RING_RESET_CTL(base)	_MMIO((base)+0xd0)
 #define   RESET_CTL_REQUEST_RESET  (1 << 0)
 #define   RESET_CTL_READY_TO_RESET (1 << 1)
+#define RING_CNTR(base)      _MMIO((base)+0x178)
+#define RING_THRESH(base)    _MMIO((base)+0x17C)
 
 #define HSW_GTT_CACHE_EN	_MMIO(0x4024)
 #define   GTT_CACHE_EN_ALL	0xF0007FFF
@@ -2008,6 +2010,11 @@ enum skl_disp_power_wells {
 #define GT_BSD_USER_INTERRUPT			(1 << 12)
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /* hsw+; rsvd on snb, ivb, vlv */
 #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
+#define GT_GEN6_RENDER_WATCHDOG_INTERRUPT	(1 <<  6)
+#define GT_GEN8_RCS_WATCHDOG_INTERRUPT		(1 <<  6)
+#define   GEN6_RCS_WATCHDOG_DISABLE		(1)
+#define GT_GEN8_VCS_WATCHDOG_INTERRUPT		(1 <<  6)
+#define   GEN8_VCS_WATCHDOG_DISABLE		0xFFFFFFFF
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /* !snb */
 #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
 #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index a2e56d4..6efbcd7 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -2400,6 +2400,9 @@ static int logical_render_ring_init(struct drm_device *dev)
 	if (HAS_L3_DPF(dev))
 		ring->irq_keep_mask |= GT_RENDER_L3_PARITY_ERROR_INTERRUPT;
 
+	ring->irq_keep_mask |=
+		(GT_GEN8_RCS_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT);
+
 	if (INTEL_INFO(dev)->gen >= 9)
 		ring->init_hw = gen9_init_render_ring;
 	else
@@ -2460,6 +2463,8 @@ static int logical_bsd_ring_init(struct drm_device *dev)
 		GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT;
 	ring->irq_keep_mask =
 		GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT;
+	ring->irq_keep_mask |=
+		(GT_GEN8_VCS_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT);
 
 	ring->init_hw = gen8_init_common_ring;
 	if (IS_BXT_REVID(dev, 0, BXT_REVID_A1)) {
@@ -2494,6 +2499,8 @@ static int logical_bsd2_ring_init(struct drm_device *dev)
 		GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT;
 	ring->irq_keep_mask =
 		GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT;
+	ring->irq_keep_mask |=
+		(GT_GEN8_VCS_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT);
 
 	ring->init_hw = gen8_init_common_ring;
 	ring->get_seqno = gen8_get_seqno;
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 7014778..dbace39 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -138,6 +138,9 @@ struct intel_ring_hangcheck {
 
 	/* Number of TDR hang detections */
 	u32 tdr_count;
+
+	/* Number of watchdog hang detections for this ring */
+	u32 watchdog_count;
 };
 
 struct intel_ringbuffer {
@@ -366,6 +369,12 @@ struct  intel_engine_cs {
 	/* Saved head value to be restored after reset */
 	u32 saved_head;
 
+	/*
+	 * Watchdog timer threshold values
+	 * only RCS, VCS, VCS2 rings have watchdog timeout support
+	 */
+	uint32_t watchdog_threshold;
+
 	struct {
 		struct drm_i915_gem_object *obj;
 		u32 gtt_offset;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 09/20] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (7 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 08/20] drm/i915: Watchdog timeout: IRQ handler for gen8 Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 17:28 ` [PATCH 10/20] drm/i915: Watchdog timeout: DRM kernel interface enablement Arun Siluvery
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ian Lister, Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

*** General ***

Watchdog timeout (or "media engine reset") is a feature that allows userland
applications to enable hang detection on individual batch buffers. The
detection mechanism itself is mostly bound to the hardware and the only thing
that the driver needs to do to support this form of hang detection is to
implement the interrupt handling support as well as watchdog command emission
before and after the emitted batch buffer start instruction in the ring buffer.

The principle of the hang detection mechanism is as follows:

1. Once the decision has been made to enable watchdog timeout for a particular
batch buffer and the driver is in the process of emitting the batch buffer
start instruction into the ring buffer it also emits a watchdog timer start
instruction before and a watchdog timer cancellation instruction after the
batch buffer start instruction in the ring buffer.

2. Once the GPU execution reaches the watchdog timer start instruction the
hardware watchdog counter is started by the hardware. The counter keeps
counting until either reaching a previously configured threshold value or the
timer cancellation instruction is executed.

2a. If the counter reaches the threshold value the hardware fires a watchdog
interrupt that is picked up by the watchdog interrupt handler. This means that
a hang has been detected and the driver needs to deal with it the same way it
would deal with a engine hang detected by the periodic hang checker. The only
difference between the two is that we never promote full GPU reset following a
watchdog timeout in case a per-engine reset was attempted too recently. Thusly,
the watchdog interrupt handler calls the error handler directly passing the
engine mask of the hung engine in question, which immediately results in a
per-engine hang recovery being scheduled.

2b. If the batch buffer completes and the execution reaches the watchdog
cancellation instruction before the watchdog counter reaches its threshold
value the watchdog is cancelled and nothing more comes of it. No hang is
detected.

*** This patch introduces: ***

1. Command emission into the ring buffer for starting and stopping the watchdog
timer before/after batch buffer start during batch buffer submission.

2. Feature support query functions for verifying that the requested engine
actually supports watchdog timeout and fails the batch buffer submission
otherwise.

NOTE: I don't know if Ben Widawsky had any part in this code from 3 years ago.
There have been so many people involved in this already that I am in no
position to know. If I've missed anyone's sob line please let me know.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c        | 99 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h | 22 ++++++++
 2 files changed, 121 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 6efbcd7..43d424f 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1095,6 +1095,80 @@ int intel_logical_ring_reserve_space(struct drm_i915_gem_request *request)
 	return intel_logical_ring_begin(request, 0);
 }
 
+static int
+gen8_ring_start_watchdog(struct drm_i915_gem_request *req)
+{
+	int ret;
+	struct intel_ringbuffer *ringbuf = req->ringbuf;
+	struct intel_engine_cs *ring = ringbuf->ring;
+
+	ret = intel_logical_ring_begin(req, 10);
+	if (ret)
+		return ret;
+
+	/*
+	 * i915_reg.h includes a warning to place a MI_NOOP
+	 * before a MI_LOAD_REGISTER_IMM
+	 */
+	intel_logical_ring_emit(ringbuf, MI_NOOP);
+	intel_logical_ring_emit(ringbuf, MI_NOOP);
+
+	/* Set counter period */
+	intel_logical_ring_emit(ringbuf, MI_LOAD_REGISTER_IMM(1));
+	intel_logical_ring_emit_reg(ringbuf, RING_THRESH(ring->mmio_base));
+	intel_logical_ring_emit(ringbuf, ring->watchdog_threshold);
+	intel_logical_ring_emit(ringbuf, MI_NOOP);
+
+	/* Start counter */
+	intel_logical_ring_emit(ringbuf, MI_LOAD_REGISTER_IMM(1));
+	intel_logical_ring_emit_reg(ringbuf, RING_CNTR(ring->mmio_base));
+	intel_logical_ring_emit(ringbuf, I915_WATCHDOG_ENABLE);
+	intel_logical_ring_emit(ringbuf, MI_NOOP);
+	intel_logical_ring_advance(ringbuf);
+
+	return 0;
+}
+
+static int
+gen8_ring_stop_watchdog(struct drm_i915_gem_request *req)
+{
+	int ret;
+	struct intel_ringbuffer *ringbuf = req->ringbuf;
+	struct intel_engine_cs *ring = ringbuf->ring;
+
+	ret = intel_logical_ring_begin(req, 6);
+	if (ret)
+		return ret;
+
+	/*
+	 * i915_reg.h includes a warning to place a MI_NOOP
+	 * before a MI_LOAD_REGISTER_IMM
+	 */
+	intel_logical_ring_emit(ringbuf, MI_NOOP);
+	intel_logical_ring_emit(ringbuf, MI_NOOP);
+
+	intel_logical_ring_emit(ringbuf, MI_LOAD_REGISTER_IMM(1));
+	intel_logical_ring_emit_reg(ringbuf, RING_CNTR(ring->mmio_base));
+
+	switch (ring->id) {
+	default:
+		WARN(1, "%s does not support watchdog timeout! " \
+			"Defaulting to render engine.\n", ring->name);
+	case RCS:
+		intel_logical_ring_emit(ringbuf, GEN6_RCS_WATCHDOG_DISABLE);
+		break;
+	case VCS:
+	case VCS2:
+		intel_logical_ring_emit(ringbuf, GEN8_VCS_WATCHDOG_DISABLE);
+		break;
+	}
+
+	intel_logical_ring_emit(ringbuf, MI_NOOP);
+	intel_logical_ring_advance(ringbuf);
+
+	return 0;
+}
+
 /**
  * execlists_submission() - submit a batchbuffer for execution, Execlists style
  * @dev: DRM device.
@@ -1124,6 +1198,12 @@ int intel_execlists_submission(struct i915_execbuffer_params *params,
 	int instp_mode;
 	u32 instp_mask;
 	int ret;
+	bool watchdog_running = false;
+	/*
+	 * NB: Place-holder until watchdog timeout is enabled through DRM
+	 * execbuf interface
+	 */
+	bool enable_watchdog = false;
 
 	instp_mode = args->flags & I915_EXEC_CONSTANTS_MASK;
 	instp_mask = I915_EXEC_CONSTANTS_MASK;
@@ -1160,6 +1240,18 @@ int intel_execlists_submission(struct i915_execbuffer_params *params,
 	if (ret)
 		return ret;
 
+	/* Start watchdog timer */
+	if (enable_watchdog) {
+		if (!intel_ring_supports_watchdog(ring))
+			return -EINVAL;
+
+		ret = gen8_ring_start_watchdog(params->request);
+		if (ret)
+			return ret;
+
+		watchdog_running = true;
+	}
+
 	if (ring == &dev_priv->ring[RCS] &&
 	    instp_mode != dev_priv->relative_constants_mode) {
 		ret = intel_logical_ring_begin(params->request, 4);
@@ -1184,6 +1276,13 @@ int intel_execlists_submission(struct i915_execbuffer_params *params,
 
 	trace_i915_gem_ring_dispatch(params->request, params->dispatch_flags);
 
+	/* Cancel watchdog timer */
+	if (watchdog_running) {
+		ret = gen8_ring_stop_watchdog(params->request);
+		if (ret)
+			return ret;
+	}
+
 	i915_gem_execbuffer_move_to_active(vmas, params->request);
 	i915_gem_execbuffer_retire_commands(params);
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index dbace39..1a78105 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -31,6 +31,8 @@ struct  intel_hw_status_page {
 	struct		drm_i915_gem_object *obj;
 };
 
+#define I915_WATCHDOG_ENABLE 0
+
 #define I915_READ_TAIL(ring) I915_READ(RING_TAIL((ring)->mmio_base))
 #define I915_WRITE_TAIL(ring, val) I915_WRITE(RING_TAIL((ring)->mmio_base), val)
 
@@ -536,6 +538,26 @@ int intel_ring_save(struct intel_engine_cs *ring,
 int intel_ring_restore(struct intel_engine_cs *ring,
 		struct drm_i915_gem_request *req);
 
+static inline bool intel_ring_supports_watchdog(struct intel_engine_cs *ring)
+{
+	bool ret = false;
+
+	if (WARN_ON(!ring))
+		goto exit;
+
+	ret = (	ring->id == RCS ||
+		ring->id == VCS ||
+		ring->id == VCS2);
+
+	if (!ret)
+		DRM_ERROR("%s does not support watchdog timeout!\n", ring->name);
+
+exit:
+	return ret;
+}
+int intel_ring_start_watchdog(struct intel_engine_cs *ring);
+int intel_ring_stop_watchdog(struct intel_engine_cs *ring);
+
 int __must_check intel_ring_idle(struct intel_engine_cs *ring);
 void intel_ring_init_seqno(struct intel_engine_cs *ring, u32 seqno);
 int intel_ring_flush_all_caches(struct drm_i915_gem_request *req);
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 10/20] drm/i915: Watchdog timeout: DRM kernel interface enablement
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (8 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 09/20] drm/i915: Watchdog timeout: Ringbuffer command emission " Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 17:28 ` [PATCH 11/20] drm/i915: Fake lost context event interrupts through forced CSB checking Arun Siluvery
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

Final enablement patch for GPU hang recovery using watchdog timeout.
Added execbuf flag for watchdog timeout in DRM kernel interface.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 6 ++----
 include/uapi/drm/i915_drm.h      | 5 ++++-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 43d424f..cdb1b9a 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1199,10 +1199,6 @@ int intel_execlists_submission(struct i915_execbuffer_params *params,
 	u32 instp_mask;
 	int ret;
 	bool watchdog_running = false;
-	/*
-	 * NB: Place-holder until watchdog timeout is enabled through DRM
-	 * execbuf interface
-	 */
 	bool enable_watchdog = false;
 
 	instp_mode = args->flags & I915_EXEC_CONSTANTS_MASK;
@@ -1240,6 +1236,8 @@ int intel_execlists_submission(struct i915_execbuffer_params *params,
 	if (ret)
 		return ret;
 
+	enable_watchdog = args->flags & I915_EXEC_ENABLE_WATCHDOG;
+
 	/* Start watchdog timer */
 	if (enable_watchdog) {
 		if (!intel_ring_supports_watchdog(ring))
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index acf2102..e157cc0 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -782,7 +782,10 @@ struct drm_i915_gem_execbuffer2 {
  */
 #define I915_EXEC_RESOURCE_STREAMER     (1<<15)
 
-#define __I915_EXEC_UNKNOWN_FLAGS -(I915_EXEC_RESOURCE_STREAMER<<1)
+/* Enable watchdog timer for this batch buffer */
+#define I915_EXEC_ENABLE_WATCHDOG       (1<<16)
+
+#define __I915_EXEC_UNKNOWN_FLAGS -(I915_EXEC_ENABLE_WATCHDOG<<1)
 
 #define I915_EXEC_CONTEXT_ID_MASK	(0xffffffff)
 #define i915_execbuffer2_set_context_id(eb2, context) \
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 11/20] drm/i915: Fake lost context event interrupts through forced CSB checking.
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (9 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 10/20] drm/i915: Watchdog timeout: DRM kernel interface enablement Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 17:28 ` [PATCH 12/20] drm/i915: Debugfs interface for per-engine hang recovery Arun Siluvery
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

*** General ***
A recurring issue during long-duration operations testing of concurrent
rendering tasks with intermittent hangs is that context completion interrupts
following engine resets are sometimes lost. This becomes a real problem since
the hardware might have completed a previously hung context following a
per-engine hang recovery and then gone idle somehow without sending an
interrupt telling the driver about this. At this point the driver would be
stuck waiting for context completion, thinking that the context is still
active, even though the hardware would be idle and waiting for more work. The
periodic hang checker would detect hangs caused by stuck workloads in the GPU
as well as these inconsistencies causing software hangs. The difference lies in
how we handle these two types of hangs.

*** Rectification ***
The way hangs caused by inconsistent context submission states are resolved is
by checking the context submission state consistency as a pre-stage to the
engine recovery path. If the state is not consistent at that point then the
normal form of engine recovery is not attempted. Instead, an attempt to rectify
the inconsistency is made by faking the presumed lost context event interrupt -
or more specifically by calling the context event interrupt handler manually
from the hang recovery path outside of the normal interrupt execution context.
The reason this works is because regardless of whether or not an IRQ goes
missing the hardware always updates the CSB buffer during context state
transitions, which means that in the case of a missing IRQ there would be
outstanding CSB events waiting to be processed, out of which one might be the
context completion event belonging to the context currently blocking the work
submission flow in one of the execlist queues. The faked context event
interrupt would then end up in the interrupt handler, which would process the
outstanding events and purge the stuck context.

If this rectification attempt fails (because there are no outstanding CSB
events, at least none that could account for the inconsistency) then the engine
recovery is failed and the error handler falls back to legacy full GPU reset
mode. Assuming the full GPU reset is successful this form of recovery will
always cause the system to become consistent since the GPU is reset and forced
into an idle state and all pending driver work is discarded, which would
consistently reflect the idle GPU hardware state.

If the rectification attempt succeeds, meaning that unprocessed CSB events were
found and acted upon which lead to old contexts being purged from the execlist
queue and new work being submitted to hardware, then the inconsistency
rectification is considered to have successfully resolved the detected hang
that brought on the hang recovery. Therefore the engine recovery is ended early
at that point and no further attempts at resolving the hang are made and the
hang detection is cleared, allowing the driver to resume executing.

*** Detection ***
In principle a context submission status inconsistency is detected by comparing
the ID of the context in the head request of an execlist queue with the context
ID currently in the EXECLIST_STATUS register of the same engine (the latter
denoting the ID of the context currently running on the hardware). If the two
do not match it is assumed that an interrupt was missed and that the driver is
now stuck in an inconsistent state. Of course, the driver and hardware can
go in and out of consistency momentarily many times per second as contexts
start and complete in the driver independently from the actual GPU
hardware. The only way an inconsistency detection can be trusted is by
first making sure that the detected state is stable, either by observing
sustained, initial signs of a hang in the periodic hang checker or at the
onset of the hang recovery path, at which point it has been decided that
the execution is hung and that the driver is stable in that state.

*** WARNING ***
In time-constrained scenarios waiting until the onset of hang recovery before
detecting and potentially rectifying context submission state inconsistencies
might cause problematic side-effects. For example, in Android the
SurfaceFlinger/HWC compositor has a hard time limit of 3 seconds after which
any unresolved hangs might cause display freezes (due to dropped display flip
requests), which can only be resolved by a reboot. If hang detection and hang
recovery takes upwards of 3 seconds then there is a distinct risk that handling
inconsistencies this late might cause issues. Whether or not this will become a
problem remains to be shown in practice. So far no issues have been spotted in
other environments such as X but it is worth being aware of.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c      | 182 +++++++++++++++++++++++++++--------
 drivers/gpu/drm/i915/i915_irq.c      |  24 +----
 drivers/gpu/drm/i915/intel_lrc.c     |  83 +++++++++++++++-
 drivers/gpu/drm/i915/intel_lrc.h     |   2 +-
 drivers/gpu/drm/i915/intel_lrc_tdr.h |   3 +
 5 files changed, 228 insertions(+), 66 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index c0ad003..6faf908 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -961,6 +961,120 @@ int i915_reset(struct drm_device *dev)
 }
 
 /**
+ * i915_gem_reset_engine_CSSC_precheck - check/rectify context inconsistency
+ * @dev_priv:	...
+ * @engine: 	The engine whose state is to be checked.
+ * @req: 	Output parameter containing the request most recently submitted
+ * 		to hardware, if any.  May be NULL.
+ * @ret: 	Output parameter containing the error code to be returned to
+ * 		i915_handle_error().
+ *
+ * Before an engine reset can be attempted it is important that the submission
+ * state of the currently running (i.e. hung) context is verified as
+ * consistent. If the context submission state is inconsistent that means that
+ * the context that the driver thinks is running on hardware is in fact not
+ * running at all. It might be that the hardware is idle or is running another
+ * context altogether. The reason why this is important in the case of engine
+ * reset in particular is because at the end of the engine recovery path the
+ * fixed-up context needs to be resubmitted to hardware in order for the
+ * context changes (HEAD register nudged past the hung batch buffer) to take
+ * effect. Context resubmission requires the same context as is resubmitted to
+ * be running on hardware - otherwise we might cause unexpected preemptions or
+ * submit a context to a GPU engine that is idle, which would not make much
+ * sense. (if the engine is idle why does the driver think that the context in
+ * question is hung etc.)
+ * If an inconsistent state like this is detected then a rectification attempt
+ * is made by faking the presumed lost context event interrupt. The outcome of
+ * this attempt is returned back to the per-engine recovery path: If it was
+ * succesful the hang recovery can be aborted early since we now have resolved
+ * the hang this way. If it was not successful then fail the hang recovery and
+ * let the error handler promote to the next level of hang recovery.
+ *
+ * Returns:
+ *	True: 	Work currently in progress, consistent state.
+ *		Proceed with engine reset.
+ *	False: 	No work in progress or work in progress but state irrecoverably
+ *		inconsistent (context event IRQ faking attempted but failed).
+ *		Do not proceed with engine reset.
+ */
+static bool i915_gem_reset_engine_CSSC_precheck(
+		struct drm_i915_private *dev_priv,
+		struct intel_engine_cs *engine,
+		struct drm_i915_gem_request **req,
+		int *ret)
+{
+	bool precheck_ok = true;
+	enum context_submission_status status;
+
+	WARN_ON(!ret);
+
+	*ret = 0;
+
+	status = intel_execlists_TDR_get_current_request(engine, req);
+
+	if (status == CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED) {
+		/*
+		 * No work in flight, no way to carry out a per-engine hang
+		 * recovery in this state. Just do early exit and forget it
+		 * happened. If this state persists then the error handler will
+		 * be called by the periodic hang checker soon after this and
+		 * at that point the hang will hopefully be promoted to full
+		 * GPU reset, which will take care of it.
+		 */
+		WARN(1, "No work in flight! Aborting recovery on %s\n",
+			engine->name);
+
+		 precheck_ok = false;
+		 *ret = 0;
+
+	} else if (status == CONTEXT_SUBMISSION_STATUS_INCONSISTENT) {
+		if (!intel_execlists_TDR_force_CSB_check(dev_priv, engine)) {
+			DRM_ERROR("Inconsistency rectification on %s unsuccessful!\n",
+				engine->name);
+
+			/*
+			 * Context submission state is inconsistent and
+			 * faking a context event IRQ did not help.
+			 * Fail and promote to higher level of
+			 * recovery!
+			 */
+			precheck_ok = false;
+			*ret = -EINVAL;
+		} else {
+			DRM_INFO("Inconsistency rectification on %s successful!\n",
+				engine->name);
+
+			/*
+			 * Rectifying the inconsistent context
+			 * submission status helped! No reset required,
+			 * just exit and move on!
+			 */
+			 precheck_ok = false;
+			 *ret = 0;
+
+			/*
+			 * Reset the hangcheck state otherwise the hang checker
+			 * will detect another hang immediately. Since the
+			 * forced CSB checker resulted in more work being
+			 * submitted to hardware we know that we are not hung
+			 * anymore so it should be safe to clear any hang
+			 * detections for this engine prior to this point.
+			 */
+			i915_hangcheck_reinit(engine);
+		}
+
+	} else if (status != CONTEXT_SUBMISSION_STATUS_OK) {
+		WARN(1, "Unexpected context submission status (%u) on %s\n",
+			status, engine->name);
+
+		precheck_ok = false;
+		*ret = -EINVAL;
+	}
+
+	return precheck_ok;
+}
+
+/**
  * i915_reset_engine - reset GPU engine after a hang
  * @engine: engine to reset
  *
@@ -1001,28 +1115,22 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	i915_gem_reset_ring_status(dev_priv, engine);
 
 	if (i915.enable_execlists) {
-		enum context_submission_status status =
-			intel_execlists_TDR_get_current_request(engine, NULL);
-
 		/*
-		 * If the context submission state in hardware is not
-		 * consistent with the the corresponding state in the driver or
-		 * if there for some reason is no current context in the
-		 * process of being submitted then bail out and try again. Do
-		 * not proceed unless we have reliable current context state
-		 * information. The reason why this is important is because
-		 * per-engine hang recovery relies on context resubmission in
-		 * order to force the execution to resume following the hung
-		 * batch buffer. If the hardware is not currently running the
-		 * same context as the driver thinks is hung then anything can
-		 * happen at the point of context resubmission, e.g. unexpected
-		 * preemptions or the previously hung context could be
-		 * submitted when the hardware is idle which makes no sense.
+		 * Check context submission status consistency (CSSC) before
+		 * moving on. If the driver and hardware have different
+		 * opinions about what is going on and this inconsistency
+		 * cannot be rectified then just fail and let TDR escalate to a
+		 * higher form of hang recovery.
 		 */
-		if (status != CONTEXT_SUBMISSION_STATUS_OK) {
-			ret = -EAGAIN;
+		 if (!i915_gem_reset_engine_CSSC_precheck(dev_priv,
+							  engine,
+							  NULL,
+							  &ret)) {
+			DRM_INFO("Aborting hang recovery on %s (%d)\n",
+				engine->name, ret);
+
 			goto reset_engine_error;
-		}
+		 }
 	}
 
 	ret = intel_ring_disable(engine);
@@ -1032,27 +1140,25 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	}
 
 	if (i915.enable_execlists) {
-		enum context_submission_status status;
-		bool inconsistent;
-
-		status = intel_execlists_TDR_get_current_request(engine,
-				&current_request);
-
-		inconsistent = (status != CONTEXT_SUBMISSION_STATUS_OK);
-		if (inconsistent) {
-			/*
-			 * If we somehow have reached this point with
-			 * an inconsistent context submission status then
-			 * back out of the previously requested reset and
-			 * retry later.
-			 */
-			WARN(inconsistent,
-			     "Inconsistent context status on %s: %u\n",
-			     engine->name, status);
+		/*
+		 * Get a hold of the currently executing context.
+		 *
+		 * Context submission status consistency is done implicitly so
+		 * we might as well check it post-engine disablement since we
+		 * get that option for free. Also, it's conceivable that the
+		 * context submission state might have changed as part of the
+		 * reset request on gen8+ so it's not completely devoid of
+		 * value to do this.
+		 */
+		 if (!i915_gem_reset_engine_CSSC_precheck(dev_priv,
+							  engine,
+							  &current_request,
+							  &ret)) {
+			DRM_INFO("Aborting hang recovery on %s (%d)\n",
+				engine->name, ret);
 
-			ret = -EAGAIN;
 			goto reenable_reset_engine_error;
-		}
+		 }
 	}
 
 	/* Sample the current ring head position */
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index c4f888b..f8fedbc 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -36,6 +36,7 @@
 #include "i915_drv.h"
 #include "i915_trace.h"
 #include "intel_drv.h"
+#include "intel_lrc_tdr.h"
 
 /**
  * DOC: interrupt handling
@@ -1324,7 +1325,7 @@ gen8_cs_irq_handler(struct intel_engine_cs *ring, u32 iir, int test_shift)
 	if (iir & (GT_RENDER_USER_INTERRUPT << test_shift))
 		notify_ring(ring);
 	if (iir & (GT_CONTEXT_SWITCH_INTERRUPT << test_shift))
-		intel_lrc_irq_handler(ring);
+		intel_lrc_irq_handler(ring, true);
 	if (iir & (GT_GEN8_RCS_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT)) {
 		struct drm_i915_private *dev_priv = ring->dev->dev_private;
 
@@ -2524,27 +2525,6 @@ static void i915_error_work_func(struct work_struct *work)
 
 			ret = i915_reset_engine(ring);
 
-			/*
-			 * Execlist mode only:
-			 *
-			 * -EAGAIN means that between detecting a hang (and
-			 * also determining that the currently submitted
-			 * context is stable and valid) and trying to recover
-			 * from the hang the current context changed state.
-			 * This means that we are probably not completely hung
-			 * after all. Just fail and retry by exiting all the
-			 * way back and wait for the next hang detection. If we
-			 * have a true hang on our hands then we will detect it
-			 * again, otherwise we will continue like nothing
-			 * happened.
-			 */
-			if (ret == -EAGAIN) {
-				DRM_ERROR("Reset of %s aborted due to " \
-					  "change in context submission " \
-					  "state - retrying!", ring->name);
-				ret = 0;
-			}
-
 			if (ret) {
 				DRM_ERROR("Reset of %s failed! (%d)", ring->name, ret);
 
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index cdb1b9a..b6069d3 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -726,11 +726,15 @@ static void get_context_status(struct intel_engine_cs *ring,
 /**
  * intel_lrc_irq_handler() - handle Context Switch interrupts
  * @ring: Engine Command Streamer to handle.
+ * @do_lock: Lock execlist spinlock (if false the caller is responsible for this)
  *
  * Check the unread Context Status Buffers and manage the submission of new
  * contexts to the ELSP accordingly.
+ *
+ * Return:
+ *      The number of unqueued contexts.
  */
-void intel_lrc_irq_handler(struct intel_engine_cs *ring)
+int intel_lrc_irq_handler(struct intel_engine_cs *ring, bool do_lock)
 {
 	struct drm_i915_private *dev_priv = ring->dev->dev_private;
 	u32 status_pointer;
@@ -740,6 +744,9 @@ void intel_lrc_irq_handler(struct intel_engine_cs *ring)
 	u32 status_id;
 	u32 submit_contexts = 0;
 
+	if (do_lock)
+		spin_lock(&ring->execlist_lock);
+
 	status_pointer = I915_READ(RING_CONTEXT_STATUS_PTR(ring));
 
 	read_pointer = ring->next_context_status_buffer;
@@ -747,8 +754,6 @@ void intel_lrc_irq_handler(struct intel_engine_cs *ring)
 	if (read_pointer > write_pointer)
 		write_pointer += GEN8_CSB_ENTRIES;
 
-	spin_lock(&ring->execlist_lock);
-
 	while (read_pointer < write_pointer) {
 
 		get_context_status(ring, ++read_pointer % GEN8_CSB_ENTRIES,
@@ -781,8 +786,6 @@ void intel_lrc_irq_handler(struct intel_engine_cs *ring)
 		execlists_context_unqueue(ring, false);
 	}
 
-	spin_unlock(&ring->execlist_lock);
-
 	if (unlikely(submit_contexts > 2))
 		DRM_ERROR("More than two context complete events?\n");
 
@@ -793,6 +796,11 @@ void intel_lrc_irq_handler(struct intel_engine_cs *ring)
 	I915_WRITE(RING_CONTEXT_STATUS_PTR(ring),
 		   _MASKED_FIELD(GEN8_CSB_READ_PTR_MASK,
 				 ring->next_context_status_buffer << 8));
+
+	if (do_lock)
+		spin_unlock(&ring->execlist_lock);
+
+	return submit_contexts;
 }
 
 static int execlists_context_queue(struct drm_i915_gem_request *request)
@@ -1811,6 +1819,7 @@ static int gen8_init_common_ring(struct intel_engine_cs *ring)
 	struct drm_device *dev = ring->dev;
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	u8 next_context_status_buffer_hw;
+	unsigned long flags;
 
 	lrc_setup_hardware_status_page(ring,
 				ring->default_context->engine[ring->id].state);
@@ -1823,6 +1832,7 @@ static int gen8_init_common_ring(struct intel_engine_cs *ring)
 		   _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
 	POSTING_READ(RING_MODE_GEN7(ring));
 
+	spin_lock_irqsave(&ring->execlist_lock, flags);
 	/*
 	 * Instead of resetting the Context Status Buffer (CSB) read pointer to
 	 * zero, we need to read the write pointer from hardware and use its
@@ -1847,6 +1857,8 @@ static int gen8_init_common_ring(struct intel_engine_cs *ring)
 		next_context_status_buffer_hw = (GEN8_CSB_ENTRIES - 1);
 
 	ring->next_context_status_buffer = next_context_status_buffer_hw;
+	spin_unlock_irqrestore(&ring->execlist_lock, flags);
+
 	DRM_DEBUG_DRIVER("Execlists enabled for %s\n", ring->name);
 
 	i915_hangcheck_reinit(ring);
@@ -3232,3 +3244,64 @@ intel_execlists_TDR_get_current_request(struct intel_engine_cs *ring,
 
 	return status;
 }
+
+/**
+ * execlists_TDR_force_CSB_check() - rectify inconsistency by faking IRQ
+ * @dev_priv: ...
+ * @engine: engine whose CSB is to be checked.
+ *
+ * Context submission status inconsistencies are caused by lost interrupts that
+ * leave CSB events unprocessed and leave contexts in the execlist queues when
+ * they should really have been removed. These stale contexts block further
+ * submissions to the hardware (all the while the hardware is sitting idle) and
+ * thereby cause a software hang. The way to rectify this is by manually
+ * checking the CSB buffer for outstanding context state transition events and
+ * acting on these. The easiest way of doing this is by simply faking the
+ * presumed lost context event interrupt by manually calling the interrupt
+ * handler. If there are indeed outstanding, unprocessed CSB events then these
+ * will be processed by the faked interrupt call and if one of these events is
+ * some form of context completion event then that will purge a stale context
+ * from the execlist queue and submit a new context to hardware from the queue,
+ * thereby resuming execution.
+ *
+ * Returns:
+ * 	True: Forced CSB check successful, state consistency restored.
+ * 	False: No CSB events found, forced CSB check unsuccessful, failed
+ * 	       trying to restore consistency.
+ */
+bool intel_execlists_TDR_force_CSB_check(struct drm_i915_private *dev_priv,
+					 struct intel_engine_cs *engine)
+{
+	unsigned long flags;
+	bool hw_active;
+	int was_effective;
+
+	hw_active =
+		(I915_READ(RING_EXECLIST_STATUS_LO(engine)) &
+			EXECLIST_STATUS_CURRENT_ACTIVE_ELEMENT_STATUS) ?
+				true : false;
+	if (hw_active) {
+		u32 hw_context;
+
+		hw_context = I915_READ(RING_EXECLIST_STATUS_CTX_ID(engine));
+		WARN(hw_active, "Context (%x) executing on %s - " \
+				"No need for faked IRQ!\n",
+				hw_context, engine->name);
+		return false;
+	}
+
+	spin_lock_irqsave(&engine->execlist_lock, flags);
+
+	WARN(1, "%s: Inconsistent context state - Faking context event IRQ!\n",
+		engine->name);
+
+	if (!(was_effective = intel_lrc_irq_handler(engine, false)))
+		DRM_ERROR("Forced CSB check of %s ineffective!\n", engine->name);
+
+	spin_unlock_irqrestore(&engine->execlist_lock, flags);
+
+	wake_up_all(&engine->irq_queue);
+
+	return !!was_effective;
+}
+
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index d9acb31..55a582d 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -117,7 +117,7 @@ int intel_execlists_submission(struct i915_execbuffer_params *params,
 			       struct list_head *vmas);
 u32 intel_execlists_ctx_id(struct drm_i915_gem_object *ctx_obj);
 
-void intel_lrc_irq_handler(struct intel_engine_cs *ring);
+int intel_lrc_irq_handler(struct intel_engine_cs *ring, bool do_lock);
 void intel_execlists_retire_requests(struct intel_engine_cs *ring);
 
 int intel_execlists_read_tail(struct intel_engine_cs *ring,
diff --git a/drivers/gpu/drm/i915/intel_lrc_tdr.h b/drivers/gpu/drm/i915/intel_lrc_tdr.h
index 4520753..041c808 100644
--- a/drivers/gpu/drm/i915/intel_lrc_tdr.h
+++ b/drivers/gpu/drm/i915/intel_lrc_tdr.h
@@ -32,5 +32,8 @@ enum context_submission_status
 intel_execlists_TDR_get_current_request(struct intel_engine_cs *ring,
 		struct drm_i915_gem_request **req);
 
+bool intel_execlists_TDR_force_CSB_check(struct drm_i915_private *dev_priv,
+					 struct intel_engine_cs *engine);
+
 #endif /* _INTEL_LRC_TDR_H_ */
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 12/20] drm/i915: Debugfs interface for per-engine hang recovery.
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (10 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 11/20] drm/i915: Fake lost context event interrupts through forced CSB checking Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 17:28 ` [PATCH 13/20] drm/i915: Test infrastructure for context state inconsistency simulation Arun Siluvery
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Ian Lister, Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

1. The i915_wedged_set() function now allows for both legacy full GPU reset and
   per-engine reset of one or more engines at a time:

	a) Legacy hang recovery by passing 0.

	b) Multiple engine hang recovery by passing in an engine flag mask
	where bit 0 corresponds to engine 0 = RCS, bit 1 corresponds to engine
	1 = VCS etc. This allows for any combination of engine hang recoveries
	to be tested. For example, by passing in the value 0x3 hang recovery
	for engines 0 and 1 (RCS and VCS) are scheduled at the same time.

2. The i915_hangcheck_info() function is complemented with statistics related
   to:

	a) Number of engine hangs detected by periodic hang checker.
	b) Number of watchdog timeout hangs detected.
	c) Number of full GPU resets carried out.
	d) Number of engine resets carried out.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 75 +++++++++++++++++++++++++++++++++++--
 1 file changed, 71 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index dabddda..62c9a41 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1357,6 +1357,8 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 	} else
 		seq_printf(m, "Hangcheck inactive\n");
 
+	seq_printf(m, "Full GPU resets = %u\n", i915_reset_count(&dev_priv->gpu_error));
+
 	for_each_ring(ring, dev_priv, i) {
 		seq_printf(m, "%s:\n", ring->name);
 		seq_printf(m, "\tseqno = %x [current %x]\n",
@@ -1368,6 +1370,12 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 			   (long long)ring->hangcheck.max_acthd);
 		seq_printf(m, "\tscore = %d\n", ring->hangcheck.score);
 		seq_printf(m, "\taction = %d\n", ring->hangcheck.action);
+		seq_printf(m, "\tengine resets = %u\n",
+			ring->hangcheck.reset_count);
+		seq_printf(m, "\tengine hang detections = %u\n",
+			ring->hangcheck.tdr_count);
+		seq_printf(m, "\tengine watchdog timeout detections = %u\n",
+			ring->hangcheck.watchdog_count);
 
 		if (ring->id == RCS) {
 			seq_puts(m, "\tinstdone read =");
@@ -4701,11 +4709,48 @@ i915_wedged_get(void *data, u64 *val)
 	return 0;
 }
 
+static const char *ringid_to_str(enum intel_ring_id ring_id)
+{
+	switch (ring_id) {
+	case RCS:
+		return "RCS";
+	case VCS:
+		return "VCS";
+	case BCS:
+		return "BCS";
+	case VECS:
+		return "VECS";
+	case VCS2:
+		return "VCS2";
+	}
+
+	return "unknown";
+}
+
 static int
 i915_wedged_set(void *data, u64 val)
 {
 	struct drm_device *dev = data;
 	struct drm_i915_private *dev_priv = dev->dev_private;
+	struct intel_engine_cs *engine;
+	u32 i;
+#define ENGINE_MSGLEN 64
+	char msg[ENGINE_MSGLEN];
+
+	/*
+	 * Val contains the engine flag mask of engines to be reset.
+	 *
+	 * * Full GPU reset is caused by passing val == 0x0
+	 *
+	 * * Any combination of engine hangs is caused by setting up val as a
+	 *   mask with the following bits set for each engine to be hung:
+	 *
+	 *	Bit 0: RCS engine
+	 *	Bit 1: VCS engine
+	 *	Bit 2: BCS engine
+	 *	Bit 3: VECS engine
+	 *	Bit 4: VCS2 engine (if available)
+	 */
 
 	/*
 	 * There is no safeguard against this debugfs entry colliding
@@ -4714,14 +4759,36 @@ i915_wedged_set(void *data, u64 val)
 	 * test harness is responsible enough not to inject gpu hangs
 	 * while it is writing to 'i915_wedged'
 	 */
-
-	if (i915_reset_in_progress(&dev_priv->gpu_error))
+	if (i915_gem_check_wedge(dev_priv, NULL, true))
 		return -EAGAIN;
 
 	intel_runtime_pm_get(dev_priv);
 
-	i915_handle_error(dev, 0x0, false, val,
-			  "Manually setting wedged to %llu", val);
+	memset(msg, 0, sizeof(msg));
+
+	if (val) {
+		scnprintf(msg, sizeof(msg), "Manual reset:");
+
+		/* Assemble message string */
+		for_each_ring(engine, dev_priv, i)
+			if (intel_ring_flag(engine) & val) {
+				DRM_INFO("Manual reset: %s\n", engine->name);
+
+				scnprintf(msg, sizeof(msg),
+					  "%s [%s]",
+					  msg,
+					  ringid_to_str(i));
+			}
+
+	} else {
+		scnprintf(msg, sizeof(msg), "Manual global reset");
+	}
+
+	i915_handle_error(dev,
+			  val,
+			  false,
+			  true,
+			  msg);
 
 	intel_runtime_pm_put(dev_priv);
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 13/20] drm/i915: Test infrastructure for context state inconsistency simulation
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (11 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 12/20] drm/i915: Debugfs interface for per-engine hang recovery Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 17:28 ` [PATCH 14/20] drm/i915: TDR/watchdog trace points Arun Siluvery
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

Added debugfs functions and embedded test infrastructure in the context event
interrupt handler for simulating the loss of context event interrupts so that a
context submission state inconsistency can be induced. This is useful for
testing the consistency checker pre-stage to the engine hang recovery path
since in order to test that the inconsistency detection works we first need to
induce a state inconsistency that the inconsistency checker can detect and act
upon.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 88 +++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_dma.c     |  2 +
 drivers/gpu/drm/i915/i915_drv.c     |  3 ++
 drivers/gpu/drm/i915/i915_drv.h     | 12 +++++
 drivers/gpu/drm/i915/intel_lrc.c    | 68 ++++++++++++++++++++++++++++
 5 files changed, 173 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 62c9a41..7148a65 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -4800,6 +4800,93 @@ DEFINE_SIMPLE_ATTRIBUTE(i915_wedged_fops,
 			"%llu\n");
 
 static int
+i915_fake_ctx_submission_inconsistency_get(void *data, u64 *val)
+{
+	struct drm_device *dev = data;
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	struct intel_engine_cs *ring;
+	unsigned i;
+
+	DRM_INFO("Faked inconsistent context submission state: %x\n",
+		dev_priv->gpu_error.faked_lost_ctx_event_irq);
+
+	for_each_ring(ring, dev_priv, i) {
+		u32 fake_cnt =
+			(dev_priv->gpu_error.faked_lost_ctx_event_irq >> (i<<2)) & 0xf;
+
+		DRM_INFO("%s: Faking %s [%u IRQs left to drop]\n",
+			ring->name,
+			fake_cnt?"enabled":"disabled",
+			fake_cnt);
+	}
+
+	*val = (u64) dev_priv->gpu_error.faked_lost_ctx_event_irq;
+
+	return 0;
+}
+
+static int
+i915_fake_ctx_submission_inconsistency_set(void *data, u64 val)
+{
+	struct drm_device *dev = data;
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	u32 fake_status;
+
+	/*
+	 * Set up a simulated/faked lost context event interrupt. This is used
+	 * to induce inconsistent HW/driver states that the context submission
+	 * status consistency checker (involved as a pre-stage to GPU engine
+	 * hang recovery), which is required for validation purposes.
+	 *
+	 * val contains the new faked_lost_ctx_event_irq word that is to be
+	 * merged with the already set faked_lost_ctx_event_irq word.
+	 *
+	 * val == 0 means clear all previously set fake bits.
+	 *
+	 * Each nibble contains a number between 0-15 denoting the number of
+	 * interrupts left to lose on the engine that nibble corresponds to.
+	 *
+	 * RCS: faked_lost_ctx_event_irq[3:0]
+	 * VCS: faked_lost_ctx_event_irq[7:4]
+	 * BCS: faked_lost_ctx_event_irq[11:8]
+	 * VECS: faked_lost_ctx_event_irq[15:12]
+	 * etc
+	 *
+	 * The number in each nibble is decremented by the context event
+	 * interrupt handler in intel_lrc.c once the faked interrupt loss is
+	 * executed. If a targetted interrupt is received when bit
+	 * corresponding to that engine is set that interrupt will be dropped
+	 * without side-effects, thus inducing an inconsistency since the
+	 * hardware has entered a state where removal of a context from the
+	 * context queue is required but the driver is not informed of this and
+	 * is therefore stuck in that state until inconsistency rectification
+	 * (forced CSB checking) or reboot.
+	 */
+
+	fake_status =
+		dev_priv->gpu_error.faked_lost_ctx_event_irq;
+
+	DRM_INFO("Faking lost context event IRQ (new status: %x, old status: %x)\n",
+		(u32) val, fake_status);
+
+	if (val) {
+		dev_priv->gpu_error.faked_lost_ctx_event_irq |= ((u32) val);
+	} else {
+		DRM_INFO("Clearing lost context event IRQ mask\n");
+
+		dev_priv->gpu_error.faked_lost_ctx_event_irq = 0;
+	}
+
+
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(i915_fake_ctx_submission_inconsistency_fops,
+			i915_fake_ctx_submission_inconsistency_get,
+			i915_fake_ctx_submission_inconsistency_set,
+			"%llu\n");
+
+static int
 i915_ring_stop_get(void *data, u64 *val)
 {
 	struct drm_device *dev = data;
@@ -5455,6 +5542,7 @@ static const struct i915_debugfs_files {
 	const struct file_operations *fops;
 } i915_debugfs_files[] = {
 	{"i915_wedged", &i915_wedged_fops},
+	{"i915_fake_ctx_inconsistency", &i915_fake_ctx_submission_inconsistency_fops},
 	{"i915_max_freq", &i915_max_freq_fops},
 	{"i915_min_freq", &i915_min_freq_fops},
 	{"i915_cache_sharing", &i915_cache_sharing_fops},
diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index eb12810..5748912 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -843,6 +843,8 @@ i915_hangcheck_init(struct drm_device *dev)
 	int i;
 	struct drm_i915_private *dev_priv = dev->dev_private;
 
+	dev_priv->gpu_error.faked_lost_ctx_event_irq = 0;
+
 	for (i = 0; i < I915_NUM_RINGS; i++) {
 		struct intel_engine_cs *engine = &dev_priv->ring[i];
 		struct intel_ring_hangcheck *hc = &engine->hangcheck;
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 6faf908..c597dcc 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -909,6 +909,9 @@ int i915_reset(struct drm_device *dev)
 		}
 	}
 
+	/* Clear simulated lost context event interrupts */
+	dev_priv->gpu_error.faked_lost_ctx_event_irq = 0;
+
 	if (i915_stop_ring_allow_warn(dev_priv))
 		pr_notice("drm/i915: Resetting chip after gpu hang\n");
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 24787ed..0a223b1 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1380,6 +1380,18 @@ struct i915_gpu_error {
 #define I915_STOP_RING_ALLOW_BAN       (1 << 31)
 #define I915_STOP_RING_ALLOW_WARN      (1 << 30)
 
+	/*
+	 * Bit mask for simulation of lost context event IRQs on each
+	 * respective engine.
+	 *
+	 *   Bits 0:3: 	 Number of lost IRQs to be faked on RCS
+	 *   Bits 4:7:	 Number of lost IRQs to be faked on VCS
+	 *   Bits 8:11:  Number of lost IRQs to be faked on BCS
+	 *   Bits 12:15: Number of lost IRQs to be faked on VECS
+	 *   Bits 16:19: Number of lost IRQs to be faked on VCS2
+	*/
+	u32 faked_lost_ctx_event_irq;
+
 	/* For missed irq/seqno simulation. */
 	unsigned int test_irq_rings;
 
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index b6069d3..913fdbb 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -724,6 +724,52 @@ static void get_context_status(struct intel_engine_cs *ring,
 }
 
 /**
+ * fake_lost_ctx_event_irq() - Checks for pending faked lost context event IRQs.
+ * @dev_priv: ...
+ * @ring: Engine to check pending faked lost IRQs for.
+ *
+ * Checks the bits in dev_priv->gpu_error.faked_lost_ctx_event_irq corresponding
+ * to the specified engine and updates the bits and returns a value accordingly.
+ *
+ * Return:
+ * 	true: If the current IRQ is to be lost.
+ * 	false: If the current IRQ is to be processed as normal.
+ */
+static inline bool fake_lost_ctx_event_irq(struct drm_i915_private *dev_priv,
+				           struct intel_engine_cs *ring)
+{
+	u32 *faked_lost_irq_mask =
+		&dev_priv->gpu_error.faked_lost_ctx_event_irq;
+
+	/*
+	 * Point out the least significant bit in the nibble of the faked lost
+	 * context event IRQ mask that corresponds to the engine at hand.
+	 */
+	u32 engine_nibble = (ring->id << 2);
+
+	/* Check engine nibble for any pending IRQs to be simulated as lost */
+	if (*faked_lost_irq_mask & (0xf << engine_nibble)) {
+		DRM_INFO("Faked lost interrupt on %s! (%x)\n",
+			ring->name,
+			*faked_lost_irq_mask);
+
+		/*
+		 * Subtract the IRQ that is to be simulated as lost from the
+		 * engine nibble.
+		 */
+		*faked_lost_irq_mask -= (0x1 << engine_nibble);
+
+		DRM_INFO("New fake lost irq mask: %x\n",
+			*faked_lost_irq_mask);
+
+		/* Tell the IRQ handler to simulate lost context event IRQ */
+		return true;
+	}
+
+	return false;
+}
+
+/**
  * intel_lrc_irq_handler() - handle Context Switch interrupts
  * @ring: Engine Command Streamer to handle.
  * @do_lock: Lock execlist spinlock (if false the caller is responsible for this)
@@ -764,6 +810,23 @@ int intel_lrc_irq_handler(struct intel_engine_cs *ring, bool do_lock)
 
 		if (status & GEN8_CTX_STATUS_PREEMPTED) {
 			if (status & GEN8_CTX_STATUS_LITE_RESTORE) {
+				if (fake_lost_ctx_event_irq(dev_priv, ring)) {
+				    /*
+				     * If we want to simulate the loss of a
+				     * context event IRQ (only for such events
+				     * that could affect the execlist queue,
+				     * since this is something that could
+				     * affect the context submission status
+				     * consistency checker) then just exit the
+				     * IRQ handler early with no side-effects!
+				     * We want to pretend like this IRQ never
+				     * happened. The next time the IRQ handler
+				     * is entered for this engine the CSB
+				     * events should remain in the CSB, waiting
+				     * to be processed.
+				     */
+				    goto exit;
+				}
 				if (execlists_check_remove_request(ring, status_id))
 					WARN(1, "Lite Restored request removed from queue\n");
 			} else
@@ -772,6 +835,10 @@ int intel_lrc_irq_handler(struct intel_engine_cs *ring, bool do_lock)
 
 		if ((status & GEN8_CTX_STATUS_ACTIVE_IDLE) ||
 		    (status & GEN8_CTX_STATUS_ELEMENT_SWITCH)) {
+
+			if (fake_lost_ctx_event_irq(dev_priv, ring))
+			    goto exit;
+
 			if (execlists_check_remove_request(ring, status_id))
 				submit_contexts++;
 		}
@@ -797,6 +864,7 @@ int intel_lrc_irq_handler(struct intel_engine_cs *ring, bool do_lock)
 		   _MASKED_FIELD(GEN8_CSB_READ_PTR_MASK,
 				 ring->next_context_status_buffer << 8));
 
+exit:
 	if (do_lock)
 		spin_unlock(&ring->execlist_lock);
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 14/20] drm/i915: TDR/watchdog trace points.
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (12 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 13/20] drm/i915: Test infrastructure for context state inconsistency simulation Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 17:28 ` [PATCH 15/20] drm/i915: Port of Added scheduler support to __wait_request() calls Arun Siluvery
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

Defined trace points and sprinkled the usage of these throughout the
TDR/watchdog implementation.

The following trace points are supported:

	1. trace_i915_tdr_gpu_recovery:
	Called at the onset of the full GPU reset recovery path.

	2. trace_i915_tdr_engine_recovery:
	Called at the onset of the per-engine recovery path.

	3. i915_tdr_recovery_start:
	Called at the onset of hang recovery before recovery mode has been
	decided.

	4. i915_tdr_recovery_complete:
	Called at the point of hang recovery completion.

	5. i915_tdr_recovery_queued:
	Called once the error handler decides to schedule the actual hang
	recovery, which marks the end of the hang detection path.

	6. i915_tdr_engine_save:
	Called at the point of saving the engine state during per-engine hang
	recovery.

	7. i915_tdr_gpu_reset_complete:
	Called at the point of full GPU reset recovery completion.

	8. i915_tdr_engine_reset_complete:
	Called at the point of per-engine recovery completion.

	9. i915_tdr_forced_csb_check:
	Called at the completion of a forced CSB check.

	10. i915_tdr_hang_check:
	Called for every engine in the periodic hang checker loop before moving
	on to the next engine. Provides an overview of all hang check stats in
	real-time. The collected stats are:

		a. Engine name.

		b. Current engine seqno.

		c. Seqno of previous hang check iteration for that engine.

		d. ACTHD register value of given engine.

		e. Current hang check score of given engine (and whether or not
		the engine has been detected as hung).

		f. Current action for given engine.

		g. Busyness of given engine.

		h. Submission status of currently running context on given engine.

	11. i915_tdr_inconsistency:
	Called when an inconsistency is detected to provide more information in
	the log about the nature of the inconsistency. The collected
	information is:

		a. Engine name.

		b. ID of the currently executing context on hardware.

		c. Is the given engine idle or not?

		d. The ID of the context that was most recently submitted to
		the ELSP port from the execlist queue for the given engine.

		e. The submission/IRQ balance of the request most recently
		submitted to hardware (elsp_submitted).

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c       |   3 +
 drivers/gpu/drm/i915/i915_drv.h       |   1 +
 drivers/gpu/drm/i915/i915_gpu_error.c |   2 +-
 drivers/gpu/drm/i915/i915_irq.c       |  11 +-
 drivers/gpu/drm/i915/i915_trace.h     | 339 ++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.c      |  21 ++-
 drivers/gpu/drm/i915/intel_uncore.c   |   4 +
 7 files changed, 377 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index c597dcc..73976f9 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -888,6 +888,7 @@ int i915_reset(struct drm_device *dev)
 	bool simulated;
 	int ret;
 
+	trace_i915_tdr_gpu_recovery(dev);
 	intel_reset_gt_powersave(dev);
 
 	mutex_lock(&dev->struct_mutex);
@@ -1112,6 +1113,8 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 
 	WARN_ON(!mutex_is_locked(&dev->struct_mutex));
 
+	trace_i915_tdr_engine_recovery(engine);
+
         /* Take wake lock to prevent power saving mode */
 	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 0a223b1..a65722d 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3382,6 +3382,7 @@ void i915_destroy_error_state(struct drm_device *dev);
 
 void i915_get_extra_instdone(struct drm_device *dev, uint32_t *instdone);
 const char *i915_cache_level_str(struct drm_i915_private *i915, int type);
+const char *hangcheck_action_to_str(enum intel_ring_hangcheck_action a);
 
 /* i915_cmd_parser.c */
 int i915_cmd_parser_get_version(void);
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 06ca408..a5dcc7b 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -221,7 +221,7 @@ static void print_error_buffers(struct drm_i915_error_state_buf *m,
 	}
 }
 
-static const char *hangcheck_action_to_str(enum intel_ring_hangcheck_action a)
+const char *hangcheck_action_to_str(enum intel_ring_hangcheck_action a)
 {
 	switch (a) {
 	case HANGCHECK_IDLE:
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index f8fedbc..acca5d8 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2502,6 +2502,7 @@ static void i915_error_work_func(struct work_struct *work)
 
 	mutex_lock(&dev->struct_mutex);
 
+	trace_i915_tdr_recovery_start(dev);
 	kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, error_event);
 
 	for_each_ring(ring, dev_priv, i) {
@@ -2620,6 +2621,7 @@ static void i915_error_work_func(struct work_struct *work)
 			kobject_uevent_env(&dev->primary->kdev->kobj,
 					   KOBJ_CHANGE, reset_done_event);
 	}
+	trace_i915_tdr_recovery_complete(dev);
 }
 
 static void i915_report_and_clear_eir(struct drm_device *dev)
@@ -2745,6 +2747,7 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask,
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	va_list args;
 	char error_msg[80];
+	bool full_reset = true;
 
 	struct intel_engine_cs *engine;
 
@@ -2763,7 +2766,6 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask,
 		 *	2. The hardware does not support it (pre-gen7).
 		 *	3. We already tried per-engine reset recently.
 		 */
-		bool full_reset = true;
 
 		if (!i915.enable_engine_reset) {
 			DRM_INFO("Engine reset disabled: Using full GPU reset.\n");
@@ -2792,6 +2794,7 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask,
 						i915.gpu_reset_promotion_time;
 
 					engine->hangcheck.last_engine_reset_time = now;
+
 				} else {
 					/*
 					 * Watchdog timeout always results
@@ -2840,6 +2843,8 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask,
 		i915_error_wake_up(dev_priv, false);
 	}
 
+	trace_i915_tdr_recovery_queued(dev, engine_mask, watchdog, full_reset);
+
 	/*
 	 * Gen 7:
 	 *
@@ -3264,6 +3269,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	for_each_ring(ring, dev_priv, i) {
 		u64 acthd;
 		u32 seqno;
+		u32 head;
 		bool busy = true;
 
 		semaphore_clear_deadlocks(dev_priv);
@@ -3344,7 +3350,10 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 
 		ring->hangcheck.seqno = seqno;
 		ring->hangcheck.acthd = acthd;
+		head = I915_READ_HEAD(ring);
 		busy_count += busy;
+
+		trace_i915_tdr_hang_check(ring, seqno, acthd, head, busy);
 	}
 
 	for_each_ring(ring, dev_priv, i) {
diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
index 52b2d40..5c15d43 100644
--- a/drivers/gpu/drm/i915/i915_trace.h
+++ b/drivers/gpu/drm/i915/i915_trace.h
@@ -810,6 +810,345 @@ TRACE_EVENT(switch_mm,
 		  __entry->dev, __entry->ring, __entry->to, __entry->vm)
 );
 
+/**
+ * DOC: i915_tdr_gpu_recovery
+ *
+ * This tracepoint tracks the onset of the full GPU recovery path
+ */
+TRACE_EVENT(i915_tdr_gpu_recovery,
+	TP_PROTO(struct drm_device *dev),
+
+	TP_ARGS(dev),
+
+	TP_STRUCT__entry(
+			__field(u32, dev)
+	),
+
+	TP_fast_assign(
+			__entry->dev = dev->primary->index;
+	),
+
+	TP_printk("dev=%u, full GPU recovery started",
+		  __entry->dev)
+);
+
+/**
+ * DOC: i915_tdr_engine_recovery
+ *
+ * This tracepoint tracks the onset of the engine recovery path
+ */
+TRACE_EVENT(i915_tdr_engine_recovery,
+	TP_PROTO(struct intel_engine_cs *engine),
+
+	TP_ARGS(engine),
+
+	TP_STRUCT__entry(
+			__field(struct intel_engine_cs *, engine)
+	),
+
+	TP_fast_assign(
+			__entry->engine = engine;
+	),
+
+	TP_printk("dev=%u, engine=%u, recovery of %s started",
+		  __entry->engine->dev->primary->index,
+		  __entry->engine->id,
+		  __entry->engine->name)
+);
+
+/**
+ * DOC: i915_tdr_recovery_start
+ *
+ * This tracepoint tracks hang recovery start
+ */
+TRACE_EVENT(i915_tdr_recovery_start,
+	TP_PROTO(struct drm_device *dev),
+
+	TP_ARGS(dev),
+
+	TP_STRUCT__entry(
+			__field(u32, dev)
+	),
+
+	TP_fast_assign(
+			__entry->dev = dev->primary->index;
+	),
+
+	TP_printk("dev=%u, hang recovery started",
+		  __entry->dev)
+);
+
+/**
+ * DOC: i915_tdr_recovery_complete
+ *
+ * This tracepoint tracks hang recovery completion
+ */
+TRACE_EVENT(i915_tdr_recovery_complete,
+	TP_PROTO(struct drm_device *dev),
+
+	TP_ARGS(dev),
+
+	TP_STRUCT__entry(
+			__field(u32, dev)
+	),
+
+	TP_fast_assign(
+			__entry->dev = dev->primary->index;
+	),
+
+	TP_printk("dev=%u, hang recovery completed",
+		  __entry->dev)
+);
+
+/**
+ * DOC: i915_tdr_recovery_queued
+ *
+ * This tracepoint tracks the point of queuing recovery from hang check.
+ * If engine recovery is requested engine name will be displayed, otherwise
+ * it will be set to "none". If too many engine reset was attempted in the
+ * previous history we promote to full GPU reset, which is remarked by appending
+ * the "[PROMOTED]" flag.
+ */
+TRACE_EVENT(i915_tdr_recovery_queued,
+	TP_PROTO(struct drm_device *dev,
+		 u32 hung_engines,
+		 bool watchdog,
+		 bool full_reset),
+
+	TP_ARGS(dev, hung_engines, watchdog, full_reset),
+
+	TP_STRUCT__entry(
+			__field(u32, dev)
+			__field(u32, hung_engines)
+			__field(bool, watchdog)
+			__field(bool, full_reset)
+	),
+
+	TP_fast_assign(
+			__entry->dev = dev->primary->index;
+			__entry->hung_engines = hung_engines;
+			__entry->watchdog = watchdog;
+			__entry->full_reset = full_reset;
+	),
+
+	TP_printk("dev=%u, hung_engines=0x%02x%s%s%s%s%s%s%s, watchdog=%s, full_reset=%s",
+		  __entry->dev,
+		  __entry->hung_engines,
+		  __entry->hung_engines ? " (":"",
+		  __entry->hung_engines & RENDER_RING ? " [RCS] " : "",
+		  __entry->hung_engines & BSD_RING ? 	" [VCS] " : "",
+		  __entry->hung_engines & BLT_RING ? 	" [BCS] " : "",
+		  __entry->hung_engines & VEBOX_RING ? 	" [VECS] " : "",
+		  __entry->hung_engines & BSD2_RING ? 	" [VCS2] " : "",
+		  __entry->hung_engines ? ")":"",
+		  __entry->watchdog ? "true" : "false",
+		  __entry->full_reset ?
+			(__entry->hung_engines ? "true [PROMOTED]" : "true") :
+				"false")
+);
+
+/**
+ * DOC: i915_tdr_engine_save
+ *
+ * This tracepoint tracks the point of engine state save during the engine
+ * recovery path. Logs the head pointer position at point of hang, the position
+ * after recovering and whether or not we forced a head pointer advancement or
+ * rounded up to an aligned QWORD position.
+ */
+TRACE_EVENT(i915_tdr_engine_save,
+	TP_PROTO(struct intel_engine_cs *engine,
+		 u32 old_head,
+		 u32 new_head,
+		 bool forced_advance),
+
+	TP_ARGS(engine, old_head, new_head, forced_advance),
+
+	TP_STRUCT__entry(
+			__field(struct intel_engine_cs *, engine)
+			__field(u32, old_head)
+			__field(u32, new_head)
+			__field(bool, forced_advance)
+	),
+
+	TP_fast_assign(
+			__entry->engine = engine;
+			__entry->old_head = old_head;
+			__entry->new_head = new_head;
+			__entry->forced_advance = forced_advance;
+	),
+
+	TP_printk("dev=%u, engine=%s, old_head=%u, new_head=%u, forced_advance=%s",
+		  __entry->engine->dev->primary->index,
+		  __entry->engine->name,
+		  __entry->old_head,
+		  __entry->new_head,
+		  __entry->forced_advance ? "true" : "false")
+);
+
+/**
+ * DOC: i915_tdr_gpu_reset_complete
+ *
+ * This tracepoint tracks the point of full GPU reset completion
+ */
+TRACE_EVENT(i915_tdr_gpu_reset_complete,
+	TP_PROTO(struct drm_device *dev),
+
+	TP_ARGS(dev),
+
+	TP_STRUCT__entry(
+			__field(struct drm_device *, dev)
+	),
+
+	TP_fast_assign(
+			__entry->dev = dev;
+	),
+
+	TP_printk("dev=%u, resets=%u",
+		__entry->dev->primary->index,
+		i915_reset_count(&((struct drm_i915_private *)
+			(__entry->dev)->dev_private)->gpu_error) )
+);
+
+/**
+ * DOC: i915_tdr_engine_reset_complete
+ *
+ * This tracepoint tracks the point of engine reset completion
+ */
+TRACE_EVENT(i915_tdr_engine_reset_complete,
+	TP_PROTO(struct intel_engine_cs *engine),
+
+	TP_ARGS(engine),
+
+	TP_STRUCT__entry(
+			__field(struct intel_engine_cs *, engine)
+	),
+
+	TP_fast_assign(
+			__entry->engine = engine;
+	),
+
+	TP_printk("dev=%u, engine=%s, resets=%u",
+		  __entry->engine->dev->primary->index,
+		  __entry->engine->name,
+		  __entry->engine->hangcheck.reset_count)
+);
+
+/**
+ * DOC: i915_tdr_forced_csb_check
+ *
+ * This tracepoint tracks the occurences of forced CSB checks
+ * that the driver does when detecting inconsistent context
+ * submission states between the driver state and the current
+ * CPU engine state.
+ */
+TRACE_EVENT(i915_tdr_forced_csb_check,
+	TP_PROTO(struct intel_engine_cs *engine,
+		 bool was_effective),
+
+	TP_ARGS(engine, was_effective),
+
+	TP_STRUCT__entry(
+			__field(struct intel_engine_cs *, engine)
+			__field(bool, was_effective)
+	),
+
+	TP_fast_assign(
+			__entry->engine = engine;
+			__entry->was_effective = was_effective;
+	),
+
+	TP_printk("dev=%u, engine=%s, was_effective=%s",
+		  __entry->engine->dev->primary->index,
+		  __entry->engine->name,
+		  __entry->was_effective ? "yes" : "no")
+);
+
+/**
+ * DOC: i915_tdr_hang_check
+ *
+ * This tracepoint tracks hang checks on each engine.
+ */
+TRACE_EVENT(i915_tdr_hang_check,
+	TP_PROTO(struct intel_engine_cs *engine,
+		 u32 seqno,
+		 u64 acthd,
+		 u32 hd,
+		 bool busy),
+
+	TP_ARGS(engine, seqno, acthd, hd, busy),
+
+	TP_STRUCT__entry(
+			__field(struct intel_engine_cs *, engine)
+			__field(u32, seqno)
+			__field(u64, acthd)
+			__field(u32, hd)
+			__field(bool, busy)
+	),
+
+	TP_fast_assign(
+			__entry->engine = engine;
+			__entry->seqno = seqno;
+			__entry->acthd = acthd;
+			__entry->hd = hd;
+			__entry->busy = busy;
+	),
+
+	TP_printk("dev=%u, engine=%s, seqno=%u (%d), last seqno=%u (%d), head=%u (%d), acthd=%lu, score=%d%s, action=%u [%s], busy=%s",
+		  __entry->engine->dev->primary->index,
+		  __entry->engine->name,
+		  __entry->seqno,
+		  __entry->seqno,
+		  __entry->engine->hangcheck.seqno,
+		  __entry->engine->hangcheck.seqno,
+		  __entry->hd,
+		  __entry->hd,
+		  (long unsigned int) __entry->acthd,
+		  __entry->engine->hangcheck.score,
+		  (__entry->engine->hangcheck.score >= HANGCHECK_SCORE_RING_HUNG) ? " [HUNG]" : "",
+		  (unsigned int) __entry->engine->hangcheck.action,
+		  hangcheck_action_to_str(__entry->engine->hangcheck.action),
+		  __entry->busy ? "yes" : "no")
+);
+
+/**
+ * DOC: i915_tdr_inconsistency
+ *
+ * This tracepoint tracks detected inconsistencies
+ */
+TRACE_EVENT(i915_tdr_inconsistency,
+	TP_PROTO(struct intel_engine_cs *engine,
+		 u32 hw_context,
+		 bool hw_active,
+		 u32 sw_context,
+		 struct drm_i915_gem_request *req),
+
+	TP_ARGS(engine, hw_context, hw_active, sw_context, req),
+
+	TP_STRUCT__entry(
+			__field(struct intel_engine_cs *, engine)
+			__field(u32, hw_context)
+			__field(bool, hw_active)
+			__field(u32, sw_context)
+			__field(int, elsp_submitted)
+	),
+
+	TP_fast_assign(
+			__entry->engine = engine;
+			__entry->hw_context = hw_context;
+			__entry->hw_active = hw_active;
+			__entry->sw_context = sw_context;
+			__entry->elsp_submitted = req?req->elsp_submitted:0;
+	),
+
+	TP_printk("dev=%u, engine=%s, hw_context=%x, hw_active=%s, sw_context=%x, elsp_submitted=%d",
+		  __entry->engine->dev->primary->index,
+		  __entry->engine->name,
+		  __entry->hw_context,
+		  __entry->hw_active?"true":"false",
+		  __entry->sw_context,
+		  __entry->elsp_submitted)
+);
+
 #endif /* _I915_TRACE_H_ */
 
 /* This part must be outside protection */
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 913fdbb..85107a1 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -2322,7 +2322,7 @@ gen8_ring_save(struct intel_engine_cs *ring, struct drm_i915_gem_request *req,
 	struct intel_context *ctx;
 	int ret = 0;
 	int clamp_to_tail = 0;
-	uint32_t head;
+	uint32_t head, old_head;
 	uint32_t tail;
 	uint32_t head_addr;
 	uint32_t tail_addr;
@@ -2337,7 +2337,7 @@ gen8_ring_save(struct intel_engine_cs *ring, struct drm_i915_gem_request *req,
 	 * Read head from MMIO register since it contains the
 	 * most up to date value of head at this point.
 	 */
-	head = I915_READ_HEAD(ring);
+	old_head = head = I915_READ_HEAD(ring);
 
 	/*
 	 * Read tail from the context because the execlist queue
@@ -2394,6 +2394,9 @@ gen8_ring_save(struct intel_engine_cs *ring, struct drm_i915_gem_request *req,
 	head |= (head_addr & HEAD_ADDR);
 	ring->saved_head = head;
 
+	trace_i915_tdr_engine_save(ring, old_head,
+		head, force_advance);
+
 	return 0;
 }
 
@@ -3304,6 +3307,19 @@ intel_execlists_TDR_get_current_request(struct intel_engine_cs *ring,
 			CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED;
 	}
 
+	/*
+	 * This may or may not be a sustained inconsistency. Most of the time
+	 * it's only a matter of a transitory inconsistency during context
+	 * submission/completion but if we happen to detect a sustained
+	 * inconsistency then it helps to have more information.
+	 */
+	if (status == CONTEXT_SUBMISSION_STATUS_INCONSISTENT)
+		trace_i915_tdr_inconsistency(ring,
+					     hw_context,
+					     hw_active,
+					     sw_context,
+					     tmpreq);
+
 	if (req)
 		*req = tmpreq;
 
@@ -3368,6 +3384,7 @@ bool intel_execlists_TDR_force_CSB_check(struct drm_i915_private *dev_priv,
 
 	spin_unlock_irqrestore(&engine->execlist_lock, flags);
 
+	trace_i915_tdr_forced_csb_check(engine, !!was_effective);
 	wake_up_all(&engine->irq_queue);
 
 	return !!was_effective;
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index f20548c..1c527cd 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1515,6 +1515,8 @@ static int gen6_do_reset(struct drm_device *dev)
 	/* Spin waiting for the device to ack the reset request */
 	ret = wait_for((__raw_i915_read32(dev_priv, GEN6_GDRST) & GEN6_GRDOM_FULL) == 0, 500);
 
+	trace_i915_tdr_gpu_reset_complete(dev);
+
 	intel_uncore_forcewake_reset(dev, true);
 
 	return ret;
@@ -1680,6 +1682,8 @@ static int do_engine_reset_nolock(struct intel_engine_cs *engine)
 		break;
 	}
 
+	trace_i915_tdr_engine_reset_complete(engine);
+
 	return ret;
 }
 
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 15/20] drm/i915: Port of Added scheduler support to __wait_request() calls
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (13 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 14/20] drm/i915: TDR/watchdog trace points Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 17:28 ` [PATCH 16/20] drm/i915: Fix __i915_wait_request() behaviour during hang detection Arun Siluvery
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

This is a partial port of the following patch from John Harrison's GPU
scheduler patch series: (patch sent to Intel-GFX with the subject line
"[Intel-gfx] [RFC 19/39] drm/i915: Added scheduler support to __wait_request()
calls" on Fri 17 July 2015)

	Author: John Harrison <John.C.Harrison@Intel.com>
	Date:   Thu Apr 10 10:48:55 2014 +0100
	Subject: drm/i915: Added scheduler support to __wait_request() calls

Removed all scheduler references and backported it to this baseline. The reason
we need this is because Chris Wilson has pointed out that threads that don't
hold the struct_mutex should not be thrown out of __i915_wait_request during
TDR hang recovery. Therefore we need a way to determine which threads are
holding the mutex and which are not.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: John Harrison <john.c.harrison@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |  3 ++-
 drivers/gpu/drm/i915/i915_gem.c         | 16 ++++++++++------
 drivers/gpu/drm/i915/intel_display.c    |  5 +++--
 drivers/gpu/drm/i915/intel_ringbuffer.c |  2 +-
 4 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index a65722d..f1c56b3 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3102,7 +3102,8 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
 			unsigned reset_counter,
 			bool interruptible,
 			s64 *timeout,
-			struct intel_rps_client *rps);
+			struct intel_rps_client *rps,
+			bool is_locked);
 int __must_check i915_wait_request(struct drm_i915_gem_request *req);
 int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
 int __must_check
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index e6eb45d..7122315 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1257,6 +1257,8 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
  * @reset_counter: reset sequence associated with the given request
  * @interruptible: do an interruptible wait (normally yes)
  * @timeout: in - how long to wait (NULL forever); out - how much time remaining
+ * @rps: ...
+ * @is_locked: true = Caller is holding struct_mutex
  *
  * Note: It is of utmost importance that the passed in seqno and reset_counter
  * values have been read by the caller in an smp safe manner. Where read-side
@@ -1272,7 +1274,8 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
 			unsigned reset_counter,
 			bool interruptible,
 			s64 *timeout,
-			struct intel_rps_client *rps)
+			struct intel_rps_client *rps,
+			bool is_locked)
 {
 	struct intel_engine_cs *ring = i915_gem_request_get_ring(req);
 	struct drm_device *dev = ring->dev;
@@ -1514,7 +1517,7 @@ i915_wait_request(struct drm_i915_gem_request *req)
 
 	ret = __i915_wait_request(req,
 				  atomic_read(&dev_priv->gpu_error.reset_counter),
-				  interruptible, NULL, NULL);
+				  interruptible, NULL, NULL, true);
 	if (ret)
 		return ret;
 
@@ -1627,7 +1630,7 @@ i915_gem_object_wait_rendering__nonblocking(struct drm_i915_gem_object *obj,
 	mutex_unlock(&dev->struct_mutex);
 	for (i = 0; ret == 0 && i < n; i++)
 		ret = __i915_wait_request(requests[i], reset_counter, true,
-					  NULL, rps);
+					  NULL, rps, false);
 	mutex_lock(&dev->struct_mutex);
 
 	for (i = 0; i < n; i++) {
@@ -3160,7 +3163,7 @@ i915_gem_wait_ioctl(struct drm_device *dev, void *data, struct drm_file *file)
 		if (ret == 0)
 			ret = __i915_wait_request(req[i], reset_counter, true,
 						  args->timeout_ns > 0 ? &args->timeout_ns : NULL,
-						  to_rps_client(file));
+						  to_rps_client(file), false);
 		i915_gem_request_unreference__unlocked(req[i]);
 	}
 	return ret;
@@ -3193,7 +3196,8 @@ __i915_gem_object_sync(struct drm_i915_gem_object *obj,
 					  atomic_read(&i915->gpu_error.reset_counter),
 					  i915->mm.interruptible,
 					  NULL,
-					  &i915->rps.semaphores);
+					  &i915->rps.semaphores,
+					  true); /* Is the mutex always held by this thread at this point? */
 		if (ret)
 			return ret;
 
@@ -4172,7 +4176,7 @@ i915_gem_ring_throttle(struct drm_device *dev, struct drm_file *file)
 	if (target == NULL)
 		return 0;
 
-	ret = __i915_wait_request(target, reset_counter, true, NULL, NULL);
+	ret = __i915_wait_request(target, reset_counter, true, NULL, NULL, false);
 	if (ret == 0)
 		queue_delayed_work(dev_priv->wq, &dev_priv->mm.retire_work, 0);
 
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index abfb5ba..74e97d5 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -11468,7 +11468,8 @@ static void intel_mmio_flip_work_func(struct work_struct *work)
 		WARN_ON(__i915_wait_request(mmio_flip->req,
 					    mmio_flip->crtc->reset_counter,
 					    false, NULL,
-					    &mmio_flip->i915->rps.mmioflips));
+					    &mmio_flip->i915->rps.mmioflips,
+					    false));
 		i915_gem_request_unreference__unlocked(mmio_flip->req);
 	}
 
@@ -13523,7 +13524,7 @@ static int intel_atomic_prepare_commit(struct drm_device *dev,
 
 			ret = __i915_wait_request(intel_plane_state->wait_req,
 						  reset_counter, true,
-						  NULL, NULL);
+						  NULL, NULL, false);
 
 			/* Swallow -EIO errors to allow updates during hw lockup. */
 			if (ret == -EIO)
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index f959326..ec1b85f 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -2372,7 +2372,7 @@ int intel_ring_idle(struct intel_engine_cs *ring)
 	return __i915_wait_request(req,
 				   atomic_read(&to_i915(ring->dev)->gpu_error.reset_counter),
 				   to_i915(ring->dev)->mm.interruptible,
-				   NULL, NULL);
+				   NULL, NULL, true); /* Is the mutex always held by this thread at this point? */
 }
 
 int intel_ring_alloc_request_extras(struct drm_i915_gem_request *request)
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 16/20] drm/i915: Fix __i915_wait_request() behaviour during hang detection.
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (14 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 15/20] drm/i915: Port of Added scheduler support to __wait_request() calls Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 17:28 ` [PATCH 17/20] drm/i915: Extended error state with TDR count, watchdog count and engine reset count Arun Siluvery
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

Use is_locked parameter in __i915_wait_request() to determine if a thread
should be forced to back off and retry or if it can continue sleeping. Don't
return -EIO from __i915_wait_request since that is bad for the upper layers,
only -EAGAIN to signify reset in progress. (unless the driver is terminally
wedged, in which case there's no mode of recovery left to attempt)

Also, use is_locked in trace_i915_gem_request_wait_begin() trace point for more
accurate reflection of current thread's lock state.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c   | 88 ++++++++++++++++++++++++++++++++-------
 drivers/gpu/drm/i915/i915_trace.h | 15 ++-----
 2 files changed, 78 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 7122315..80df5b5 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1287,7 +1287,6 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
 	unsigned long timeout_expire;
 	s64 before, now;
 	int ret;
-	int reset_in_progress = 0;
 
 	WARN(!intel_irqs_enabled(dev_priv), "IRQs disabled");
 
@@ -1312,7 +1311,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
 		gen6_rps_boost(dev_priv, rps, req->emitted_jiffies);
 
 	/* Record current time in case interrupted by signal, or wedged */
-	trace_i915_gem_request_wait_begin(req);
+	trace_i915_gem_request_wait_begin(req, is_locked);
 	before = ktime_get_raw_ns();
 
 	/* Optimistic spin for the next jiffie before touching IRQs */
@@ -1327,23 +1326,84 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
 
 	for (;;) {
 		struct timer_list timer;
+		bool full_gpu_reset_completed_unlocked = false;
+		bool reset_in_progress_locked = false;
 
 		prepare_to_wait(&ring->irq_queue, &wait, state);
 
-		/* We need to check whether any gpu reset happened in between
-		 * the caller grabbing the seqno and now ... */
-		reset_in_progress =
-			i915_gem_check_wedge(ring->dev->dev_private, NULL, interruptible);
+		/*
+		 * Rules for waiting with/without struct_mutex held during
+		 * asynchronous TDR hang detection/recovery:
+		 *
+		 * ("reset in progress" = TDR has detected a hang, hang
+		 * recovery may or may not have commenced)
+		 *
+		 * 1. Is the driver terminally wedged? If so, return -EIO since
+		 *    there is no point in having the caller retry - the driver
+		 *    is irrecoverably stuck.
+		 *
+		 * 2. Is this thread holding the struct_mutex and is any of the
+		 *    following true?
+		 *
+		 *	a) Is any kind of reset in progress?
+		 *	b) Has a full GPU reset happened while this thread were
+		 *	   sleeping?
+		 *
+		 *    If so:
+		 *	Return -EAGAIN. The caller should interpret this as:
+		 *	Release struct_mutex, try to acquire struct_mutex
+		 *	(through i915_mutex_lock_interruptible(), which will
+		 *	fail as long as any reset is in progress) and retry the
+		 *	call to __i915_wait_request(), which hopefully will go
+		 *	better as soon as the hang has been resolved.
+		 *
+		 * 3. Is this thread not holding the struct_mutex and has a
+		 *    full GPU reset completed? (that is, the reset count has
+		 *    changed but there is currenly no full GPU reset in
+		 *    progress?)
+		 *
+		 *    If so:
+		 *	Return 0. Since the request has been purged there is no
+		 *	requests left to wait for. Just go home.
+		 *
+		 * 4. Is this thread not holding the struct_mutex and is any
+		 *    kind of reset in progress?
+		 *
+		 *    If so:
+		 *	This thread may keep on waiting.
+		 */
 
-		if ((reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) ||
-		     reset_in_progress) {
+		if (i915_terminally_wedged(&dev_priv->gpu_error)) {
+			ret = -EIO;
+			break;
+		}
 
-			/* ... but upgrade the -EAGAIN to an -EIO if the gpu
-			 * is truely gone. */
-			if (reset_in_progress)
-				ret = reset_in_progress;
-			else
-				ret = -EAGAIN;
+		reset_in_progress_locked =
+			(((i915_gem_check_wedge(ring->dev->dev_private, NULL, interruptible)) ||
+			  (reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter))) &&
+			  is_locked);
+
+		if (reset_in_progress_locked) {
+			/*
+			 * If the caller is holding the struct_mutex throw them
+			 * out since TDR needs access to it.
+			 */
+			ret = -EAGAIN;
+			break;
+		}
+
+		full_gpu_reset_completed_unlocked =
+			(((reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) &&
+			(!i915_reset_in_progress(&dev_priv->gpu_error))));
+
+		if (full_gpu_reset_completed_unlocked) {
+			/*
+			 * Full GPU reset without holding the struct_mutex has
+			 * completed - just return. If recovery is still in
+			 * progress the thread will keep on sleeping until
+			 * recovery is complete.
+			 */
+			ret = 0;
 			break;
 		}
 
diff --git a/drivers/gpu/drm/i915/i915_trace.h b/drivers/gpu/drm/i915/i915_trace.h
index 5c15d43..7dcac93 100644
--- a/drivers/gpu/drm/i915/i915_trace.h
+++ b/drivers/gpu/drm/i915/i915_trace.h
@@ -591,8 +591,8 @@ DEFINE_EVENT(i915_gem_request, i915_gem_request_complete,
 );
 
 TRACE_EVENT(i915_gem_request_wait_begin,
-	    TP_PROTO(struct drm_i915_gem_request *req),
-	    TP_ARGS(req),
+	    TP_PROTO(struct drm_i915_gem_request *req, bool blocking),
+	    TP_ARGS(req, blocking),
 
 	    TP_STRUCT__entry(
 			     __field(u32, dev)
@@ -601,25 +601,18 @@ TRACE_EVENT(i915_gem_request_wait_begin,
 			     __field(bool, blocking)
 			     ),
 
-	    /* NB: the blocking information is racy since mutex_is_locked
-	     * doesn't check that the current thread holds the lock. The only
-	     * other option would be to pass the boolean information of whether
-	     * or not the class was blocking down through the stack which is
-	     * less desirable.
-	     */
 	    TP_fast_assign(
 			   struct intel_engine_cs *ring =
 						i915_gem_request_get_ring(req);
 			   __entry->dev = ring->dev->primary->index;
 			   __entry->ring = ring->id;
 			   __entry->seqno = i915_gem_request_get_seqno(req);
-			   __entry->blocking =
-				     mutex_is_locked(&ring->dev->struct_mutex);
+			   __entry->blocking = blocking;
 			   ),
 
 	    TP_printk("dev=%u, ring=%u, seqno=%u, blocking=%s",
 		      __entry->dev, __entry->ring,
-		      __entry->seqno, __entry->blocking ?  "yes (NB)" : "no")
+		      __entry->seqno, __entry->blocking ?  "yes" : "no")
 );
 
 DEFINE_EVENT(i915_gem_request, i915_gem_request_wait_end,
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 17/20] drm/i915: Extended error state with TDR count, watchdog count and engine reset count
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (15 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 16/20] drm/i915: Fix __i915_wait_request() behaviour during hang detection Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 17:28 ` [PATCH 18/20] drm/i915: TDR / per-engine hang recovery kernel docs Arun Siluvery
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

These new TDR-specific metrics have previously been added to
i915_hangcheck_info() in debugfs. During design review Chris Wilson asked for
these metrics to be added to the error state as well.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h       | 3 +++
 drivers/gpu/drm/i915/i915_gpu_error.c | 6 ++++++
 2 files changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index f1c56b3..22f361c 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -520,6 +520,9 @@ struct drm_i915_error_state {
 		int hangcheck_score;
 		enum intel_ring_hangcheck_action hangcheck_action;
 		int num_requests;
+		int hangcheck_tdr_count;
+		int hangcheck_watchdog_count;
+		int hangcheck_reset_count;
 
 		/* our own tracking of ring head and tail */
 		u32 cpu_ring_head;
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index a5dcc7b..89ea16a 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -304,6 +304,9 @@ static void i915_ring_error_state(struct drm_i915_error_state_buf *m,
 	err_printf(m, "  hangcheck: %s [%d]\n",
 		   hangcheck_action_to_str(ring->hangcheck_action),
 		   ring->hangcheck_score);
+	err_printf(m, "  TDR count: %d\n", ring->hangcheck_tdr_count);
+	err_printf(m, "  Watchdog count: %d\n", ring->hangcheck_watchdog_count);
+	err_printf(m, "  Engine reset count: %d\n", ring->hangcheck_reset_count);
 }
 
 void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...)
@@ -940,6 +943,9 @@ static void i915_record_ring_state(struct drm_device *dev,
 
 	ering->hangcheck_score = ring->hangcheck.score;
 	ering->hangcheck_action = ring->hangcheck.action;
+	ering->hangcheck_tdr_count = ring->hangcheck.tdr_count;
+	ering->hangcheck_watchdog_count = ring->hangcheck.watchdog_count;
+	ering->hangcheck_reset_count = ring->hangcheck.reset_count;
 
 	if (USES_PPGTT(dev)) {
 		int i;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 18/20] drm/i915: TDR / per-engine hang recovery kernel docs
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (16 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 17/20] drm/i915: Extended error state with TDR count, watchdog count and engine reset count Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 17:28 ` [PATCH 19/20] drm/i915: drm/i915 changes to simulated hangs Arun Siluvery
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 Documentation/DocBook/gpu.tmpl  | 476 ++++++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_irq.c |   8 +-
 2 files changed, 483 insertions(+), 1 deletion(-)

diff --git a/Documentation/DocBook/gpu.tmpl b/Documentation/DocBook/gpu.tmpl
index 351e801..b765bcc 100644
--- a/Documentation/DocBook/gpu.tmpl
+++ b/Documentation/DocBook/gpu.tmpl
@@ -3411,6 +3411,482 @@ int num_ioctls;</synopsis>
       </sect2>
     </sect1>
 
+    <!--
+	TODO:
+	Create sections with no subsection listing. Deal with subsection
+	references as they appear in the text.
+
+	How do we create inline references to functions with DocBook tags?
+	If you want to refer to a function it would be nice if you could tag
+	the function name so that the reader can click the tag while reading
+	and get straight to the DocBook page on that function.
+    -->
+
+    <sect1>
+    <title>GPU hang management</title>
+    <para>
+    There are two sides to handling GPU hangs: <link linkend='detection'
+    endterm="detection.title"/> and <link linkend='recovery'
+    endterm="recovery.title"/>. In this section we will discuss how the driver
+    detect hangs and what it can do to recover from them.
+    </para>
+
+    <sect2 id='detection'>
+    <title id='detection.title'>Detection</title>
+    <para>
+    There is no theoretically sound definition of what a GPU hang actually is,
+    only assumptions based on empirical observations. One such observation is
+    that if a batch buffer takes more than a certain amount of time to finish
+    then we would assume that it's hung. However, one problem with that
+    assumption is that the execution might be ongoing inside the batch buffer.
+    In fact, it's easy to determine whether or not execution is progressing
+    within a batch buffer. If taking that into account we could create a more
+    refined hang detection algorithm. Unfortunately, there is then the
+    complication that the execution might be stuck in a never-ending loop which
+    keeps execution busy for an unbounded amount of time. These are all
+    practical problems that we need to deal with when detecting a hang and
+    whatever hang detection algorithm we come up with will have a certain
+    probability of false positives.
+    </para>
+    <para>
+    The i915 driver currently supports two forms of hang
+    detection:
+    <orderedlist>
+    <listitem>
+    <link linkend='periodic_hang_checking' endterm="periodic_hang_checking.title"/>
+    </listitem>
+    <listitem>
+    <link linkend='watchdog' endterm="watchdog.title"/>
+    </listitem>
+    </orderedlist>
+    </para>
+
+    <sect3 id='periodic_hang_checking'>
+    <title id='periodic_hang_checking.title'>Periodic Hang Checking</title>
+    <para>
+    The periodic hang checker is a work queue that keeps running in the
+    background as long as there is work outstanding that is pending execution.
+    i915_hangcheck_elapsed() implements the work function of the queue and is
+    executed at every hang checker invocation.
+    </para>
+
+    <para>
+    While being scheduled the hang checker keeps track of a hang score for each
+    individual engine. The hang score is an indication of what degree of
+    severity a hang has reached for a certain engine. The higher the score gets
+    the more radical forms of intervention are employed to force execution to
+    resume.
+    </para>
+
+    <para>
+    The hang checker is scheduled from two places:
+    </para>
+
+    <orderedlist>
+    <listitem>
+    __i915_add_request(), after a new request has been added and is pending submission.
+    </listitem>
+
+    <listitem>
+    i915_hangcheck_elapsed() itself, if work is still pending for any GPU
+    engine the hang checker is rescheduled.
+    </listitem>
+
+    </orderedlist>
+    <para>
+    The periodic hang checker keeps track of the sequence number
+    progression of the currently executing requests on every GPU engine. If
+    they keep progressing in between every hang checker invocation this is
+    interpreted as the engine being active, the hang score is cleared and and
+    no intervention is made. If the sequence number has stalled for one or more
+    engines in between two hang checks that is an indication of one of two
+    things:
+    </para>
+
+    <orderedlist>
+    <listitem>
+    There is no more work pending on the given engine. If there are no
+    threads waiting for request completion this is an indication that no more
+    hang checking is necessary and the hang checker is not rescheduled. If
+    there is someone waiting for request completion the hang checker is
+    rescheduled and the hang score is continually incremented.
+    </listitem>
+
+    <listitem>
+    <para>The given engine is truly hung. In this case a number of hardware
+    state checks are made to determine what the most suitable course of action
+    is and a corresponding hang score incrementation is made to reflect the
+    current hang severity.</para>
+    </listitem>
+    </orderedlist>
+
+    <para>
+    If the hang score of any engine reaches the hung threshold hang recovery is
+    scheduled by calling i915_handle_error() with a engine flag mask containing
+    the bits representing all currently hung engines.
+    </para>
+
+
+    <sect4>
+    <title>Context Submission State Consistency Checking</title>
+
+    <para>
+    On top of this there is the context submission status consistency pre-check
+    in the hang checker that keeps track of driver/HW consistency. The
+    underlying problem that this pre-check is trying to solve is the fact that
+    on some occasions the driver does not receive the proper context event
+    interrupt upon context state changes. Specifically, this has been observed
+    following the completion of engine reset and the subsequent resubmission of
+    the fixed-up context. At this point the engine hang is unblocked and the
+    context completes and the hardware marks the context as complete in the
+    context status buffer (CSB) for the given engine. However, the interrupt
+    that would normally signal this to the driver is lost. What this means to
+    the driver is that it gets stuck waiting for context completion on the
+    given engine until reboot, stalling all further submissions to the engine
+    ELSP.
+    </para>
+
+    <para>
+    The way to detect this is to check for inconsistencies between the context
+    submission state in the hardware as well as in the driver. What this means
+    is that the EXECLIST_STATUS register has to be checked for every engine.
+    From this register the ID of the currently running context can be extracted
+    as well as information about whether or not the engine is idle or not. This
+    information can then be compared against the current state of the execlist
+    queue for the given engine. If the hardware is idle but the driver has
+    pending contexts in the execlist queue for a prolonged period of time then
+    it's safe to assume that the driver/HW state is inconsistent.
+    </para>
+
+    <para>
+    The way driver/HW state inconsistencies are rectified is by faking the
+    presumably lost context event interrupts simply by calling the execlist
+    interrupt handler manually.
+    </para>
+
+    <para>
+    What this means to the periodic hang checker is the following:
+    </para>
+
+    <orderedlist>
+    <listitem>
+    <para>
+    State consistency checking happens at the start of the hang check
+    procedure. If an inconsistency has been detected enough times (more
+    detections than the threshold level of I915_FAKED_CONTEXT_IRQ_THRESHOLD)
+    the hang checker will fake a context event interrupt. If there are
+    outstanding, unprocessed context events in the CSB buffer these will be
+    acted upon.
+    </para>
+    </listitem>
+
+    <listitem>
+    <para>
+    As long as the driver/HW state has been determined to be inconsistent the
+    error handler will not be called. The reason for this is that the engine
+    recovery mode, which is the hang recovery mode that the driver prefers, is
+    not effective if context submissions does not work. If the driver/HW state
+    is inconsistent it might mean that the hardware is currently executing (and
+    might be hung in) a completely different context than the driver expects, which would lead to
+    unexpected pre-emptions, which might mean that trying to resubmit the
+    context that the driver has identified as hung might make the situation
+    worse. Therefore, before any recovery is scheduled the driver/HW state must
+    be confirmed as consistent and stable.
+    </para>
+    </listitem>
+
+    <listitem>
+    <para>
+    If any inconsistent driver/HW states persist regardless of any attempts to
+    rectify the situation there is a final fall-back: In case the hang score on
+    any engine reaches twice that of the normal hang threshold the error
+    handler is called with no engine mask populated, meaning that a full GPU
+    reset is forced. Going for a full GPU reset in this case makes sense since
+    there are two problems that need fixing: 1) <emphasis role="bold">The GPU
+    is hung</emphasis> and 2) <emphasis role="bold">The driver/HW state is
+    inconsistent</emphasis>. The full GPU reset solves both of these problems
+    and does not require the driver/HW state to be consistent to begin with so
+    its a sensible choice in this situation.
+    </para>
+    </listitem>
+
+    </orderedlist>
+
+    </sect4>
+    </sect3>
+
+    <sect3 id='watchdog'>
+    <title id='watchdog.title'>Watchdog Timeout</title>
+    <para>
+    Unlike the <link linkend='periodic_hang_checker'>periodic hang
+    checker</link> Watchdog Timeout is a mode of hang detection that relies on
+    the GPU hardware to notify the driver in the event of a hang. Another
+    dissimilarity is that this mode does not target every engine at all times
+    but rather targets individual batch buffers that have been selected by the
+    submitting application. The way this works is that a submitter can opt-in
+    to use Watchdog Timeout for a particular batch buffer is by setting the
+    Watchdog Timeout enablement flag for that batch buffer. By doing so the
+    driver will emit instructions in the ring buffer before the batch buffer
+    start instruction to enable the Watchdog HW timer and afterwards to cancel
+    the same timer. The purpose of this is to keep track of how long the
+    execution stays inside the batch buffer once the execution reaches that
+    point. If the execution takes to long to clear the batch buffer and the
+    preset Watchdog Timer Threshold elapses the GPU hardware will fire a
+    Watchdog Timeout interrupt to the driver, which is interpreted as current
+    batch buffer for the given engine being hung.  Thus, hang detection in this
+    case is purely interrupt-driven and the driver is free to do other things.
+    </para>
+
+    <para>
+    Once the GT interrupt handler receives the Watchdog Timeout interrupt it
+    then proceeds by making a direct call to i915_handle_error() with
+    information about which engine is hung and by setting the dedicated
+    watchdog priority flag that allows the error handler to circumvent the
+    normal hang promotion logic that applies to hang detections originating
+    from the periodic hang checker.
+    </para>
+
+    <para>
+    In order to enable this Watchdog Timeout for a particular batch buffer
+    userland libDRM has to enable the corresponding bit contained in
+    I915_EXEC_ENABLE_WATCHDOG in the batch buffer flag bitmask. This feature is
+    disabled by default and therefore it operates purely on an opt-in basis
+    from userland's point of view.
+    </para>
+
+    </sect3>
+
+    </sect2>
+
+    <sect2 id='recovery'>
+    <title id='recovery.title'>Recovery</title>
+    <para>
+    Once a hang has been detected, either through periodic hang checking or
+    Watchdog Timeout, the error handler (i915_handle_error) takes over and
+    decices what to do from there on. Generally speaking there are two modes of
+    hang recovery that the error handler can choose from:
+
+    <orderedlist>
+    <listitem>
+    <link linkend='engine_reset' endterm="engine_reset.title"/>
+    </listitem>
+    <listitem>
+    <link linkend='GPU_reset' endterm="GPU_reset.title"/>
+    </listitem>
+    </orderedlist>
+
+    Exactly what recovery mode the hang is promoted to depends on a number of factors:
+    </para>
+
+    <literallayout></literallayout>
+    <itemizedlist>
+    <listitem>
+    <para>
+    <emphasis role="bold">
+    Did the caller say that a hang had been detected but did not specifically ask for engine reset?
+    </emphasis>
+    If the wedged parameter is set in the call to i915_handle_error() but the
+    engine_mask parameter is set to 0 it means that we need to do some kind of
+    hang recovery but no engine is specified. In that case the outcome will
+    always be an attempt to do a GPU reset.
+    </para>
+    </listitem>
+
+    <listitem>
+    <literallayout></literallayout>
+    <para>
+    <emphasis role="bold">
+    Did the caller say that a hang had been detected and specify at least one hung engine?
+    </emphasis>
+    If one or more engines have been specified as hung the first attempt will
+    always be to do an engine reset of those hung engines. There are two
+    reasons why an GPU reset would be carried out instead of a simple engine
+    reset:
+    </para>
+    <orderedlist>
+
+    <listitem>
+    <para>
+    An engine reset was carried out on the same engine too recently. What
+    constitutes "too recent" is determined by the i915 module parameter
+    gpu_reset_promotion_time. If two engine resets were attempted within the
+    time window defined by this module parameter it is decided that the
+    previous engine reset was ineffective and therefore there is no point in
+    trying another one. Thus, a full GPU reset will be done instead.
+    </para>
+    </listitem>
+
+    <listitem>
+    <para>
+    An engine reset was carried out but failed. In this case the hang recovery
+    path (i915_error_work_func) would go straight from the failed engine reset
+    attempt (i915_reset_engine call) to a full GPU reset without delay.
+    </para>
+    </listitem>
+
+    </orderedlist>
+    </listitem>
+
+    <listitem>
+    <literallayout></literallayout>
+    <literallayout></literallayout>
+    <para>
+    <emphasis role="bold">
+    Did the Watchdog Timeout detect the hang?
+    </emphasis>
+    In case of the Watchdog Timeout calling the error handler the dedicated
+    watchdog parameter will be set and this forces the error handler to only
+    consider engine reset and not full GPU reset. We will only promote to full
+    GPU reset if the driver itself, based on its own hang detection mechanism,
+    has detected a persisting hang that will not be resolved by an engine hang.
+    Watchdog Timeout is user-controlled and is therefore not trusted the same
+    way.
+    </para>
+    <literallayout></literallayout>
+    </listitem>
+    </itemizedlist>
+
+    <para>
+    When the error handler reaches a decision of what hang recovery mode to use
+    it sets up the corresponding reset in progress flag. There is one main
+    reset in progress flag for GPU resets as well as one dedicated reset in
+    progress flag in each hangcheck struct for each engine. After that the
+    error handler schedules the actual hang recovery work queue, which ends up
+    in i915_error_work_func, which is the function that grabs all necessary
+    locks and actually calls the main hang recovery functions. For all engines
+    that have their respective error in progress flags the <link
+    linkend='engine_reset' endterm="engine_reset.title">engine reset
+    path</link> is taken for each engine in sequence. If the GPU reset in
+    progress flag is set no attempts at carrying out engine resets are made and
+    instead the legacy <link linkend='GPU_reset' endterm="GPU_reset.title">full
+    GPU reset path</link> is taken.
+    </para>
+
+    <sect3 id='engine_reset'>
+    <title id='engine_reset.title'>Engine Reset</title>
+    <para>
+    The engine reset path is implemented in i915_reset_engine and the following
+    is a summary of how that function operates:
+
+    <orderedlist>
+    <listitem>
+    <para>
+    Get currently running context and check context submission status
+    consistency. If the currently running (hung) context is in an inconsistent
+    state there is really no reason why the execution should be at this point
+    since the hang checker does a consistency check before scheduling hang
+    recovery unless the state has changed since hang recovery was scheduled, in
+    which case the engine is not truly hung. If so, do early exit.
+    </para>
+    </listitem>
+
+    <listitem>
+    <para>
+    Force engine to idle and save the current context image. On gen8+ this is
+    done by setting the reset request bit in the reset control register. On
+    gen7 and earlier gens the MI_MODE register in combination with the ring
+    control register has to be used to disable the engine.
+    </para>
+    </listitem>
+
+    <listitem>
+    <para>
+    Save the head MMIO register value and nudge it to the following valid
+    instruction in the ring buffer following the batch buffer start instruction
+    of the currently hung batch buffer.
+    </para>
+    </listitem>
+
+    <listitem>
+    <para>
+    Reset engine.
+    </para>
+    </listitem>
+
+    <listitem>
+    <para>
+    Call the init() function for the previously hung engine, which should
+    reapply HW workarounds and carry out other essential state
+    reinitialization.
+    </para>
+    </listitem>
+
+    <listitem>
+    <para>
+    Write the previously nudged head register value to both MMIO and context registers.
+    </para>
+    </listitem>
+
+    <listitem>
+    <para>
+    Submit updated context to ELSP in order to force execution to resume (gen8 only).
+    </para>
+    </listitem>
+
+    <listitem>
+    <para>
+    Clear reset in progress engine flag and wake up all threads waiting for requests to complete.
+    </para>
+    </listitem>
+    </orderedlist>
+
+    <literallayout></literallayout>
+
+    <para>
+    The intended outcome of an engine reset is that the hung batch buffer is
+    dropped by forcing the execution to resume following the batch buffer start
+    instruction in the ring buffer. This should only affect the hung engine and
+    none other. No reinitialization aside from a subset of the state for the
+    hung engine should happen and pending work should be retained requiring no
+    further resubmissions.
+    </para>
+
+    </para>
+    </sect3>
+
+    <sect3 id='GPU_reset'>
+    <title id='GPU_reset.title'>GPU reset</title>
+    <para>
+    Basically the GPU reset function, i915_reset, does 3 things:
+    <literallayout></literallayout>
+
+    <orderedlist>
+    <listitem>
+    <para>
+    Reset GEM.
+    </para>
+    </listitem>
+
+    <listitem>
+    <para>
+    Do the actual GPU reset.
+    </para>
+    </listitem>
+
+    <listitem>
+    <para>
+    Reinitialize the GEM part of the driver, including purging all pending work, reinitialize the engines and ring setup and more.
+    </para>
+    </listitem>
+    </orderedlist>
+
+    <literallayout></literallayout>
+
+    The intended outcome of a GPU reset is that all work, including the hung
+    batch buffer as well as all batch buffers following it, is dropped and the
+    GEM part of the driver is reinitialized following the GPU reset. This means
+    that the driver goes to an idle state together with the hardware and should
+    start over from a state in which it is ready to accept more work and move
+    forwards from there. All pending work will have to be resubmitted by the
+    submitting application.
+
+    </para>
+    </sect3>
+
+    </sect2>
+
+    </sect1>
+
     <sect1>
       <title> Tracing </title>
       <para>
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index acca5d8..1618fef 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2730,7 +2730,9 @@ static void i915_report_and_clear_eir(struct drm_device *dev)
  *			or if one of the current engine resets fails we fall
  *			back to legacy full GPU reset.
  * @watchdog: 		true = Engine hang detected by hardware watchdog.
+ *
  * @wedged: 		true = Hang detected, invoke hang recovery.
+ *
  * @fmt, ...: 		Error message describing reason for error.
  *
  * Do some basic checking of register state at error time and
@@ -3227,7 +3229,11 @@ ring_stuck(struct intel_engine_cs *ring, u64 acthd)
 	return HANGCHECK_HUNG;
 }
 
-/*
+/**
+ * i915_hangcheck_elapsed - hang checker work function
+ *
+ * @work: Work item containing reference to private DRM struct.
+ *
  * This is called when the chip hasn't reported back with completed
  * batchbuffers in a long time. We keep track per ring seqno progress and
  * if there are no progress, hangcheck score for that ring is increased.
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 19/20] drm/i915: drm/i915 changes to simulated hangs
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (17 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 18/20] drm/i915: TDR / per-engine hang recovery kernel docs Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-13 17:28 ` [PATCH 20/20] drm/i915: Enable TDR / per-engine hang recovery Arun Siluvery
  2016-01-14  8:30 ` ✗ failure: Fi.CI.BAT Patchwork
  20 siblings, 0 replies; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomas Elf

From: Tim Gore <tim.gore@intel.com>

Simulated hangs, as used by drv_hangman and some other IGT tests, are not
handled correctly with the new per-engine hang recovery mode. This patch fixes
several issues needed to get them working in the execlist case.

1) The "simulated" hang is effected by not submitting a particular batch buffer
   to the hardware. In this way it is not handled by the hardware are hence
   remains in the software queue, leading the TDR mechanism to declare a hang.
   The place where the submission of the batch was being blocked was in
   intel_logical_ring_advance_and_submit. Because this means the request never
   enters the execlist_queue, the TDR mechanism does not detect a hang and the
   situation is never cleared. Also, blocking the batch buffer here is before
   the intel_ctx_submit_request object gets allocated. During TDR we need to
   actually complete the submission process to unhang the ring, but we are not
   in user context so cannot allocate the request object.  To overcome both
   these issues I moved the place where submission is blocked to
   execlists_context_unqueue. This means that the request enters the
   ring->execlist_queue, so the TDR mechanism detects the hang and can resubmit
   the request after the stop_rings bit is cleared.

2) A further problem arises from a workaround in i915_hangcheck_sample to deal
   with a context submission status of "...INCONSISTENT" being reported by
   intel_execlists_TDR_get_current_request() when the hardware is idle.  A
   simulated hang, because it causes the sw and hw context id's to be out of
   sync, results in a context submission status of "...INCONSISTENT" being
   reported, triggering this workaround which resubmits the batch and clears
   the hang, avoiding a ring reset. But we want the ring reset to occur, since
   this is part of what we are testing. So, I have made
   intel_execlists_TDR_get_current_request() aware of simulated hangs, so that
   it returns a status of OK in this case.  This avoids the workaround being
   triggered, leading to the TDR mechanism declaring a ring hang and doing a
   ring reset.

Issue: VIZ-5488
Signed-off-by: Tim Gore <tim.gore@intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c  | 10 ++++++++++
 drivers/gpu/drm/i915/intel_lrc.c | 26 +++++++++++++++++++++-----
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 73976f9..fd51c26 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1167,6 +1167,16 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 		 }
 	}
 
+	/* Clear any simulated hang flags */
+	if (dev_priv->gpu_error.stop_rings) {
+		DRM_INFO("Simulated gpu hang, reset stop_rings bits %08x\n",
+			(0x1 << engine->id));
+		dev_priv->gpu_error.stop_rings &= ~(0x1 << engine->id);
+		/* if all hangs are cleared, then clear the ALLOW_BAN/ERROR bits */
+		if ((dev_priv->gpu_error.stop_rings & ((1 << I915_NUM_RINGS) - 1)) == 0)
+			dev_priv->gpu_error.stop_rings = 0;
+	}
+
 	/* Sample the current ring head position */
 	head = I915_READ_HEAD(engine) & HEAD_ADDR;
 
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 85107a1..b565d78 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -653,6 +653,16 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring, bool tdr_res
 		}
 	}
 
+	/* Check for a simulated hang request */
+	if (intel_ring_stopped(ring)) {
+		/*
+		 * Mark the request at the head of the queue as submitted but
+		 * dont actually submit it.
+		 */
+		req0->elsp_submitted++;
+		return;
+	}
+
 	WARN_ON(req1 && req1->elsp_submitted && !tdr_resubmission);
 
 	execlists_submit_requests(req0, req1, tdr_resubmission);
@@ -1045,16 +1055,12 @@ static int logical_ring_wait_for_space(struct drm_i915_gem_request *req,
 static void
 intel_logical_ring_advance_and_submit(struct drm_i915_gem_request *request)
 {
-	struct intel_engine_cs *ring = request->ring;
 	struct drm_i915_private *dev_priv = request->i915;
 
 	intel_logical_ring_advance(request->ringbuf);
 
 	request->tail = request->ringbuf->tail;
 
-	if (intel_ring_stopped(ring))
-		return;
-
 	if (dev_priv->guc.execbuf_client)
 		i915_guc_submit(dev_priv->guc.execbuf_client, request);
 	else
@@ -3290,7 +3296,17 @@ intel_execlists_TDR_get_current_request(struct intel_engine_cs *ring,
 	}
 
 	if (tmpctx) {
-		status = ((hw_context == sw_context) && hw_active) ?
+		/*
+		 * Check for simuated hang. In this case the head entry in the
+		 * sw execlist queue will not have been submitted to the ELSP, so
+		 * the hw and sw context id's may well disagree, but we still want
+		 * to proceed with hang recovery. So we return OK which allows
+		 * the TDR recovery mechanism to proceed with a ring reset.
+		 */
+		if (intel_ring_stopped(ring))
+			status = CONTEXT_SUBMISSION_STATUS_OK;
+		else
+			status = ((hw_context == sw_context) && hw_active) ?
 				CONTEXT_SUBMISSION_STATUS_OK :
 				CONTEXT_SUBMISSION_STATUS_INCONSISTENT;
 	} else {
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 20/20] drm/i915: Enable TDR / per-engine hang recovery
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (18 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 19/20] drm/i915: drm/i915 changes to simulated hangs Arun Siluvery
@ 2016-01-13 17:28 ` Arun Siluvery
  2016-01-14  8:30 ` ✗ failure: Fi.CI.BAT Patchwork
  20 siblings, 0 replies; 31+ messages in thread
From: Arun Siluvery @ 2016-01-13 17:28 UTC (permalink / raw)
  To: intel-gfx; +Cc: Tomas Elf

From: Tomas Elf <tomas.elf@intel.com>

This is the final enablement patch for per-engine hang recovery. It sets up
per-engine hang recovery to be used per default in favour of full GPU reset.
Legacy full GPU reset will no longer be the preferred mode of hang recovery and
will only be used as a fall-back in case of frequent hangs on individual
engines or in the case of engine hang recovery failures.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
---
 drivers/gpu/drm/i915/i915_params.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index 5cf9c11..c098a5a 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -37,7 +37,7 @@ struct i915_params i915 __read_mostly = {
 	.enable_fbc = -1,
 	.enable_execlists = -1,
 	.enable_hangcheck = true,
-	.enable_engine_reset = false,
+	.enable_engine_reset = true,
 	.gpu_reset_promotion_time = 10,
 	.enable_ppgtt = -1,
 	.enable_psr = 0,
@@ -123,7 +123,7 @@ MODULE_PARM_DESC(enable_engine_reset,
 	"Enable GPU engine hang recovery mode. Used as a soft, low-impact form "
 	"of hang recovery that targets individual GPU engines rather than the "
 	"entire GPU"
-	"(default: false)");
+	"(default: true)");
 
 module_param_named(gpu_reset_promotion_time,
                i915.gpu_reset_promotion_time, int, 0644);
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 04/20] drm/i915: TDR / per-engine hang detection
  2016-01-13 17:28 ` [PATCH 04/20] drm/i915: TDR / per-engine hang detection Arun Siluvery
@ 2016-01-13 20:37   ` Chris Wilson
  0 siblings, 0 replies; 31+ messages in thread
From: Chris Wilson @ 2016-01-13 20:37 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx, Tomas Elf

On Wed, Jan 13, 2016 at 05:28:16PM +0000, Arun Siluvery wrote:
> From: Tomas Elf <tomas.elf@intel.com>
> 
> With the per-engine hang recovery path already in place this patch adds
> per-engine hang detection by letting the periodic hang checker detect hangs on
> individual engines and communicate this to the error handler. During hang
> checking every engine is checked and the hang detection status for each engine
> is aggregated into a single 32-bit engine flag mask that contains all the
> engine flags (1 << ring->id) of all the hung engines or'ed together. The
> per-engine path in the error handler then sets up the hangcheck state for each
> invidual, hung engine based on the engine flag mask before potentially calling
> the per-engine hang recovery path.
> 
> This allows the hang detection to happen in lock-step for all engines in
> parallel and lets the driver process all hung engines in turn in the error
> handler.
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_debugfs.c |  2 +-
>  drivers/gpu/drm/i915/i915_drv.h     |  4 ++--
>  drivers/gpu/drm/i915/i915_irq.c     | 41 +++++++++++++++++++++++--------------
>  3 files changed, 29 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index e3377ab..6d1b6c3 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -4720,7 +4720,7 @@ i915_wedged_set(void *data, u64 val)
>  
>  	intel_runtime_pm_get(dev_priv);
>  
> -	i915_handle_error(dev, val,
> +	i915_handle_error(dev, 0x0, val,
>  			  "Manually setting wedged to %llu", val);
>  
>  	intel_runtime_pm_put(dev_priv);
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index e866f14..85cf692 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2765,8 +2765,8 @@ static inline void i915_hangcheck_reinit(struct intel_engine_cs *engine)
>  
>  /* i915_irq.c */
>  void i915_queue_hangcheck(struct drm_device *dev);
> -__printf(3, 4)
> -void i915_handle_error(struct drm_device *dev, bool wedged,
> +__printf(4, 5)
> +void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
>  		       const char *fmt, ...);
>  
>  extern void intel_irq_init(struct drm_i915_private *dev_priv);
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 6a0ec37..fef74cf 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2712,15 +2712,29 @@ static void i915_report_and_clear_eir(struct drm_device *dev)
>  
>  /**
>   * i915_handle_error - handle a gpu error
> - * @dev: drm device
> + * @dev: 		drm device
> + * @engine_mask: 	Bit mask containing the engine flags of all engines
> + *			associated with one or more detected errors.
> + *			May be 0x0.
> + *			If wedged is set to true this implies that one or more
> + *			engine hangs were detected. In this case we will
> + *			attempt to reset all engines that have been detected
> + *			as hung.
> + *			If a previous engine reset was attempted too recently
> + *			or if one of the current engine resets fails we fall
> + *			back to legacy full GPU reset.
> + * @wedged: 		true = Hang detected, invoke hang recovery.

These two look to be tautological. If wedged == 0, we expect engine_mask
to be 0 zero as well since we will not be resetting the engines, just
capturing error state. It's not though. Conversely if wedged == 1, we
expect engine_mask to be set. Again, it's not and I think there the
caller is wrong.

> + * @fmt, ...: 		Error message describing reason for error.
>   *
>   * Do some basic checking of register state at error time and
>   * dump it to the syslog.  Also call i915_capture_error_state() to make
>   * sure we get a record and make it available in debugfs.  Fire a uevent
>   * so userspace knows something bad happened (should trigger collection
> - * of a ring dump etc.).
> + * of a ring dump etc.). If a hang was detected (wedged = true) try to
> + * reset the associated engine. Failing that, try to fall back to legacy
> + * full GPU reset recovery mode.
>   */
> -void i915_handle_error(struct drm_device *dev, bool wedged,
> +void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
>  		       const char *fmt, ...)
>  {
>  	struct drm_i915_private *dev_priv = dev->dev_private;
> @@ -2729,12 +2743,6 @@ void i915_handle_error(struct drm_device *dev, bool wedged,
>  
>  	struct intel_engine_cs *engine;
>  
> -	/*
> -	 * NB: Placeholder until the hang checker supports
> -	 * per-engine hang detection.
> -	 */
> -	u32 engine_mask = 0;
> -
>  	va_start(args, fmt);
>  	vscnprintf(error_msg, sizeof(error_msg), fmt, args);
>  	va_end(args);
> @@ -3162,7 +3170,7 @@ ring_stuck(struct intel_engine_cs *ring, u64 acthd)
>  	 */
>  	tmp = I915_READ_CTL(ring);
>  	if (tmp & RING_WAIT) {
> -		i915_handle_error(dev, false,
> +		i915_handle_error(dev, intel_ring_flag(ring), false,
>  				  "Kicking stuck wait on %s",
>  				  ring->name);
>  		I915_WRITE_CTL(ring, tmp);
> @@ -3174,7 +3182,7 @@ ring_stuck(struct intel_engine_cs *ring, u64 acthd)
>  		default:
>  			return HANGCHECK_HUNG;
>  		case 1:
> -			i915_handle_error(dev, false,
> +			i915_handle_error(dev, intel_ring_flag(ring), false,
>  					  "Kicking stuck semaphore on %s",
>  					  ring->name);
>  			I915_WRITE_CTL(ring, tmp);
> @@ -3203,7 +3211,8 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>  	struct drm_device *dev = dev_priv->dev;
>  	struct intel_engine_cs *ring;
>  	int i;
> -	int busy_count = 0, rings_hung = 0;
> +	u32 engine_mask = 0;
> +	int busy_count = 0;
>  	bool stuck[I915_NUM_RINGS] = { 0 };
>  #define BUSY 1
>  #define KICK 5
> @@ -3316,12 +3325,14 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>  			DRM_INFO("%s on %s\n",
>  				 stuck[i] ? "stuck" : "no progress",
>  				 ring->name);
> -			rings_hung++;
> +
> +			engine_mask |= intel_ring_flag(ring);
> +			ring->hangcheck.tdr_count++;

tdr_count was introduced unused in the previous patch and here it has
nothing to do with tdr, but just hangcheck pure and simple. It would
actually be a useful statistic for hangcheck/error-state...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 05/20] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress
  2016-01-13 17:28 ` [PATCH 05/20] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress Arun Siluvery
@ 2016-01-13 20:49   ` Chris Wilson
  0 siblings, 0 replies; 31+ messages in thread
From: Chris Wilson @ 2016-01-13 20:49 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx, Ian Lister, Tomas Elf

On Wed, Jan 13, 2016 at 05:28:17PM +0000, Arun Siluvery wrote:
> From: Tomas Elf <tomas.elf@intel.com>
> 
> i915_gem_wedge now returns a non-zero result in three different cases:
> 
> 1. Legacy: A hang has been detected and full GPU reset is in progress.
> 
> 2. Per-engine recovery:
> 
> 	a. A single engine reference can be passed to the function, in which
> 	case only that engine will be checked. If that particular engine is
> 	detected to be hung and is to be reset this will yield a non-zero
> 	result but not if reset is in progress for any other engine.
> 
> 	b. No engine reference is passed to the function, in which case all
> 	engines are checked for ongoing per-engine hang recovery.
> 
> Also, i915_wait_request was updated to take advantage of this new
> functionality. This is important since the TDR hang recovery mechanism needs a
> way to force waiting threads that hold the struct_mutex to give up the
> struct_mutex and try again after the hang recovery has completed. If
> i915_wait_request does not take per-engine hang recovery into account there is
> no way for a waiting thread to know that a per-engine recovery is about to
> happen and that it needs to back off.
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@intel.com>
> Signed-off-by: Ian Lister <ian.lister@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h         |  3 +-
>  drivers/gpu/drm/i915/i915_gem.c         | 60 +++++++++++++++++++++++++++------
>  drivers/gpu/drm/i915/intel_lrc.c        |  4 +--
>  drivers/gpu/drm/i915/intel_ringbuffer.c |  4 ++-
>  4 files changed, 56 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 85cf692..5be7d3e 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -3033,7 +3033,8 @@ i915_gem_find_active_request(struct intel_engine_cs *ring);
>  
>  bool i915_gem_retire_requests(struct drm_device *dev);
>  void i915_gem_retire_requests_ring(struct intel_engine_cs *ring);
> -int __must_check i915_gem_check_wedge(struct i915_gpu_error *error,
> +int __must_check i915_gem_check_wedge(struct drm_i915_private *dev_priv,
> +				      struct intel_engine_cs *engine,
>  				      bool interruptible);
>  
>  static inline bool i915_reset_in_progress(struct i915_gpu_error *error)
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index e3cfed2..e6eb45d 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -80,12 +80,38 @@ static void i915_gem_info_remove_obj(struct drm_i915_private *dev_priv,
>  	spin_unlock(&dev_priv->mm.object_stat_lock);
>  }
>  
> +static inline int
> +i915_engine_reset_in_progress(struct drm_i915_private *dev_priv,
> +	struct intel_engine_cs *engine)
> +{
> +	int ret = 0;
> +
> +	if (engine) {
> +		ret = !!(atomic_read(&dev_priv->ring[engine->id].hangcheck.flags)
> +			& I915_ENGINE_RESET_IN_PROGRESS);
> +	} else {
> +		int i;
> +
> +		for (i = 0; i < I915_NUM_RINGS; i++)
> +			if (atomic_read(&dev_priv->ring[i].hangcheck.flags)
> +				& I915_ENGINE_RESET_IN_PROGRESS) {
> +
> +				ret = 1;
> +				break;
> +			}
> +	}

Since this side will be called far more often than the writer, could you
not make this more convenient for the reader and move it to a global set
of flags in dev_priv->gpu_error?

To avoid regressing on the EIO front, the waiter sequence should look
like

if (req->reset_counter != i915_reset_counter(&req->i915->gpu_error))
	return 0;

if (flags & LOCKED && i915_engine_reset_in_process(&req->i915->gpu_error, req->engine))
	return -EAGAIN;

Oh, and don't add a second boolean to __i915_wait_request, just
transform the first into flags.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 06/20] drm/i915: Reinstate hang recovery work queue.
  2016-01-13 17:28 ` [PATCH 06/20] drm/i915: Reinstate hang recovery work queue Arun Siluvery
@ 2016-01-13 21:01   ` Chris Wilson
  0 siblings, 0 replies; 31+ messages in thread
From: Chris Wilson @ 2016-01-13 21:01 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx, Tomas Elf, Mika Kuoppala

On Wed, Jan 13, 2016 at 05:28:18PM +0000, Arun Siluvery wrote:
> From: Tomas Elf <tomas.elf@intel.com>
> 
> There used to be a work queue separating the error handler from the hang
> recovery path, which was removed a while back in this commit:
> 
> 	commit b8d24a06568368076ebd5a858a011699a97bfa42
> 	Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> 	Date:   Wed Jan 28 17:03:14 2015 +0200
> 
> 	    drm/i915: Remove nested work in gpu error handling
> 
> Now we need to revert most of that commit since the work queue separating hang
> detection from hang recovery is needed in preparation for the upcoming watchdog
> timeout feature. The watchdog interrupt service routine will be a second
> callsite of the error handler alongside the periodic hang checker, which runs
> in a work queue context. Seeing as the error handler will be serving a caller
> in a hard interrupt execution context that means that the error handler must
> never end up in a situation where it needs to grab the struct_mutex.
> Unfortunately, that is exactly what we need to do first at the start of the
> hang recovery path, which might potentially sleep if the struct_mutex is
> already held by another thread. Not good when you're in a hard interrupt
> context.

We also would not dream of running i915_handle_error() from inside an
interrupt handler anyway as the capture is too heavy...
 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_dma.c |  1 +
>  drivers/gpu/drm/i915/i915_drv.h |  1 +
>  drivers/gpu/drm/i915/i915_irq.c | 31 ++++++++++++++++++++++++-------
>  3 files changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> index c45ec353..67003c2 100644
> --- a/drivers/gpu/drm/i915/i915_dma.c
> +++ b/drivers/gpu/drm/i915/i915_dma.c
> @@ -1203,6 +1203,7 @@ int i915_driver_unload(struct drm_device *dev)
>  	/* Free error state after interrupts are fully disabled. */
>  	cancel_delayed_work_sync(&dev_priv->gpu_error.hangcheck_work);
>  	i915_destroy_error_state(dev);
> +	cancel_work_sync(&dev_priv->gpu_error.work);

This should be before the destroy as we could be in the process of
resetting state but after the cancel(hangcheck_work), as that may queue
the error_work.

> @@ -2827,7 +2830,21 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
>  		i915_error_wake_up(dev_priv, false);
>  	}
>  
> -	i915_reset_and_wakeup(dev);
> +	/*
> +	 * Gen 7:
> +	 *

Gen 7? A little misleading
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 07/20] drm/i915: Watchdog timeout: Hang detection integration into error handler
  2016-01-13 17:28 ` [PATCH 07/20] drm/i915: Watchdog timeout: Hang detection integration into error handler Arun Siluvery
@ 2016-01-13 21:13   ` Chris Wilson
  0 siblings, 0 replies; 31+ messages in thread
From: Chris Wilson @ 2016-01-13 21:13 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx, Ian Lister, Tomas Elf

On Wed, Jan 13, 2016 at 05:28:19PM +0000, Arun Siluvery wrote:
>  /* i915_irq.c */
>  void i915_queue_hangcheck(struct drm_device *dev);
> -__printf(4, 5)
> -void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
> -		       const char *fmt, ...);
> +__printf(5, 6)
> +void i915_handle_error(struct drm_device *dev, u32 engine_mask,
> +		       bool watchdog, bool wedged, const char *fmt, ...);
>  
>  extern void intel_irq_init(struct drm_i915_private *dev_priv);
>  int intel_irq_install(struct drm_i915_private *dev_priv);
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 8937c82..0710724 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2726,6 +2726,7 @@ static void i915_report_and_clear_eir(struct drm_device *dev)
>   *			If a previous engine reset was attempted too recently
>   *			or if one of the current engine resets fails we fall
>   *			back to legacy full GPU reset.
> + * @watchdog: 		true = Engine hang detected by hardware watchdog.
>   * @wedged: 		true = Hang detected, invoke hang recovery.

A bitmask and 2 booleans? Whilst this isn't going to be the most widely
used of functions, those parameters are just inviting trouble.

>   * @fmt, ...: 		Error message describing reason for error.
>   *
> @@ -2737,8 +2738,8 @@ static void i915_report_and_clear_eir(struct drm_device *dev)
>   * reset the associated engine. Failing that, try to fall back to legacy
>   * full GPU reset recovery mode.
>   */
> -void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
> -		       const char *fmt, ...)
> +void i915_handle_error(struct drm_device *dev, u32 engine_mask,
> +                       bool watchdog, bool wedged, const char *fmt, ...)
>  {
>  	struct drm_i915_private *dev_priv = dev->dev_private;
>  	va_list args;
> @@ -2776,20 +2777,27 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
>  			u32 i;
>  
>  			for_each_ring(engine, dev_priv, i) {
> -				u32 now, last_engine_reset_timediff;

Oops skipped a patch, I'll be back.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 03/20] drm/i915: TDR / per-engine hang recovery support for gen8.
  2016-01-13 17:28 ` [PATCH 03/20] drm/i915: TDR / per-engine hang recovery support for gen8 Arun Siluvery
@ 2016-01-13 21:16   ` Chris Wilson
  2016-01-13 21:21   ` Chris Wilson
  2016-01-29 14:16   ` Mika Kuoppala
  2 siblings, 0 replies; 31+ messages in thread
From: Chris Wilson @ 2016-01-13 21:16 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx, Ian Lister, Tomas Elf

On Wed, Jan 13, 2016 at 05:28:15PM +0000, Arun Siluvery wrote:
> @@ -596,6 +598,16 @@ static int i915_drm_suspend(struct drm_device *dev)

> +	atomic_clear_mask(I915_RESET_IN_PROGRESS_FLAG,
> +		&dev_priv->gpu_error.reset_counter);

This could be its own little patch as we could apply it today.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 03/20] drm/i915: TDR / per-engine hang recovery support for gen8.
  2016-01-13 17:28 ` [PATCH 03/20] drm/i915: TDR / per-engine hang recovery support for gen8 Arun Siluvery
  2016-01-13 21:16   ` Chris Wilson
@ 2016-01-13 21:21   ` Chris Wilson
  2016-01-29 14:16   ` Mika Kuoppala
  2 siblings, 0 replies; 31+ messages in thread
From: Chris Wilson @ 2016-01-13 21:21 UTC (permalink / raw)
  To: Arun Siluvery; +Cc: intel-gfx, Ian Lister, Tomas Elf

On Wed, Jan 13, 2016 at 05:28:15PM +0000, Arun Siluvery wrote:
> diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
> index 8d90c25..5cf9c11 100644
> --- a/drivers/gpu/drm/i915/i915_params.c
> +++ b/drivers/gpu/drm/i915/i915_params.c
> @@ -37,6 +37,8 @@ struct i915_params i915 __read_mostly = {
>  	.enable_fbc = -1,
>  	.enable_execlists = -1,
>  	.enable_hangcheck = true,
> +	.enable_engine_reset = false,

Can we combine this with the existing i915.reset?

reset == 1 => full reset only
reset == 2 => per-engine reset, with fallback to full reset

Looks like we can. Remember this is a user parameter and they like to
mess around, fewer options means less maintenance, less hassle for us.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* ✗ failure: Fi.CI.BAT
  2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
                   ` (19 preceding siblings ...)
  2016-01-13 17:28 ` [PATCH 20/20] drm/i915: Enable TDR / per-engine hang recovery Arun Siluvery
@ 2016-01-14  8:30 ` Patchwork
  20 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2016-01-14  8:30 UTC (permalink / raw)
  To: Tomas Elf; +Cc: intel-gfx

== Summary ==

HEAD is now at 058740f drm-intel-nightly: 2016y-01m-13d-17h-07m-44s UTC integration manifest
Applying: drm/i915: Make i915_gem_reset_ring_status() public
Applying: drm/i915: Generalise common GPU engine reset request/unrequest code
Applying: drm/i915: TDR / per-engine hang recovery support for gen8.
Using index info to reconstruct a base tree...
M	drivers/gpu/drm/i915/i915_dma.c
M	drivers/gpu/drm/i915/i915_irq.c
M	drivers/gpu/drm/i915/intel_lrc.c
Falling back to patching base and 3-way merge...
Auto-merging drivers/gpu/drm/i915/intel_lrc.c
CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/intel_lrc.c
Auto-merging drivers/gpu/drm/i915/i915_irq.c
Auto-merging drivers/gpu/drm/i915/i915_dma.c
Patch failed at 0003 drm/i915: TDR / per-engine hang recovery support for gen8.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 02/20] drm/i915: Generalise common GPU engine reset request/unrequest code
  2016-01-13 17:28 ` [PATCH 02/20] drm/i915: Generalise common GPU engine reset request/unrequest code Arun Siluvery
@ 2016-01-22 11:24   ` Mika Kuoppala
  0 siblings, 0 replies; 31+ messages in thread
From: Mika Kuoppala @ 2016-01-22 11:24 UTC (permalink / raw)
  To: Arun Siluvery, intel-gfx; +Cc: Tomas Elf

Arun Siluvery <arun.siluvery@linux.intel.com> writes:

> From: Tomas Elf <tomas.elf@intel.com>
>
> GPU engine reset handshaking is something that is applicable to both full GPU
> reset and engine reset, which is something that is part of the upcoming TDR
> per-engine hang recovery patches. Break out the common engine reset
> request/unrequest code (originally written by Mika Kuoppala) for reuse later in
> the TDR enablement patch series.
>
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/intel_uncore.c | 46 ++++++++++++++++++++++++++-----------
>  1 file changed, 32 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index c3c13dc..2df4246 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1529,32 +1529,50 @@ static int wait_for_register(struct drm_i915_private *dev_priv,
>  	return wait_for((I915_READ(reg) & mask) == value, timeout_ms);
>  }
>  
> +static inline int gen8_request_engine_reset(struct intel_engine_cs *engine)
> +{

Inline is superfluous here. Please remove.

> +	struct drm_i915_private *dev_priv = engine->dev->dev_private;
> +	int ret = 0;

No need to set ret to zero.

> +
> +	I915_WRITE(RING_RESET_CTL(engine->mmio_base),
> +		   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
> +

Indentation seems to be bit off here..

> +	ret = wait_for_register(dev_priv,
> +			      RING_RESET_CTL(engine->mmio_base),
> +			      RESET_CTL_READY_TO_RESET,
> +			      RESET_CTL_READY_TO_RESET,
> +			      700);

and here 

> +	if (ret)
> +		DRM_ERROR("%s: reset request timeout\n", engine->name);
> +
> +	return ret;
> +}
> +
> +static inline int gen8_unrequest_engine_reset(struct intel_engine_cs
> *engine)

Remove inline and do not return value if there
is no use for it.

> +{
> +	struct drm_i915_private *dev_priv = engine->dev->dev_private;
> +
> +	I915_WRITE(RING_RESET_CTL(engine->mmio_base),
> +		_MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
> +
indent.

With these done, you can add my r-b.

-Mika

> +	return 0;
> +}
> +
>  static int gen8_do_reset(struct drm_device *dev)
>  {
>  	struct drm_i915_private *dev_priv = dev->dev_private;
>  	struct intel_engine_cs *engine;
>  	int i;
>  
> -	for_each_ring(engine, dev_priv, i) {
> -		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
> -			   _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
> -
> -		if (wait_for_register(dev_priv,
> -				      RING_RESET_CTL(engine->mmio_base),
> -				      RESET_CTL_READY_TO_RESET,
> -				      RESET_CTL_READY_TO_RESET,
> -				      700)) {
> -			DRM_ERROR("%s: reset request timeout\n", engine->name);
> +	for_each_ring(engine, dev_priv, i)
> +		if (gen8_request_engine_reset(engine))
>  			goto not_ready;
> -		}
> -	}
>  
>  	return gen6_do_reset(dev);
>  
>  not_ready:
>  	for_each_ring(engine, dev_priv, i)
> -		I915_WRITE(RING_RESET_CTL(engine->mmio_base),
> -			   _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
> +		gen8_unrequest_engine_reset(engine);
>  
>  	return -EIO;
>  }
> -- 
> 1.9.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 03/20] drm/i915: TDR / per-engine hang recovery support for gen8.
  2016-01-13 17:28 ` [PATCH 03/20] drm/i915: TDR / per-engine hang recovery support for gen8 Arun Siluvery
  2016-01-13 21:16   ` Chris Wilson
  2016-01-13 21:21   ` Chris Wilson
@ 2016-01-29 14:16   ` Mika Kuoppala
  2 siblings, 0 replies; 31+ messages in thread
From: Mika Kuoppala @ 2016-01-29 14:16 UTC (permalink / raw)
  To: Arun Siluvery, intel-gfx; +Cc: Ian Lister, Tomas Elf

Arun Siluvery <arun.siluvery@linux.intel.com> writes:

> From: Tomas Elf <tomas.elf@intel.com>
>
> TDR = Timeout Detection and Recovery.
>
> This change introduces support for TDR-style per-engine reset as an initial,
> less intrusive hang recovery option to be attempted before falling back to the
> legacy full GPU reset recovery mode if necessary. Initially we're only
> supporting gen8 but adding support for gen7 is straight-forward since we've
> already established an extensible framework where gen7 support can be plugged
> in (add corresponding versions of intel_ring_enable, intel_ring_disable,
> intel_ring_save, intel_ring_restore, etc.).
>
> 1. Per-engine recovery vs. Full GPU recovery
>
> To capture the state of a single engine being detected as hung there is now a
> new flag for every engine that can be set once the decision has been made to
> schedule hang recovery for that particular engine. This patch only provides the
> hang recovery path but not the hang detection integration so for now there is
> no way of detecting individual engines as hung and targetting that individual
> engine for per-engine hang recovery.
>
> The following algorithm is used to determine when to use which recovery mode
> given that hang detection has somehow detected a hang on an individual engine
> and given that per-engine hang recovery has been enabled (which it by default
> is not):
>
> 	1. The error handler checks all engines that have been marked as hung
> 	by the hang checker and checks how long ago it was since it last
> 	attempted to do per-engine hang recovery for each respective, currently
> 	hung engine. If the measured time period is within a certain time
> 	window, i.e. the last per-engine hang recovery was done too recently,
> 	it is determined that the previously attempted per-engine hang recovery
> 	was ineffective and the step is taken to promote the current hang to a
> 	full GPU reset. The default value for this time window is 10 seconds,
> 	meaning any hang happening within 10 seconds of a previous hang on the
> 	same engine will be promoted to full GPU reset. (of course, as long as
> 	the per-engine hang recovery option is disabled this won't matter and
> 	the error handler will always go for legacy full GPU reset)
>
> 	2. If the error handler determines that no currently hung engine has
> 	recently had hang recovery a per-engine hang recovery is scheduled.
>
> 	3. If the decision to go with per-engine hang recovery is not taken, or
> 	if per-engine hang recovery is attempted but failed for whatever
> 	reason, TDR falls back to legacy full GPU recovery.
>
> NOTE: Gen7 and earlier will always promote to full GPU reset since there is
> currently no per-engine reset support for these gens.
>
> 2. Context Submission Status Consistency.
>
> Per-engine hang recovery on gen8 (or execlist submission mode in general)
> relies on the basic concept of context submission status consistency. What this
> means is that we make sure that the status of the hardware and the driver when
> it comes to the submission of the currently running context on any engine is
> consistent. For example, when submitting a context to the corresponding ELSP
> port of an engine we expect the owning request of that context to be at the
> head of the corresponding execution list queue. Likewise, as long as the
> context is executing on the GPU we expect the EXECLIST_STATUS register and the
> context status buffer (CSB) to reflect this. Thus, if the context submission
> status is consistent the ID of the currently executing context should be in
> EXECLIST_STATUS and it should be consistent with the context of the head
> request element in the execution list queue corresponding to that engine.
>
> The reason why this is important for per-engine hang recovery in execlist mode
> is because this recovery mode relies on context resubmission in order to resume
> execution following the recovery. If a context has been determined to be hung
> and the per-engine hang recovery mode is engaged leading to the resubmission of
> that context it's important that the hardware is in fact not busy doing
> something else or is being idle since a resubmission during this state could
> cause unforseen side-effects such as unexpected preemptions.
>
> There are rare, although consistently reproducable, situations that have shown
> up in practice where the driver and hardware are no longer consistent with each
> other, e.g. due to lost context completion interrupts after which the hardware
> would be idle but the driver would still think that a context would still be
> active.
>
> 3. There is a new reset path for engine reset alongside the legacy full GPU
> reset path. This path does the following:
>
> 	1) Check for context submission consistency to make sure that the
> 	context that the hardware is currently stuck on is actually what the
> 	driver is working on. If not then clearly we're not in a consistently
> 	hung state and we bail out early.
>
> 	2) Disable/idle the engine. This is done through reset handshaking on
> 	gen8+ unlike earlier gens where this was done by clearing the ring
> 	valid bits in MI_MODE and ring control registers, which are no longer
> 	supported on gen8+. Reset handshaking translates to setting the reset
> 	request bit in the reset control register.
>
> 	3) Save the current engine state. What this translates to on gen8 is
> 	simply to read the current value of the head register and nudge it so
> 	that it points to the next valid instruction in the ring buffer. Since
> 	we assume that the execution is currently stuck in a batch buffer the
> 	effect of this is that the batchbuffer start instruction of the hung
> 	batch buffer is skipped so that when execution resumes, following the
> 	hang recovery completion, it resumes immediately following the batch
> 	buffer.
>
> 	This effectively means that we're forcefully terminating the currently
> 	active, hung batch buffer. Obviously, the outcome of this intervention
> 	is potentially undefined but there are not many good options in this
> 	scenario. It's better than resetting the entire GPU in the vast
> 	majority of cases.
>
> 	Save the nudged head value to be applied later.
>
> 	4) Reset the engine.
>
> 	5) Apply the nudged head value to the head register.
>
> 	6) Reenable the engine. For gen8 this means resubmitting the fixed-up
> 	context, allowing execution to resume. In order to resubmit a context
> 	without relying on the currently hung execlist queue we use a new,
> 	privileged API that is dedicated to TDR use only. This submission API
> 	bypasses any currently queued work and gets exclusive access to the
> 	ELSP ports.
>
> 	7) If the engine hang recovery procedure fails at any point in between
> 	disablement and reenablement of the engine there is a back-off
> 	procedure: For gen8 it's possible to back out of the reset handshake by
> 	clearing the reset request bit in the reset control register.
>
> NOTE:
> It's possible that some of Ben Widawsky's original per-engine reset patches
> from 3 years ago are in this commit but since this work has gone through the
> hands of at least 3 people already any kind of ownership tracking has been lost
> a long time ago. If you think that you should be on the sob list just let me
> know.
>
> * RFCv2: (Chris Wilson / Daniel Vetter)
> - Simply use the previously private function i915_gem_reset_ring_status() from
>   the engine hang recovery path to set active/pending context status. This
>   replicates the same behaviour as in full GPU reset but for a single,
>   targetted engine.
>
> - Remove all additional uevents for both full GPU reset and per-engine reset.
>   Adapted uevent behaviour to the new per-engine hang recovery mode in that it
>   will only send one uevent regardless of which form of recovery is employed.
>   If a per-engine reset is attempted first then one uevent will be dispatched.
>   If that recovery mode fails and the hang is promoted to a full GPU reset no
>   further uevents will be dispatched at that point.
>
> - Tidied up the TDR context resubmission path in intel_lrc.c . Reduced the
>   amount of duplication by relying entirely on the normal unqueue function.
>   Added a new parameter to the unqueue function that takes into consideration
>   if the unqueue call is for a first-time context submission or a resubmission
>   and adapts the handling of elsp_submitted accordingly. The reason for
>   this is that for context resubmission we don't expect any further
>   interrupts for the submission or the following context completion. A more
>   elegant way of handling this would be to phase out elsp_submitted
>   altogether, however that's part of a LRC/execlist cleanup effort that is
>   happening independently of this patch series. For now we make this change
>   as simple as possible with as few non-TDR-related side-effects as
>   possible.
>

> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Ian Lister <ian.lister@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_dma.c         |  18 +
>  drivers/gpu/drm/i915/i915_drv.c         | 206 ++++++++++++
>  drivers/gpu/drm/i915/i915_drv.h         |  58 ++++
>  drivers/gpu/drm/i915/i915_irq.c         | 169 +++++++++-
>  drivers/gpu/drm/i915/i915_params.c      |  19 ++
>  drivers/gpu/drm/i915/i915_params.h      |   2 +
>  drivers/gpu/drm/i915/i915_reg.h         |   2 +
>  drivers/gpu/drm/i915/intel_lrc.c        | 565 +++++++++++++++++++++++++++++++-
>  drivers/gpu/drm/i915/intel_lrc.h        |  14 +
>  drivers/gpu/drm/i915/intel_lrc_tdr.h    |  36 ++
>  drivers/gpu/drm/i915/intel_ringbuffer.c |  84 ++++-
>  drivers/gpu/drm/i915/intel_ringbuffer.h |  64 ++++
>  drivers/gpu/drm/i915/intel_uncore.c     | 147 +++++++++
>  13 files changed, 1358 insertions(+), 26 deletions(-)
>  create mode 100644 drivers/gpu/drm/i915/intel_lrc_tdr.h
>

1332 lines of new code in a single patch. We need to figure
out how to split this.

The context register write/read code and related macros are
not needed anymore so that will lessen the lines alot.

But some random comments for round two inlined below...


> diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> index 44a896c..c45ec353 100644
> --- a/drivers/gpu/drm/i915/i915_dma.c
> +++ b/drivers/gpu/drm/i915/i915_dma.c
> @@ -837,6 +837,22 @@ static void intel_device_info_runtime_init(struct drm_device *dev)
>  			 info->has_eu_pg ? "y" : "n");
>  }
>  
> +static void
> +i915_hangcheck_init(struct drm_device *dev)
> +{
> +	int i;
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +
> +	for (i = 0; i < I915_NUM_RINGS; i++) {
> +		struct intel_engine_cs *engine = &dev_priv->ring[i];
> +		struct intel_ring_hangcheck *hc = &engine->hangcheck;
> +
> +		i915_hangcheck_reinit(engine);

intel_engine_init_hangcheck(engine);


> +		hc->reset_count = 0;
> +		hc->tdr_count = 0;
> +	}
> +}
> +
>  static void intel_init_dpio(struct drm_i915_private *dev_priv)
>  {
>  	/*
> @@ -1034,6 +1050,8 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
>  
>  	i915_gem_load(dev);
>  
> +	i915_hangcheck_init(dev);
> +
>  	/* On the 945G/GM, the chipset reports the MSI capability on the
>  	 * integrated graphics even though the support isn't actually there
>  	 * according to the published specs.  It doesn't appear to function
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index f17a2b0..c0ad003 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -34,6 +34,7 @@
>  #include "i915_drv.h"
>  #include "i915_trace.h"
>  #include "intel_drv.h"
> +#include "intel_lrc_tdr.h"

We want to push pre gen 8 stuff also to here, atleast
eventually. So

#include "intel_tdr.h"

>  
>  #include <linux/console.h>
>  #include <linux/module.h>
> @@ -571,6 +572,7 @@ static int i915_drm_suspend(struct drm_device *dev)
>  	struct drm_i915_private *dev_priv = dev->dev_private;
>  	pci_power_t opregion_target_state;
>  	int error;
> +	int i;
>  
>  	/* ignore lid events during suspend */
>  	mutex_lock(&dev_priv->modeset_restore_lock);
> @@ -596,6 +598,16 @@ static int i915_drm_suspend(struct drm_device *dev)
>  
>  	intel_guc_suspend(dev);
>  
> +	/*
> +	 * Clear any pending reset requests. They should be picked up
> +	 * after resume when new work is submitted
> +	 */
> +	for (i = 0; i < I915_NUM_RINGS; i++)
> +		atomic_set(&dev_priv->ring[i].hangcheck.flags, 0);

This will cause havoc if you ever expand the flag space. If
the comment says that you want to clear pending resets, then
clear it with mask.

> +
> +	atomic_clear_mask(I915_RESET_IN_PROGRESS_FLAG,
> +		&dev_priv->gpu_error.reset_counter);
> +
>  	intel_suspend_gt_powersave(dev);
>  
>  	/*
> @@ -948,6 +960,200 @@ int i915_reset(struct drm_device *dev)
>  	return 0;
>  }
>  
> +/**
> + * i915_reset_engine - reset GPU engine after a hang
> + * @engine: engine to reset
> + *
> + * Reset a specific GPU engine. Useful if a hang is detected. Returns zero on successful
> + * reset or otherwise an error code.
> + *
> + * Procedure is fairly simple:
> + *
> + *	- Force engine to idle.
> + *
> + *	- Save current head register value and nudge it past the point of the hang in the
> + *	  ring buffer, which is typically the BB_START instruction of the hung batch buffer,
> + *	  on to the following instruction.
> + *
> + *	- Reset engine.
> + *
> + *	- Restore the previously saved, nudged head register value.
> + *
> + *	- Re-enable engine to resume running. On gen8 this requires the previously hung
> + *	  context to be resubmitted to ELSP via the dedicated TDR-execlists interface.
> + *
> + */
> +int i915_reset_engine(struct intel_engine_cs *engine)
> +{
> +	struct drm_device *dev = engine->dev;
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	struct drm_i915_gem_request *current_request = NULL;
> +	uint32_t head;
> +	bool force_advance = false;
> +	int ret = 0;
> +	int err_ret = 0;
> +
> +	WARN_ON(!mutex_is_locked(&dev->struct_mutex));
> +
> +        /* Take wake lock to prevent power saving mode */
> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
> +
> +	i915_gem_reset_ring_status(dev_priv, engine);
> 
> +	if (i915.enable_execlists) {
> +		enum context_submission_status status =
> +			intel_execlists_TDR_get_current_request(engine, NULL);
> +
> +		/*
> +		 * If the context submission state in hardware is not
> +		 * consistent with the the corresponding state in the driver or
> +		 * if there for some reason is no current context in the
> +		 * process of being submitted then bail out and try again. Do
> +		 * not proceed unless we have reliable current context state
> +		 * information. The reason why this is important is because
> +		 * per-engine hang recovery relies on context resubmission in
> +		 * order to force the execution to resume following the hung
> +		 * batch buffer. If the hardware is not currently running the
> +		 * same context as the driver thinks is hung then anything can
> +		 * happen at the point of context resubmission, e.g. unexpected
> +		 * preemptions or the previously hung context could be
> +		 * submitted when the hardware is idle which makes no sense.
> +		 */
> +		if (status != CONTEXT_SUBMISSION_STATUS_OK) {
> +			ret = -EAGAIN;
> +			goto reset_engine_error;
> +		}
> +	}

This whole ambivalence troubles me. If our hangcheck part is lacking so
that it will reset engines that really are not stuck, then we should
move/improve this logic in hangcheck side.

We are juggling here with the the execlist lock inside the 
intel_execlist_TDR_get_current_request and on multiple calls to that.

We need to hold the execlist lock during the state save and
restore.

> +
> +	ret = intel_ring_disable(engine);
> +	if (ret != 0) {
> +		DRM_ERROR("Failed to disable %s\n", engine->name);
> +		goto reset_engine_error;
> +	}
> +
> +	if (i915.enable_execlists) {
> +		enum context_submission_status status;
> +		bool inconsistent;
> +
> +		status = intel_execlists_TDR_get_current_request(engine,
> +				&current_request);
> +

intel_execlist_get_current_request()
intel_execlist_get_submission_status()

if we have lock, no need to do everything in same function.

And move the referencing of current_request up to this context as
the unreferencing is already here.


> +		inconsistent = (status != CONTEXT_SUBMISSION_STATUS_OK);
> +		if (inconsistent) {
> +			/*
> +			 * If we somehow have reached this point with
> +			 * an inconsistent context submission status then
> +			 * back out of the previously requested reset and
> +			 * retry later.
> +			 */
> +			WARN(inconsistent,
> +			     "Inconsistent context status on %s: %u\n",
> +			     engine->name, status);
> +
> +			ret = -EAGAIN;
> +			goto reenable_reset_engine_error;
> +		}
> +	}
> +
> +	/* Sample the current ring head position */
> +	head = I915_READ_HEAD(engine) & HEAD_ADDR;

intel_ring_get_active_head(engine);

> +
> +	if (head == engine->hangcheck.last_head) {
> +		/*
> +		 * The engine has not advanced since the last
> +		 * time it hung so force it to advance to the
> +		 * next QWORD. In most cases the engine head
> +		 * pointer will automatically advance to the
> +		 * next instruction as soon as it has read the
> +		 * current instruction, without waiting for it
> +		 * to complete. This seems to be the default
> +		 * behaviour, however an MBOX wait inserted
> +		 * directly to the VCS/BCS engines does not behave
> +		 * in the same way, instead the head pointer
> +		 * will still be pointing at the MBOX instruction
> +		 * until it completes.
> +		 */
> +		force_advance = true;
> +	}
> +
> +	engine->hangcheck.last_head = head;
> +
> +	ret = intel_ring_save(engine, current_request, force_advance);

intel_engine_save()

> +	if (ret) {
> +		DRM_ERROR("Failed to save %s engine state\n", engine->name);
> +		goto reenable_reset_engine_error;
> +	}
> +
> +	ret = intel_gpu_engine_reset(engine);

intel_engine_reset()

> +	if (ret) {
> +		DRM_ERROR("Failed to reset %s\n", engine->name);
> +		goto reenable_reset_engine_error;
> +	}
> +
> +	ret = intel_ring_restore(engine, current_request);

intel_engine_restore()

> +	if (ret) {
> +		DRM_ERROR("Failed to restore %s engine state\n", engine->name);
> +		goto reenable_reset_engine_error;
> +	}
> +
> +	/* Correct driver state */
> +	intel_gpu_engine_reset_resample(engine, current_request);

This looks like it resamples the head.

intel_engine_reset_head()

> +
> +	/*
> +	 * Reenable engine
> +	 *
> +	 * In execlist mode on gen8+ this is implicit by simply resubmitting
> +	 * the previously hung context. In ring buffer submission mode on gen7
> +	 * and earlier we need to actively turn on the engine first.
> +	 */
> +	if (i915.enable_execlists)
> +		intel_execlists_TDR_context_resubmission(engine);

intel_logical_ring_enable()?

> +	else
> +		ret = intel_ring_enable(engine);
> +

> +	if (ret) {
> +		DRM_ERROR("Failed to enable %s again after reset\n",
> +			engine->name);
> +
> +		goto reset_engine_error;
> +	}
> +
> +	/* Clear reset flags to allow future hangchecks */
> +	atomic_set(&engine->hangcheck.flags, 0);
> +
> +	/* Wake up anything waiting on this engine's queue */
> +	wake_up_all(&engine->irq_queue);
> +
> +	if (i915.enable_execlists && current_request)
> +		i915_gem_request_unreference(current_request);
> +
> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
> +

reset_engine_error: is identical to code block above.

> +	return ret;
> +
> +reenable_reset_engine_error:
> +
> +	err_ret = intel_ring_enable(engine);
> +	if (err_ret)
> +		DRM_ERROR("Failed to reenable %s following error during reset (%d)\n",
> +			engine->name, err_ret);
> +
> +reset_engine_error:
> +
> +	/* Clear reset flags to allow future hangchecks */
> +	atomic_set(&engine->hangcheck.flags, 0);
> +
> +	/* Wake up anything waiting on this engine's queue */
> +	wake_up_all(&engine->irq_queue);
> +
> +	if (i915.enable_execlists && current_request)
> +		i915_gem_request_unreference(current_request);
> +
> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
> +
> +	return ret;
> +}
> +
>  static int i915_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  {
>  	struct intel_device_info *intel_info =
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 703a320..e866f14 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2432,6 +2432,48 @@ struct drm_i915_cmd_table {
>  	int count;
>  };
>  
> +/*
> + * Context submission status
> + *
> + * CONTEXT_SUBMISSION_STATUS_OK:
> + *	Context submitted to ELSP and state of execlist queue is the same as
> + *	the state of EXECLIST_STATUS register. Software and hardware states
> + *	are consistent and can be trusted.
> + *
> + * CONTEXT_SUBMISSION_STATUS_INCONSISTENT:
> + *	Context has been submitted to the execlist queue but the state of the
> + *	EXECLIST_STATUS register is different from the execlist queue state.
> + *	This could mean any of the following:
> + *
> + *		1. The context is in the head position of the execlist queue
> + *		   but has not yet been submitted to ELSP.
> + *
> + *		2. The hardware just recently completed the context but the
> + *		   context is pending removal from the execlist queue.
> + *
> + *		3. The driver has lost a context state transition interrupt.
> + *		   Typically what this means is that hardware has completed and
> + *		   is now idle but the driver thinks the hardware is still
> + *		   busy.
> + *
> + *	Overall what this means is that the context submission status is
> + *	currently in transition and cannot be trusted until it settles down.
> + *
> + * CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED:
> + *	No context submitted to the execlist queue and the EXECLIST_STATUS
> + *	register shows no context being processed.
> + *
> + * CONTEXT_SUBMISSION_STATUS_NONE_UNDEFINED:
> + *	Initial state before submission status has been determined.
> + *
> + */
> +enum context_submission_status {
> +	CONTEXT_SUBMISSION_STATUS_OK = 0,
> +	CONTEXT_SUBMISSION_STATUS_INCONSISTENT,
> +	CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED,
> +	CONTEXT_SUBMISSION_STATUS_UNDEFINED
> +};
> +
>  /* Note that the (struct drm_i915_private *) cast is just to shut up gcc. */
>  #define __I915__(p) ({ \
>  	struct drm_i915_private *__p; \
> @@ -2690,8 +2732,12 @@ extern long i915_compat_ioctl(struct file *filp, unsigned int cmd,
>  			      unsigned long arg);
>  #endif
>  extern int intel_gpu_reset(struct drm_device *dev);
> +extern int intel_gpu_engine_reset(struct intel_engine_cs *engine);
> +extern int intel_request_gpu_engine_reset(struct intel_engine_cs *engine);
> +extern int intel_unrequest_gpu_engine_reset(struct intel_engine_cs *engine);
>  extern bool intel_has_gpu_reset(struct drm_device *dev);
>  extern int i915_reset(struct drm_device *dev);
> +extern int i915_reset_engine(struct intel_engine_cs *engine);
>  extern unsigned long i915_chipset_val(struct drm_i915_private *dev_priv);
>  extern unsigned long i915_mch_val(struct drm_i915_private *dev_priv);
>  extern unsigned long i915_gfx_val(struct drm_i915_private *dev_priv);
> @@ -2704,6 +2750,18 @@ void intel_hpd_init(struct drm_i915_private *dev_priv);
>  void intel_hpd_init_work(struct drm_i915_private *dev_priv);
>  void intel_hpd_cancel_work(struct drm_i915_private *dev_priv);
>  bool intel_hpd_pin_to_port(enum hpd_pin pin, enum port *port);
> +static inline void i915_hangcheck_reinit(struct intel_engine_cs *engine)
> +{
> +	struct intel_ring_hangcheck *hc = &engine->hangcheck;
> +
> +	hc->acthd = 0;
> +	hc->max_acthd = 0;
> +	hc->seqno = 0;
> +	hc->score = 0;
> +	hc->action = HANGCHECK_IDLE;
> +	hc->deadlock = 0;
> +}
> +

Rename to intel_engine_hangcheck_init and to intel_ringbuffer.c

>  
>  /* i915_irq.c */
>  void i915_queue_hangcheck(struct drm_device *dev);
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index f04d799..6a0ec37 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2470,10 +2470,70 @@ static void i915_reset_and_wakeup(struct drm_device *dev)
>  	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
>  	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
>  	char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
> -	int ret;
> +	bool reset_complete = false;
> +	struct intel_engine_cs *ring;
> +	int ret = 0;
> +	int i;
> +
> +	mutex_lock(&dev->struct_mutex);
>  
>  	kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, error_event);
>  
> +	for_each_ring(ring, dev_priv, i) {
> +
> +		/*
> +		 * Skip further individual engine reset requests if full GPU
> +		 * reset requested.
> +		 */
> +		if (i915_reset_in_progress(error))
> +			break;
> +
> +		if (atomic_read(&ring->hangcheck.flags) &
> +			I915_ENGINE_RESET_IN_PROGRESS) {
> +
> +			if (!reset_complete)
> +				kobject_uevent_env(&dev->primary->kdev->kobj,
> +						   KOBJ_CHANGE,
> +						   reset_event);
> +
> +			reset_complete = true;
> +
> +			ret = i915_reset_engine(ring);
> +
> +			/*
> +			 * Execlist mode only:
> +			 *
> +			 * -EAGAIN means that between detecting a hang (and
> +			 * also determining that the currently submitted
> +			 * context is stable and valid) and trying to recover
> +			 * from the hang the current context changed state.
> +			 * This means that we are probably not completely hung
> +			 * after all. Just fail and retry by exiting all the
> +			 * way back and wait for the next hang detection. If we
> +			 * have a true hang on our hands then we will detect it
> +			 * again, otherwise we will continue like nothing
> +			 * happened.
> +			 */
> +			if (ret == -EAGAIN) {
> +				DRM_ERROR("Reset of %s aborted due to " \
> +					  "change in context submission " \
> +					  "state - retrying!", ring->name);
> +				ret = 0;
> +			}
> +
> +			if (ret) {
> +				DRM_ERROR("Reset of %s failed! (%d)", ring->name, ret);
> +
> +				atomic_or(I915_RESET_IN_PROGRESS_FLAG,
> +					&dev_priv->gpu_error.reset_counter);
> +				break;
> +			}
> +		}
> +	}
> +
> +	/* The full GPU reset will grab the struct_mutex when it needs it */
> +	mutex_unlock(&dev->struct_mutex);
> +
>  	/*
>  	 * Note that there's only one work item which does gpu resets, so we
>  	 * need not worry about concurrent gpu resets potentially incrementing
> @@ -2486,8 +2546,13 @@ static void i915_reset_and_wakeup(struct drm_device *dev)
>  	 */
>  	if (i915_reset_in_progress(error) && !i915_terminally_wedged(error)) {
>  		DRM_DEBUG_DRIVER("resetting chip\n");
> -		kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE,
> -				   reset_event);
> +
> +		if (!reset_complete)
> +			kobject_uevent_env(&dev->primary->kdev->kobj,
> +					   KOBJ_CHANGE,
> +					   reset_event);
> +
> +		reset_complete = true;
>  
>  		/*
>  		 * In most cases it's guaranteed that we get here with an RPM
> @@ -2520,23 +2585,36 @@ static void i915_reset_and_wakeup(struct drm_device *dev)
>  			 *
>  			 * Since unlock operations are a one-sided barrier only,
>  			 * we need to insert a barrier here to order any seqno
> -			 * updates before
> -			 * the counter increment.
> +			 * updates before the counter increment.
> +			 *
> +			 * The increment clears I915_RESET_IN_PROGRESS_FLAG.
>  			 */
>  			smp_mb__before_atomic();
>  			atomic_inc(&dev_priv->gpu_error.reset_counter);
>  
> -			kobject_uevent_env(&dev->primary->kdev->kobj,
> -					   KOBJ_CHANGE, reset_done_event);
> +			/*
> +			 * If any per-engine resets were promoted to full GPU
> +			 * reset don't forget to clear those reset flags.
> +			 */
> +			for_each_ring(ring, dev_priv, i)
> +				atomic_set(&ring->hangcheck.flags, 0);
>  		} else {
> +			/* Terminal wedge condition */
> +			WARN(1, "i915_reset failed, declaring GPU as wedged!\n");
>  			atomic_or(I915_WEDGED, &error->reset_counter);
>  		}
> +	}
>  
> -		/*
> -		 * Note: The wake_up also serves as a memory barrier so that
> -		 * waiters see the update value of the reset counter atomic_t.
> -		 */
> +	/*
> +	 * Note: The wake_up also serves as a memory barrier so that
> +	 * waiters see the update value of the reset counter atomic_t.
> +	 */
> +	if (reset_complete) {
>  		i915_error_wake_up(dev_priv, true);
> +
> +		if (ret == 0)
> +			kobject_uevent_env(&dev->primary->kdev->kobj,
> +					   KOBJ_CHANGE, reset_done_event);
>  	}
>  }
>  
> @@ -2649,6 +2727,14 @@ void i915_handle_error(struct drm_device *dev, bool wedged,
>  	va_list args;
>  	char error_msg[80];
>  
> +	struct intel_engine_cs *engine;
> +
> +	/*
> +	 * NB: Placeholder until the hang checker supports
> +	 * per-engine hang detection.
> +	 */
> +	u32 engine_mask = 0;
> +
>  	va_start(args, fmt);
>  	vscnprintf(error_msg, sizeof(error_msg), fmt, args);
>  	va_end(args);
> @@ -2657,8 +2743,65 @@ void i915_handle_error(struct drm_device *dev, bool wedged,
>  	i915_report_and_clear_eir(dev);
>  
>  	if (wedged) {
> -		atomic_or(I915_RESET_IN_PROGRESS_FLAG,
> -				&dev_priv->gpu_error.reset_counter);
> +		/*
> +		 * Defer to full GPU reset if any of the following is true:
> +		 *	0. Engine reset disabled.
> +		 * 	1. The caller did not ask for per-engine reset.
> +		 *	2. The hardware does not support it (pre-gen7).
> +		 *	3. We already tried per-engine reset recently.
> +		 */
> +		bool full_reset = true;
> +
> +		if (!i915.enable_engine_reset) {
> +			DRM_INFO("Engine reset disabled: Using full GPU reset.\n");
> +			engine_mask = 0x0;
> +		}
> +
> +		/*
> +		 * TBD: We currently only support per-engine reset for gen8+.
> +		 * Implement support for gen7.
> +		 */
> +		if (engine_mask && (INTEL_INFO(dev)->gen >= 8)) {
> +			u32 i;
> +
> +			for_each_ring(engine, dev_priv, i) {
> +				u32 now, last_engine_reset_timediff;
> +
> +				if (!(intel_ring_flag(engine) & engine_mask))
> +					continue;
> +
> +				/* Measure the time since this engine was last reset */
> +				now = get_seconds();
> +				last_engine_reset_timediff =
> +					now - engine->hangcheck.last_engine_reset_time;
> +
> +				full_reset = last_engine_reset_timediff <
> +					i915.gpu_reset_promotion_time;
> +
> +				engine->hangcheck.last_engine_reset_time = now;
> +
> +				/*
> +				 * This engine was not reset too recently - go ahead
> +				 * with engine reset instead of falling back to full
> +				 * GPU reset.
> +				 *
> +				 * Flag that we want to try and reset this engine.
> +				 * This can still be overridden by a global
> +				 * reset e.g. if per-engine reset fails.
> +				 */
> +				if (!full_reset)
> +					atomic_or(I915_ENGINE_RESET_IN_PROGRESS,
> +						&engine->hangcheck.flags);
> +				else
> +					break;
> +
> +			} /* for_each_ring */
> +		}
> +
> +		if (full_reset) {
> +			atomic_or(I915_RESET_IN_PROGRESS_FLAG,
> +					&dev_priv->gpu_error.reset_counter);
> +		}
>  
>  		/*
>  		 * Wakeup waiting processes so that the reset function
> diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
> index 8d90c25..5cf9c11 100644
> --- a/drivers/gpu/drm/i915/i915_params.c
> +++ b/drivers/gpu/drm/i915/i915_params.c
> @@ -37,6 +37,8 @@ struct i915_params i915 __read_mostly = {
>  	.enable_fbc = -1,
>  	.enable_execlists = -1,
>  	.enable_hangcheck = true,
> +	.enable_engine_reset = false,
> +	.gpu_reset_promotion_time = 10,
>  	.enable_ppgtt = -1,
>  	.enable_psr = 0,
>  	.preliminary_hw_support = IS_ENABLED(CONFIG_DRM_I915_PRELIMINARY_HW_SUPPORT),
> @@ -116,6 +118,23 @@ MODULE_PARM_DESC(enable_hangcheck,
>  	"WARNING: Disabling this can cause system wide hangs. "
>  	"(default: true)");
>  
> +module_param_named_unsafe(enable_engine_reset, i915.enable_engine_reset, bool, 0644);
> +MODULE_PARM_DESC(enable_engine_reset,
> +	"Enable GPU engine hang recovery mode. Used as a soft, low-impact form "
> +	"of hang recovery that targets individual GPU engines rather than the "
> +	"entire GPU"
> +	"(default: false)");
> +
> +module_param_named(gpu_reset_promotion_time,
> +               i915.gpu_reset_promotion_time, int, 0644);
> +MODULE_PARM_DESC(gpu_reset_promotion_time,
> +               "Catch excessive engine resets. Each engine maintains a "
> +	       "timestamp of the last time it was reset. If it hangs again "
> +	       "within this period then fall back to full GPU reset to try and"
> +	       " recover from the hang. Only applicable if enable_engine_reset "
> +	       "is enabled."
> +               "default=10 seconds");
> +
>  module_param_named_unsafe(enable_ppgtt, i915.enable_ppgtt, int, 0400);
>  MODULE_PARM_DESC(enable_ppgtt,
>  	"Override PPGTT usage. "
> diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
> index 5299290..60f3d23 100644
> --- a/drivers/gpu/drm/i915/i915_params.h
> +++ b/drivers/gpu/drm/i915/i915_params.h
> @@ -49,8 +49,10 @@ struct i915_params {
>  	int use_mmio_flip;
>  	int mmio_debug;
>  	int edp_vswing;
> +	unsigned int gpu_reset_promotion_time;
>  	/* leave bools at the end to not create holes */
>  	bool enable_hangcheck;
> +	bool enable_engine_reset;
>  	bool fastboot;
>  	bool prefault_disable;
>  	bool load_detect_test;
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 0a98889..3fc5d75 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -164,6 +164,8 @@ static inline bool i915_mmio_reg_valid(i915_reg_t reg)
>  #define  GEN6_GRDOM_RENDER		(1 << 1)
>  #define  GEN6_GRDOM_MEDIA		(1 << 2)
>  #define  GEN6_GRDOM_BLT			(1 << 3)
> +#define  GEN6_GRDOM_VECS		(1 << 4)
> +#define  GEN8_GRDOM_MEDIA2		(1 << 7)
>  
>  #define RING_PP_DIR_BASE(ring)		_MMIO((ring)->mmio_base+0x228)
>  #define RING_PP_DIR_BASE_READ(ring)	_MMIO((ring)->mmio_base+0x518)
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ab344e0..fcec476 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -136,6 +136,7 @@
>  #include <drm/i915_drm.h>
>  #include "i915_drv.h"
>  #include "intel_mocs.h"
> +#include "intel_lrc_tdr.h"
>  
>  #define GEN9_LR_CONTEXT_RENDER_SIZE (22 * PAGE_SIZE)
>  #define GEN8_LR_CONTEXT_RENDER_SIZE (20 * PAGE_SIZE)
> @@ -325,7 +326,8 @@ uint64_t intel_lr_context_descriptor(struct intel_context *ctx,
>  }
>  
>  static void execlists_elsp_write(struct drm_i915_gem_request *rq0,
> -				 struct drm_i915_gem_request *rq1)
> +				 struct drm_i915_gem_request *rq1,
> +				 bool tdr_resubmission)
>  {
>  
>  	struct intel_engine_cs *ring = rq0->ring;
> @@ -335,13 +337,17 @@ static void execlists_elsp_write(struct drm_i915_gem_request *rq0,
>  
>  	if (rq1) {
>  		desc[1] = intel_lr_context_descriptor(rq1->ctx, rq1->ring);
> -		rq1->elsp_submitted++;
> +
> +		if (!tdr_resubmission)
> +			rq1->elsp_submitted++;
>  	} else {
>  		desc[1] = 0;
>  	}
>  
>  	desc[0] = intel_lr_context_descriptor(rq0->ctx, rq0->ring);
> -	rq0->elsp_submitted++;
> +
> +	if (!tdr_resubmission)
> +		rq0->elsp_submitted++;
>  
>  	/* You must always write both descriptors in the order below. */
>  	spin_lock(&dev_priv->uncore.lock);
> @@ -359,6 +365,182 @@ static void execlists_elsp_write(struct drm_i915_gem_request *rq0,
>  	spin_unlock(&dev_priv->uncore.lock);
>  }
>  
> +/**
> + * execlist_get_context_reg_page() - Get memory page for context object
> + * @engine: engine
> + * @ctx: context running on engine
> + * @page: returned page
> + *
> + * Return: 0 if successful, otherwise propagates error codes.
> + */
> +static inline int execlist_get_context_reg_page(struct intel_engine_cs *engine,
> +		struct intel_context *ctx,
> +		struct page **page)
> +{

All the macros and reg_page stuff can be removed as 
there is ctx->engine[id].lrc_reg_state for pinned
ctx objects.

> +	struct drm_i915_gem_object *ctx_obj;
> +
> +	if (!page)
> +		return -EINVAL;
> +
> +	if (!ctx)
> +		ctx = engine->default_context;
> +

No. Add a warn which triggers if someone tries to 
touch the default_context through this mechanism.

Default should be sacred, we don't want any state to
accidentally creep into it.


> +	ctx_obj = ctx->engine[engine->id].state;
> +
> +	if (WARN(!ctx_obj, "Context object not set up!\n"))
> +		return -EINVAL;
> +
> +	WARN(!i915_gem_obj_is_pinned(ctx_obj),
> +	     "Context object is not pinned!\n");
> +
> +	*page = i915_gem_object_get_page(ctx_obj, LRC_STATE_PN);

> +
> +	if (WARN(!*page, "Context object page could not be resolved!\n"))
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +/**
> + * execlist_write_context_reg() - Write value to Context register
> + * @engine: Engine
> + * @ctx: Context running on engine
> + * @ctx_reg: Index into context image pointing to register location
> + * @mmio_reg_addr: MMIO register address
> + * @val: Value to be written
> + * @mmio_reg_name_str: Designated register name
> + *
> + * Return: 0 if successful, otherwise propagates error codes.
> + */
> +static inline int execlists_write_context_reg(struct intel_engine_cs *engine,
> +					      struct intel_context *ctx,
> +					      u32 ctx_reg,
> +					      i915_reg_t mmio_reg,
> +					      u32 val,
> +					      const char *mmio_reg_name_str)
> +{

> +	struct page *page = NULL;
> +	uint32_t *reg_state;
> +
> +	int ret = execlist_get_context_reg_page(engine, ctx, &page);
> +	if (WARN(ret, "[write %s:%u] Failed to get context memory page for %s!\n",
> +		 mmio_reg_name_str, (unsigned int) mmio_reg.reg, engine->name)) {
> +		return ret;
> +	}
> +
> +	reg_state = kmap_atomic(page);
> +
> +	WARN(reg_state[ctx_reg] != mmio_reg.reg,
> +	     "[write %s:%u]: Context reg addr (%x) != MMIO reg addr (%x)!\n",
> +	     mmio_reg_name_str,
> +	     (unsigned int) mmio_reg.reg,
> +	     (unsigned int) reg_state[ctx_reg],
> +	     (unsigned int) mmio_reg.reg);
> +
> +	reg_state[ctx_reg+1] = val;
> +	kunmap_atomic(reg_state);
> +
> +	return ret;
> +}
> +
> +/**
> + * execlist_read_context_reg() - Read value from Context register
> + * @engine: Engine
> + * @ctx: Context running on engine
> + * @ctx_reg: Index into context image pointing to register location
> + * @mmio_reg: MMIO register struct
> + * @val: Output parameter returning register value
> + * @mmio_reg_name_str: Designated register name
> + *
> + * Return: 0 if successful, otherwise propagates error codes.
> + */
> +static inline int execlists_read_context_reg(struct intel_engine_cs *engine,
> +					     struct intel_context *ctx,
> +					     u32 ctx_reg,
> +					     i915_reg_t mmio_reg,
> +					     u32 *val,
> +					     const char *mmio_reg_name_str)
> +{


> +	struct page *page = NULL;
> +	uint32_t *reg_state;
> +	int ret = 0;
> +
> +	if (!val)
> +		return -EINVAL;
> +
> +	ret = execlist_get_context_reg_page(engine, ctx, &page);
> +	if (WARN(ret, "[read %s:%u] Failed to get context memory page for %s!\n",
> +		 mmio_reg_name_str, (unsigned int) mmio_reg.reg, engine->name)) {
> +		return ret;
> +	}
> +
> +	reg_state = kmap_atomic(page);
> +
> +	WARN(reg_state[ctx_reg] != mmio_reg.reg,
> +	     "[read %s:%u]: Context reg addr (%x) != MMIO reg addr (%x)!\n",
> +	     mmio_reg_name_str,
> +	     (unsigned int) ctx_reg,
> +	     (unsigned int) reg_state[ctx_reg],
> +	     (unsigned int) mmio_reg.reg);
> +
> +	*val = reg_state[ctx_reg+1];
> +	kunmap_atomic(reg_state);
> +
> +	return ret;
> + }
> +
> +/*
> + * Generic macros for generating function implementation for context register
> + * read/write functions.
> + *
> + * Macro parameters
> + * ----------------
> + * reg_name: Designated name of context register (e.g. tail, head, buffer_ctl)
> + *
> + * reg_def: Context register macro definition (e.g. CTX_RING_TAIL)
> + *
> + * mmio_reg_def: Name of macro function used to determine the address
> + *		 of the corresponding MMIO register (e.g. RING_TAIL, RING_HEAD).
> + *		 This macro function is assumed to be defined on the form of:
> + *
> + *			#define mmio_reg_def(base) (base+register_offset)
> + *
> + *		 Where "base" is the MMIO base address of the respective ring
> + *		 and "register_offset" is the offset relative to "base".
> + *
> + * Function parameters
> + * -------------------
> + * engine: The engine that the context is running on
> + * ctx: The context of the register that is to be accessed
> + * reg_name: Value to be written/read to/from the register.
> + */
> +#define INTEL_EXECLISTS_WRITE_REG(reg_name, reg_def, mmio_reg_def) \
> +	int intel_execlists_write_##reg_name(struct intel_engine_cs *engine, \
> +					     struct intel_context *ctx, \
> +					     u32 reg_name) \
> +{ \
> +	return execlists_write_context_reg(engine, ctx, (reg_def), \
> +			mmio_reg_def(engine->mmio_base), (reg_name), \
> +			(#reg_name)); \
> +}
> +
> +#define INTEL_EXECLISTS_READ_REG(reg_name, reg_def, mmio_reg_def) \
> +	int intel_execlists_read_##reg_name(struct intel_engine_cs *engine, \
> +					    struct intel_context *ctx, \
> +					    u32 *reg_name) \
> +{ \
> +	return execlists_read_context_reg(engine, ctx, (reg_def), \
> +			mmio_reg_def(engine->mmio_base), (reg_name), \
> +			(#reg_name)); \
> +}
> +
> +INTEL_EXECLISTS_READ_REG(tail, CTX_RING_TAIL, RING_TAIL)
> +INTEL_EXECLISTS_WRITE_REG(head, CTX_RING_HEAD, RING_HEAD)
> +INTEL_EXECLISTS_READ_REG(head, CTX_RING_HEAD, RING_HEAD)
> +
> +#undef INTEL_EXECLISTS_READ_REG
> +#undef INTEL_EXECLISTS_WRITE_REG
> +
>  static int execlists_update_context(struct drm_i915_gem_request *rq)
>  {
>  	struct intel_engine_cs *ring = rq->ring;
> @@ -396,17 +578,18 @@ static int execlists_update_context(struct drm_i915_gem_request *rq)
>  }
>  
>  static void execlists_submit_requests(struct drm_i915_gem_request *rq0,
> -				      struct drm_i915_gem_request *rq1)
> +				      struct drm_i915_gem_request *rq1,
> +				      bool tdr_resubmission)
>  {
>  	execlists_update_context(rq0);
>  
>  	if (rq1)
>  		execlists_update_context(rq1);
>  
> -	execlists_elsp_write(rq0, rq1);
> +	execlists_elsp_write(rq0, rq1, tdr_resubmission);
>  }
>  
> -static void execlists_context_unqueue(struct intel_engine_cs *ring)
> +static void execlists_context_unqueue(struct intel_engine_cs *ring, bool tdr_resubmission)
>  {
>  	struct drm_i915_gem_request *req0 = NULL, *req1 = NULL;
>  	struct drm_i915_gem_request *cursor = NULL, *tmp = NULL;
> @@ -440,6 +623,16 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring)
>  		}
>  	}
>  
> +	/*
> +	 * Only do TDR resubmission of the second head request if it's already
> +	 * been submitted. The intention is to restore the original submission
> +	 * state from the situation when the hang originally happened. If it
> +	 * was never submitted we don't want to submit it for the first time at
> +	 * this point
> +	 */
> +	if (tdr_resubmission && req1 && !req1->elsp_submitted)
> +		req1 = NULL;
> +
>  	if (IS_GEN8(ring->dev) || IS_GEN9(ring->dev)) {
>  		/*
>  		 * WaIdleLiteRestore: make sure we never cause a lite
> @@ -460,9 +653,32 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring)
>  		}
>  	}
>  
> -	WARN_ON(req1 && req1->elsp_submitted);
> +	WARN_ON(req1 && req1->elsp_submitted && !tdr_resubmission);
>  
> -	execlists_submit_requests(req0, req1);
> +	execlists_submit_requests(req0, req1, tdr_resubmission);
> +}
> +
> +/**
> + * intel_execlists_TDR_context_resubmission() - ELSP context resubmission
> + * @ring: engine to do resubmission for.
> + *
> + * Context submission mechanism exclusively used by TDR that bypasses the
> + * execlist queue. This is necessary since at the point of TDR hang recovery
> + * the hardware will be hung and resubmitting a fixed context (the context that
> + * the TDR has identified as hung and fixed up in order to move past the
> + * blocking batch buffer) to a hung execlist queue will lock up the TDR.
> + * Instead, opt for direct ELSP submission without depending on the rest of the
> + * driver.
> + */
> +void intel_execlists_TDR_context_resubmission(struct intel_engine_cs *ring)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&ring->execlist_lock, flags);
> +	WARN_ON(list_empty(&ring->execlist_queue));
> +
> +	execlists_context_unqueue(ring, true);
> +	spin_unlock_irqrestore(&ring->execlist_lock, flags);
>  }
>  
>  static bool execlists_check_remove_request(struct intel_engine_cs *ring,
> @@ -560,9 +776,9 @@ void intel_lrc_irq_handler(struct intel_engine_cs *ring)
>  		/* Prevent a ctx to preempt itself */
>  		if ((status & GEN8_CTX_STATUS_ACTIVE_IDLE) &&
>  		    (submit_contexts != 0))
> -			execlists_context_unqueue(ring);
> +			execlists_context_unqueue(ring, false);
>  	} else if (submit_contexts != 0) {
> -		execlists_context_unqueue(ring);
> +		execlists_context_unqueue(ring, false);
>  	}
>  
>  	spin_unlock(&ring->execlist_lock);
> @@ -613,7 +829,7 @@ static int execlists_context_queue(struct drm_i915_gem_request *request)
>  
>  	list_add_tail(&request->execlist_link, &ring->execlist_queue);
>  	if (num_elements == 0)
> -		execlists_context_unqueue(ring);
> +		execlists_context_unqueue(ring, false);
>  
>  	spin_unlock_irq(&ring->execlist_lock);
>  
> @@ -1536,7 +1752,7 @@ static int gen8_init_common_ring(struct intel_engine_cs *ring)
>  	ring->next_context_status_buffer = next_context_status_buffer_hw;
>  	DRM_DEBUG_DRIVER("Execlists enabled for %s\n", ring->name);
>  
> -	memset(&ring->hangcheck, 0, sizeof(ring->hangcheck));
> +	i915_hangcheck_reinit(ring);
>  
>  	return 0;
>  }
> @@ -1888,6 +2104,187 @@ out:
>  	return ret;
>  }
>  
> +static int
> +gen8_ring_disable(struct intel_engine_cs *ring)
> +{
> +	intel_request_gpu_engine_reset(ring);
> +	return 0;
> +}
> +
> +static int
> +gen8_ring_enable(struct intel_engine_cs *ring)
> +{
> +	intel_unrequest_gpu_engine_reset(ring);
> +	return 0;
> +}
> +
> +/**
> + * gen8_ring_save() - save minimum engine state
> + * @ring: engine whose state is to be saved
> + * @req: request containing the context currently running on engine
> + * @force_advance: indicates whether or not we should nudge the head
> + *		  forward or not
> + *
> + * Saves the head MMIO register to scratch memory while engine is reset and
> + * reinitialized. Before saving the head register we nudge the head position to
> + * be correctly aligned with a QWORD boundary, which brings it up to the next
> + * presumably valid instruction. Typically, at the point of hang recovery the
> + * head register will be pointing to the last DWORD of the BB_START
> + * instruction, which is followed by a padding MI_NOOP inserted by the
> + * driver.
> + *
> + * Returns:
> + * 	0 if ok, otherwise propagates error codes.
> + */
> +static int
> +gen8_ring_save(struct intel_engine_cs *ring, struct drm_i915_gem_request *req,
> +		bool force_advance)
> +{
> +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> +	struct intel_ringbuffer *ringbuf = NULL;
> +	struct intel_context *ctx;
> +	int ret = 0;
> +	int clamp_to_tail = 0;
> +	uint32_t head;
> +	uint32_t tail;
> +	uint32_t head_addr;
> +	uint32_t tail_addr;
> +
> +	if (WARN_ON(!req))
> +	    return -EINVAL;
> +
> +	ctx = req->ctx;
> +	ringbuf = ctx->engine[ring->id].ringbuf;
> +
> +	/*
> +	 * Read head from MMIO register since it contains the
> +	 * most up to date value of head at this point.
> +	 */
> +	head = I915_READ_HEAD(ring);
> +
> +	/*
> +	 * Read tail from the context because the execlist queue
> +	 * updates the tail value there first during submission.
> +	 * The MMIO tail register is not updated until the actual
> +	 * ring submission completes.
> +	 */
> +	ret = I915_READ_TAIL_CTX(ring, ctx, tail);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * head_addr and tail_addr are the head and tail values
> +	 * excluding ring wrapping information and aligned to DWORD
> +	 * boundary
> +	 */
> +	head_addr = head & HEAD_ADDR;
> +	tail_addr = tail & TAIL_ADDR;
> +
> +	/*
> +	 * The head must always chase the tail.
> +	 * If the tail is beyond the head then do not allow
> +	 * the head to overtake it. If the tail is less than
> +	 * the head then the tail has already wrapped and
> +	 * there is no problem in advancing the head or even
> +	 * wrapping the head back to 0 as worst case it will
> +	 * become equal to tail
> +	 */
> +	if (head_addr <= tail_addr)
> +		clamp_to_tail = 1;
> +
> +	if (force_advance) {
> +
> +		/* Force head pointer to next QWORD boundary */
> +		head_addr &= ~0x7;
> +		head_addr += 8;
> +
> +	} else if (head & 0x7) {
> +
> +		/* Ensure head pointer is pointing to a QWORD boundary */
> +		head += 0x7;
> +		head &= ~0x7;
> +		head_addr = head;
> +	}
> +
> +	if (clamp_to_tail && (head_addr > tail_addr)) {
> +		head_addr = tail_addr;
> +	} else if (head_addr >= ringbuf->size) {
> +		/* Wrap head back to start if it exceeds ring size */
> +		head_addr = 0;
> +	}
> +
> +	head &= ~HEAD_ADDR;
> +	head |= (head_addr & HEAD_ADDR);
> +	ring->saved_head = head;
> +
> +	return 0;
> +}
> +
> +
> +/**
> + * gen8_ring_restore() - restore previously saved engine state
> + * @ring: engine whose state is to be restored
> + * @req: request containing the context currently running on engine
> + *
> + * Reinitializes engine and restores the previously saved engine state.
> + * See: gen8_ring_save()
> + *
> + * Returns:
> + * 	0 if ok, otherwise propagates error codes.
> + */
> +static int
> +gen8_ring_restore(struct intel_engine_cs *ring, struct drm_i915_gem_request *req)
> +{
> +	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> +	struct intel_context *ctx;
> +
> +	if (WARN_ON(!req))
> +	    return -EINVAL;
> +
> +	ctx = req->ctx;
> +
> +	/* Re-initialize ring */
> +	if (ring->init_hw) {
> +		int ret = ring->init_hw(ring);
> +		if (ret != 0) {
> +			DRM_ERROR("Failed to re-initialize %s\n",
> +					ring->name);
> +			return ret;
> +		}
> +	} else {
> +		DRM_ERROR("ring init function pointer not set up\n");
> +		return -EINVAL;
> +	}
> +
> +	if (ring->id == RCS) {
> +		/*
> +		 * These register reinitializations are only located here
> +		 * temporarily until they are moved out of the
> +		 * init_clock_gating function to some function we can
> +		 * call from here.
> +		 */
> +
> +		/* WaVSRefCountFullforceMissDisable:chv */
> +		/* WaDSRefCountFullforceMissDisable:chv */
> +		I915_WRITE(GEN7_FF_THREAD_MODE,
> +			   I915_READ(GEN7_FF_THREAD_MODE) &
> +			   ~(GEN8_FF_DS_REF_CNT_FFME | GEN7_FF_VS_REF_CNT_FFME));
> +
> +		I915_WRITE(_3D_CHICKEN3,
> +			   _3D_CHICKEN_SDE_LIMIT_FIFO_POLY_DEPTH(2));
> +
> +		/* WaSwitchSolVfFArbitrationPriority:bdw */
> +		I915_WRITE(GAM_ECOCHK, I915_READ(GAM_ECOCHK) | HSW_ECOCHK_ARB_PRIO_SOL);
> +	}
> +
> +	/* Restore head */
> +
> +	I915_WRITE_HEAD(ring, ring->saved_head);
> +	I915_WRITE_HEAD_CTX(ring, ctx, ring->saved_head);
> +
> +	return 0;
> +}
> +
>  static int gen8_init_rcs_context(struct drm_i915_gem_request *req)
>  {
>  	int ret;
> @@ -2021,6 +2418,10 @@ static int logical_render_ring_init(struct drm_device *dev)
>  	ring->irq_get = gen8_logical_ring_get_irq;
>  	ring->irq_put = gen8_logical_ring_put_irq;
>  	ring->emit_bb_start = gen8_emit_bb_start;
> +	ring->enable = gen8_ring_enable;
> +	ring->disable = gen8_ring_disable;
> +	ring->save = gen8_ring_save;
> +	ring->restore = gen8_ring_restore;
>  
>  	ring->dev = dev;
>  
> @@ -2073,6 +2474,10 @@ static int logical_bsd_ring_init(struct drm_device *dev)
>  	ring->irq_get = gen8_logical_ring_get_irq;
>  	ring->irq_put = gen8_logical_ring_put_irq;
>  	ring->emit_bb_start = gen8_emit_bb_start;
> +	ring->enable = gen8_ring_enable;
> +	ring->disable = gen8_ring_disable;
> +	ring->save = gen8_ring_save;
> +	ring->restore = gen8_ring_restore;
>  
>  	return logical_ring_init(dev, ring);
>  }
> @@ -2098,6 +2503,10 @@ static int logical_bsd2_ring_init(struct drm_device *dev)
>  	ring->irq_get = gen8_logical_ring_get_irq;
>  	ring->irq_put = gen8_logical_ring_put_irq;
>  	ring->emit_bb_start = gen8_emit_bb_start;
> +	ring->enable = gen8_ring_enable;
> +	ring->disable = gen8_ring_disable;
> +	ring->save = gen8_ring_save;
> +	ring->restore = gen8_ring_restore;
>  
>  	return logical_ring_init(dev, ring);
>  }
> @@ -2128,6 +2537,10 @@ static int logical_blt_ring_init(struct drm_device *dev)
>  	ring->irq_get = gen8_logical_ring_get_irq;
>  	ring->irq_put = gen8_logical_ring_put_irq;
>  	ring->emit_bb_start = gen8_emit_bb_start;
> +	ring->enable = gen8_ring_enable;
> +	ring->disable = gen8_ring_disable;
> +	ring->save = gen8_ring_save;
> +	ring->restore = gen8_ring_restore;
>  
>  	return logical_ring_init(dev, ring);
>  }
> @@ -2158,6 +2571,10 @@ static int logical_vebox_ring_init(struct drm_device *dev)
>  	ring->irq_get = gen8_logical_ring_get_irq;
>  	ring->irq_put = gen8_logical_ring_put_irq;
>  	ring->emit_bb_start = gen8_emit_bb_start;
> +	ring->enable = gen8_ring_enable;
> +	ring->disable = gen8_ring_disable;
> +	ring->save = gen8_ring_save;
> +	ring->restore = gen8_ring_restore;
>  
>  	return logical_ring_init(dev, ring);
>  }
> @@ -2587,3 +3004,127 @@ void intel_lr_context_reset(struct drm_device *dev,
>  		ringbuf->tail = 0;
>  	}
>  }
> +
> +/**
> + * intel_execlists_TDR_get_current_request() - return request currently
> + * processed by engine
> + *
> + * @ring: Engine currently running context to be returned.
> + *
> + * @req:  Output parameter containing the current request (the request at the
> + *	  head of execlist queue corresponding to the given ring). May be NULL
> + *	  if no request has been submitted to the execlist queue of this
> + *	  engine. If the req parameter passed in to the function is not NULL
> + *	  and a request is found and returned the request is referenced before
> + *	  it is returned. It is the responsibility of the caller to dereference
> + *	  it at the end of its life cycle.
> + *
> + * Return:
> + *	CONTEXT_SUBMISSION_STATUS_OK if request is found to be submitted and its
> + *	context is currently running on engine.
> + *
> + *	CONTEXT_SUBMISSION_STATUS_INCONSISTENT if request is found to be submitted
> + *	but its context is not in a state that is consistent with current
> + *	hardware state for the given engine. This has been observed in three cases:
> + *
> + *		1. Before the engine has switched to this context after it has
> + *		been submitted to the execlist queue.
> + *
> + *		2. After the engine has switched away from this context but
> + *		before the context has been removed from the execlist queue.
> + *
> + *		3. The driver has lost an interrupt. Typically the hardware has
> + *		gone to idle but the driver still thinks the context belonging to
> + *		the request at the head of the queue is still executing.
> + *
> + *	CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED if no context has been found
> + *	to be submitted to the execlist queue and if the hardware is idle.
> + */
> +enum context_submission_status
> +intel_execlists_TDR_get_current_request(struct intel_engine_cs *ring,
> +		struct drm_i915_gem_request **req)
> +{
> +	struct drm_i915_private *dev_priv;
> +	unsigned long flags;
> +	struct drm_i915_gem_request *tmpreq = NULL;
> +	struct intel_context *tmpctx = NULL;
> +	unsigned hw_context = 0;
> +	unsigned sw_context = 0;
> +	bool hw_active = false;
> +	enum context_submission_status status =
> +			CONTEXT_SUBMISSION_STATUS_UNDEFINED;
> +
> +	if (WARN_ON(!ring))
> +		return status;
> +
> +	dev_priv = ring->dev->dev_private;
> +
> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
> +	spin_lock_irqsave(&ring->execlist_lock, flags);
> +	hw_context = I915_READ(RING_EXECLIST_STATUS_CTX_ID(ring));
> +
> +	hw_active = (I915_READ(RING_EXECLIST_STATUS_LO(ring)) &
> +		EXECLIST_STATUS_CURRENT_ACTIVE_ELEMENT_STATUS) ? true : false;
> +
> +	tmpreq = list_first_entry_or_null(&ring->execlist_queue,
> +		struct drm_i915_gem_request, execlist_link);
> +
> +	if (tmpreq) {
> +		sw_context = intel_execlists_ctx_id((tmpreq->ctx)->engine[ring->id].state);
> +
> +		/*
> +		 * Only acknowledge the request in the execlist queue if it's
> +		 * actually been submitted to hardware, otherwise there's the
> +		 * risk of a false inconsistency detection between the
> +		 * (unsubmitted) request and the idle hardware state.
> +		 */
> +		if (tmpreq->elsp_submitted > 0) {
> +			/*
> +			 * If the caller has not passed a non-NULL req
> +			 * parameter then it is not interested in getting a
> +			 * request reference back.  Don't temporarily grab a
> +			 * reference since holding the execlist lock is enough
> +			 * to ensure that the execlist code will hold its
> +			 * reference all throughout this function. As long as
> +			 * that reference is kept there is no need for us to
> +			 * take yet another reference.  The reason why this is
> +			 * of interest is because certain callers, such as the
> +			 * TDR hang checker, cannot grab struct_mutex before
> +			 * calling and because of that we cannot dereference
> +			 * any requests (DRM might assert if we do). Just rely
> +			 * on the execlist code to provide indirect protection.
> +			 */
> +			if (req)
> +				i915_gem_request_reference(tmpreq);
> +
> +			if (tmpreq->ctx)
> +				tmpctx = tmpreq->ctx;
> +		}
> +	}
> +
> +	if (tmpctx) {
> +		status = ((hw_context == sw_context) && hw_active) ?
> +				CONTEXT_SUBMISSION_STATUS_OK :
> +				CONTEXT_SUBMISSION_STATUS_INCONSISTENT;
> +	} else {
> +		/*
> +		 * If we don't have any queue entries and the
> +		 * EXECLIST_STATUS register points to zero we are
> +		 * clearly not processing any context right now
> +		 */
> +		WARN((hw_context || hw_active), "hw_context=%x, hardware %s!\n",
> +			hw_context, hw_active ? "not idle":"idle");
> +
> +		status = (hw_context || hw_active) ?
> +			CONTEXT_SUBMISSION_STATUS_INCONSISTENT :
> +			CONTEXT_SUBMISSION_STATUS_NONE_SUBMITTED;
> +	}
> +
> +	if (req)
> +		*req = tmpreq;
> +
> +	spin_unlock_irqrestore(&ring->execlist_lock, flags);
> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
> +
> +	return status;
> +}
> diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
> index de41ad6..d9acb31 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.h
> +++ b/drivers/gpu/drm/i915/intel_lrc.h
> @@ -29,7 +29,9 @@
>  /* Execlists regs */
>  #define RING_ELSP(ring)				_MMIO((ring)->mmio_base + 0x230)
>  #define RING_EXECLIST_STATUS_LO(ring)		_MMIO((ring)->mmio_base + 0x234)
> +#define	  EXECLIST_STATUS_CURRENT_ACTIVE_ELEMENT_STATUS	(0x3 << 14)
>  #define RING_EXECLIST_STATUS_HI(ring)		_MMIO((ring)->mmio_base + 0x234 + 4)
> +#define RING_EXECLIST_STATUS_CTX_ID(ring)	RING_EXECLIST_STATUS_HI(ring)
>  #define RING_CONTEXT_CONTROL(ring)		_MMIO((ring)->mmio_base + 0x244)
>  #define	  CTX_CTRL_INHIBIT_SYN_CTX_SWITCH	(1 << 3)
>  #define	  CTX_CTRL_ENGINE_CTX_RESTORE_INHIBIT	(1 << 0)
> @@ -118,4 +120,16 @@ u32 intel_execlists_ctx_id(struct drm_i915_gem_object *ctx_obj);
>  void intel_lrc_irq_handler(struct intel_engine_cs *ring);
>  void intel_execlists_retire_requests(struct intel_engine_cs *ring);
>  
> +int intel_execlists_read_tail(struct intel_engine_cs *ring,
> +			 struct intel_context *ctx,
> +			 u32 *tail);
> +
> +int intel_execlists_write_head(struct intel_engine_cs *ring,
> +			  struct intel_context *ctx,
> +			  u32 head);
> +
> +int intel_execlists_read_head(struct intel_engine_cs *ring,
> +			 struct intel_context *ctx,
> +			 u32 *head);
> +



>  #endif /* _INTEL_LRC_H_ */
> diff --git a/drivers/gpu/drm/i915/intel_lrc_tdr.h b/drivers/gpu/drm/i915/intel_lrc_tdr.h
> new file mode 100644
> index 0000000..4520753
> --- /dev/null
> +++ b/drivers/gpu/drm/i915/intel_lrc_tdr.h
> @@ -0,0 +1,36 @@
> +/*
> + * Copyright © 2015 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
> + * DEALINGS IN THE SOFTWARE.
> + */
> +
> +#ifndef _INTEL_LRC_TDR_H_
> +#define _INTEL_LRC_TDR_H_
> +
> +/* Privileged execlist API used exclusively by TDR */
> +
> +void intel_execlists_TDR_context_resubmission(struct intel_engine_cs *ring);
> +
> +enum context_submission_status
> +intel_execlists_TDR_get_current_request(struct intel_engine_cs *ring,
> +		struct drm_i915_gem_request **req);
> +
> +#endif /* _INTEL_LRC_TDR_H_ */
> +
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index 4060acf..def0dcf 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -434,6 +434,88 @@ static void ring_write_tail(struct intel_engine_cs *ring,
>  	I915_WRITE_TAIL(ring, value);
>  }
>  
> +int intel_ring_disable(struct intel_engine_cs *ring)
> +{
> +	WARN_ON(!ring);
> +
> +	if (ring && ring->disable)
> +		return ring->disable(ring);
> +	else {
> +		DRM_ERROR("Ring disable not supported on %s\n", ring->name);
> +		return -EINVAL;
> +	}
> +}
> +
> +int intel_ring_enable(struct intel_engine_cs *ring)
> +{
> +	WARN_ON(!ring);
> +
> +	if (ring && ring->enable)
> +		return ring->enable(ring);
> +	else {
> +		DRM_ERROR("Ring enable not supported on %s\n", ring->name);
> +		return -EINVAL;
> +	}
> +}
> +
> +int intel_ring_save(struct intel_engine_cs *ring,
> +		struct drm_i915_gem_request *req,
> +		bool force_advance)
> +{
> +	WARN_ON(!ring);
> +
> +	if (ring && ring->save)
> +		return ring->save(ring, req, force_advance);
> +	else {
> +		DRM_ERROR("Ring save not supported on %s\n", ring->name);
> +		return -EINVAL;
> +	}
> +}
> +
> +int intel_ring_restore(struct intel_engine_cs *ring,
> +		struct drm_i915_gem_request *req)
> +{
> +	WARN_ON(!ring);
> +
> +	if (ring && ring->restore)
> +		return ring->restore(ring, req);
> +	else {
> +		DRM_ERROR("Ring restore not supported on %s\n", ring->name);
> +		return -EINVAL;
> +	}
> +}
> +
> +void intel_gpu_engine_reset_resample(struct intel_engine_cs *ring,
> +		struct drm_i915_gem_request *req)
> +{
> +	struct intel_ringbuffer *ringbuf;
> +	struct drm_i915_private *dev_priv;
> +
> +	if (WARN_ON(!ring))
> +		return;
> +
> +	dev_priv = ring->dev->dev_private;
> +
> +	if (i915.enable_execlists) {
> +		struct intel_context *ctx;
> +
> +		if (WARN_ON(!req))
> +			return;
> +
> +		ctx = req->ctx;
> +		ringbuf = ctx->engine[ring->id].ringbuf;
> +
> +		/*
> +		 * In gen8+ context head is restored during reset and
> +		 * we can use it as a reference to set up the new
> +		 * driver state.
> +		 */
> +		I915_READ_HEAD_CTX(ring, ctx, ringbuf->head);
> +		ringbuf->last_retired_head = -1;
> +		intel_ring_update_space(ringbuf);
> +	}
> +}
> +
>  u64 intel_ring_get_active_head(struct intel_engine_cs *ring)
>  {
>  	struct drm_i915_private *dev_priv = ring->dev->dev_private;
> @@ -629,7 +711,7 @@ static int init_ring_common(struct intel_engine_cs *ring)
>  	ringbuf->tail = I915_READ_TAIL(ring) & TAIL_ADDR;
>  	intel_ring_update_space(ringbuf);
>  
> -	memset(&ring->hangcheck, 0, sizeof(ring->hangcheck));
> +	i915_hangcheck_reinit(ring);
>  
>  out:
>  	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 7349d92..7014778 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -49,6 +49,22 @@ struct  intel_hw_status_page {
>  #define I915_READ_MODE(ring) I915_READ(RING_MI_MODE((ring)->mmio_base))
>  #define I915_WRITE_MODE(ring, val) I915_WRITE(RING_MI_MODE((ring)->mmio_base), val)
>  
> +
> +#define I915_READ_TAIL_CTX(engine, ctx, outval) \
> +	intel_execlists_read_tail((engine), \
> +				(ctx), \
> +				&(outval));
> +
> +#define I915_READ_HEAD_CTX(engine, ctx, outval) \
> +	intel_execlists_read_head((engine), \
> +				(ctx), \
> +				&(outval));
> +
> +#define I915_WRITE_HEAD_CTX(engine, ctx, val) \
> +	intel_execlists_write_head((engine), \
> +				(ctx), \
> +				(val));
> +


Don't see the benefit of all the macros.

If you look at lrc_reg_state we can throw
most if not all this register reading/writing code out.


>  /* seqno size is actually only a uint32, but since we plan to use MI_FLUSH_DW to
>   * do the writes, and that must have qw aligned offsets, simply pretend it's 8b.
>   */
> @@ -94,6 +110,34 @@ struct intel_ring_hangcheck {
>  	enum intel_ring_hangcheck_action action;
>  	int deadlock;
>  	u32 instdone[I915_NUM_INSTDONE_REG];
> +
> +	/*
> +	 * Last recorded ring head index.
> +	 * This is only ever a ring index where as active
> +	 * head may be a graphics address in a ring buffer
> +	 */
> +	u32 last_head;
> +
> +	/* Flag to indicate if engine reset required */
> +	atomic_t flags;
> +
> +	/* Indicates request to reset this engine */
> +#define I915_ENGINE_RESET_IN_PROGRESS (1<<0)
> +
> +	/*
> +	 * Timestamp (seconds) from when the last time
> +	 * this engine was reset.
> +	 */
> +	u32 last_engine_reset_time;
> +
> +	/*
> +	 * Number of times this engine has been
> +	 * reset since boot
> +	 */
> +	u32 reset_count;
> +
> +	/* Number of TDR hang detections */
> +	u32 tdr_count;
>  };
>  
>  struct intel_ringbuffer {
> @@ -205,6 +249,14 @@ struct  intel_engine_cs {
>  #define I915_DISPATCH_RS     0x4
>  	void		(*cleanup)(struct intel_engine_cs *ring);
>  
> +	int (*enable)(struct intel_engine_cs *ring);
> +	int (*disable)(struct intel_engine_cs *ring);
> +	int (*save)(struct intel_engine_cs *ring,
> +		    struct drm_i915_gem_request *req,
> +		    bool force_advance);
> +	int (*restore)(struct intel_engine_cs *ring,
> +		       struct drm_i915_gem_request *req);
> +
>  	/* GEN8 signal/wait table - never trust comments!
>  	 *	  signal to	signal to    signal to   signal to      signal to
>  	 *	    RCS		   VCS          BCS        VECS		 VCS2
> @@ -311,6 +363,9 @@ struct  intel_engine_cs {
>  
>  	struct intel_ring_hangcheck hangcheck;
>  
> +	/* Saved head value to be restored after reset */
> +	u32 saved_head;
> +
>  	struct {
>  		struct drm_i915_gem_object *obj;
>  		u32 gtt_offset;
> @@ -463,6 +518,15 @@ void intel_ring_update_space(struct intel_ringbuffer *ringbuf);
>  int intel_ring_space(struct intel_ringbuffer *ringbuf);
>  bool intel_ring_stopped(struct intel_engine_cs *ring);
>  
> +void intel_gpu_engine_reset_resample(struct intel_engine_cs *ring,
> +		struct drm_i915_gem_request *req);
> +int intel_ring_disable(struct intel_engine_cs *ring);
> +int intel_ring_enable(struct intel_engine_cs *ring);
> +int intel_ring_save(struct intel_engine_cs *ring,
> +		struct drm_i915_gem_request *req, bool force_advance);
> +int intel_ring_restore(struct intel_engine_cs *ring,
> +		struct drm_i915_gem_request *req);
> +
>  int __must_check intel_ring_idle(struct intel_engine_cs *ring);
>  void intel_ring_init_seqno(struct intel_engine_cs *ring, u32 seqno);
>  int intel_ring_flush_all_caches(struct drm_i915_gem_request *req);
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 2df4246..f20548c 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1623,6 +1623,153 @@ bool intel_has_gpu_reset(struct drm_device *dev)
>  	return intel_get_gpu_reset(dev) != NULL;
>  }
>  
> +static inline int wait_for_engine_reset(struct drm_i915_private *dev_priv,
> +		unsigned int grdom)
> +{

No need to inline

> +#define _CND ((__raw_i915_read32(dev_priv, GEN6_GDRST) & grdom) == 0)
> +
> +	/*
> +	 * Spin waiting for the device to ack the reset request.
> +	 * Times out after 500 us
> +	 * */
> +	return wait_for_atomic_us(_CND, 500);
> +
> +#undef _CND
> +}
> +
> +static int do_engine_reset_nolock(struct intel_engine_cs *engine)
> +{
> +	int ret = -ENODEV;
> +	struct drm_i915_private *dev_priv = engine->dev->dev_private;
> +
> +	assert_spin_locked(&dev_priv->uncore.lock);
> +
> +	switch (engine->id) {
> +	case RCS:
> +		__raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_RENDER);
> +		engine->hangcheck.reset_count++;
> +		ret = wait_for_engine_reset(dev_priv, GEN6_GRDOM_RENDER);
> +		break;
> +
> +	case BCS:
> +		__raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_BLT);
> +		engine->hangcheck.reset_count++;
> +		ret = wait_for_engine_reset(dev_priv, GEN6_GRDOM_BLT);
> +		break;
> +
> +	case VCS:
> +		__raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_MEDIA);
> +		engine->hangcheck.reset_count++;
> +		ret = wait_for_engine_reset(dev_priv, GEN6_GRDOM_MEDIA);
> +		break;
> +
> +	case VECS:
> +		__raw_i915_write32(dev_priv, GEN6_GDRST, GEN6_GRDOM_VECS);
> +		engine->hangcheck.reset_count++;
> +		ret = wait_for_engine_reset(dev_priv, GEN6_GRDOM_VECS);
> +		break;
> +
> +	case VCS2:
> +		__raw_i915_write32(dev_priv, GEN6_GDRST, GEN8_GRDOM_MEDIA2);
> +		engine->hangcheck.reset_count++;
> +		ret = wait_for_engine_reset(dev_priv, GEN8_GRDOM_MEDIA2);
> +		break;
> +
> +	default:
> +		DRM_ERROR("Unexpected engine: %d\n", engine->id);
> +		break;
> +	}

  int mask[NUM_RINGS] = { GEN6_GRDOM_RENDER, GEN6_GDROM_BLT...};
  
  if (WARN_ON_ONCE(!engine->initialized))
    return;       
  
  __raw_i915_write(dev_priv, mask[engine->id]);
  engine->hangcheck.reset_count++;
  ret = wait_for_engine_reset(dev_priv, mask[engine->id]);


> +
> +	return ret;
> +}
> +
> +static int gen8_do_engine_reset(struct intel_engine_cs *engine)
> +{
> +	struct drm_device *dev = engine->dev;
> +	struct drm_i915_private *dev_priv = dev->dev_private;
> +	int ret = -ENODEV;
> +	unsigned long irqflags;
> +
> +	spin_lock_irqsave(&dev_priv->uncore.lock, irqflags);
> +	ret = do_engine_reset_nolock(engine);
> +	spin_unlock_irqrestore(&dev_priv->uncore.lock, irqflags);
> +
> +	if (!ret) {
> +		u32 reset_ctl = 0;
> +
> +		/*
> +		 * Confirm that reset control register back to normal
> +		 * following the reset.
> +		 */
> +		reset_ctl = I915_READ(RING_RESET_CTL(engine->mmio_base));
> +		WARN(reset_ctl & 0x3, "Reset control still active after reset! (0x%08x)\n",
> +			reset_ctl);
> +	} else {
> +		DRM_ERROR("Engine reset failed! (%d)\n", ret);
> +	}
> +
> +	return ret;
> +}
> +
> +int intel_gpu_engine_reset(struct intel_engine_cs *engine)
> +{
> +	/* Reset an individual engine */
> +	int ret = -ENODEV;
> +	struct drm_device *dev = engine->dev;
> +
> +	switch (INTEL_INFO(dev)->gen) {

You can pass dev_priv to INTEL_INFO also, and prefer to do so in here
and rest of the code.

> +	case 8:
case 9: ?

Thanks,
-Mika

> +		ret = gen8_do_engine_reset(engine);
> +		break;
> +	default:
> +		DRM_ERROR("Per Engine Reset not supported on Gen%d\n",
> +			  INTEL_INFO(dev)->gen);
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * On gen8+ a reset request has to be issued via the reset control register
> + * before a GPU engine can be reset in order to stop the command streamer
> + * and idle the engine. This replaces the legacy way of stopping an engine
> + * by writing to the stop ring bit in the MI_MODE register.
> + */
> +int intel_request_gpu_engine_reset(struct intel_engine_cs *engine)
> +{
> +	/* Request reset for an individual engine */
> +	int ret = -ENODEV;
> +	struct drm_device *dev = engine->dev;
> +
> +	if (INTEL_INFO(dev)->gen >= 8)
> +		ret = gen8_request_engine_reset(engine);
> +	else
> +		DRM_ERROR("Reset request not supported on Gen%d\n",
> +			  INTEL_INFO(dev)->gen);
> +
> +	return ret;
> +}
> +
> +/*
> + * It is possible to back off from a previously issued reset request by simply
> + * clearing the reset request bit in the reset control register.
> + */
> +int intel_unrequest_gpu_engine_reset(struct intel_engine_cs *engine)
> +{
> +	/* Roll back reset request for an individual engine */
> +	int ret = -ENODEV;
> +	struct drm_device *dev = engine->dev;
> +
> +	if (INTEL_INFO(dev)->gen >= 8)
> +		ret = gen8_unrequest_engine_reset(engine);
> +	else
> +		DRM_ERROR("Reset unrequest not supported on Gen%d\n",
> +			  INTEL_INFO(dev)->gen);
> +
> +	return ret;
> +}
> +
>  bool intel_uncore_unclaimed_mmio(struct drm_i915_private *dev_priv)
>  {
>  	return check_for_unclaimed_mmio(dev_priv);
> -- 
> 1.9.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 07/20] drm/i915: Watchdog timeout: Hang detection integration into error handler
  2015-10-23  1:32 [PATCH 00/20] TDR/watchdog support for gen8 Tomas Elf
@ 2015-10-23  1:32 ` Tomas Elf
  0 siblings, 0 replies; 31+ messages in thread
From: Tomas Elf @ 2015-10-23  1:32 UTC (permalink / raw)
  To: Intel-GFX; +Cc: Ian Lister

This patch enables watchdog timeout hang detection as an entrypoint into the
driver error handler. This form of hang detection overrides the promotion logic
normally used by the periodic hang checker and instead allows for direct access
to the per-engine hang recovery path.

NOTE: I don't know if Ben Widawsky had any part in this code from 3 years
ago. There have been so many people involved in this already that I am in no
position to know. If I've missed anyone's sob line please let me know.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c |  2 +-
 drivers/gpu/drm/i915/i915_drv.h     |  6 +++---
 drivers/gpu/drm/i915/i915_irq.c     | 43 ++++++++++++++++++++++---------------
 3 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 68f86cd..aa05988 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -4587,7 +4587,7 @@ i915_wedged_set(void *data, u64 val)
 
 	intel_runtime_pm_get(dev_priv);
 
-	i915_handle_error(dev, 0x0, val,
+	i915_handle_error(dev, 0x0, false, val,
 			  "Manually setting wedged to %llu", val);
 
 	intel_runtime_pm_put(dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index bbc18cc..b86d34b 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2756,9 +2756,9 @@ static inline void i915_hangcheck_reinit(struct intel_engine_cs *engine)
 
 /* i915_irq.c */
 void i915_queue_hangcheck(struct drm_device *dev);
-__printf(4, 5)
-void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
-		       const char *fmt, ...);
+__printf(5, 6)
+void i915_handle_error(struct drm_device *dev, u32 engine_mask,
+		       bool watchdog, bool wedged, const char *fmt, ...);
 
 extern void intel_irq_init(struct drm_i915_private *dev_priv);
 int intel_irq_install(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index c34783a..19ab79e 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2663,6 +2663,7 @@ static void i915_report_and_clear_eir(struct drm_device *dev)
  *			If a previous engine reset was attempted too recently
  *			or if one of the current engine resets fails we fall
  *			back to legacy full GPU reset.
+ * @watchdog: 		true = Engine hang detected by hardware watchdog.
  * @wedged: 		true = Hang detected, invoke hang recovery.
  * @fmt, ...: 		Error message describing reason for error.
  *
@@ -2674,8 +2675,8 @@ static void i915_report_and_clear_eir(struct drm_device *dev)
  * reset the associated engine. Failing that, try to fall back to legacy
  * full GPU reset recovery mode.
  */
-void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
-		       const char *fmt, ...)
+void i915_handle_error(struct drm_device *dev, u32 engine_mask,
+                       bool watchdog, bool wedged, const char *fmt, ...)
 {
 	struct drm_i915_private *dev_priv = dev->dev_private;
 	va_list args;
@@ -2713,20 +2714,27 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
 			u32 i;
 
 			for_each_ring(engine, dev_priv, i) {
-				u32 now, last_engine_reset_timediff;
 
 				if (!(intel_ring_flag(engine) & engine_mask))
 					continue;
 
-				/* Measure the time since this engine was last reset */
-				now = get_seconds();
-				last_engine_reset_timediff =
-					now - engine->hangcheck.last_engine_reset_time;
-
-				full_reset = last_engine_reset_timediff <
-					i915.gpu_reset_promotion_time;
-
-				engine->hangcheck.last_engine_reset_time = now;
+				if (!watchdog) {
+					/* Measure the time since this engine was last reset */
+					u32 now = get_seconds();
+					u32 last_engine_reset_timediff =
+						now - engine->hangcheck.last_engine_reset_time;
+
+					full_reset = last_engine_reset_timediff <
+						i915.gpu_reset_promotion_time;
+
+					engine->hangcheck.last_engine_reset_time = now;
+				} else {
+					/*
+					 * Watchdog timeout always results
+					 * in engine reset.
+					 */
+					full_reset = false;
+				}
 
 				/*
 				 * This engine was not reset too recently - go ahead
@@ -2737,10 +2745,11 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
 				 * This can still be overridden by a global
 				 * reset e.g. if per-engine reset fails.
 				 */
-				if (!full_reset)
+				if (watchdog || !full_reset)
 					atomic_or(I915_ENGINE_RESET_IN_PROGRESS,
 						&engine->hangcheck.flags);
-				else
+
+				if (full_reset)
 					break;
 
 			} /* for_each_ring */
@@ -3079,7 +3088,7 @@ ring_stuck(struct intel_engine_cs *ring, u64 acthd)
 	 */
 	tmp = I915_READ_CTL(ring);
 	if (tmp & RING_WAIT) {
-		i915_handle_error(dev, intel_ring_flag(ring), false,
+		i915_handle_error(dev, intel_ring_flag(ring), false, false,
 				  "Kicking stuck wait on %s",
 				  ring->name);
 		I915_WRITE_CTL(ring, tmp);
@@ -3091,7 +3100,7 @@ ring_stuck(struct intel_engine_cs *ring, u64 acthd)
 		default:
 			return HANGCHECK_HUNG;
 		case 1:
-			i915_handle_error(dev, intel_ring_flag(ring), false,
+			i915_handle_error(dev, intel_ring_flag(ring), false, false,
 					  "Kicking stuck semaphore on %s",
 					  ring->name);
 			I915_WRITE_CTL(ring, tmp);
@@ -3224,7 +3233,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	}
 
 	if (engine_mask)
-		i915_handle_error(dev, engine_mask, true, "Ring hung (0x%02x)", engine_mask);
+		i915_handle_error(dev, engine_mask, false, true, "Ring hung (0x%02x)", engine_mask);
 
 	if (busy_count)
 		/* Reset timer case chip hangs without another request
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2016-01-29 14:18 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-13 17:28 [PATCH 00/20] TDR/watchdog support for gen8 Arun Siluvery
2016-01-13 17:28 ` [PATCH 01/20] drm/i915: Make i915_gem_reset_ring_status() public Arun Siluvery
2016-01-13 17:28 ` [PATCH 02/20] drm/i915: Generalise common GPU engine reset request/unrequest code Arun Siluvery
2016-01-22 11:24   ` Mika Kuoppala
2016-01-13 17:28 ` [PATCH 03/20] drm/i915: TDR / per-engine hang recovery support for gen8 Arun Siluvery
2016-01-13 21:16   ` Chris Wilson
2016-01-13 21:21   ` Chris Wilson
2016-01-29 14:16   ` Mika Kuoppala
2016-01-13 17:28 ` [PATCH 04/20] drm/i915: TDR / per-engine hang detection Arun Siluvery
2016-01-13 20:37   ` Chris Wilson
2016-01-13 17:28 ` [PATCH 05/20] drm/i915: Extending i915_gem_check_wedge to check engine reset in progress Arun Siluvery
2016-01-13 20:49   ` Chris Wilson
2016-01-13 17:28 ` [PATCH 06/20] drm/i915: Reinstate hang recovery work queue Arun Siluvery
2016-01-13 21:01   ` Chris Wilson
2016-01-13 17:28 ` [PATCH 07/20] drm/i915: Watchdog timeout: Hang detection integration into error handler Arun Siluvery
2016-01-13 21:13   ` Chris Wilson
2016-01-13 17:28 ` [PATCH 08/20] drm/i915: Watchdog timeout: IRQ handler for gen8 Arun Siluvery
2016-01-13 17:28 ` [PATCH 09/20] drm/i915: Watchdog timeout: Ringbuffer command emission " Arun Siluvery
2016-01-13 17:28 ` [PATCH 10/20] drm/i915: Watchdog timeout: DRM kernel interface enablement Arun Siluvery
2016-01-13 17:28 ` [PATCH 11/20] drm/i915: Fake lost context event interrupts through forced CSB checking Arun Siluvery
2016-01-13 17:28 ` [PATCH 12/20] drm/i915: Debugfs interface for per-engine hang recovery Arun Siluvery
2016-01-13 17:28 ` [PATCH 13/20] drm/i915: Test infrastructure for context state inconsistency simulation Arun Siluvery
2016-01-13 17:28 ` [PATCH 14/20] drm/i915: TDR/watchdog trace points Arun Siluvery
2016-01-13 17:28 ` [PATCH 15/20] drm/i915: Port of Added scheduler support to __wait_request() calls Arun Siluvery
2016-01-13 17:28 ` [PATCH 16/20] drm/i915: Fix __i915_wait_request() behaviour during hang detection Arun Siluvery
2016-01-13 17:28 ` [PATCH 17/20] drm/i915: Extended error state with TDR count, watchdog count and engine reset count Arun Siluvery
2016-01-13 17:28 ` [PATCH 18/20] drm/i915: TDR / per-engine hang recovery kernel docs Arun Siluvery
2016-01-13 17:28 ` [PATCH 19/20] drm/i915: drm/i915 changes to simulated hangs Arun Siluvery
2016-01-13 17:28 ` [PATCH 20/20] drm/i915: Enable TDR / per-engine hang recovery Arun Siluvery
2016-01-14  8:30 ` ✗ failure: Fi.CI.BAT Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2015-10-23  1:32 [PATCH 00/20] TDR/watchdog support for gen8 Tomas Elf
2015-10-23  1:32 ` [PATCH 07/20] drm/i915: Watchdog timeout: Hang detection integration into error handler Tomas Elf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.