All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] Execlist based engine-reset
@ 2016-12-16 20:20 Michel Thierry
  2016-12-16 20:20 ` [PATCH 1/9] drm/i915: Keep i915_handle_error kerneldoc parameters together Michel Thierry
                   ` (9 more replies)
  0 siblings, 10 replies; 23+ messages in thread
From: Michel Thierry @ 2016-12-16 20:20 UTC (permalink / raw)
  To: intel-gfx

These patches are to add engine reset feature from Gen8. This is also
referred to as Timeout detection and recovery (TDR). This complements to
the full gpu reset feature available in i915 but it only allows to reset a
particular engine instead of all engines thus providing a light weight
engine reset and recovery mechanism.

This implementation is for execlist based submission only hence limited
from Gen8 onwards. For GuC based submission, additional changes can be
added later on.

Timeout detection relies on the existing hangcheck which remains the same,
main changes are to the recovery mechanism. Once we detect a hang on a
particular engine we identify the request that caused the hang, skip the
request and adjust head pointers to allow the execution to proceed
normally. After some cleanup, submissions are restarted to process
remaining work queued to that engine.

If engine reset fails to recover engine correctly then we fallback to full
gpu reset.

v2: ELSP queue request tracking and reset path changes to handle incomplete
requests during reset. Thanks to Chris Wilson for providing these patches.

v3: Let the waiter keep handling the full gpu reset if it already has the
lock; point out that GuC submission needs a different method to restart
workloads after the engine reset completes.

Arun Siluvery (6):
  drm/i915: Update i915.reset to handle engine resets
  drm/i915/tdr: Modify error handler for per engine hang recovery
  drm/i915/tdr: Add support for per engine reset recovery
  drm/i915/tdr: Add engine reset count to error state
  drm/i915/tdr: Export per-engine reset count info to debugfs
  drm/i915/tdr: Enable Engine reset and recovery support

Michel Thierry (2):
  drm/i915: Keep i915_handle_error kerneldoc parameters together
  drm/i915: Add engine reset count in get-reset-stats ioctl

Mika Kuoppala (1):
  drm/i915: Skip reset request if there is one already

 drivers/gpu/drm/i915/i915_debugfs.c     | 18 +++++++
 drivers/gpu/drm/i915/i915_drv.c         | 74 +++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_drv.h         | 15 ++++++
 drivers/gpu/drm/i915/i915_gem.c         |  2 +-
 drivers/gpu/drm/i915/i915_gem_context.c | 14 +++--
 drivers/gpu/drm/i915/i915_gpu_error.c   |  3 ++
 drivers/gpu/drm/i915/i915_irq.c         | 91 ++++++++++++++++++++++++---------
 drivers/gpu/drm/i915/i915_params.c      |  6 +--
 drivers/gpu/drm/i915/i915_params.h      |  2 +-
 drivers/gpu/drm/i915/i915_pci.c         |  5 +-
 drivers/gpu/drm/i915/intel_lrc.c        | 12 +++++
 drivers/gpu/drm/i915/intel_lrc.h        |  1 +
 drivers/gpu/drm/i915/intel_uncore.c     | 61 +++++++++++++++++++---
 include/uapi/drm/i915_drm.h             |  3 +-
 14 files changed, 266 insertions(+), 41 deletions(-)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/9] drm/i915: Keep i915_handle_error kerneldoc parameters together
  2016-12-16 20:20 [PATCH 0/9] Execlist based engine-reset Michel Thierry
@ 2016-12-16 20:20 ` Michel Thierry
  2016-12-16 21:40   ` Chris Wilson
  2016-12-19  8:22   ` Tvrtko Ursulin
  2016-12-16 20:20 ` [PATCH 2/9] drm/i915: Update i915.reset to handle engine resets Michel Thierry
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 23+ messages in thread
From: Michel Thierry @ 2016-12-16 20:20 UTC (permalink / raw)
  To: intel-gfx

And before the function description.
Tidy up from commit 14bb2c11796d70b ("drm/i915: Fix a buch of kerneldoc
warnings"), all others kerneldoc blocks look ok.

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index a0e70f5b3aad..77caeeb5ee55 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2703,12 +2703,13 @@ static void i915_clear_error_registers(struct drm_i915_private *dev_priv)
  * i915_handle_error - handle a gpu error
  * @dev_priv: i915 device private
  * @engine_mask: mask representing engines that are hung
+ * @fmt: Error message format string
+ *
  * Do some basic checking of register state at error time and
  * dump it to the syslog.  Also call i915_capture_error_state() to make
  * sure we get a record and make it available in debugfs.  Fire a uevent
  * so userspace knows something bad happened (should trigger collection
  * of a ring dump etc.).
- * @fmt: Error message format string
  */
 void i915_handle_error(struct drm_i915_private *dev_priv,
 		       u32 engine_mask,
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/9] drm/i915: Update i915.reset to handle engine resets
  2016-12-16 20:20 [PATCH 0/9] Execlist based engine-reset Michel Thierry
  2016-12-16 20:20 ` [PATCH 1/9] drm/i915: Keep i915_handle_error kerneldoc parameters together Michel Thierry
@ 2016-12-16 20:20 ` Michel Thierry
  2016-12-16 20:20 ` [PATCH 3/9] drm/i915/tdr: Modify error handler for per engine hang recovery Michel Thierry
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Michel Thierry @ 2016-12-16 20:20 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

In preparation for engine reset work update this parameter to handle more
than one type of reset. Default at the moment is still full gpu reset.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_params.c | 6 +++---
 drivers/gpu/drm/i915/i915_params.h | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index 0e280fbd52f1..c858c4d50491 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -46,7 +46,7 @@ struct i915_params i915 __read_mostly = {
 	.prefault_disable = 0,
 	.load_detect_test = 0,
 	.force_reset_modeset_test = 0,
-	.reset = true,
+	.reset = 1,
 	.error_capture = true,
 	.invert_brightness = 0,
 	.disable_display = 0,
@@ -113,8 +113,8 @@ MODULE_PARM_DESC(vbt_sdvo_panel_type,
 	"Override/Ignore selection of SDVO panel mode in the VBT "
 	"(-2=ignore, -1=auto [default], index in VBT BIOS table)");
 
-module_param_named_unsafe(reset, i915.reset, bool, 0600);
-MODULE_PARM_DESC(reset, "Attempt GPU resets (default: true)");
+module_param_named_unsafe(reset, i915.reset, int, 0600);
+MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset [default], 2=engine reset)");
 
 #if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
 module_param_named(error_capture, i915.error_capture, bool, 0600);
diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
index 8e433de04679..da569e20bbec 100644
--- a/drivers/gpu/drm/i915/i915_params.h
+++ b/drivers/gpu/drm/i915/i915_params.h
@@ -34,6 +34,7 @@ struct i915_params {
 	int lvds_channel_mode;
 	int panel_use_ssc;
 	int vbt_sdvo_panel_type;
+	int reset;
 	int enable_rc6;
 	int enable_dc;
 	int enable_fbc;
@@ -58,7 +59,6 @@ struct i915_params {
 	bool prefault_disable;
 	bool load_detect_test;
 	bool force_reset_modeset_test;
-	bool reset;
 	bool error_capture;
 	bool disable_display;
 	bool verbose_state_checks;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/9] drm/i915/tdr: Modify error handler for per engine hang recovery
  2016-12-16 20:20 [PATCH 0/9] Execlist based engine-reset Michel Thierry
  2016-12-16 20:20 ` [PATCH 1/9] drm/i915: Keep i915_handle_error kerneldoc parameters together Michel Thierry
  2016-12-16 20:20 ` [PATCH 2/9] drm/i915: Update i915.reset to handle engine resets Michel Thierry
@ 2016-12-16 20:20 ` Michel Thierry
  2016-12-16 20:20 ` [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery Michel Thierry
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Michel Thierry @ 2016-12-16 20:20 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

This is a preparatory patch which modifies error handler to do per engine
hang recovery. The actual patch which implements this sequence follows
later in the series. The aim is to prepare existing recovery function to
adapt to this new function where applicable (which fails at this point
because core implementation is lacking) and continue recovery using legacy
full gpu reset.

A helper function is also added to query the availability of engine
reset.

The error events behaviour that are used to notify user of reset are
adapted to engine reset such that it doesn't break users listening to these
events. In legacy we report an error event, a reset event before resetting
the gpu and a reset done event marking the completion of reset. The same
behaviour is adapted but reset event is only dispatched once even when
multiple engines are hung. Finally once reset is complete we send reset
done event as usual.

Note that this implementation of engine reset is for i915 directly
submitting to the ELSP, where the driver manages the hang detection,
recovery and resubmission. With GuC submission these tasks are shared
between driver and firmware; i915 will still responsible for detecting a
hang, and when it does it will have to request GuC to reset that Engine and
remind the firmware about the outstanding submissions.

v2: rebase, advertise engine reset availability in platform definition,
add note about GuC submission.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c     | 21 +++++++++
 drivers/gpu/drm/i915/i915_drv.h     |  3 ++
 drivers/gpu/drm/i915/i915_irq.c     | 88 +++++++++++++++++++++++++++----------
 drivers/gpu/drm/i915/i915_pci.c     |  5 ++-
 drivers/gpu/drm/i915/intel_uncore.c | 11 +++++
 5 files changed, 103 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 6428588518aa..e5688edd62cd 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1824,6 +1824,27 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	goto wakeup;
 }
 
+/**
+ * i915_reset_engine - reset GPU engine to recover from a hang
+ * @engine: engine to reset
+ *
+ * Reset a specific GPU engine. Useful if a hang is detected.
+ * Returns zero on successful reset or otherwise an error code.
+ */
+int i915_reset_engine(struct intel_engine_cs *engine)
+{
+	int ret;
+	struct drm_i915_private *dev_priv = engine->i915;
+
+	/* FIXME: replace me with engine reset sequence */
+	ret = -ENODEV;
+
+	/* use full gpu reset to recover on error */
+	set_bit(I915_RESET_IN_PROGRESS, &dev_priv->gpu_error.flags);
+
+	return ret;
+}
+
 static int i915_pm_suspend(struct device *kdev)
 {
 	struct pci_dev *pdev = to_pci_dev(kdev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index fb3c88144aa8..6b4e8ee19905 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -781,6 +781,7 @@ struct intel_csr {
 	func(has_ddi); \
 	func(has_decoupled_mmio); \
 	func(has_dp_mst); \
+	func(has_engine_reset); \
 	func(has_fbc); \
 	func(has_fpga_dbg); \
 	func(has_full_ppgtt); \
@@ -3007,6 +3008,8 @@ extern void i915_driver_unload(struct drm_device *dev);
 extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern void i915_reset(struct drm_i915_private *dev_priv);
+extern bool intel_has_engine_reset(struct drm_i915_private *dev_priv);
+extern int i915_reset_engine(struct intel_engine_cs *engine);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 77caeeb5ee55..da619810f59e 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2594,7 +2594,8 @@ static void i915_error_wake_up(struct drm_i915_private *dev_priv)
  * Fire an error uevent so userspace can see that a hang or error
  * was detected.
  */
-static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
+static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv,
+				  u32 engine_mask)
 {
 	struct kobject *kobj = &dev_priv->drm.primary->kdev->kobj;
 	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
@@ -2603,7 +2604,15 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 
 	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
 
-	DRM_DEBUG_DRIVER("resetting chip\n");
+	/*
+	 * This event needs to be sent before performing gpu reset. When
+	 * engine resets are supported we iterate through all engines and
+	 * reset hung engines individually. To keep the event dispatch
+	 * mechanism consistent with full gpu reset, this is only sent once
+	 * even when multiple engines are hung. It is also safe to move this
+	 * here because when we are in this function, we will definitely
+	 * perform gpu reset.
+	 */
 	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
 
 	/*
@@ -2614,30 +2623,55 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 	 * simulated reset via debugs, so get an RPM reference.
 	 */
 	intel_runtime_pm_get(dev_priv);
-	intel_prepare_reset(dev_priv);
 
-	do {
-		/*
-		 * All state reset _must_ be completed before we update the
-		 * reset counter, for otherwise waiters might miss the reset
-		 * pending state and not properly drop locks, resulting in
-		 * deadlocks with the reset work.
-		 */
-		if (mutex_trylock(&dev_priv->drm.struct_mutex)) {
-			i915_reset(dev_priv);
-			mutex_unlock(&dev_priv->drm.struct_mutex);
+	/* If hardware supports it (GEN8+), try engine reset first */
+	if (intel_has_engine_reset(dev_priv)) {
+		struct intel_engine_cs *engine;
+		unsigned int tmp, ret;
+
+		for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
+			ret = i915_reset_engine(engine);
+			/* on failure fallback to full gpu reset for recovery */
+			if (ret)
+				break;
 		}
+	}
 
-		/* We need to wait for anyone holding the lock to wakeup */
-	} while (wait_on_bit_timeout(&dev_priv->gpu_error.flags,
-				     I915_RESET_IN_PROGRESS,
-				     TASK_UNINTERRUPTIBLE,
-				     HZ));
+	/*
+	 * If the waiter already held the struct_mutex lock, it may have already
+	 * triggered the GPU reset and the reset_in_progress can be false.
+	 */
+	if (i915_reset_in_progress(&dev_priv->gpu_error)) {
+		DRM_DEBUG_DRIVER("resetting chip\n");
+		intel_prepare_reset(dev_priv);
+
+		do {
+			/*
+			 * All state reset _must_ be completed before we update
+			 * the reset counter, for otherwise waiters might miss
+			 * the reset pending state and not properly drop locks,
+			 * resulting in deadlocks with the reset work.
+			 */
+			if (mutex_trylock(&dev_priv->drm.struct_mutex)) {
+				i915_reset(dev_priv);
+				mutex_unlock(&dev_priv->drm.struct_mutex);
+			}
+
+			/*
+			 * We need to wait for anyone holding the lock to
+			 * wakeup.
+			 */
+		} while (wait_on_bit_timeout(&dev_priv->gpu_error.flags,
+					     I915_RESET_IN_PROGRESS,
+					     TASK_UNINTERRUPTIBLE,
+					     HZ));
+
+		intel_finish_reset(dev_priv);
+	}
 
-	intel_finish_reset(dev_priv);
 	intel_runtime_pm_put(dev_priv);
 
-	if (!test_bit(I915_WEDGED, &dev_priv->gpu_error.flags))
+	if (!i915_terminally_wedged(&dev_priv->gpu_error))
 		kobject_uevent_env(kobj,
 				   KOBJ_CHANGE, reset_done_event);
 
@@ -2728,9 +2762,15 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 	if (!engine_mask)
 		return;
 
-	if (test_and_set_bit(I915_RESET_IN_PROGRESS,
-			     &dev_priv->gpu_error.flags))
-		return;
+	/*
+	 * Engine reset support is only available from Gen8 onwards so if
+	 * it is not available or explicity disabled, use full gpu reset.
+	 */
+	if (!intel_has_engine_reset(dev_priv)) {
+		if (test_and_set_bit(I915_RESET_IN_PROGRESS,
+				     &dev_priv->gpu_error.flags))
+			return;
+	}
 
 	/*
 	 * Wakeup waiting processes so that the reset function
@@ -2746,7 +2786,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 	 */
 	i915_error_wake_up(dev_priv);
 
-	i915_reset_and_wakeup(dev_priv);
+	i915_reset_and_wakeup(dev_priv, engine_mask);
 }
 
 /* Called from drm generic code, passed 'crtc' which
diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index 37244fad9b03..6e9716ae4c21 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -308,7 +308,8 @@ static const struct intel_device_info intel_haswell_info = {
 	BDW_COLORS, \
 	.has_logical_ring_contexts = 1, \
 	.has_full_48bit_ppgtt = 1, \
-	.has_64bit_reloc = 1
+	.has_64bit_reloc = 1, \
+	.has_engine_reset = 1
 
 static const struct intel_device_info intel_broadwell_info = {
 	BDW_FEATURES,
@@ -339,6 +340,7 @@ static const struct intel_device_info intel_cherryview_info = {
 	.has_gmch_display = 1,
 	.has_aliasing_ppgtt = 1,
 	.has_full_ppgtt = 1,
+	.has_engine_reset = 1,
 	.display_mmio_offset = VLV_DISPLAY_BASE,
 	GEN_CHV_PIPEOFFSETS,
 	CURSOR_OFFSETS,
@@ -390,6 +392,7 @@ static const struct intel_device_info intel_skylake_gt3_info = {
 	.has_aliasing_ppgtt = 1, \
 	.has_full_ppgtt = 1, \
 	.has_full_48bit_ppgtt = 1, \
+	.has_engine_reset = 1, \
 	GEN_DEFAULT_PIPEOFFSETS, \
 	IVB_CURSOR_OFFSETS, \
 	BDW_COLORS
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 8fc5f29e79a8..97ce324570ad 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1851,6 +1851,17 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv)
 	return intel_get_gpu_reset(dev_priv) != NULL;
 }
 
+/*
+ * When GuC submission is enabled, GuC manages ELSP and can initiate the
+ * engine reset too. For now, fall back to full GPU reset if it is enabled.
+ */
+bool intel_has_engine_reset(struct drm_i915_private *dev_priv)
+{
+	return (dev_priv->info.has_engine_reset &&
+		!dev_priv->guc.execbuf_client &&
+		i915.reset == 2);
+}
+
 int intel_guc_reset(struct drm_i915_private *dev_priv)
 {
 	int ret;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery
  2016-12-16 20:20 [PATCH 0/9] Execlist based engine-reset Michel Thierry
                   ` (2 preceding siblings ...)
  2016-12-16 20:20 ` [PATCH 3/9] drm/i915/tdr: Modify error handler for per engine hang recovery Michel Thierry
@ 2016-12-16 20:20 ` Michel Thierry
  2016-12-16 20:45   ` Chris Wilson
  2016-12-16 20:20 ` [PATCH 5/9] drm/i915: Skip reset request if there is one already Michel Thierry
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Michel Thierry @ 2016-12-16 20:20 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

This change implements support for per-engine reset as an initial, less
intrusive hang recovery option to be attempted before falling back to the
legacy full GPU reset recovery mode if necessary. This is only supported
from Gen8 onwards.

Hangchecker determines which engines are hung and invokes error handler to
recover from it. Error handler schedules recovery for each of those engines
that are hung. The recovery procedure is as follows,
 - identifies the request that caused the hang and it is dropped
 - force engine to idle: this is done by issuing a reset request
 - reset and re-init engine
 - restart submissions to the engine

If engine reset fails then we fall back to heavy weight full gpu reset
which resets all engines and reinitiazes complete state of HW and SW.

v2: Rebase.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c     | 56 +++++++++++++++++++++++++++++++++++--
 drivers/gpu/drm/i915/i915_drv.h     |  3 ++
 drivers/gpu/drm/i915/i915_gem.c     |  2 +-
 drivers/gpu/drm/i915/intel_lrc.c    | 12 ++++++++
 drivers/gpu/drm/i915/intel_lrc.h    |  1 +
 drivers/gpu/drm/i915/intel_uncore.c | 41 ++++++++++++++++++++++++---
 6 files changed, 108 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index e5688edd62cd..a034793bc246 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1830,18 +1830,70 @@ void i915_reset(struct drm_i915_private *dev_priv)
  *
  * Reset a specific GPU engine. Useful if a hang is detected.
  * Returns zero on successful reset or otherwise an error code.
+ *
+ * Procedure is fairly simple:
+ *  - identifies the request that caused the hang and it is dropped
+ *  - force engine to idle: this is done by issuing a reset request
+ *  - reset engine
+ *  - restart submissions to the engine
  */
 int i915_reset_engine(struct intel_engine_cs *engine)
 {
 	int ret;
 	struct drm_i915_private *dev_priv = engine->i915;
 
-	/* FIXME: replace me with engine reset sequence */
-	ret = -ENODEV;
+	/*
+	 * We need to first idle the engine by issuing a reset request,
+	 * then perform soft reset and re-initialize hw state, for all of
+	 * this GT power need to be awake so ensure it does throughout the
+	 * process
+	 */
+	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+
+	/*
+	 * the request that caused the hang is stuck on elsp, identify the
+	 * active request and drop it, adjust head to skip the offending
+	 * request to resume executing remaining requests in the queue.
+	 */
+	i915_gem_reset_engine(engine);
+
+	ret = intel_engine_reset_begin(engine);
+	if (ret) {
+		DRM_ERROR("Failed to disable %s\n", engine->name);
+		goto error;
+	}
+
+	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
+	if (ret) {
+		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
+		intel_engine_reset_cancel(engine);
+		goto error;
+	}
+
+	ret = engine->init_hw(engine);
+	if (ret)
+		goto error;
 
+	intel_engine_reset_cancel(engine);
+	intel_execlists_restart_submission(engine);
+
+	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+	return 0;
+
+error:
 	/* use full gpu reset to recover on error */
 	set_bit(I915_RESET_IN_PROGRESS, &dev_priv->gpu_error.flags);
 
+	/* Engine reset is performed without taking struct_mutex, since it
+	 * failed we now fallback to full gpu reset. Wakeup any waiters
+	 * which should now see the reset_in_progress and release
+	 * struct_mutex for us to continue recovery.
+	 */
+	rcu_read_lock();
+	intel_engine_wakeup(engine);
+	rcu_read_unlock();
+
+	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
 	return ret;
 }
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 6b4e8ee19905..a97cc8f50ade 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3009,6 +3009,8 @@ extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern void i915_reset(struct drm_i915_private *dev_priv);
 extern bool intel_has_engine_reset(struct drm_i915_private *dev_priv);
+extern int intel_engine_reset_begin(struct intel_engine_cs *engine);
+extern int intel_engine_reset_cancel(struct intel_engine_cs *engine);
 extern int i915_reset_engine(struct intel_engine_cs *engine);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
@@ -3397,6 +3399,7 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 
 void i915_gem_reset(struct drm_i915_private *dev_priv);
 void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
+void i915_gem_reset_engine(struct intel_engine_cs *engine);
 void i915_gem_clflush_object(struct drm_i915_gem_object *obj, bool force);
 int __must_check i915_gem_init(struct drm_i915_private *dev_priv);
 int __must_check i915_gem_init_hw(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index f86a71d9fe37..d2e9d30bf755 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2718,7 +2718,7 @@ static void reset_request(struct drm_i915_gem_request *request)
 	memset(vaddr + head, 0, request->postfix - head);
 }
 
-static void i915_gem_reset_engine(struct intel_engine_cs *engine)
+void i915_gem_reset_engine(struct intel_engine_cs *engine)
 {
 	struct drm_i915_gem_request *request;
 	struct i915_gem_context *incomplete_ctx;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index bc6aec619a9d..266ad7703862 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -676,6 +676,18 @@ static void execlists_submit_request(struct drm_i915_gem_request *request)
 	spin_unlock_irqrestore(&engine->timeline->lock, flags);
 }
 
+void intel_execlists_restart_submission(struct intel_engine_cs *engine)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&engine->timeline->lock, flags);
+
+	if (execlists_elsp_idle(engine))
+		tasklet_hi_schedule(&engine->irq_tasklet);
+
+	spin_unlock_irqrestore(&engine->timeline->lock, flags);
+}
+
 static struct intel_engine_cs *
 pt_lock_engine(struct i915_priotree *pt, struct intel_engine_cs *locked)
 {
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 8df97a97f18d..da1a169e5bb6 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -96,6 +96,7 @@ int intel_lr_rcs_context_setup_trtt(struct i915_gem_context *ctx);
 int intel_sanitize_enable_execlists(struct drm_i915_private *dev_priv,
 				    int enable_execlists);
 void intel_execlists_enable_submission(struct drm_i915_private *dev_priv);
+void intel_execlists_restart_submission(struct intel_engine_cs *engine);
 bool intel_execlists_idle(struct drm_i915_private *dev_priv);
 
 #endif /* _INTEL_LRC_H_ */
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 97ce324570ad..9105c166a6d5 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1758,7 +1758,7 @@ int intel_wait_for_register(struct drm_i915_private *dev_priv,
 	return ret;
 }
 
-static int gen8_request_engine_reset(struct intel_engine_cs *engine)
+static int gen8_engine_reset_begin(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
 	int ret;
@@ -1777,7 +1777,7 @@ static int gen8_request_engine_reset(struct intel_engine_cs *engine)
 	return ret;
 }
 
-static void gen8_unrequest_engine_reset(struct intel_engine_cs *engine)
+static void gen8_engine_reset_cancel(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
 
@@ -1792,14 +1792,14 @@ static int gen8_reset_engines(struct drm_i915_private *dev_priv,
 	unsigned int tmp;
 
 	for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
-		if (gen8_request_engine_reset(engine))
+		if (gen8_engine_reset_begin(engine))
 			goto not_ready;
 
 	return gen6_reset_engines(dev_priv, engine_mask);
 
 not_ready:
 	for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
-		gen8_unrequest_engine_reset(engine);
+		gen8_engine_reset_cancel(engine);
 
 	return -EIO;
 }
@@ -1881,6 +1881,39 @@ int intel_guc_reset(struct drm_i915_private *dev_priv)
 	return ret;
 }
 
+/*
+ * On gen8+ a reset request has to be issued via the reset control register
+ * before a GPU engine can be reset in order to stop the command streamer
+ * and idle the engine. This replaces the legacy way of stopping an engine
+ * by writing to the stop ring bit in the MI_MODE register.
+ */
+int intel_engine_reset_begin(struct intel_engine_cs *engine)
+{
+	if (!intel_has_engine_reset(engine->i915)) {
+		DRM_ERROR("Engine Reset not supported on Gen%d\n",
+			  INTEL_INFO(engine->i915)->gen);
+		return -EINVAL;
+	}
+
+	return gen8_engine_reset_begin(engine);
+}
+
+/*
+ * It is possible to back off from a previously issued reset request by simply
+ * clearing the reset request bit in the reset control register.
+ */
+int intel_engine_reset_cancel(struct intel_engine_cs *engine)
+{
+	if (!intel_has_engine_reset(engine->i915)) {
+		DRM_ERROR("Request to clear reset not supported on Gen%d\n",
+			  INTEL_INFO(engine->i915)->gen);
+		return -EINVAL;
+	}
+
+	gen8_engine_reset_cancel(engine);
+	return 0;
+}
+
 bool intel_uncore_unclaimed_mmio(struct drm_i915_private *dev_priv)
 {
 	return check_for_unclaimed_mmio(dev_priv);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 5/9] drm/i915: Skip reset request if there is one already
  2016-12-16 20:20 [PATCH 0/9] Execlist based engine-reset Michel Thierry
                   ` (3 preceding siblings ...)
  2016-12-16 20:20 ` [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery Michel Thierry
@ 2016-12-16 20:20 ` Michel Thierry
  2016-12-16 20:20 ` [PATCH 6/9] drm/i915/tdr: Add engine reset count to error state Michel Thierry
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Michel Thierry @ 2016-12-16 20:20 UTC (permalink / raw)
  To: intel-gfx

From: Mika Kuoppala <mika.kuoppala@linux.intel.com>

To perform engine reset we first disable engine to capture its state. This
is done by issuing a reset request. Because we are reusing existing
infrastructure, again when we actually reset an engine, reset function
checks engine mask and issues reset request again which is unnecessary. To
avoid this we check if the engine is already prepared, if so we just exit
from that point.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 9105c166a6d5..cd3102d841a4 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1761,10 +1761,15 @@ int intel_wait_for_register(struct drm_i915_private *dev_priv,
 static int gen8_engine_reset_begin(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
+	const i915_reg_t reset_ctrl = RING_RESET_CTL(engine->mmio_base);
+	const u32 ready = RESET_CTL_REQUEST_RESET | RESET_CTL_READY_TO_RESET;
 	int ret;
 
-	I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
-		      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
+	/* If engine has been already prepared, we can shortcut here */
+	if ((I915_READ_FW(reset_ctrl) & ready) == ready)
+		return 0;
+
+	I915_WRITE_FW(reset_ctrl, _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
 
 	ret = intel_wait_for_register_fw(dev_priv,
 					 RING_RESET_CTL(engine->mmio_base),
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 6/9] drm/i915/tdr: Add engine reset count to error state
  2016-12-16 20:20 [PATCH 0/9] Execlist based engine-reset Michel Thierry
                   ` (4 preceding siblings ...)
  2016-12-16 20:20 ` [PATCH 5/9] drm/i915: Skip reset request if there is one already Michel Thierry
@ 2016-12-16 20:20 ` Michel Thierry
  2016-12-16 20:37   ` Chris Wilson
  2016-12-19  8:27   ` Tvrtko Ursulin
  2016-12-16 20:20 ` [PATCH 7/9] drm/i915/tdr: Export per-engine reset count info to debugfs Michel Thierry
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 23+ messages in thread
From: Michel Thierry @ 2016-12-16 20:20 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

Driver maintains count of how many times a given engine is reset, useful to
capture this in error state also. It gives an idea of how engine is coping
up with the workloads it is executing before this error state.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c       | 1 +
 drivers/gpu/drm/i915/i915_drv.h       | 9 +++++++++
 drivers/gpu/drm/i915/i915_gpu_error.c | 3 +++
 3 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index a034793bc246..1c706a082d60 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1877,6 +1877,7 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	intel_engine_reset_cancel(engine);
 	intel_execlists_restart_submission(engine);
 
+	dev_priv->gpu_error.engine_reset_count[engine->id]++;
 	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
 	return 0;
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index a97cc8f50ade..25183762ed94 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -937,6 +937,7 @@ struct drm_i915_error_state {
 		enum intel_engine_hangcheck_action hangcheck_action;
 		struct i915_address_space *vm;
 		int num_requests;
+		u32 reset_count;
 
 		/* position of active request inside the ring */
 		u32 rq_head, rq_post, rq_tail;
@@ -1629,6 +1630,8 @@ struct i915_gpu_error {
 #define I915_RESET_IN_PROGRESS	0
 #define I915_WEDGED		(BITS_PER_LONG - 1)
 
+	unsigned long engine_reset_count[I915_NUM_ENGINES];
+
 	/**
 	 * Waitqueue to signal when a hang is detected. Used to for waiters
 	 * to release the struct_mutex for the reset to procede.
@@ -3397,6 +3400,12 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
+static inline u32 i915_engine_reset_count(struct i915_gpu_error *error,
+					  struct intel_engine_cs *engine)
+{
+	return READ_ONCE(error->engine_reset_count[engine->id]);
+}
+
 void i915_gem_reset(struct drm_i915_private *dev_priv);
 void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
 void i915_gem_reset_engine(struct intel_engine_cs *engine);
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index e16037d1b0ba..f168ad873521 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -453,6 +453,7 @@ static void error_print_engine(struct drm_i915_error_state_buf *m,
 	err_printf(m, "  hangcheck action timestamp: %lu, %u ms ago\n",
 		   ee->hangcheck_timestamp,
 		   jiffies_to_msecs(jiffies - ee->hangcheck_timestamp));
+	err_printf(m, "  engine reset count: %u\n", ee->reset_count);
 
 	error_print_request(m, "  ELSP[0]: ", &ee->execlist[0]);
 	error_print_request(m, "  ELSP[1]: ", &ee->execlist[1]);
@@ -1170,6 +1171,8 @@ static void error_record_engine_registers(struct drm_i915_error_state *error,
 	ee->hangcheck_timestamp = engine->hangcheck.action_timestamp;
 	ee->hangcheck_action = engine->hangcheck.action;
 	ee->hangcheck_stalled = engine->hangcheck.stalled;
+	ee->reset_count = i915_engine_reset_count(&dev_priv->gpu_error,
+						  engine);
 
 	if (USES_PPGTT(dev_priv)) {
 		int i;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 7/9] drm/i915/tdr: Export per-engine reset count info to debugfs
  2016-12-16 20:20 [PATCH 0/9] Execlist based engine-reset Michel Thierry
                   ` (5 preceding siblings ...)
  2016-12-16 20:20 ` [PATCH 6/9] drm/i915/tdr: Add engine reset count to error state Michel Thierry
@ 2016-12-16 20:20 ` Michel Thierry
  2016-12-16 20:20 ` [PATCH 8/9] drm/i915/tdr: Enable Engine reset and recovery support Michel Thierry
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Michel Thierry @ 2016-12-16 20:20 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

A new variable is added to export the reset counts to debugfs, this
includes full gpu reset and engine reset count. This is useful for tests
where they are expected to trigger reset; these counts are checked before
and after the test to ensure the same.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 54e196d9d83e..b2a3c31ba95c 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1370,6 +1370,23 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 	return 0;
 }
 
+static int i915_reset_info(struct seq_file *m, void *unused)
+{
+	struct drm_i915_private *dev_priv = node_to_i915(m->private);
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+
+	seq_printf(m, "full gpu reset = %u\n", i915_reset_count(error));
+
+	for_each_engine(engine, dev_priv, id) {
+		seq_printf(m, "%s = %u\n", engine->name,
+			   i915_engine_reset_count(error, engine));
+	}
+
+	return 0;
+}
+
 static int ironlake_drpc_info(struct seq_file *m)
 {
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
@@ -5408,6 +5425,7 @@ static const struct drm_info_list i915_debugfs_list[] = {
 	{"i915_guc_log_dump", i915_guc_log_dump, 0},
 	{"i915_frequency_info", i915_frequency_info, 0},
 	{"i915_hangcheck_info", i915_hangcheck_info, 0},
+	{"i915_reset_info", i915_reset_info, 0},
 	{"i915_drpc_info", i915_drpc_info, 0},
 	{"i915_emon_status", i915_emon_status, 0},
 	{"i915_ring_freq_table", i915_ring_freq_table, 0},
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 8/9] drm/i915/tdr: Enable Engine reset and recovery support
  2016-12-16 20:20 [PATCH 0/9] Execlist based engine-reset Michel Thierry
                   ` (6 preceding siblings ...)
  2016-12-16 20:20 ` [PATCH 7/9] drm/i915/tdr: Export per-engine reset count info to debugfs Michel Thierry
@ 2016-12-16 20:20 ` Michel Thierry
  2016-12-16 20:20 ` [RFC 9/9] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
  2016-12-16 20:45 ` ✗ Fi.CI.BAT: warning for Execlist based engine-reset Patchwork
  9 siblings, 0 replies; 23+ messages in thread
From: Michel Thierry @ 2016-12-16 20:20 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

This feature is made available only from Gen8, for previous gen devices
driver uses legacy full gpu reset.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_params.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index c858c4d50491..fe1cc54e82e3 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -46,7 +46,7 @@ struct i915_params i915 __read_mostly = {
 	.prefault_disable = 0,
 	.load_detect_test = 0,
 	.force_reset_modeset_test = 0,
-	.reset = 1,
+	.reset = 2,
 	.error_capture = true,
 	.invert_brightness = 0,
 	.disable_display = 0,
@@ -114,7 +114,7 @@ MODULE_PARM_DESC(vbt_sdvo_panel_type,
 	"(-2=ignore, -1=auto [default], index in VBT BIOS table)");
 
 module_param_named_unsafe(reset, i915.reset, int, 0600);
-MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset [default], 2=engine reset)");
+MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset, 2=engine reset [default])");
 
 #if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
 module_param_named(error_capture, i915.error_capture, bool, 0600);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 9/9] drm/i915: Add engine reset count in get-reset-stats ioctl
  2016-12-16 20:20 [PATCH 0/9] Execlist based engine-reset Michel Thierry
                   ` (7 preceding siblings ...)
  2016-12-16 20:20 ` [PATCH 8/9] drm/i915/tdr: Enable Engine reset and recovery support Michel Thierry
@ 2016-12-16 20:20 ` Michel Thierry
  2016-12-16 20:45 ` ✗ Fi.CI.BAT: warning for Execlist based engine-reset Patchwork
  9 siblings, 0 replies; 23+ messages in thread
From: Michel Thierry @ 2016-12-16 20:20 UTC (permalink / raw)
  To: intel-gfx; +Cc: mesa-dev

Users/tests relying on the total reset count will start seeing a smaller
number since most of the hangs can be handled by engine reset.
Note that if reset engine x, context a running on engine y will be unaware
and unaffected.

To start the discussion, include just a total engine reset count. If it
is deemed useful, it can be extended to report each engine separately.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: mesa-dev@lists.freedesktop.org
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 14 +++++++++++---
 include/uapi/drm/i915_drm.h             |  3 ++-
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 1822d338c7aa..7f89ff21e2e5 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -1339,9 +1339,11 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct drm_i915_reset_stats *args = data;
 	struct i915_gem_context *ctx;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
 	int ret;
 
-	if (args->flags || args->pad)
+	if (args->flags)
 		return -EINVAL;
 
 	if (args->ctx_id == DEFAULT_CONTEXT_HANDLE && !capable(CAP_SYS_ADMIN))
@@ -1357,10 +1359,16 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 		return PTR_ERR(ctx);
 	}
 
-	if (capable(CAP_SYS_ADMIN))
+	if (capable(CAP_SYS_ADMIN)) {
 		args->reset_count = i915_reset_count(&dev_priv->gpu_error);
-	else
+		for_each_engine(engine, dev_priv, id)
+			args->engine_reset_count +=
+				i915_engine_reset_count(&dev_priv->gpu_error,
+							engine);
+	} else {
 		args->reset_count = 0;
+		args->engine_reset_count = 0;
+	}
 
 	args->batch_active = ctx->guilty_count;
 	args->batch_pending = ctx->active_count;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 1110e628c239..172ce188a8cb 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1201,7 +1201,8 @@ struct drm_i915_reset_stats {
 	/* Number of batches lost pending for execution, for this context */
 	__u32 batch_pending;
 
-	__u32 pad;
+	/* Number of engine resets since boot/module reload, for all contexts */
+	__u32 engine_reset_count;
 };
 
 struct drm_i915_gem_userptr {
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/9] drm/i915/tdr: Add engine reset count to error state
  2016-12-16 20:20 ` [PATCH 6/9] drm/i915/tdr: Add engine reset count to error state Michel Thierry
@ 2016-12-16 20:37   ` Chris Wilson
  2016-12-16 21:19     ` Michel Thierry
  2016-12-19  8:27   ` Tvrtko Ursulin
  1 sibling, 1 reply; 23+ messages in thread
From: Chris Wilson @ 2016-12-16 20:37 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Fri, Dec 16, 2016 at 12:20:07PM -0800, Michel Thierry wrote:
> From: Arun Siluvery <arun.siluvery@linux.intel.com>
> 
> Driver maintains count of how many times a given engine is reset, useful to
> capture this in error state also. It gives an idea of how engine is coping
> up with the workloads it is executing before this error state.

Also i915_engine_info().
 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c       | 1 +
>  drivers/gpu/drm/i915/i915_drv.h       | 9 +++++++++
>  drivers/gpu/drm/i915/i915_gpu_error.c | 3 +++
>  3 files changed, 13 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index a034793bc246..1c706a082d60 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1877,6 +1877,7 @@ int i915_reset_engine(struct intel_engine_cs *engine)
>  	intel_engine_reset_cancel(engine);
>  	intel_execlists_restart_submission(engine);
>  
> +	dev_priv->gpu_error.engine_reset_count[engine->id]++;
>  	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>  	return 0;
>  
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index a97cc8f50ade..25183762ed94 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -937,6 +937,7 @@ struct drm_i915_error_state {
>  		enum intel_engine_hangcheck_action hangcheck_action;
>  		struct i915_address_space *vm;
>  		int num_requests;
> +		u32 reset_count;
>  
>  		/* position of active request inside the ring */
>  		u32 rq_head, rq_post, rq_tail;
> @@ -1629,6 +1630,8 @@ struct i915_gpu_error {
>  #define I915_RESET_IN_PROGRESS	0
>  #define I915_WEDGED		(BITS_PER_LONG - 1)
>  
> +	unsigned long engine_reset_count[I915_NUM_ENGINES];
> +
>  	/**
>  	 * Waitqueue to signal when a hang is detected. Used to for waiters
>  	 * to release the struct_mutex for the reset to procede.
> @@ -3397,6 +3400,12 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
>  	return READ_ONCE(error->reset_count);
>  }
>  
> +static inline u32 i915_engine_reset_count(struct i915_gpu_error *error,

That's a weird hodgepodge. We first think of intel_engine and are
confused, then we spot the reset. i915_reset_engine_count?

> +					  struct intel_engine_cs *engine)
> +{
> +	return READ_ONCE(error->engine_reset_count[engine->id]);
> +}
> +
>  void i915_gem_reset(struct drm_i915_private *dev_priv);
>  void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
>  void i915_gem_reset_engine(struct intel_engine_cs *engine);
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index e16037d1b0ba..f168ad873521 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -453,6 +453,7 @@ static void error_print_engine(struct drm_i915_error_state_buf *m,
>  	err_printf(m, "  hangcheck action timestamp: %lu, %u ms ago\n",
>  		   ee->hangcheck_timestamp,
>  		   jiffies_to_msecs(jiffies - ee->hangcheck_timestamp));
> +	err_printf(m, "  engine reset count: %u\n", ee->reset_count);
>  
>  	error_print_request(m, "  ELSP[0]: ", &ee->execlist[0]);
>  	error_print_request(m, "  ELSP[1]: ", &ee->execlist[1]);
> @@ -1170,6 +1171,8 @@ static void error_record_engine_registers(struct drm_i915_error_state *error,
>  	ee->hangcheck_timestamp = engine->hangcheck.action_timestamp;
>  	ee->hangcheck_action = engine->hangcheck.action;
>  	ee->hangcheck_stalled = engine->hangcheck.stalled;
> +	ee->reset_count = i915_engine_reset_count(&dev_priv->gpu_error,
> +						  engine);
>  
>  	if (USES_PPGTT(dev_priv)) {
>  		int i;
> -- 
> 2.11.0
> 

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery
  2016-12-16 20:20 ` [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery Michel Thierry
@ 2016-12-16 20:45   ` Chris Wilson
  2016-12-19  5:02     ` Michel Thierry
  2016-12-19 14:24     ` Mika Kuoppala
  0 siblings, 2 replies; 23+ messages in thread
From: Chris Wilson @ 2016-12-16 20:45 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Fri, Dec 16, 2016 at 12:20:05PM -0800, Michel Thierry wrote:
> From: Arun Siluvery <arun.siluvery@linux.intel.com>
> 
> This change implements support for per-engine reset as an initial, less
> intrusive hang recovery option to be attempted before falling back to the
> legacy full GPU reset recovery mode if necessary. This is only supported
> from Gen8 onwards.
> 
> Hangchecker determines which engines are hung and invokes error handler to
> recover from it. Error handler schedules recovery for each of those engines
> that are hung. The recovery procedure is as follows,
>  - identifies the request that caused the hang and it is dropped
>  - force engine to idle: this is done by issuing a reset request
>  - reset and re-init engine
>  - restart submissions to the engine
> 
> If engine reset fails then we fall back to heavy weight full gpu reset
> which resets all engines and reinitiazes complete state of HW and SW.
> 
> v2: Rebase.
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c     | 56 +++++++++++++++++++++++++++++++++++--
>  drivers/gpu/drm/i915/i915_drv.h     |  3 ++
>  drivers/gpu/drm/i915/i915_gem.c     |  2 +-
>  drivers/gpu/drm/i915/intel_lrc.c    | 12 ++++++++
>  drivers/gpu/drm/i915/intel_lrc.h    |  1 +
>  drivers/gpu/drm/i915/intel_uncore.c | 41 ++++++++++++++++++++++++---
>  6 files changed, 108 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index e5688edd62cd..a034793bc246 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1830,18 +1830,70 @@ void i915_reset(struct drm_i915_private *dev_priv)
>   *
>   * Reset a specific GPU engine. Useful if a hang is detected.
>   * Returns zero on successful reset or otherwise an error code.
> + *
> + * Procedure is fairly simple:
> + *  - identifies the request that caused the hang and it is dropped
> + *  - force engine to idle: this is done by issuing a reset request
> + *  - reset engine
> + *  - restart submissions to the engine
>   */
>  int i915_reset_engine(struct intel_engine_cs *engine)

What's the serialisation between potential callers of
i915_reset_engine()?

>  {
>  	int ret;
>  	struct drm_i915_private *dev_priv = engine->i915;
>  
> -	/* FIXME: replace me with engine reset sequence */
> -	ret = -ENODEV;
> +	/*
> +	 * We need to first idle the engine by issuing a reset request,
> +	 * then perform soft reset and re-initialize hw state, for all of
> +	 * this GT power need to be awake so ensure it does throughout the
> +	 * process
> +	 */
> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
> +
> +	/*
> +	 * the request that caused the hang is stuck on elsp, identify the
> +	 * active request and drop it, adjust head to skip the offending
> +	 * request to resume executing remaining requests in the queue.
> +	 */
> +	i915_gem_reset_engine(engine);

Must freeze the engine and irqs first, before calling
i915_gem_reset_engine() (i.e. something like disable_engines_irq,
cancelling tasklet)

Eeek note that the current i915_gem_reset_engine() is lacking a
spinlock.

> +
> +	ret = intel_engine_reset_begin(engine);
> +	if (ret) {
> +		DRM_ERROR("Failed to disable %s\n", engine->name);
> +		goto error;
> +	}
> +
> +	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
> +	if (ret) {
> +		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
> +		intel_engine_reset_cancel(engine);
> +		goto error;
> +	}
> +
> +	ret = engine->init_hw(engine);
> +	if (ret)
> +		goto error;
>  
> +	intel_engine_reset_cancel(engine);
> +	intel_execlists_restart_submission(engine);

engine->init_hw(engine) *is* intel_execlists_restart_submission.

> +
> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
> +	return 0;
> +
> +error:
>  	/* use full gpu reset to recover on error */
>  	set_bit(I915_RESET_IN_PROGRESS, &dev_priv->gpu_error.flags);
>  
> +	/* Engine reset is performed without taking struct_mutex, since it
> +	 * failed we now fallback to full gpu reset. Wakeup any waiters
> +	 * which should now see the reset_in_progress and release
> +	 * struct_mutex for us to continue recovery.
> +	 */
> +	rcu_read_lock();
> +	intel_engine_wakeup(engine);
> +	rcu_read_unlock();
> +
> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>  	return ret;
>  }

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* ✗ Fi.CI.BAT: warning for Execlist based engine-reset
  2016-12-16 20:20 [PATCH 0/9] Execlist based engine-reset Michel Thierry
                   ` (8 preceding siblings ...)
  2016-12-16 20:20 ` [RFC 9/9] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
@ 2016-12-16 20:45 ` Patchwork
  9 siblings, 0 replies; 23+ messages in thread
From: Patchwork @ 2016-12-16 20:45 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

== Series Details ==

Series: Execlist based engine-reset
URL   : https://patchwork.freedesktop.org/series/16936/
State : warning

== Summary ==

Series 16936v1 Execlist based engine-reset
https://patchwork.freedesktop.org/api/1.0/series/16936/revisions/1/mbox/

Test kms_pipe_crc_basic:
        Subgroup suspend-read-crc-pipe-b:
                pass       -> SKIP       (fi-bxt-j4205)

fi-bdw-5557u     total:247  pass:233  dwarn:0   dfail:0   fail:0   skip:14 
fi-bsw-n3050     total:247  pass:208  dwarn:0   dfail:0   fail:0   skip:39 
fi-bxt-j4205     total:247  pass:222  dwarn:0   dfail:0   fail:0   skip:25 
fi-bxt-t5700     total:247  pass:220  dwarn:0   dfail:0   fail:0   skip:27 
fi-byt-j1900     total:247  pass:220  dwarn:0   dfail:0   fail:0   skip:27 
fi-byt-n2820     total:247  pass:216  dwarn:0   dfail:0   fail:0   skip:31 
fi-hsw-4770      total:247  pass:228  dwarn:0   dfail:0   fail:0   skip:19 
fi-hsw-4770r     total:247  pass:228  dwarn:0   dfail:0   fail:0   skip:19 
fi-ilk-650       total:247  pass:195  dwarn:0   dfail:0   fail:0   skip:52 
fi-ivb-3520m     total:247  pass:226  dwarn:0   dfail:0   fail:0   skip:21 
fi-ivb-3770      total:247  pass:226  dwarn:0   dfail:0   fail:0   skip:21 
fi-kbl-7500u     total:247  pass:226  dwarn:0   dfail:0   fail:0   skip:21 
fi-skl-6260u     total:247  pass:234  dwarn:0   dfail:0   fail:0   skip:13 
fi-skl-6700hq    total:247  pass:227  dwarn:0   dfail:0   fail:0   skip:20 
fi-skl-6700k     total:247  pass:224  dwarn:3   dfail:0   fail:0   skip:20 
fi-skl-6770hq    total:247  pass:234  dwarn:0   dfail:0   fail:0   skip:13 
fi-snb-2520m     total:247  pass:216  dwarn:0   dfail:0   fail:0   skip:31 
fi-snb-2600      total:247  pass:215  dwarn:0   dfail:0   fail:0   skip:32 

705f1e8fef81d504f0032df8e21bdc2e74850b3a drm-tip: 2016y-12m-16d-15h-40m-02s UTC integration manifest
9a591cb drm/i915: Add engine reset count in get-reset-stats ioctl
709af8f drm/i915/tdr: Enable Engine reset and recovery support
f37aa42 drm/i915/tdr: Export per-engine reset count info to debugfs
4c1e046 drm/i915/tdr: Add engine reset count to error state
41e96eb drm/i915: Skip reset request if there is one already
d3b2dc8f4 drm/i915/tdr: Add support for per engine reset recovery
d654f09 drm/i915/tdr: Modify error handler for per engine hang recovery
db8e495 drm/i915: Update i915.reset to handle engine resets
b94f5b6 drm/i915: Keep i915_handle_error kerneldoc parameters together

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_3313/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/9] drm/i915/tdr: Add engine reset count to error state
  2016-12-16 20:37   ` Chris Wilson
@ 2016-12-16 21:19     ` Michel Thierry
  0 siblings, 0 replies; 23+ messages in thread
From: Michel Thierry @ 2016-12-16 21:19 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Mika Kuoppala


On 16/12/16 12:37, Chris Wilson wrote:
> On Fri, Dec 16, 2016 at 12:20:07PM -0800, Michel Thierry wrote:
>> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>>
>> Driver maintains count of how many times a given engine is reset, useful to
>> capture this in error state also. It gives an idea of how engine is coping
>> up with the workloads it is executing before this error state.
>
> Also i915_engine_info().
>
I'll add it there too.

>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_drv.c       | 1 +
>>  drivers/gpu/drm/i915/i915_drv.h       | 9 +++++++++
>>  drivers/gpu/drm/i915/i915_gpu_error.c | 3 +++
>>  3 files changed, 13 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index a034793bc246..1c706a082d60 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1877,6 +1877,7 @@ int i915_reset_engine(struct intel_engine_cs *engine)
>>  	intel_engine_reset_cancel(engine);
>>  	intel_execlists_restart_submission(engine);
>>
>> +	dev_priv->gpu_error.engine_reset_count[engine->id]++;
>>  	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>>  	return 0;
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>> index a97cc8f50ade..25183762ed94 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -937,6 +937,7 @@ struct drm_i915_error_state {
>>  		enum intel_engine_hangcheck_action hangcheck_action;
>>  		struct i915_address_space *vm;
>>  		int num_requests;
>> +		u32 reset_count;
>>
>>  		/* position of active request inside the ring */
>>  		u32 rq_head, rq_post, rq_tail;
>> @@ -1629,6 +1630,8 @@ struct i915_gpu_error {
>>  #define I915_RESET_IN_PROGRESS	0
>>  #define I915_WEDGED		(BITS_PER_LONG - 1)
>>
>> +	unsigned long engine_reset_count[I915_NUM_ENGINES];
>> +
>>  	/**
>>  	 * Waitqueue to signal when a hang is detected. Used to for waiters
>>  	 * to release the struct_mutex for the reset to procede.
>> @@ -3397,6 +3400,12 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
>>  	return READ_ONCE(error->reset_count);
>>  }
>>
>> +static inline u32 i915_engine_reset_count(struct i915_gpu_error *error,
>
> That's a weird hodgepodge. We first think of intel_engine and are
> confused, then we spot the reset. i915_reset_engine_count?
>

Agreed i915_reset_engine_count is better (plus it's called from 
i915_reset_info).

>> +					  struct intel_engine_cs *engine)
>> +{
>> +	return READ_ONCE(error->engine_reset_count[engine->id]);
>> +}
>> +

Then the same rename here, reset_engine_count[].

Thanks
>>  void i915_gem_reset(struct drm_i915_private *dev_priv);
>>  void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
>>  void i915_gem_reset_engine(struct intel_engine_cs *engine);
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index e16037d1b0ba..f168ad873521 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -453,6 +453,7 @@ static void error_print_engine(struct drm_i915_error_state_buf *m,
>>  	err_printf(m, "  hangcheck action timestamp: %lu, %u ms ago\n",
>>  		   ee->hangcheck_timestamp,
>>  		   jiffies_to_msecs(jiffies - ee->hangcheck_timestamp));
>> +	err_printf(m, "  engine reset count: %u\n", ee->reset_count);
>>
>>  	error_print_request(m, "  ELSP[0]: ", &ee->execlist[0]);
>>  	error_print_request(m, "  ELSP[1]: ", &ee->execlist[1]);
>> @@ -1170,6 +1171,8 @@ static void error_record_engine_registers(struct drm_i915_error_state *error,
>>  	ee->hangcheck_timestamp = engine->hangcheck.action_timestamp;
>>  	ee->hangcheck_action = engine->hangcheck.action;
>>  	ee->hangcheck_stalled = engine->hangcheck.stalled;
>> +	ee->reset_count = i915_engine_reset_count(&dev_priv->gpu_error,
>> +						  engine);
>>
>>  	if (USES_PPGTT(dev_priv)) {
>>  		int i;
>> --
>> 2.11.0
>>
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/9] drm/i915: Keep i915_handle_error kerneldoc parameters together
  2016-12-16 20:20 ` [PATCH 1/9] drm/i915: Keep i915_handle_error kerneldoc parameters together Michel Thierry
@ 2016-12-16 21:40   ` Chris Wilson
  2016-12-19  8:22   ` Tvrtko Ursulin
  1 sibling, 0 replies; 23+ messages in thread
From: Chris Wilson @ 2016-12-16 21:40 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Fri, Dec 16, 2016 at 12:20:02PM -0800, Michel Thierry wrote:
> And before the function description.
> Tidy up from commit 14bb2c11796d70b ("drm/i915: Fix a buch of kerneldoc
> warnings"), all others kerneldoc blocks look ok.
> 
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery
  2016-12-16 20:45   ` Chris Wilson
@ 2016-12-19  5:02     ` Michel Thierry
  2016-12-19  9:51       ` Chris Wilson
  2016-12-19 14:24     ` Mika Kuoppala
  1 sibling, 1 reply; 23+ messages in thread
From: Michel Thierry @ 2016-12-19  5:02 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Mika Kuoppala



On 12/16/2016 12:45 PM, Chris Wilson wrote:
> On Fri, Dec 16, 2016 at 12:20:05PM -0800, Michel Thierry wrote:
>> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>>
>> This change implements support for per-engine reset as an initial, less
>> intrusive hang recovery option to be attempted before falling back to the
>> legacy full GPU reset recovery mode if necessary. This is only supported
>> from Gen8 onwards.
>>
>> Hangchecker determines which engines are hung and invokes error handler to
>> recover from it. Error handler schedules recovery for each of those engines
>> that are hung. The recovery procedure is as follows,
>>  - identifies the request that caused the hang and it is dropped
>>  - force engine to idle: this is done by issuing a reset request
>>  - reset and re-init engine
>>  - restart submissions to the engine
>>
>> If engine reset fails then we fall back to heavy weight full gpu reset
>> which resets all engines and reinitiazes complete state of HW and SW.
>>
>> v2: Rebase.
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_drv.c     | 56 +++++++++++++++++++++++++++++++++++--
>>  drivers/gpu/drm/i915/i915_drv.h     |  3 ++
>>  drivers/gpu/drm/i915/i915_gem.c     |  2 +-
>>  drivers/gpu/drm/i915/intel_lrc.c    | 12 ++++++++
>>  drivers/gpu/drm/i915/intel_lrc.h    |  1 +
>>  drivers/gpu/drm/i915/intel_uncore.c | 41 ++++++++++++++++++++++++---
>>  6 files changed, 108 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index e5688edd62cd..a034793bc246 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1830,18 +1830,70 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>   *
>>   * Reset a specific GPU engine. Useful if a hang is detected.
>>   * Returns zero on successful reset or otherwise an error code.
>> + *
>> + * Procedure is fairly simple:
>> + *  - identifies the request that caused the hang and it is dropped
>> + *  - force engine to idle: this is done by issuing a reset request
>> + *  - reset engine
>> + *  - restart submissions to the engine
>>   */
>>  int i915_reset_engine(struct intel_engine_cs *engine)
>
> What's the serialisation between potential callers of
> i915_reset_engine()?
>

I haven't seen simultaneous calls happening yet, would a 
reset_engine-specific mutex be enough?

>>  {
>>  	int ret;
>>  	struct drm_i915_private *dev_priv = engine->i915;
>>
>> -	/* FIXME: replace me with engine reset sequence */
>> -	ret = -ENODEV;
>> +	/*
>> +	 * We need to first idle the engine by issuing a reset request,
>> +	 * then perform soft reset and re-initialize hw state, for all of
>> +	 * this GT power need to be awake so ensure it does throughout the
>> +	 * process
>> +	 */
>> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>> +
>> +	/*
>> +	 * the request that caused the hang is stuck on elsp, identify the
>> +	 * active request and drop it, adjust head to skip the offending
>> +	 * request to resume executing remaining requests in the queue.
>> +	 */
>> +	i915_gem_reset_engine(engine);
>
> Must freeze the engine and irqs first, before calling
> i915_gem_reset_engine() (i.e. something like disable_engines_irq,
> cancelling tasklet)
>
Will do.

> Eeek note that the current i915_gem_reset_engine() is lacking a
> spinlock.
>

The new mutex (for i915_reset_engine) should cover this.

>> +
>> +	ret = intel_engine_reset_begin(engine);
>> +	if (ret) {
>> +		DRM_ERROR("Failed to disable %s\n", engine->name);
>> +		goto error;
>> +	}
>> +
>> +	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
>> +	if (ret) {
>> +		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
>> +		intel_engine_reset_cancel(engine);
>> +		goto error;
>> +	}
>> +
>> +	ret = engine->init_hw(engine);
>> +	if (ret)
>> +		goto error;
>>
>> +	intel_engine_reset_cancel(engine);
>> +	intel_execlists_restart_submission(engine);
>
> engine->init_hw(engine) *is* intel_execlists_restart_submission.
>

I'll make these changes,
Thanks.

>> +
>> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>> +	return 0;
>> +
>> +error:
>>  	/* use full gpu reset to recover on error */
>>  	set_bit(I915_RESET_IN_PROGRESS, &dev_priv->gpu_error.flags);
>>
>> +	/* Engine reset is performed without taking struct_mutex, since it
>> +	 * failed we now fallback to full gpu reset. Wakeup any waiters
>> +	 * which should now see the reset_in_progress and release
>> +	 * struct_mutex for us to continue recovery.
>> +	 */
>> +	rcu_read_lock();
>> +	intel_engine_wakeup(engine);
>> +	rcu_read_unlock();
>> +
>> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>>  	return ret;
>>  }
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/9] drm/i915: Keep i915_handle_error kerneldoc parameters together
  2016-12-16 20:20 ` [PATCH 1/9] drm/i915: Keep i915_handle_error kerneldoc parameters together Michel Thierry
  2016-12-16 21:40   ` Chris Wilson
@ 2016-12-19  8:22   ` Tvrtko Ursulin
  1 sibling, 0 replies; 23+ messages in thread
From: Tvrtko Ursulin @ 2016-12-19  8:22 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx


On 16/12/2016 20:20, Michel Thierry wrote:
> And before the function description.
> Tidy up from commit 14bb2c11796d70b ("drm/i915: Fix a buch of kerneldoc
> warnings"), all others kerneldoc blocks look ok.
>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_irq.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index a0e70f5b3aad..77caeeb5ee55 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2703,12 +2703,13 @@ static void i915_clear_error_registers(struct drm_i915_private *dev_priv)
>   * i915_handle_error - handle a gpu error
>   * @dev_priv: i915 device private
>   * @engine_mask: mask representing engines that are hung
> + * @fmt: Error message format string
> + *
>   * Do some basic checking of register state at error time and
>   * dump it to the syslog.  Also call i915_capture_error_state() to make
>   * sure we get a record and make it available in debugfs.  Fire a uevent
>   * so userspace knows something bad happened (should trigger collection
>   * of a ring dump etc.).
> - * @fmt: Error message format string
>   */
>  void i915_handle_error(struct drm_i915_private *dev_priv,
>  		       u32 engine_mask,
>

No idea how I managed to put it there. :(

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/9] drm/i915/tdr: Add engine reset count to error state
  2016-12-16 20:20 ` [PATCH 6/9] drm/i915/tdr: Add engine reset count to error state Michel Thierry
  2016-12-16 20:37   ` Chris Wilson
@ 2016-12-19  8:27   ` Tvrtko Ursulin
  2016-12-19 18:09     ` Michel Thierry
  1 sibling, 1 reply; 23+ messages in thread
From: Tvrtko Ursulin @ 2016-12-19  8:27 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx


On 16/12/2016 20:20, Michel Thierry wrote:
> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>
> Driver maintains count of how many times a given engine is reset, useful to
> capture this in error state also. It gives an idea of how engine is coping
> up with the workloads it is executing before this error state.
>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c       | 1 +
>  drivers/gpu/drm/i915/i915_drv.h       | 9 +++++++++
>  drivers/gpu/drm/i915/i915_gpu_error.c | 3 +++
>  3 files changed, 13 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index a034793bc246..1c706a082d60 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1877,6 +1877,7 @@ int i915_reset_engine(struct intel_engine_cs *engine)
>  	intel_engine_reset_cancel(engine);
>  	intel_execlists_restart_submission(engine);
>
> +	dev_priv->gpu_error.engine_reset_count[engine->id]++;
>  	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>  	return 0;
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index a97cc8f50ade..25183762ed94 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -937,6 +937,7 @@ struct drm_i915_error_state {
>  		enum intel_engine_hangcheck_action hangcheck_action;
>  		struct i915_address_space *vm;
>  		int num_requests;
> +		u32 reset_count;
>
>  		/* position of active request inside the ring */
>  		u32 rq_head, rq_post, rq_tail;
> @@ -1629,6 +1630,8 @@ struct i915_gpu_error {
>  #define I915_RESET_IN_PROGRESS	0
>  #define I915_WEDGED		(BITS_PER_LONG - 1)
>
> +	unsigned long engine_reset_count[I915_NUM_ENGINES];
> +
>  	/**
>  	 * Waitqueue to signal when a hang is detected. Used to for waiters
>  	 * to release the struct_mutex for the reset to procede.
> @@ -3397,6 +3400,12 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
>  	return READ_ONCE(error->reset_count);
>  }
>
> +static inline u32 i915_engine_reset_count(struct i915_gpu_error *error,
> +					  struct intel_engine_cs *engine)
> +{
> +	return READ_ONCE(error->engine_reset_count[engine->id]);
> +}

Accidentally spotted a type width mismatch here between u32 and unsigned 
long. Not sure which one would be best, maybe unsigned int?

Regards,

Tvrtko

> +
>  void i915_gem_reset(struct drm_i915_private *dev_priv);
>  void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
>  void i915_gem_reset_engine(struct intel_engine_cs *engine);
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index e16037d1b0ba..f168ad873521 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -453,6 +453,7 @@ static void error_print_engine(struct drm_i915_error_state_buf *m,
>  	err_printf(m, "  hangcheck action timestamp: %lu, %u ms ago\n",
>  		   ee->hangcheck_timestamp,
>  		   jiffies_to_msecs(jiffies - ee->hangcheck_timestamp));
> +	err_printf(m, "  engine reset count: %u\n", ee->reset_count);
>
>  	error_print_request(m, "  ELSP[0]: ", &ee->execlist[0]);
>  	error_print_request(m, "  ELSP[1]: ", &ee->execlist[1]);
> @@ -1170,6 +1171,8 @@ static void error_record_engine_registers(struct drm_i915_error_state *error,
>  	ee->hangcheck_timestamp = engine->hangcheck.action_timestamp;
>  	ee->hangcheck_action = engine->hangcheck.action;
>  	ee->hangcheck_stalled = engine->hangcheck.stalled;
> +	ee->reset_count = i915_engine_reset_count(&dev_priv->gpu_error,
> +						  engine);
>
>  	if (USES_PPGTT(dev_priv)) {
>  		int i;
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery
  2016-12-19  5:02     ` Michel Thierry
@ 2016-12-19  9:51       ` Chris Wilson
  2017-01-05 23:55         ` Michel Thierry
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Wilson @ 2016-12-19  9:51 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Sun, Dec 18, 2016 at 09:02:33PM -0800, Michel Thierry wrote:
> 
> 
> On 12/16/2016 12:45 PM, Chris Wilson wrote:
> >On Fri, Dec 16, 2016 at 12:20:05PM -0800, Michel Thierry wrote:
> >>From: Arun Siluvery <arun.siluvery@linux.intel.com>
> >>
> >>This change implements support for per-engine reset as an initial, less
> >>intrusive hang recovery option to be attempted before falling back to the
> >>legacy full GPU reset recovery mode if necessary. This is only supported
> >>from Gen8 onwards.
> >>
> >>Hangchecker determines which engines are hung and invokes error handler to
> >>recover from it. Error handler schedules recovery for each of those engines
> >>that are hung. The recovery procedure is as follows,
> >> - identifies the request that caused the hang and it is dropped
> >> - force engine to idle: this is done by issuing a reset request
> >> - reset and re-init engine
> >> - restart submissions to the engine
> >>
> >>If engine reset fails then we fall back to heavy weight full gpu reset
> >>which resets all engines and reinitiazes complete state of HW and SW.
> >>
> >>v2: Rebase.
> >>
> >>Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >>Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> >>Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> >>Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> >>---
> >> drivers/gpu/drm/i915/i915_drv.c     | 56 +++++++++++++++++++++++++++++++++++--
> >> drivers/gpu/drm/i915/i915_drv.h     |  3 ++
> >> drivers/gpu/drm/i915/i915_gem.c     |  2 +-
> >> drivers/gpu/drm/i915/intel_lrc.c    | 12 ++++++++
> >> drivers/gpu/drm/i915/intel_lrc.h    |  1 +
> >> drivers/gpu/drm/i915/intel_uncore.c | 41 ++++++++++++++++++++++++---
> >> 6 files changed, 108 insertions(+), 7 deletions(-)
> >>
> >>diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> >>index e5688edd62cd..a034793bc246 100644
> >>--- a/drivers/gpu/drm/i915/i915_drv.c
> >>+++ b/drivers/gpu/drm/i915/i915_drv.c
> >>@@ -1830,18 +1830,70 @@ void i915_reset(struct drm_i915_private *dev_priv)
> >>  *
> >>  * Reset a specific GPU engine. Useful if a hang is detected.
> >>  * Returns zero on successful reset or otherwise an error code.
> >>+ *
> >>+ * Procedure is fairly simple:
> >>+ *  - identifies the request that caused the hang and it is dropped
> >>+ *  - force engine to idle: this is done by issuing a reset request
> >>+ *  - reset engine
> >>+ *  - restart submissions to the engine
> >>  */
> >> int i915_reset_engine(struct intel_engine_cs *engine)
> >
> >What's the serialisation between potential callers of
> >i915_reset_engine()?
> >
> 
> I haven't seen simultaneous calls happening yet, would a
> reset_engine-specific mutex be enough?

That is what it more or less boils down to. But first, do we ensure we
don't declare a second hang on this engine before the first reset is
complete? Then we only a barrier between a single engine reset and a
full reset.

> >> {
> >> 	int ret;
> >> 	struct drm_i915_private *dev_priv = engine->i915;
> >>
> >>-	/* FIXME: replace me with engine reset sequence */
> >>-	ret = -ENODEV;
> >>+	/*
> >>+	 * We need to first idle the engine by issuing a reset request,
> >>+	 * then perform soft reset and re-initialize hw state, for all of
> >>+	 * this GT power need to be awake so ensure it does throughout the
> >>+	 * process
> >>+	 */
> >>+	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
> >>+
> >>+	/*
> >>+	 * the request that caused the hang is stuck on elsp, identify the
> >>+	 * active request and drop it, adjust head to skip the offending
> >>+	 * request to resume executing remaining requests in the queue.
> >>+	 */
> >>+	i915_gem_reset_engine(engine);
> >
> >Must freeze the engine and irqs first, before calling
> >i915_gem_reset_engine() (i.e. something like disable_engines_irq,
> >cancelling tasklet)
> >
> Will do.
> 
> >Eeek note that the current i915_gem_reset_engine() is lacking a
> >spinlock.
> >
> 
> The new mutex (for i915_reset_engine) should cover this.

No. We need to protect these lists from concurrent manipulation in the
fences. The spinlock protection there needs to be extended here, being
covered by struct_mutex is no longer sufficient protection for the
current code.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery
  2016-12-16 20:45   ` Chris Wilson
  2016-12-19  5:02     ` Michel Thierry
@ 2016-12-19 14:24     ` Mika Kuoppala
  2016-12-19 14:42       ` Chris Wilson
  1 sibling, 1 reply; 23+ messages in thread
From: Mika Kuoppala @ 2016-12-19 14:24 UTC (permalink / raw)
  To: Chris Wilson, Michel Thierry; +Cc: intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Fri, Dec 16, 2016 at 12:20:05PM -0800, Michel Thierry wrote:
>> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>> 
>> This change implements support for per-engine reset as an initial, less
>> intrusive hang recovery option to be attempted before falling back to the
>> legacy full GPU reset recovery mode if necessary. This is only supported
>> from Gen8 onwards.
>> 
>> Hangchecker determines which engines are hung and invokes error handler to
>> recover from it. Error handler schedules recovery for each of those engines
>> that are hung. The recovery procedure is as follows,
>>  - identifies the request that caused the hang and it is dropped
>>  - force engine to idle: this is done by issuing a reset request
>>  - reset and re-init engine
>>  - restart submissions to the engine
>> 
>> If engine reset fails then we fall back to heavy weight full gpu reset
>> which resets all engines and reinitiazes complete state of HW and SW.
>> 
>> v2: Rebase.
>> 
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_drv.c     | 56 +++++++++++++++++++++++++++++++++++--
>>  drivers/gpu/drm/i915/i915_drv.h     |  3 ++
>>  drivers/gpu/drm/i915/i915_gem.c     |  2 +-
>>  drivers/gpu/drm/i915/intel_lrc.c    | 12 ++++++++
>>  drivers/gpu/drm/i915/intel_lrc.h    |  1 +
>>  drivers/gpu/drm/i915/intel_uncore.c | 41 ++++++++++++++++++++++++---
>>  6 files changed, 108 insertions(+), 7 deletions(-)
>> 
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index e5688edd62cd..a034793bc246 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1830,18 +1830,70 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>   *
>>   * Reset a specific GPU engine. Useful if a hang is detected.
>>   * Returns zero on successful reset or otherwise an error code.
>> + *
>> + * Procedure is fairly simple:
>> + *  - identifies the request that caused the hang and it is dropped
>> + *  - force engine to idle: this is done by issuing a reset request
>> + *  - reset engine
>> + *  - restart submissions to the engine
>>   */
>>  int i915_reset_engine(struct intel_engine_cs *engine)
>
> What's the serialisation between potential callers of
> i915_reset_engine()?
>

I feel that making 'reset_in_progress' per engine feature
would clarify this and would be more fitting as now it is
the one engine that can be in reset at particular point in time,
decoupled with others.

Having global 'reset_in_progress' and then a per engine
resets just doesn't feel right.

-Mika

>>  {
>>  	int ret;
>>  	struct drm_i915_private *dev_priv = engine->i915;
>>  
>> -	/* FIXME: replace me with engine reset sequence */
>> -	ret = -ENODEV;
>> +	/*
>> +	 * We need to first idle the engine by issuing a reset request,
>> +	 * then perform soft reset and re-initialize hw state, for all of
>> +	 * this GT power need to be awake so ensure it does throughout the
>> +	 * process
>> +	 */
>> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>> +
>> +	/*
>> +	 * the request that caused the hang is stuck on elsp, identify the
>> +	 * active request and drop it, adjust head to skip the offending
>> +	 * request to resume executing remaining requests in the queue.
>> +	 */
>> +	i915_gem_reset_engine(engine);
>
> Must freeze the engine and irqs first, before calling
> i915_gem_reset_engine() (i.e. something like disable_engines_irq,
> cancelling tasklet)
>
> Eeek note that the current i915_gem_reset_engine() is lacking a
> spinlock.
>
>> +
>> +	ret = intel_engine_reset_begin(engine);
>> +	if (ret) {
>> +		DRM_ERROR("Failed to disable %s\n", engine->name);
>> +		goto error;
>> +	}
>> +
>> +	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
>> +	if (ret) {
>> +		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
>> +		intel_engine_reset_cancel(engine);
>> +		goto error;
>> +	}
>> +
>> +	ret = engine->init_hw(engine);
>> +	if (ret)
>> +		goto error;
>>  
>> +	intel_engine_reset_cancel(engine);
>> +	intel_execlists_restart_submission(engine);
>
> engine->init_hw(engine) *is* intel_execlists_restart_submission.
>
>> +
>> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>> +	return 0;
>> +
>> +error:
>>  	/* use full gpu reset to recover on error */
>>  	set_bit(I915_RESET_IN_PROGRESS, &dev_priv->gpu_error.flags);
>>  
>> +	/* Engine reset is performed without taking struct_mutex, since it
>> +	 * failed we now fallback to full gpu reset. Wakeup any waiters
>> +	 * which should now see the reset_in_progress and release
>> +	 * struct_mutex for us to continue recovery.
>> +	 */
>> +	rcu_read_lock();
>> +	intel_engine_wakeup(engine);
>> +	rcu_read_unlock();
>> +
>> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>>  	return ret;
>>  }
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery
  2016-12-19 14:24     ` Mika Kuoppala
@ 2016-12-19 14:42       ` Chris Wilson
  0 siblings, 0 replies; 23+ messages in thread
From: Chris Wilson @ 2016-12-19 14:42 UTC (permalink / raw)
  To: Mika Kuoppala; +Cc: intel-gfx

On Mon, Dec 19, 2016 at 04:24:18PM +0200, Mika Kuoppala wrote:
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > On Fri, Dec 16, 2016 at 12:20:05PM -0800, Michel Thierry wrote:
> >> From: Arun Siluvery <arun.siluvery@linux.intel.com>
> >> 
> >> This change implements support for per-engine reset as an initial, less
> >> intrusive hang recovery option to be attempted before falling back to the
> >> legacy full GPU reset recovery mode if necessary. This is only supported
> >> from Gen8 onwards.
> >> 
> >> Hangchecker determines which engines are hung and invokes error handler to
> >> recover from it. Error handler schedules recovery for each of those engines
> >> that are hung. The recovery procedure is as follows,
> >>  - identifies the request that caused the hang and it is dropped
> >>  - force engine to idle: this is done by issuing a reset request
> >>  - reset and re-init engine
> >>  - restart submissions to the engine
> >> 
> >> If engine reset fails then we fall back to heavy weight full gpu reset
> >> which resets all engines and reinitiazes complete state of HW and SW.
> >> 
> >> v2: Rebase.
> >> 
> >> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> >> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> >> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> >> ---
> >>  drivers/gpu/drm/i915/i915_drv.c     | 56 +++++++++++++++++++++++++++++++++++--
> >>  drivers/gpu/drm/i915/i915_drv.h     |  3 ++
> >>  drivers/gpu/drm/i915/i915_gem.c     |  2 +-
> >>  drivers/gpu/drm/i915/intel_lrc.c    | 12 ++++++++
> >>  drivers/gpu/drm/i915/intel_lrc.h    |  1 +
> >>  drivers/gpu/drm/i915/intel_uncore.c | 41 ++++++++++++++++++++++++---
> >>  6 files changed, 108 insertions(+), 7 deletions(-)
> >> 
> >> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> >> index e5688edd62cd..a034793bc246 100644
> >> --- a/drivers/gpu/drm/i915/i915_drv.c
> >> +++ b/drivers/gpu/drm/i915/i915_drv.c
> >> @@ -1830,18 +1830,70 @@ void i915_reset(struct drm_i915_private *dev_priv)
> >>   *
> >>   * Reset a specific GPU engine. Useful if a hang is detected.
> >>   * Returns zero on successful reset or otherwise an error code.
> >> + *
> >> + * Procedure is fairly simple:
> >> + *  - identifies the request that caused the hang and it is dropped
> >> + *  - force engine to idle: this is done by issuing a reset request
> >> + *  - reset engine
> >> + *  - restart submissions to the engine
> >>   */
> >>  int i915_reset_engine(struct intel_engine_cs *engine)
> >
> > What's the serialisation between potential callers of
> > i915_reset_engine()?
> >
> 
> I feel that making 'reset_in_progress' per engine feature
> would clarify this and would be more fitting as now it is
> the one engine that can be in reset at particular point in time,
> decoupled with others.

That is not what "reset_in_progress" means. It principally means
MUTEX_BACKOFF.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 6/9] drm/i915/tdr: Add engine reset count to error state
  2016-12-19  8:27   ` Tvrtko Ursulin
@ 2016-12-19 18:09     ` Michel Thierry
  0 siblings, 0 replies; 23+ messages in thread
From: Michel Thierry @ 2016-12-19 18:09 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx



On 19/12/16 00:27, Tvrtko Ursulin wrote:
>
> On 16/12/2016 20:20, Michel Thierry wrote:
>> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>> @@ -937,6 +937,7 @@ struct drm_i915_error_state {
>>          enum intel_engine_hangcheck_action hangcheck_action;
>>          struct i915_address_space *vm;
>>          int num_requests;
>> +        u32 reset_count;
>>
>>          /* position of active request inside the ring */
>>          u32 rq_head, rq_post, rq_tail;
>> @@ -1629,6 +1630,8 @@ struct i915_gpu_error {
>>  #define I915_RESET_IN_PROGRESS    0
>>  #define I915_WEDGED        (BITS_PER_LONG - 1)
>>
>> +    unsigned long engine_reset_count[I915_NUM_ENGINES];
>> +
>>      /**
>>       * Waitqueue to signal when a hang is detected. Used to for waiters
>>       * to release the struct_mutex for the reset to procede.
>> @@ -3397,6 +3400,12 @@ static inline u32 i915_reset_count(struct
>> i915_gpu_error *error)
>>      return READ_ONCE(error->reset_count);
>>  }
>>
>> +static inline u32 i915_engine_reset_count(struct i915_gpu_error *error,
>> +                      struct intel_engine_cs *engine)
>> +{
>> +    return READ_ONCE(error->engine_reset_count[engine->id]);
>> +}
>
> Accidentally spotted a type width mismatch here between u32 and unsigned
> long. Not sure which one would be best, maybe unsigned int?
>
> Regards,
>

Thanks, I'll change engine_reset_count to u32...  I don't think we 
should worry too much if it overflows.

> Tvrtko
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery
  2016-12-19  9:51       ` Chris Wilson
@ 2017-01-05 23:55         ` Michel Thierry
  0 siblings, 0 replies; 23+ messages in thread
From: Michel Thierry @ 2017-01-05 23:55 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Mika Kuoppala


On 19/12/16 01:51, Chris Wilson wrote:
> On Sun, Dec 18, 2016 at 09:02:33PM -0800, Michel Thierry wrote:
>>
>>
>> On 12/16/2016 12:45 PM, Chris Wilson wrote:
>>> On Fri, Dec 16, 2016 at 12:20:05PM -0800, Michel Thierry wrote:
>>>> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>>>>
>>>> This change implements support for per-engine reset as an initial, less
>>>> intrusive hang recovery option to be attempted before falling back to the
>>>> legacy full GPU reset recovery mode if necessary. This is only supported
>>> >from Gen8 onwards.
>>>>
>>>> Hangchecker determines which engines are hung and invokes error handler to
>>>> recover from it. Error handler schedules recovery for each of those engines
>>>> that are hung. The recovery procedure is as follows,
>>>> - identifies the request that caused the hang and it is dropped
>>>> - force engine to idle: this is done by issuing a reset request
>>>> - reset and re-init engine
>>>> - restart submissions to the engine
>>>>
>>>> If engine reset fails then we fall back to heavy weight full gpu reset
>>>> which resets all engines and reinitiazes complete state of HW and SW.
>>>>
>>>> v2: Rebase.
>>>>
>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>>>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>>>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>>>> ---
>>>> drivers/gpu/drm/i915/i915_drv.c     | 56 +++++++++++++++++++++++++++++++++++--
>>>> drivers/gpu/drm/i915/i915_drv.h     |  3 ++
>>>> drivers/gpu/drm/i915/i915_gem.c     |  2 +-
>>>> drivers/gpu/drm/i915/intel_lrc.c    | 12 ++++++++
>>>> drivers/gpu/drm/i915/intel_lrc.h    |  1 +
>>>> drivers/gpu/drm/i915/intel_uncore.c | 41 ++++++++++++++++++++++++---
>>>> 6 files changed, 108 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>>>> index e5688edd62cd..a034793bc246 100644
>>>> --- a/drivers/gpu/drm/i915/i915_drv.c
>>>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>>>> @@ -1830,18 +1830,70 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>>>  *
>>>>  * Reset a specific GPU engine. Useful if a hang is detected.
>>>>  * Returns zero on successful reset or otherwise an error code.
>>>> + *
>>>> + * Procedure is fairly simple:
>>>> + *  - identifies the request that caused the hang and it is dropped
>>>> + *  - force engine to idle: this is done by issuing a reset request
>>>> + *  - reset engine
>>>> + *  - restart submissions to the engine
>>>>  */
>>>> int i915_reset_engine(struct intel_engine_cs *engine)
>>>
>>> What's the serialisation between potential callers of
>>> i915_reset_engine()?
>>>
>>
>> I haven't seen simultaneous calls happening yet, would a
>> reset_engine-specific mutex be enough?
>
> That is what it more or less boils down to. But first, do we ensure we
> don't declare a second hang on this engine before the first reset is
> complete? Then we only a barrier between a single engine reset and a
> full reset.
>
>>>> {
>>>> 	int ret;
>>>> 	struct drm_i915_private *dev_priv = engine->i915;
>>>>
>>>> -	/* FIXME: replace me with engine reset sequence */
>>>> -	ret = -ENODEV;
>>>> +	/*
>>>> +	 * We need to first idle the engine by issuing a reset request,
>>>> +	 * then perform soft reset and re-initialize hw state, for all of
>>>> +	 * this GT power need to be awake so ensure it does throughout the
>>>> +	 * process
>>>> +	 */
>>>> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>>>> +
>>>> +	/*
>>>> +	 * the request that caused the hang is stuck on elsp, identify the
>>>> +	 * active request and drop it, adjust head to skip the offending
>>>> +	 * request to resume executing remaining requests in the queue.
>>>> +	 */
>>>> +	i915_gem_reset_engine(engine);
>>>
>>> Must freeze the engine and irqs first, before calling
>>> i915_gem_reset_engine() (i.e. something like disable_engines_irq,
>>> cancelling tasklet)
>>>
>> Will do.
>>
>>> Eeek note that the current i915_gem_reset_engine() is lacking a
>>> spinlock.
>>>
>>
>> The new mutex (for i915_reset_engine) should cover this.
>
> No. We need to protect these lists from concurrent manipulation in the
> fences. The spinlock protection there needs to be extended here, being
> covered by struct_mutex is no longer sufficient protection for the
> current code.

I think you already addressed the last point in "drm/i915: Prevent 
timeline updates whilst performing reset" 
(00c25e3f40083a6d5f1111955baccd287ee49258), right?

But looking at these (and the interaction with possible waiters), do you 
think it is still achievable to reset a engine without grabbing the 
struct_mutex? Then both i915_reset (chip) and i915_reset_engine can use 
the "reset_in_progress".

Thanks
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-01-05 23:55 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-16 20:20 [PATCH 0/9] Execlist based engine-reset Michel Thierry
2016-12-16 20:20 ` [PATCH 1/9] drm/i915: Keep i915_handle_error kerneldoc parameters together Michel Thierry
2016-12-16 21:40   ` Chris Wilson
2016-12-19  8:22   ` Tvrtko Ursulin
2016-12-16 20:20 ` [PATCH 2/9] drm/i915: Update i915.reset to handle engine resets Michel Thierry
2016-12-16 20:20 ` [PATCH 3/9] drm/i915/tdr: Modify error handler for per engine hang recovery Michel Thierry
2016-12-16 20:20 ` [PATCH 4/9] drm/i915/tdr: Add support for per engine reset recovery Michel Thierry
2016-12-16 20:45   ` Chris Wilson
2016-12-19  5:02     ` Michel Thierry
2016-12-19  9:51       ` Chris Wilson
2017-01-05 23:55         ` Michel Thierry
2016-12-19 14:24     ` Mika Kuoppala
2016-12-19 14:42       ` Chris Wilson
2016-12-16 20:20 ` [PATCH 5/9] drm/i915: Skip reset request if there is one already Michel Thierry
2016-12-16 20:20 ` [PATCH 6/9] drm/i915/tdr: Add engine reset count to error state Michel Thierry
2016-12-16 20:37   ` Chris Wilson
2016-12-16 21:19     ` Michel Thierry
2016-12-19  8:27   ` Tvrtko Ursulin
2016-12-19 18:09     ` Michel Thierry
2016-12-16 20:20 ` [PATCH 7/9] drm/i915/tdr: Export per-engine reset count info to debugfs Michel Thierry
2016-12-16 20:20 ` [PATCH 8/9] drm/i915/tdr: Enable Engine reset and recovery support Michel Thierry
2016-12-16 20:20 ` [RFC 9/9] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
2016-12-16 20:45 ` ✗ Fi.CI.BAT: warning for Execlist based engine-reset Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.