All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/20] Gen8+ engine-reset
@ 2017-04-18 20:23 Michel Thierry
  2017-04-18 20:23 ` [PATCH v6 01/20] drm/i915: Fix stale comment about I915_RESET_IN_PROGRESS flag Michel Thierry
                   ` (20 more replies)
  0 siblings, 21 replies; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

These patches add the reset-engine feature from Gen8. This is also
referred to as Timeout detection and recovery (TDR). This complements to
the full gpu reset feature available in i915 but it only allows to reset a
particular engine instead of all engines thus providing a light weight
engine reset and recovery mechanism.

Thanks to recent changes merged, this implementation is now not only for
execlists, but for GuC based submission too; it is still limited from
Gen8 onwards. I have also included the changes for watchdog timeout
detection. The GuC related patches are functional, but can be seen as RFC.

Timeout detection relies on the existing hangcheck, which remains the same;
main changes are to the recovery mechanism. Once we detect a hang on a
particular engine we identify the request that caused the hang, skip the
request and adjust head pointers to allow the execution to proceed
normally. After some cleanup, submissions are restarted to process
remaining work queued to that engine.

If engine reset fails to recover engine correctly then we fallback to full
gpu reset.

We can argue about the effectiveness of reset-engine vs full reset when
more than one ring is hung, but the benefits of just resetting one engine
are reduced when the driver has to do it multiple times.

v2: ELSP queue request tracking and reset path changes to handle incomplete
requests during reset. Thanks to Chris Wilson for providing these patches.

v3: Let the waiter keep handling the full gpu reset if it already has the
lock; point out that GuC submission needs a different method to restart
workloads after the engine reset completes.

v4: Handle reset as 2 level resets, by first going to engine only and fall
backing to full/chip reset as needed, i.e. reset_engine will need the
struct_mutex.

v5: Rebased after reset flag split in 2, add GuC support, include watchdog
detection patches, addressing comments from prev RFC.

v6: Mutex-less reset engine. Updates in watchdog abi and guc whitelist &
register-restore fixes (including an old patch from Daniele).

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Arun Siluvery (7):
  drm/i915: Update i915.reset to handle engine resets
  drm/i915/tdr: Modify error handler for per engine hang recovery
  drm/i915/tdr: Add support for per engine reset recovery
  drm/i915/tdr: Add engine reset count to error state
  drm/i915/tdr: Export per-engine reset count info to debugfs
  drm/i915/tdr: Enable Engine reset and recovery support
  drm/i915/guc: Provide register list to be saved/restored during engine
    reset

Daniele Ceraolo Spurio (1):
  drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder

Michel Thierry (11):
  drm/i915: Fix stale comment about I915_RESET_IN_PROGRESS flag
  drm/i915: Rename gen8_(un)request_engine_reset to
    gen8_reset_engine_start/cancel
  drm/i915: Add engine reset count in get-reset-stats ioctl
  drm/i915/selftests: reset engine self tests
  drm/i915/guc: Add support for reset engine using GuC commands
  drm/i915: Watchdog timeout: Pass GuC shared data structure during
    param load
  drm/i915: Watchdog timeout: IRQ handler for gen8+
  drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  drm/i915: Watchdog timeout: Include threshold value in error state
  drm/i915: Watchdog timeout: Export media reset count from GuC to
    debugfs

Mika Kuoppala (1):
  drm/i915: Skip reset request if there is one already

 drivers/gpu/drm/i915/i915_debugfs.c              |  43 +++++++
 drivers/gpu/drm/i915/i915_drv.c                  | 104 ++++++++++++++-
 drivers/gpu/drm/i915/i915_drv.h                  |  63 ++++++++-
 drivers/gpu/drm/i915/i915_gem.c                  |  97 ++++++++------
 drivers/gpu/drm/i915/i915_gem_context.c          | 116 ++++++++++++++++-
 drivers/gpu/drm/i915/i915_gem_context.h          |   4 +
 drivers/gpu/drm/i915/i915_gem_request.c          |   2 +-
 drivers/gpu/drm/i915/i915_gpu_error.c            |  14 +-
 drivers/gpu/drm/i915/i915_guc_submission.c       | 128 +++++++++++++++++--
 drivers/gpu/drm/i915/i915_irq.c                  |  43 ++++++-
 drivers/gpu/drm/i915/i915_params.c               |   6 +-
 drivers/gpu/drm/i915/i915_params.h               |   2 +-
 drivers/gpu/drm/i915/i915_pci.c                  |   5 +-
 drivers/gpu/drm/i915/i915_reg.h                  |   6 +
 drivers/gpu/drm/i915/intel_guc_fwif.h            |  27 +++-
 drivers/gpu/drm/i915/intel_guc_loader.c          |  11 ++
 drivers/gpu/drm/i915/intel_hangcheck.c           |  13 +-
 drivers/gpu/drm/i915/intel_lrc.c                 | 156 ++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h          |  24 ++++
 drivers/gpu/drm/i915/intel_uc.h                  |   3 +
 drivers/gpu/drm/i915/intel_uncore.c              |  43 ++++++-
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 147 +++++++++++++++++++++
 include/uapi/drm/i915_drm.h                      |   7 +-
 23 files changed, 972 insertions(+), 92 deletions(-)

-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v6 01/20] drm/i915: Fix stale comment about I915_RESET_IN_PROGRESS flag
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-18 20:23 ` [PATCH v6 02/20] drm/i915: Rename gen8_(un)request_engine_reset to gen8_reset_engine_start/cancel Michel Thierry
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

It has been replaced by I915_RESET_BACKOFF / I915_RESET_HANDOFF.

Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index b5c81de102e8..e06af46f5a57 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1568,7 +1568,7 @@ struct i915_gpu_error {
 	 *
 	 * This is a counter which gets incremented when reset is triggered,
 	 *
-	 * Before the reset commences, the I915_RESET_IN_PROGRESS bit is set
+	 * Before the reset commences, the I915_RESET_BACKOFF bit is set
 	 * meaning that any waiters holding onto the struct_mutex should
 	 * relinquish the lock immediately in order for the reset to start.
 	 *
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 02/20] drm/i915: Rename gen8_(un)request_engine_reset to gen8_reset_engine_start/cancel
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
  2017-04-18 20:23 ` [PATCH v6 01/20] drm/i915: Fix stale comment about I915_RESET_IN_PROGRESS flag Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-27  8:13   ` Chris Wilson
  2017-04-18 20:23 ` [PATCH v6 03/20] drm/i915: Update i915.reset to handle engine resets Michel Thierry
                   ` (18 subsequent siblings)
  20 siblings, 1 reply; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

As all other functions related to resetting engines are using
reset_engine.

v2: remove _request_ and use start/cancel instead (Chris)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index a10d8863b0a9..07a722f74fa1 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1683,7 +1683,7 @@ int intel_wait_for_register(struct drm_i915_private *dev_priv,
 	return ret;
 }
 
-static int gen8_request_engine_reset(struct intel_engine_cs *engine)
+static int gen8_reset_engine_start(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
 	int ret;
@@ -1702,7 +1702,7 @@ static int gen8_request_engine_reset(struct intel_engine_cs *engine)
 	return ret;
 }
 
-static void gen8_unrequest_engine_reset(struct intel_engine_cs *engine)
+static void gen8_reset_engine_cancel(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
 
@@ -1717,14 +1717,14 @@ static int gen8_reset_engines(struct drm_i915_private *dev_priv,
 	unsigned int tmp;
 
 	for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
-		if (gen8_request_engine_reset(engine))
+		if (gen8_reset_engine_start(engine))
 			goto not_ready;
 
 	return gen6_reset_engines(dev_priv, engine_mask);
 
 not_ready:
 	for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
-		gen8_unrequest_engine_reset(engine);
+		gen8_reset_engine_cancel(engine);
 
 	return -EIO;
 }
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 03/20] drm/i915: Update i915.reset to handle engine resets
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
  2017-04-18 20:23 ` [PATCH v6 01/20] drm/i915: Fix stale comment about I915_RESET_IN_PROGRESS flag Michel Thierry
  2017-04-18 20:23 ` [PATCH v6 02/20] drm/i915: Rename gen8_(un)request_engine_reset to gen8_reset_engine_start/cancel Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-18 20:23 ` [PATCH v6 04/20] drm/i915/tdr: Modify error handler for per engine hang recovery Michel Thierry
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

In preparation for engine reset work update this parameter to handle more
than one type of reset. Default at the moment is still full gpu reset.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_params.c | 6 +++---
 drivers/gpu/drm/i915/i915_params.h | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index b6a7e363d076..045cadb77285 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -46,7 +46,7 @@ struct i915_params i915 __read_mostly = {
 	.prefault_disable = 0,
 	.load_detect_test = 0,
 	.force_reset_modeset_test = 0,
-	.reset = true,
+	.reset = 1,
 	.error_capture = true,
 	.invert_brightness = 0,
 	.disable_display = 0,
@@ -115,8 +115,8 @@ MODULE_PARM_DESC(vbt_sdvo_panel_type,
 	"Override/Ignore selection of SDVO panel mode in the VBT "
 	"(-2=ignore, -1=auto [default], index in VBT BIOS table)");
 
-module_param_named_unsafe(reset, i915.reset, bool, 0600);
-MODULE_PARM_DESC(reset, "Attempt GPU resets (default: true)");
+module_param_named_unsafe(reset, i915.reset, int, 0600);
+MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset [default], 2=engine reset)");
 
 #if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
 module_param_named(error_capture, i915.error_capture, bool, 0600);
diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
index 34148cc8637c..febbfdbd30bd 100644
--- a/drivers/gpu/drm/i915/i915_params.h
+++ b/drivers/gpu/drm/i915/i915_params.h
@@ -51,6 +51,7 @@
 	func(int, use_mmio_flip); \
 	func(int, mmio_debug); \
 	func(int, edp_vswing); \
+	func(int, reset); \
 	func(unsigned int, inject_load_failure); \
 	/* leave bools at the end to not create holes */ \
 	func(bool, alpha_support); \
@@ -60,7 +61,6 @@
 	func(bool, prefault_disable); \
 	func(bool, load_detect_test); \
 	func(bool, force_reset_modeset_test); \
-	func(bool, reset); \
 	func(bool, error_capture); \
 	func(bool, disable_display); \
 	func(bool, verbose_state_checks); \
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 04/20] drm/i915/tdr: Modify error handler for per engine hang recovery
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (2 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 03/20] drm/i915: Update i915.reset to handle engine resets Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-18 21:40   ` Chris Wilson
  2017-04-18 20:23 ` [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery Michel Thierry
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

This is a preparatory patch which modifies error handler to do per engine
hang recovery. The actual patch which implements this sequence follows
later in the series. The aim is to prepare existing recovery function to
adapt to this new function where applicable (which fails at this point
because core implementation is lacking) and continue recovery using legacy
full gpu reset.

A helper function is also added to query the availability of engine
reset.

The error events behaviour that are used to notify user of reset are
adapted to engine reset such that it doesn't break users listening to these
events. In legacy we report an error event, a reset event before resetting
the gpu and a reset done event marking the completion of reset. The same
behaviour is adapted but reset event is only dispatched once even when
multiple engines are hung. Finally once reset is complete we send reset
done event as usual.

Note that this implementation of engine reset is for i915 directly
submitting to the ELSP, where the driver manages the hang detection,
recovery and resubmission. With GuC submission these tasks are shared
between driver and firmware; i915 will still responsible for detecting a
hang, and when it does it will have to request GuC to reset that Engine and
remind the firmware about the outstanding submissions. This will be
added in different patch.

v2: rebase, advertise engine reset availability in platform definition,
add note about GuC submission.
v3: s/*engine_reset*/*reset_engine*/. (Chris)
Handle reset as 2 level resets, by first going to engine only and fall
backing to full/chip reset as needed, i.e. reset_engine will need the
struct_mutex.
v4: Pass the engine mask to i915_reset. (Chris)
v5: Rebase, update selftests.
v6: Rebase, prepare for mutex-less reset engine.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c         | 15 +++++++++++++++
 drivers/gpu/drm/i915/i915_drv.h         |  6 ++++++
 drivers/gpu/drm/i915/i915_irq.c         | 30 +++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/i915_pci.c         |  5 ++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h | 16 ++++++++++++++++
 drivers/gpu/drm/i915/intel_uncore.c     | 11 +++++++++++
 6 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index cc7393e65e99..e03d0643dbd6 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1800,6 +1800,8 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	if (!test_bit(I915_RESET_HANDOFF, &error->flags))
 		return;
 
+	DRM_DEBUG_DRIVER("resetting chip\n");
+
 	/* Clear any previous failed attempts at recovery. Time to try again. */
 	if (!i915_gem_unset_wedged(dev_priv))
 		goto wakeup;
@@ -1863,6 +1865,19 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	goto finish;
 }
 
+/**
+ * i915_reset_engine - reset GPU engine to recover from a hang
+ * @engine: engine to reset
+ *
+ * Reset a specific GPU engine. Useful if a hang is detected.
+ * Returns zero on successful reset or otherwise an error code.
+ */
+int i915_reset_engine(struct intel_engine_cs *engine)
+{
+	/* FIXME: replace me with engine reset sequence */
+	return -ENODEV;
+}
+
 static int i915_pm_suspend(struct device *kdev)
 {
 	struct pci_dev *pdev = to_pci_dev(kdev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index e06af46f5a57..7bc5f552add7 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -814,6 +814,7 @@ struct intel_csr {
 	func(has_ddi); \
 	func(has_decoupled_mmio); \
 	func(has_dp_mst); \
+	func(has_reset_engine); \
 	func(has_fbc); \
 	func(has_fpga_dbg); \
 	func(has_full_ppgtt); \
@@ -1616,6 +1617,9 @@ struct i915_gpu_error {
 #define I915_RESET_HANDOFF	1
 #define I915_WEDGED		(BITS_PER_LONG - 1)
 
+	/* if available, engine-specific reset is tried before full gpu reset */
+	u32 reset_engine_mask;
+
 	/**
 	 * Waitqueue to signal when a hang is detected. Used to for waiters
 	 * to release the struct_mutex for the reset to procede.
@@ -3019,6 +3023,8 @@ extern void i915_driver_unload(struct drm_device *dev);
 extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern void i915_reset(struct drm_i915_private *dev_priv);
+extern int i915_reset_engine(struct intel_engine_cs *engine);
+extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index fd97fe00cd0d..ab1d77ab0977 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2645,12 +2645,33 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
 	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
 	char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
+	u32 engine_mask = dev_priv->gpu_error.reset_engine_mask;
 
 	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
 
-	DRM_DEBUG_DRIVER("resetting chip\n");
+	/*
+	 * This event needs to be sent before performing gpu reset. When
+	 * engine resets are supported we iterate through all engines and
+	 * reset hung engines individually. To keep the event dispatch
+	 * mechanism consistent with full gpu reset, this is only sent once
+	 * even when multiple engines are hung. It is also safe to move this
+	 * here because when we are in this function, we will definitely
+	 * perform gpu reset.
+	 */
 	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
 
+	/* try engine reset first, and continue if fails; look mom, no mutex! */
+	if (intel_has_reset_engine(dev_priv) && !(engine_mask & (engine_mask - 1))) {
+		struct intel_engine_cs *engine =
+			dev_priv->engine[intel_engineid_from_flag(engine_mask)];
+
+		if (i915_reset_engine(engine) == 0)
+			goto finish;
+
+		DRM_ERROR("reset %s failed, promoting to full gpu reset\n",
+			  engine->name);
+	}
+
 	intel_prepare_reset(dev_priv);
 
 	set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
@@ -2680,6 +2701,7 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 		kobject_uevent_env(kobj,
 				   KOBJ_CHANGE, reset_done_event);
 
+finish:
 	/*
 	 * Note: The wake_up also serves as a memory barrier so that
 	 * waiters see the updated value of the dev_priv->gpu_error.
@@ -2781,6 +2803,12 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 			     &dev_priv->gpu_error.flags))
 		goto out;
 
+	/*
+	 * Save which engines need reset; if engine support is available,
+	 * we can just reset the hung engines.
+	 */
+	dev_priv->gpu_error.reset_engine_mask = engine_mask;
+
 	i915_reset_and_wakeup(dev_priv);
 
 out:
diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index f87b0c4e564d..d5002b55cbd8 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -313,7 +313,8 @@ static const struct intel_device_info intel_haswell_info = {
 	BDW_COLORS, \
 	.has_logical_ring_contexts = 1, \
 	.has_full_48bit_ppgtt = 1, \
-	.has_64bit_reloc = 1
+	.has_64bit_reloc = 1, \
+	.has_reset_engine = 1
 
 static const struct intel_device_info intel_broadwell_info = {
 	BDW_FEATURES,
@@ -345,6 +346,7 @@ static const struct intel_device_info intel_cherryview_info = {
 	.has_gmch_display = 1,
 	.has_aliasing_ppgtt = 1,
 	.has_full_ppgtt = 1,
+	.has_reset_engine = 1,
 	.display_mmio_offset = VLV_DISPLAY_BASE,
 	GEN_CHV_PIPEOFFSETS,
 	CURSOR_OFFSETS,
@@ -394,6 +396,7 @@ static const struct intel_device_info intel_skylake_gt3_info = {
 	.has_aliasing_ppgtt = 1, \
 	.has_full_ppgtt = 1, \
 	.has_full_48bit_ppgtt = 1, \
+	.has_reset_engine = 1, \
 	GEN_DEFAULT_PIPEOFFSETS, \
 	IVB_CURSOR_OFFSETS, \
 	BDW_COLORS
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 00d36aa4e26d..6e5577dc1f7a 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -442,6 +442,22 @@ intel_engine_flag(const struct intel_engine_cs *engine)
 	return BIT(engine->id);
 }
 
+/* works only for engine_mask with 1 engine bit set */
+static inline unsigned int
+intel_engineid_from_flag(unsigned engine_mask)
+{
+	unsigned int engine_id = 0;
+
+	GEM_BUG_ON(engine_mask & (engine_mask - 1));
+
+	while (engine_mask >>= 1)
+		engine_id++;
+
+	GEM_BUG_ON(engine_id >= I915_NUM_ENGINES);
+
+	return engine_id;
+}
+
 static inline u32
 intel_read_status_page(struct intel_engine_cs *engine, int reg)
 {
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 07a722f74fa1..ab5bdd110ac3 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1776,6 +1776,17 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv)
 	return intel_get_gpu_reset(dev_priv) != NULL;
 }
 
+/*
+ * When GuC submission is enabled, GuC manages ELSP and can initiate the
+ * engine reset too. For now, fall back to full GPU reset if it is enabled.
+ */
+bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
+{
+	return (dev_priv->info.has_reset_engine &&
+		!dev_priv->guc.execbuf_client &&
+		i915.reset == 2);
+}
+
 int intel_guc_reset(struct drm_i915_private *dev_priv)
 {
 	int ret;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (3 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 04/20] drm/i915/tdr: Modify error handler for per engine hang recovery Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-19 10:49   ` Chris Wilson
  2017-04-18 20:23 ` [PATCH v6 06/20] drm/i915: Skip reset request if there is one already Michel Thierry
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

This change implements support for per-engine reset as an initial, less
intrusive hang recovery option to be attempted before falling back to the
legacy full GPU reset recovery mode if necessary. This is only supported
from Gen8 onwards.

Hangchecker determines which engines are hung and invokes error handler to
recover from it. Error handler schedules recovery for each of those engines
that are hung. The recovery procedure is as follows,
 - identifies the request that caused the hang and it is dropped
 - force engine to idle: this is done by issuing a reset request
 - reset and re-init engine
 - restart submissions to the engine

If engine reset fails then we fall back to heavy weight full gpu reset
which resets all engines and reinitiazes complete state of HW and SW.

v2: Rebase.
v3: s/*engine_reset*/*reset_engine*/; freeze engine and irqs before
calling i915_gem_reset_engine (Chris).
v4: Rebase, modify i915_gem_reset_prepare to use a ring mask and
reuse the function for reset_engine.
v5: intel_reset_engine_start/cancel instead of request/unrequest_reset.
v6: Clean up reset_engine function to not require mutex, i.e. no need to call
revoke/restore_fences and _retire_requests (Chris).

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c         | 76 ++++++++++++++++++++++++--
 drivers/gpu/drm/i915/i915_drv.h         | 12 +++-
 drivers/gpu/drm/i915/i915_gem.c         | 97 +++++++++++++++++++--------------
 drivers/gpu/drm/i915/i915_gem_request.c |  2 +-
 drivers/gpu/drm/i915/intel_uncore.c     | 20 +++++++
 5 files changed, 158 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index e03d0643dbd6..634893cd93b3 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1810,7 +1810,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
 
 	pr_notice("drm/i915: Resetting chip after gpu hang\n");
 	disable_irq(dev_priv->drm.irq);
-	ret = i915_gem_reset_prepare(dev_priv);
+	ret = i915_gem_reset_prepare(dev_priv, ALL_ENGINES);
 	if (ret) {
 		DRM_ERROR("GPU recovery failed\n");
 		intel_gpu_reset(dev_priv, ALL_ENGINES);
@@ -1852,7 +1852,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	i915_queue_hangcheck(dev_priv);
 
 finish:
-	i915_gem_reset_finish(dev_priv);
+	i915_gem_reset_finish(dev_priv, ALL_ENGINES);
 	enable_irq(dev_priv->drm.irq);
 
 wakeup:
@@ -1871,11 +1871,79 @@ void i915_reset(struct drm_i915_private *dev_priv)
  *
  * Reset a specific GPU engine. Useful if a hang is detected.
  * Returns zero on successful reset or otherwise an error code.
+ *
+ * Procedure is:
+ *  - identifies the request that caused the hang and it is dropped
+ *  - force engine to idle: this is done by issuing a reset request
+ *  - reset engine
+ *  - restart submissions to the engine
  */
 int i915_reset_engine(struct intel_engine_cs *engine)
 {
-	/* FIXME: replace me with engine reset sequence */
-	return -ENODEV;
+	int ret;
+	struct drm_i915_private *dev_priv = engine->i915;
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
+
+	GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, &error->flags));
+
+	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
+
+	/*
+	 * We need to first idle the engine by issuing a reset request,
+	 * then perform soft reset and re-initialize hw state, for all of
+	 * this GT power need to be awake so ensure it does throughout the
+	 * process
+	 */
+	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+
+	disable_irq(dev_priv->drm.irq);
+	ret = i915_gem_reset_prepare_engine(engine);
+	if (ret) {
+		DRM_ERROR("Previous reset failed - promote to full reset\n");
+		goto error;
+	}
+
+	/*
+	 * the request that caused the hang is stuck on elsp, identify the
+	 * active request and drop it, adjust head to skip the offending
+	 * request to resume executing remaining requests in the queue.
+	 */
+	i915_gem_reset_engine(engine);
+
+	/* forcing engine to idle */
+	ret = intel_reset_engine_start(engine);
+	if (ret) {
+		DRM_ERROR("Failed to disable %s\n", engine->name);
+		goto error;
+	}
+
+	/* finally, reset engine */
+	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
+	if (ret) {
+		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
+		intel_reset_engine_cancel(engine);
+		goto error;
+	}
+
+	/* be sure the request reset bit gets cleared */
+	intel_reset_engine_cancel(engine);
+
+	i915_gem_reset_finish_engine(engine);
+
+	/* replay remaining requests in the queue */
+	ret = engine->init_hw(engine);
+	if (ret)
+		goto error;
+
+wakeup:
+	enable_irq(dev_priv->drm.irq);
+	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+	wake_up_bit(&error->flags, I915_RESET_HANDOFF);
+	return ret;
+
+error:
+	/* use full gpu reset to recover on error */
+	goto wakeup;
 }
 
 static int i915_pm_suspend(struct device *kdev)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 7bc5f552add7..eb3e5bcda478 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3025,6 +3025,8 @@ extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern void i915_reset(struct drm_i915_private *dev_priv);
 extern int i915_reset_engine(struct intel_engine_cs *engine);
 extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
+extern int intel_reset_engine_start(struct intel_engine_cs *engine);
+extern void intel_reset_engine_cancel(struct intel_engine_cs *engine);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
@@ -3413,7 +3415,6 @@ int __must_check i915_gem_set_global_seqno(struct drm_device *dev, u32 seqno);
 
 struct drm_i915_gem_request *
 i915_gem_find_active_request(struct intel_engine_cs *engine);
-
 void i915_gem_retire_requests(struct drm_i915_private *dev_priv);
 
 static inline bool i915_reset_backoff(struct i915_gpu_error *error)
@@ -3441,11 +3442,16 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
-int i915_gem_reset_prepare(struct drm_i915_private *dev_priv);
+int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
+int i915_gem_reset_prepare(struct drm_i915_private *dev_priv,
+			   unsigned int engine_mask);
 void i915_gem_reset(struct drm_i915_private *dev_priv);
-void i915_gem_reset_finish(struct drm_i915_private *dev_priv);
+void i915_gem_reset_finish_engine(struct intel_engine_cs *engine);
+void i915_gem_reset_finish(struct drm_i915_private *dev_priv,
+			   unsigned int engine_mask);
 void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
 bool i915_gem_unset_wedged(struct drm_i915_private *dev_priv);
+void i915_gem_reset_engine(struct intel_engine_cs *engine);
 
 void i915_gem_init_mmio(struct drm_i915_private *i915);
 int __must_check i915_gem_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 33fb11cc5acc..bce38062f94e 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2793,48 +2793,57 @@ static bool engine_stalled(struct intel_engine_cs *engine)
 	return true;
 }
 
-int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
+/* Ensure irq handler finishes, and not run again. */
+int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 {
-	struct intel_engine_cs *engine;
-	enum intel_engine_id id;
+	struct drm_i915_gem_request *request;
 	int err = 0;
 
-	/* Ensure irq handler finishes, and not run again. */
-	for_each_engine(engine, dev_priv, id) {
-		struct drm_i915_gem_request *request;
-
-		/* Prevent the signaler thread from updating the request
-		 * state (by calling dma_fence_signal) as we are processing
-		 * the reset. The write from the GPU of the seqno is
-		 * asynchronous and the signaler thread may see a different
-		 * value to us and declare the request complete, even though
-		 * the reset routine have picked that request as the active
-		 * (incomplete) request. This conflict is not handled
-		 * gracefully!
-		 */
-		kthread_park(engine->breadcrumbs.signaler);
-
-		/* Prevent request submission to the hardware until we have
-		 * completed the reset in i915_gem_reset_finish(). If a request
-		 * is completed by one engine, it may then queue a request
-		 * to a second via its engine->irq_tasklet *just* as we are
-		 * calling engine->init_hw() and also writing the ELSP.
-		 * Turning off the engine->irq_tasklet until the reset is over
-		 * prevents the race.
-		 */
-		tasklet_kill(&engine->irq_tasklet);
-		tasklet_disable(&engine->irq_tasklet);
 
-		if (engine->irq_seqno_barrier)
-			engine->irq_seqno_barrier(engine);
+	/* Prevent the signaler thread from updating the request
+	 * state (by calling dma_fence_signal) as we are processing
+	 * the reset. The write from the GPU of the seqno is
+	 * asynchronous and the signaler thread may see a different
+	 * value to us and declare the request complete, even though
+	 * the reset routine have picked that request as the active
+	 * (incomplete) request. This conflict is not handled
+	 * gracefully!
+	 */
+	kthread_park(engine->breadcrumbs.signaler);
+
+	/* Prevent request submission to the hardware until we have
+	 * completed the reset in i915_gem_reset_finish(). If a request
+	 * is completed by one engine, it may then queue a request
+	 * to a second via its engine->irq_tasklet *just* as we are
+	 * calling engine->init_hw() and also writing the ELSP.
+	 * Turning off the engine->irq_tasklet until the reset is over
+	 * prevents the race.
+	 */
+	tasklet_kill(&engine->irq_tasklet);
+	tasklet_disable(&engine->irq_tasklet);
 
-		if (engine_stalled(engine)) {
-			request = i915_gem_find_active_request(engine);
-			if (request && request->fence.error == -EIO)
-				err = -EIO; /* Previous reset failed! */
-		}
+	if (engine->irq_seqno_barrier)
+		engine->irq_seqno_barrier(engine);
+
+	if (engine_stalled(engine)) {
+		request = i915_gem_find_active_request(engine);
+		if (request && request->fence.error == -EIO)
+			err = -EIO; /* Previous reset failed! */
 	}
 
+	return err;
+}
+
+int i915_gem_reset_prepare(struct drm_i915_private *dev_priv,
+			   unsigned int engine_mask)
+{
+	struct intel_engine_cs *engine;
+	unsigned int tmp;
+	int err = 0;
+
+	for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
+		err = i915_gem_reset_prepare_engine(engine);
+
 	i915_gem_revoke_fences(dev_priv);
 
 	return err;
@@ -2920,7 +2929,7 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 	return guilty;
 }
 
-static void i915_gem_reset_engine(struct intel_engine_cs *engine)
+void i915_gem_reset_engine(struct intel_engine_cs *engine)
 {
 	struct drm_i915_gem_request *request;
 
@@ -2966,16 +2975,22 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
 	}
 }
 
-void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
+void i915_gem_reset_finish_engine(struct intel_engine_cs *engine)
+{
+	tasklet_enable(&engine->irq_tasklet);
+	kthread_unpark(engine->breadcrumbs.signaler);
+}
+
+void i915_gem_reset_finish(struct drm_i915_private *dev_priv,
+			   unsigned int engine_mask)
 {
 	struct intel_engine_cs *engine;
-	enum intel_engine_id id;
+	unsigned int tmp;
 
 	lockdep_assert_held(&dev_priv->drm.struct_mutex);
 
-	for_each_engine(engine, dev_priv, id) {
-		tasklet_enable(&engine->irq_tasklet);
-		kthread_unpark(engine->breadcrumbs.signaler);
+	for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
+		i915_gem_reset_finish_engine(engine);
 	}
 }
 
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 095cccc2e8b2..085405e55a29 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -1205,7 +1205,7 @@ long i915_wait_request(struct drm_i915_gem_request *req,
 	return timeout;
 }
 
-static void engine_retire_requests(struct intel_engine_cs *engine)
+void engine_retire_requests(struct intel_engine_cs *engine)
 {
 	struct drm_i915_gem_request *request, *next;
 	u32 seqno = intel_engine_get_seqno(engine);
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index ab5bdd110ac3..3ebba6b2dd74 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1801,6 +1801,26 @@ int intel_guc_reset(struct drm_i915_private *dev_priv)
 	return ret;
 }
 
+/*
+ * On gen8+ a reset request has to be issued via the reset control register
+ * before a GPU engine can be reset in order to stop the command streamer
+ * and idle the engine. This replaces the legacy way of stopping an engine
+ * by writing to the stop ring bit in the MI_MODE register.
+ */
+int intel_reset_engine_start(struct intel_engine_cs *engine)
+{
+	return gen8_reset_engine_start(engine);
+}
+
+/*
+ * It is possible to back off from a previously issued reset request by simply
+ * clearing the reset request bit in the reset control register.
+ */
+void intel_reset_engine_cancel(struct intel_engine_cs *engine)
+{
+	gen8_reset_engine_cancel(engine);
+}
+
 bool intel_uncore_unclaimed_mmio(struct drm_i915_private *dev_priv)
 {
 	return check_for_unclaimed_mmio(dev_priv);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 06/20] drm/i915: Skip reset request if there is one already
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (4 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-18 20:23 ` [PATCH v6 07/20] drm/i915/tdr: Add engine reset count to error state Michel Thierry
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

From: Mika Kuoppala <mika.kuoppala@linux.intel.com>

To perform engine reset we first disable engine to capture its state. This
is done by issuing a reset request. Because we are reusing existing
infrastructure, again when we actually reset an engine, reset function
checks engine mask and issues reset request again which is unnecessary. To
avoid this we check if the engine is already prepared, if so we just exit
from that point.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 3ebba6b2dd74..120fb440bb8b 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1686,10 +1686,15 @@ int intel_wait_for_register(struct drm_i915_private *dev_priv,
 static int gen8_reset_engine_start(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
+	const i915_reg_t reset_ctrl = RING_RESET_CTL(engine->mmio_base);
+	const u32 ready = RESET_CTL_REQUEST_RESET | RESET_CTL_READY_TO_RESET;
 	int ret;
 
-	I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
-		      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
+	/* If engine has been already prepared, we can shortcut here */
+	if ((I915_READ_FW(reset_ctrl) & ready) == ready)
+		return 0;
+
+	I915_WRITE_FW(reset_ctrl, _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
 
 	ret = intel_wait_for_register_fw(dev_priv,
 					 RING_RESET_CTL(engine->mmio_base),
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 07/20] drm/i915/tdr: Add engine reset count to error state
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (5 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 06/20] drm/i915: Skip reset request if there is one already Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-18 20:23 ` [PATCH v6 08/20] drm/i915/tdr: Export per-engine reset count info to debugfs Michel Thierry
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

Driver maintains count of how many times a given engine is reset, useful to
capture this in error state also. It gives an idea of how engine is coping
up with the workloads it is executing before this error state.

A follow-up patch will provide this information in debugfs.

v2: s/engine_reset/reset_engine/ (Chris)
    Define count as unsigned int (Tvrtko)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c       | 2 ++
 drivers/gpu/drm/i915/i915_drv.h       | 8 ++++++++
 drivers/gpu/drm/i915/i915_gpu_error.c | 3 +++
 3 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 634893cd93b3..974be1fa77f9 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1935,6 +1935,8 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	if (ret)
 		goto error;
 
+	error->reset_engine_count[engine->id]++;
+
 wakeup:
 	enable_irq(dev_priv->drm.irq);
 	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index eb3e5bcda478..71c34f15be64 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -982,6 +982,7 @@ struct i915_gpu_state {
 		enum intel_engine_hangcheck_action hangcheck_action;
 		struct i915_address_space *vm;
 		int num_requests;
+		u32 reset_count;
 
 		/* position of active request inside the ring */
 		u32 rq_head, rq_post, rq_tail;
@@ -1619,6 +1620,7 @@ struct i915_gpu_error {
 
 	/* if available, engine-specific reset is tried before full gpu reset */
 	u32 reset_engine_mask;
+	u32 reset_engine_count[I915_NUM_ENGINES];
 
 	/**
 	 * Waitqueue to signal when a hang is detected. Used to for waiters
@@ -3442,6 +3444,12 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
+static inline u32 i915_reset_engine_count(struct i915_gpu_error *error,
+					  struct intel_engine_cs *engine)
+{
+	return READ_ONCE(error->reset_engine_count[engine->id]);
+}
+
 int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv,
 			   unsigned int engine_mask);
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 14e2064b7653..a2ffb1ef2cfa 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -463,6 +463,7 @@ static void error_print_engine(struct drm_i915_error_state_buf *m,
 	err_printf(m, "  hangcheck action timestamp: %lu, %u ms ago\n",
 		   ee->hangcheck_timestamp,
 		   jiffies_to_msecs(jiffies - ee->hangcheck_timestamp));
+	err_printf(m, "  engine reset count: %u\n", ee->reset_count);
 
 	error_print_request(m, "  ELSP[0]: ", &ee->execlist[0]);
 	error_print_request(m, "  ELSP[1]: ", &ee->execlist[1]);
@@ -1244,6 +1245,8 @@ static void error_record_engine_registers(struct i915_gpu_state *error,
 	ee->hangcheck_timestamp = engine->hangcheck.action_timestamp;
 	ee->hangcheck_action = engine->hangcheck.action;
 	ee->hangcheck_stalled = engine->hangcheck.stalled;
+	ee->reset_count = i915_reset_engine_count(&dev_priv->gpu_error,
+						  engine);
 
 	if (USES_PPGTT(dev_priv)) {
 		int i;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 08/20] drm/i915/tdr: Export per-engine reset count info to debugfs
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (6 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 07/20] drm/i915/tdr: Add engine reset count to error state Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-18 20:23 ` [PATCH v6 09/20] drm/i915/tdr: Enable Engine reset and recovery support Michel Thierry
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

A new variable is added to export the reset counts to debugfs, this
includes full gpu reset and engine reset count. This is useful for tests
where they are expected to trigger reset; these counts are checked before
and after the test to ensure the same.

v2: Include reset engine count in i915_engine_info too (Chris).

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 870c470177b5..6444c1a9bd22 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1403,6 +1403,23 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 	return 0;
 }
 
+static int i915_reset_info(struct seq_file *m, void *unused)
+{
+	struct drm_i915_private *dev_priv = node_to_i915(m->private);
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+
+	seq_printf(m, "full gpu reset = %u\n", i915_reset_count(error));
+
+	for_each_engine(engine, dev_priv, id) {
+		seq_printf(m, "%s = %u\n", engine->name,
+			   i915_reset_engine_count(error, engine));
+	}
+
+	return 0;
+}
+
 static int ironlake_drpc_info(struct seq_file *m)
 {
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
@@ -3242,6 +3259,7 @@ static int i915_display_info(struct seq_file *m, void *unused)
 static int i915_engine_info(struct seq_file *m, void *unused)
 {
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
 
@@ -3265,6 +3283,8 @@ static int i915_engine_info(struct seq_file *m, void *unused)
 			   engine->hangcheck.seqno,
 			   jiffies_to_msecs(jiffies - engine->hangcheck.action_timestamp),
 			   engine->timeline->inflight_seqnos);
+		seq_printf(m, "\tReset count: %d\n",
+			   i915_reset_engine_count(error, engine));
 
 		rcu_read_lock();
 
@@ -4777,6 +4797,7 @@ static const struct drm_info_list i915_debugfs_list[] = {
 	{"i915_huc_load_status", i915_huc_load_status_info, 0},
 	{"i915_frequency_info", i915_frequency_info, 0},
 	{"i915_hangcheck_info", i915_hangcheck_info, 0},
+	{"i915_reset_info", i915_reset_info, 0},
 	{"i915_drpc_info", i915_drpc_info, 0},
 	{"i915_emon_status", i915_emon_status, 0},
 	{"i915_ring_freq_table", i915_ring_freq_table, 0},
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 09/20] drm/i915/tdr: Enable Engine reset and recovery support
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (7 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 08/20] drm/i915/tdr: Export per-engine reset count info to debugfs Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-18 20:23 ` [PATCH v6 10/20] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

This feature is made available only from Gen8, for previous gen devices
driver uses legacy full gpu reset.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_params.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index 045cadb77285..14e2c2e57f96 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -46,7 +46,7 @@ struct i915_params i915 __read_mostly = {
 	.prefault_disable = 0,
 	.load_detect_test = 0,
 	.force_reset_modeset_test = 0,
-	.reset = 1,
+	.reset = 2,
 	.error_capture = true,
 	.invert_brightness = 0,
 	.disable_display = 0,
@@ -116,7 +116,7 @@ MODULE_PARM_DESC(vbt_sdvo_panel_type,
 	"(-2=ignore, -1=auto [default], index in VBT BIOS table)");
 
 module_param_named_unsafe(reset, i915.reset, int, 0600);
-MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset [default], 2=engine reset)");
+MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset, 2=engine reset [default])");
 
 #if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
 module_param_named(error_capture, i915.error_capture, bool, 0600);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 10/20] drm/i915: Add engine reset count in get-reset-stats ioctl
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (8 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 09/20] drm/i915/tdr: Enable Engine reset and recovery support Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-18 20:23 ` [PATCH v6 11/20] drm/i915/selftests: reset engine self tests Michel Thierry
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

Users/tests relying on the total reset count will start seeing a smaller
number since most of the hangs can be handled by engine reset.
Note that if reset engine x, context a running on engine y will be unaware
and unaffected.

To start the discussion, include just a total engine reset count. If it
is deemed useful, it can be extended to report each engine separately.

v2: s/engine_reset/reset_engine/, use union in uapi to not break compatibility.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 14 +++++++++++---
 include/uapi/drm/i915_drm.h             |  6 +++++-
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 8bd0c4966913..edbed85a1c88 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -1133,9 +1133,11 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct drm_i915_reset_stats *args = data;
 	struct i915_gem_context *ctx;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
 	int ret;
 
-	if (args->flags || args->pad)
+	if (args->flags)
 		return -EINVAL;
 
 	if (args->ctx_id == DEFAULT_CONTEXT_HANDLE && !capable(CAP_SYS_ADMIN))
@@ -1151,10 +1153,16 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 		return PTR_ERR(ctx);
 	}
 
-	if (capable(CAP_SYS_ADMIN))
+	if (capable(CAP_SYS_ADMIN)) {
 		args->reset_count = i915_reset_count(&dev_priv->gpu_error);
-	else
+		for_each_engine(engine, dev_priv, id)
+			args->reset_engine_count +=
+				i915_reset_engine_count(&dev_priv->gpu_error,
+							engine);
+	} else {
 		args->reset_count = 0;
+		args->reset_engine_count = 0;
+	}
 
 	args->batch_active = ctx->guilty_count;
 	args->batch_pending = ctx->active_count;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index f24a80d2d42e..fadedefba6db 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1278,7 +1278,11 @@ struct drm_i915_reset_stats {
 	/* Number of batches lost pending for execution, for this context */
 	__u32 batch_pending;
 
-	__u32 pad;
+	union {
+		__u32 pad;
+		/* Engine resets since boot/module reload, for all contexts */
+		__u32 reset_engine_count;
+	};
 };
 
 struct drm_i915_gem_userptr {
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 11/20] drm/i915/selftests: reset engine self tests
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (9 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 10/20] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-18 20:23 ` [PATCH v6 12/20] drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder Michel Thierry
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

Check that we can reset specific engines, also check the fallback to
full reset if something didn't work.

v2: rebase.

Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 147 +++++++++++++++++++++++
 1 file changed, 147 insertions(+)

diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index aa31d6c0cdfb..f64fa0e4bb40 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -322,6 +322,56 @@ static int igt_global_reset(void *arg)
 	return err;
 }
 
+static int igt_reset_engine(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	unsigned int reset_count, reset_engine_count;
+	int err = 0;
+
+	/* Check that we can issue a global GPU and engine reset */
+
+	if (!intel_has_gpu_reset(i915))
+		return 0;
+
+	if (!intel_has_reset_engine(i915))
+		return 0;
+
+	set_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
+
+	for_each_engine(engine, i915, id) {
+		reset_count = i915_reset_count(&i915->gpu_error);
+		reset_engine_count = i915_reset_engine_count(&i915->gpu_error,
+							     engine);
+
+		err = i915_reset_engine(engine);
+		if (err) {
+			pr_err("i915_reset_engine failed\n");
+			break;
+		}
+
+		if (i915_reset_count(&i915->gpu_error) != reset_count) {
+			pr_err("Full GPU reset recorded! (engine reset expected)\n");
+			err = -EINVAL;
+			break;
+		}
+
+		if (i915_reset_engine_count(&i915->gpu_error, engine) ==
+		    reset_engine_count) {
+			pr_err("No %s engine reset recorded!\n", engine->name);
+			err = -EINVAL;
+			break;
+		}
+	}
+
+	clear_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
+	if (i915_terminally_wedged(&i915->gpu_error))
+		err = -EIO;
+
+	return err;
+}
+
 static u32 fake_hangcheck(struct drm_i915_gem_request *rq)
 {
 	u32 reset_count;
@@ -526,13 +576,110 @@ static int igt_reset_queue(void *arg)
 	return err;
 }
 
+static int igt_render_engine_reset_fallback(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	struct intel_engine_cs *engine = i915->engine[RCS];
+	struct hang h;
+	struct drm_i915_gem_request *rq;
+	unsigned int reset_count, reset_engine_count;
+	int err = 0;
+
+	/* Check that we can issue a global GPU and engine reset */
+
+	if (!intel_has_gpu_reset(i915))
+		return 0;
+
+	if (!intel_has_reset_engine(i915))
+		return 0;
+
+	set_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
+	mutex_lock(&i915->drm.struct_mutex);
+
+	err = hang_init(&h, i915);
+	if (err)
+		goto unlock;
+
+	rq = hang_create_request(&h, engine, i915->kernel_context);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto fini;
+	}
+
+	i915_gem_request_get(rq);
+	__i915_add_request(rq, true);
+
+	/* make reset engine fail */
+	rq->fence.error = -EIO;
+
+	if (!wait_for_hang(&h, rq)) {
+		pr_err("Failed to start request %x\n", rq->fence.seqno);
+		err = -EIO;
+		goto fini;
+	}
+
+	reset_engine_count = i915_reset_engine_count(&i915->gpu_error, engine);
+	reset_count = fake_hangcheck(rq);
+
+	err = i915_reset_engine(engine);
+	if (err) {
+		pr_err("i915_reset_engine failed\n");
+		goto fini;
+	}
+
+	if (i915_reset_engine_count(&i915->gpu_error, engine) !=
+	    reset_engine_count) {
+		pr_err("render engine reset recorded! (full reset expected)\n");
+		err = -EINVAL;
+		goto fini;
+	}
+
+	if (i915_reset_count(&i915->gpu_error) == reset_count) {
+		pr_err("No full GPU reset recorded!\n");
+		err = -EINVAL;
+		goto fini;
+	}
+
+	/*
+	 * by using fence.error = -EIO, full reset sets the wedged flag, do one
+	 * more full reset to re-enable the hw.
+	 */
+	if (i915_terminally_wedged(&i915->gpu_error)) {
+		rq->fence.error = 0;
+
+		set_bit(I915_RESET_HANDOFF, &i915->gpu_error.flags);
+		i915_reset(i915);
+		GEM_BUG_ON(test_bit(I915_RESET_HANDOFF,
+				    &i915->gpu_error.flags));
+
+		if (i915_reset_count(&i915->gpu_error) == reset_count) {
+			pr_err("No full GPU reset recorded!\n");
+			err = -EINVAL;
+			goto fini;
+		}
+	}
+
+fini:
+	hang_fini(&h);
+unlock:
+	mutex_unlock(&i915->drm.struct_mutex);
+	clear_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
+
+	if (i915_terminally_wedged(&i915->gpu_error))
+		return -EIO;
+
+	return err;
+}
+
 int intel_hangcheck_live_selftests(struct drm_i915_private *i915)
 {
 	static const struct i915_subtest tests[] = {
 		SUBTEST(igt_hang_sanitycheck),
 		SUBTEST(igt_global_reset),
+		SUBTEST(igt_reset_engine),
 		SUBTEST(igt_wait_reset),
 		SUBTEST(igt_reset_queue),
+		SUBTEST(igt_render_engine_reset_fallback),
 	};
 
 	if (!intel_has_gpu_reset(i915))
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 12/20] drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (10 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 11/20] drm/i915/selftests: reset engine self tests Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-18 20:23 ` [PATCH v6 13/20] drm/i915/guc: Provide register list to be saved/restored during engine reset Michel Thierry
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

From: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

The mmio_start offset for the whitelist is the first FORCE_TO_NONPRIV
register the GuC can use to restore the provided whitelist when an
engine reset via GuC (which we still don't support) is triggered.

We're currently adding the mmio_base of the engine to the absolute
address of the RCS version of the register, which results in the wrong
offset. Fix it by using the definition we already have instead of
re-defining it in the GuC FW header.

Also add a comment to avoid future issues with FORCE_TO_NONPRIV
registers, which are also used by the workaround framework.

v2: improve comment (Michal), move comment about save/restore because it
    is not related to the mmio_white_list field.

v3: rebase/resurrect.

Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Cc: Oscar Mateo <oscar.mateo@intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com> (v2)
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_guc_submission.c | 11 +++++++++--
 drivers/gpu/drm/i915/intel_guc_fwif.h      |  1 -
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index 1642fff9cf13..1ea36a88d2fb 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -1036,10 +1036,17 @@ static int guc_ads_create(struct intel_guc *guc)
 	/* MMIO reg state */
 	for_each_engine(engine, dev_priv, id) {
 		blob->reg_state.white_list[engine->guc_id].mmio_start =
-			engine->mmio_base + GUC_MMIO_WHITE_LIST_START;
+			i915_mmio_reg_offset(RING_FORCE_TO_NONPRIV(engine->mmio_base, 0));
 
-		/* Nothing to be saved or restored for now. */
+		/*
+		 * Note: if the GuC whitelist management is enabled, the values
+		 * should be filled using the workaround framework to avoid
+		 * inconsistencies with the handling of FORCE_TO_NONPRIV
+		 * registers.
+		 */
 		blob->reg_state.white_list[engine->guc_id].count = 0;
+
+		/* Nothing to be saved or restored for now. */
 	}
 
 	/*
diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
index 6156845641a3..e6f8079df94a 100644
--- a/drivers/gpu/drm/i915/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
@@ -394,7 +394,6 @@ struct guc_policies {
 #define GUC_REGSET_SAVE_CURRENT_VALUE	0x10
 
 #define GUC_REGSET_MAX_REGISTERS	25
-#define GUC_MMIO_WHITE_LIST_START	0x24d0
 #define GUC_MMIO_WHITE_LIST_MAX		12
 #define GUC_S3_SAVE_SPACE_PAGES		10
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 13/20] drm/i915/guc: Provide register list to be saved/restored during engine reset
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (11 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 12/20] drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-19  0:26   ` Daniele Ceraolo Spurio
  2017-04-18 20:23 ` [PATCH v6 14/20] drm/i915/guc: Add support for reset engine using GuC commands Michel Thierry
                   ` (7 subsequent siblings)
  20 siblings, 1 reply; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

GuC expects a list of registers from the driver which are saved/restored
during engine reset. The type of value to be saved is controlled by
flags. We provide a minimal set of registers that we want GuC to save and
restore. This is not an issue in case of engine reset as driver initializes
most of them following an engine reset, but in case of media reset (aka
watchdog reset) which is completely internal to GuC (including resubmission
of hung workload), it is necessary to provide this list, otherwise GuC won't
be able to schedule further workloads after a reset. This is the minimal
set of registers identified for things to work as expected but if we see
any new issues, this register list can be expanded.

v2: REGSET_MASKED is too difficult for GuC, use REGSET_SAVE_DEFAULT_VALUE
and current value from RING_MODE reg instead; no need to preserve
head/tail either, be extra paranoid and save whitelisted registers (Daniele).

Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_guc_submission.c | 60 +++++++++++++++++++++++++++++-
 1 file changed, 58 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index 1ea36a88d2fb..d772718861df 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -1003,6 +1003,24 @@ static void guc_policies_init(struct guc_policies *policies)
 	policies->is_valid = 1;
 }
 
+/*
+ * In this macro it is highly unlikely to exceed max value but even if we did
+ * it is not an error so just throw a warning and continue. Only side effect
+ * in continuing further means some registers won't be added to save/restore
+ * list.
+ */
+#define GUC_ADD_MMIO_REG_ADS(node, reg_addr, _flags, defvalue)		\
+	do {								\
+		u32 __count = node->number_of_registers;		\
+		if (WARN_ON(__count >= GUC_REGSET_MAX_REGISTERS))	\
+			continue;					\
+		node->registers[__count].offset = reg_addr.reg;		\
+		node->registers[__count].flags = (_flags);		\
+		if (defvalue)						\
+			node->registers[__count].value = (defvalue);	\
+		node->number_of_registers++;				\
+	} while (0)
+
 static int guc_ads_create(struct intel_guc *guc)
 {
 	struct drm_i915_private *dev_priv = guc_to_i915(guc);
@@ -1016,6 +1034,7 @@ static int guc_ads_create(struct intel_guc *guc)
 		u8 reg_state_buffer[GUC_S3_SAVE_SPACE_PAGES * PAGE_SIZE];
 	} __packed *blob;
 	struct intel_engine_cs *engine;
+	struct i915_workarounds *workarounds = &dev_priv->workarounds;
 	enum intel_engine_id id;
 	u32 base;
 
@@ -1035,6 +1054,39 @@ static int guc_ads_create(struct intel_guc *guc)
 
 	/* MMIO reg state */
 	for_each_engine(engine, dev_priv, id) {
+		u32 i;
+		struct guc_mmio_regset *eng_reg =
+			&blob->reg_state.engine_reg[engine->guc_id];
+
+		/*
+		 * Provide a list of registers to be saved/restored during gpu
+		 * reset. This is mainly required for Media reset (aka watchdog
+		 * timeout) which is completely under the control of GuC
+		 * (resubmission of hung workload is handled inside GuC).
+		 */
+		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_HWS_PGA(engine->mmio_base),
+				     GUC_REGSET_ENGINERESET |
+				     GUC_REGSET_SAVE_CURRENT_VALUE, 0);
+
+		/*
+		 * Workaround the guc issue with masked registers, note that
+		 * at this point guc submission is still disabled and the mode
+		 * register doesnt have the irq_steering bit set, which we
+		 * need to fwd irqs to GuC.
+		 */
+		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_MODE_GEN7(engine),
+				     GUC_REGSET_ENGINERESET |
+				     GUC_REGSET_SAVE_DEFAULT_VALUE,
+				     I915_READ(RING_MODE_GEN7(engine)) |
+				     GFX_INTERRUPT_STEERING | (0xFFFF<<16));
+
+		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_IMR(engine->mmio_base),
+				     GUC_REGSET_ENGINERESET |
+				     GUC_REGSET_SAVE_CURRENT_VALUE, 0);
+
+		DRM_DEBUG_DRIVER("%s register save/restore count: %u\n",
+				 engine->name, eng_reg->number_of_registers);
+
 		blob->reg_state.white_list[engine->guc_id].mmio_start =
 			i915_mmio_reg_offset(RING_FORCE_TO_NONPRIV(engine->mmio_base, 0));
 
@@ -1044,9 +1096,13 @@ static int guc_ads_create(struct intel_guc *guc)
 		 * inconsistencies with the handling of FORCE_TO_NONPRIV
 		 * registers.
 		 */
-		blob->reg_state.white_list[engine->guc_id].count = 0;
+		blob->reg_state.white_list[engine->guc_id].count =
+					workarounds->hw_whitelist_count[id];
 
-		/* Nothing to be saved or restored for now. */
+		for (i = 0; i < workarounds->hw_whitelist_count[id]; i++) {
+			blob->reg_state.white_list[engine->guc_id].offsets[i] =
+				I915_READ(RING_FORCE_TO_NONPRIV(engine->mmio_base, i));
+		}
 	}
 
 	/*
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 14/20] drm/i915/guc: Add support for reset engine using GuC commands
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (12 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 13/20] drm/i915/guc: Provide register list to be saved/restored during engine reset Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-19 10:27   ` Chris Wilson
  2017-04-18 20:23 ` [PATCH v6 15/20] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load Michel Thierry
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

This patch adds per engine reset and recovery (TDR) support when GuC is
used to submit workloads to GPU.

In the case of i915 directly submission to ELSP, driver manages hang
detection, recovery and resubmission. With GuC submission these tasks
are shared between driver and GuC. i915 is still responsible for detecting
a hang, and when it does it only requests GuC to reset that Engine. GuC
internally manages acquiring forcewake and idling the engine before actually
resetting it.

Once the reset is successful, i915 takes over again and handles resubmission.
The scheduler in i915 knows which requests are pending so after resetting
a engine, pending workloads/requests are resubmitted again.

v2: s/i915_guc_request_engine_reset/i915_guc_reset_engine/ to match the
non-guc funtion names.

Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c            | 43 +++++++++++++++++---------
 drivers/gpu/drm/i915/i915_drv.h            |  1 +
 drivers/gpu/drm/i915/i915_guc_submission.c | 48 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_guc_fwif.h      |  6 ++++
 drivers/gpu/drm/i915/intel_lrc.c           |  5 ++--
 drivers/gpu/drm/i915/intel_uc.h            |  1 +
 drivers/gpu/drm/i915/intel_uncore.c        |  5 ----
 7 files changed, 88 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 974be1fa77f9..b7e2fa8a0036 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1910,23 +1910,34 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	 */
 	i915_gem_reset_engine(engine);
 
-	/* forcing engine to idle */
-	ret = intel_reset_engine_start(engine);
-	if (ret) {
-		DRM_ERROR("Failed to disable %s\n", engine->name);
-		goto error;
-	}
+	if (!dev_priv->guc.execbuf_client) {
+		/* forcing engine to idle */
+		ret = intel_reset_engine_start(engine);
+		if (ret) {
+			DRM_ERROR("Failed to disable %s\n", engine->name);
+			goto error;
+		}
 
-	/* finally, reset engine */
-	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
-	if (ret) {
-		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
+		/* finally, reset engine */
+		ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
+		if (ret) {
+			DRM_ERROR("Failed to reset %s, ret=%d\n",
+				  engine->name, ret);
+			intel_reset_engine_cancel(engine);
+			goto error;
+		}
+
+		/* be sure the request reset bit gets cleared */
 		intel_reset_engine_cancel(engine);
-		goto error;
-	}
 
-	/* be sure the request reset bit gets cleared */
-	intel_reset_engine_cancel(engine);
+	} else {
+		ret = i915_guc_reset_engine(engine);
+		if (ret) {
+			DRM_ERROR("GuC failed to reset %s, ret=%d\n",
+				  engine->name, ret);
+			goto error;
+		}
+	}
 
 	i915_gem_reset_finish_engine(engine);
 
@@ -1935,6 +1946,10 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	if (ret)
 		goto error;
 
+	/* for guc too */
+	if (dev_priv->guc.execbuf_client)
+		i915_guc_submission_reenable_engine(engine);
+
 	error->reset_engine_count[engine->id]++;
 
 wakeup:
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 71c34f15be64..5f2345fbff44 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3029,6 +3029,7 @@ extern int i915_reset_engine(struct intel_engine_cs *engine);
 extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
 extern int intel_reset_engine_start(struct intel_engine_cs *engine);
 extern void intel_reset_engine_cancel(struct intel_engine_cs *engine);
+extern int i915_guc_reset_engine(struct intel_engine_cs *engine);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index d772718861df..c8067aeab6f4 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -1338,6 +1338,25 @@ void i915_guc_submission_disable(struct drm_i915_private *dev_priv)
 	guc->execbuf_client = NULL;
 }
 
+void i915_guc_submission_reenable_engine(struct intel_engine_cs *engine)
+{
+	struct drm_i915_private *dev_priv = engine->i915;
+	struct intel_guc *guc = &dev_priv->guc;
+	struct i915_guc_client *client = guc->execbuf_client;
+	const int wqi_size = sizeof(struct guc_wq_item);
+	struct drm_i915_gem_request *rq;
+
+	GEM_BUG_ON(!client);
+	intel_guc_sample_forcewake(guc);
+
+	spin_lock_irq(&engine->timeline->lock);
+	list_for_each_entry(rq, &engine->timeline->requests, link) {
+		guc_client_update_wq_rsvd(client, wqi_size);
+		__i915_guc_submit(rq);
+	}
+	spin_unlock_irq(&engine->timeline->lock);
+}
+
 /**
  * intel_guc_suspend() - notify GuC entering suspend state
  * @dev_priv:	i915 device private
@@ -1389,3 +1408,32 @@ int intel_guc_resume(struct drm_i915_private *dev_priv)
 
 	return intel_guc_send(guc, data, ARRAY_SIZE(data));
 }
+
+int i915_guc_reset_engine(struct intel_engine_cs *engine)
+{
+	struct drm_i915_private *dev_priv = engine->i915;
+	struct intel_guc *guc = &dev_priv->guc;
+	struct i915_gem_context *ctx;
+	u32 data[7];
+
+	if (!i915.enable_guc_submission)
+		return 0;
+
+	ctx = dev_priv->kernel_context;
+
+	/*
+	 * The affected context report is populated by GuC and is provided
+	 * to the driver using the shared page. We request for it but don't
+	 * use it as scheduler has all of these details.
+	 */
+	data[0] = INTEL_GUC_ACTION_REQUEST_ENGINE_RESET;
+	data[1] = engine->guc_id;
+	data[2] = INTEL_GUC_RESET_OPTION_REPORT_AFFECTED_CONTEXTS;
+	data[3] = 0;
+	data[4] = 0;
+	data[5] = guc->execbuf_client->stage_id;
+	/* first page is shared data with GuC */
+	data[6] = guc_ggtt_offset(ctx->engine[RCS].state);
+
+	return intel_guc_send(guc, data, ARRAY_SIZE(data));
+}
diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
index e6f8079df94a..081f2cf614e6 100644
--- a/drivers/gpu/drm/i915/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
@@ -505,6 +505,7 @@ union guc_log_control {
 /* This Action will be programmed in C180 - SOFT_SCRATCH_O_REG */
 enum intel_guc_action {
 	INTEL_GUC_ACTION_DEFAULT = 0x0,
+	INTEL_GUC_ACTION_REQUEST_ENGINE_RESET = 0x3,
 	INTEL_GUC_ACTION_SAMPLE_FORCEWAKE = 0x6,
 	INTEL_GUC_ACTION_ALLOCATE_DOORBELL = 0x10,
 	INTEL_GUC_ACTION_DEALLOCATE_DOORBELL = 0x20,
@@ -518,6 +519,11 @@ enum intel_guc_action {
 	INTEL_GUC_ACTION_LIMIT
 };
 
+/* Reset engine options */
+enum action_engine_reset_options {
+	INTEL_GUC_RESET_OPTION_REPORT_AFFECTED_CONTEXTS = 0x10,
+};
+
 /*
  * The GuC sends its response to a command by overwriting the
  * command in SS0. The response is distinguishable from a command
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 7df278fe492e..6295760098a1 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1176,14 +1176,15 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
 
 	/* After a GPU reset, we may have requests to replay */
 	clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
-	if (!i915.enable_guc_submission && !execlists_elsp_idle(engine)) {
+	if (!execlists_elsp_idle(engine)) {
 		DRM_DEBUG_DRIVER("Restarting %s from requests [0x%x, 0x%x]\n",
 				 engine->name,
 				 port_seqno(&engine->execlist_port[0]),
 				 port_seqno(&engine->execlist_port[1]));
 		engine->execlist_port[0].count = 0;
 		engine->execlist_port[1].count = 0;
-		execlists_submit_ports(engine);
+		if (!dev_priv->guc.execbuf_client)
+			execlists_submit_ports(engine);
 	}
 
 	return 0;
diff --git a/drivers/gpu/drm/i915/intel_uc.h b/drivers/gpu/drm/i915/intel_uc.h
index 2f0229da20bb..a348d08ce227 100644
--- a/drivers/gpu/drm/i915/intel_uc.h
+++ b/drivers/gpu/drm/i915/intel_uc.h
@@ -247,6 +247,7 @@ int i915_guc_wq_reserve(struct drm_i915_gem_request *rq);
 void i915_guc_wq_unreserve(struct drm_i915_gem_request *request);
 void i915_guc_submission_disable(struct drm_i915_private *dev_priv);
 void i915_guc_submission_fini(struct drm_i915_private *dev_priv);
+void i915_guc_submission_reenable_engine(struct intel_engine_cs *engine);
 struct i915_vma *intel_guc_allocate_vma(struct intel_guc *guc, u32 size);
 
 /* intel_guc_log.c */
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 120fb440bb8b..778813ac7628 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1781,14 +1781,9 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv)
 	return intel_get_gpu_reset(dev_priv) != NULL;
 }
 
-/*
- * When GuC submission is enabled, GuC manages ELSP and can initiate the
- * engine reset too. For now, fall back to full GPU reset if it is enabled.
- */
 bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
 {
 	return (dev_priv->info.has_reset_engine &&
-		!dev_priv->guc.execbuf_client &&
 		i915.reset == 2);
 }
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 15/20] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (13 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 14/20] drm/i915/guc: Add support for reset engine using GuC commands Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-18 21:18   ` Daniele Ceraolo Spurio
  2017-04-18 20:23 ` [PATCH v6 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

For watchdog / media reset, the firmware must know the address of the shared
data page (the first page of the default context).

This information should be in DWORD 9 of the GUC_CTL structure.

v2: Use guc_ggtt_offset (Chris).
Store the ggtt offset of the default ctx as we needed for
suspend/resume/reset (Daniele).

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_guc_submission.c | 21 ++++++---------------
 drivers/gpu/drm/i915/intel_guc_fwif.h      |  2 +-
 drivers/gpu/drm/i915/intel_guc_loader.c    | 11 +++++++++++
 drivers/gpu/drm/i915/intel_uc.h            |  2 ++
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index c8067aeab6f4..58d6c570a188 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -1364,7 +1364,6 @@ void i915_guc_submission_reenable_engine(struct intel_engine_cs *engine)
 int intel_guc_suspend(struct drm_i915_private *dev_priv)
 {
 	struct intel_guc *guc = &dev_priv->guc;
-	struct i915_gem_context *ctx;
 	u32 data[3];
 
 	if (guc->fw.load_status != INTEL_UC_FIRMWARE_SUCCESS)
@@ -1372,13 +1371,11 @@ int intel_guc_suspend(struct drm_i915_private *dev_priv)
 
 	gen9_disable_guc_interrupts(dev_priv);
 
-	ctx = dev_priv->kernel_context;
-
 	data[0] = INTEL_GUC_ACTION_ENTER_S_STATE;
 	/* any value greater than GUC_POWER_D0 */
 	data[1] = GUC_POWER_D1;
-	/* first page is shared data with GuC */
-	data[2] = guc_ggtt_offset(ctx->engine[RCS].state);
+	/* first page of default ctx is shared data with GuC */
+	data[2] = guc->shared_data_offset;
 
 	return intel_guc_send(guc, data, ARRAY_SIZE(data));
 }
@@ -1390,7 +1387,6 @@ int intel_guc_suspend(struct drm_i915_private *dev_priv)
 int intel_guc_resume(struct drm_i915_private *dev_priv)
 {
 	struct intel_guc *guc = &dev_priv->guc;
-	struct i915_gem_context *ctx;
 	u32 data[3];
 
 	if (guc->fw.load_status != INTEL_UC_FIRMWARE_SUCCESS)
@@ -1399,12 +1395,10 @@ int intel_guc_resume(struct drm_i915_private *dev_priv)
 	if (i915.guc_log_level >= 0)
 		gen9_enable_guc_interrupts(dev_priv);
 
-	ctx = dev_priv->kernel_context;
-
 	data[0] = INTEL_GUC_ACTION_EXIT_S_STATE;
 	data[1] = GUC_POWER_D0;
-	/* first page is shared data with GuC */
-	data[2] = guc_ggtt_offset(ctx->engine[RCS].state);
+	/* first page of default ctx is shared data with GuC */
+	data[2] = guc->shared_data_offset;
 
 	return intel_guc_send(guc, data, ARRAY_SIZE(data));
 }
@@ -1413,14 +1407,11 @@ int i915_guc_reset_engine(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
 	struct intel_guc *guc = &dev_priv->guc;
-	struct i915_gem_context *ctx;
 	u32 data[7];
 
 	if (!i915.enable_guc_submission)
 		return 0;
 
-	ctx = dev_priv->kernel_context;
-
 	/*
 	 * The affected context report is populated by GuC and is provided
 	 * to the driver using the shared page. We request for it but don't
@@ -1432,8 +1423,8 @@ int i915_guc_reset_engine(struct intel_engine_cs *engine)
 	data[3] = 0;
 	data[4] = 0;
 	data[5] = guc->execbuf_client->stage_id;
-	/* first page is shared data with GuC */
-	data[6] = guc_ggtt_offset(ctx->engine[RCS].state);
+	/* first page of default ctx is shared data with GuC */
+	data[6] = guc->shared_data_offset;
 
 	return intel_guc_send(guc, data, ARRAY_SIZE(data));
 }
diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
index 081f2cf614e6..a2d0cba2f8b9 100644
--- a/drivers/gpu/drm/i915/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
@@ -135,7 +135,7 @@
 #define   GUC_ADS_ADDR_SHIFT		11
 #define   GUC_ADS_ADDR_MASK		0xfffff800
 
-#define GUC_CTL_RSRVD			9
+#define GUC_CTL_SHARED_DATA		9
 
 #define GUC_CTL_MAX_DWORDS		(SOFT_SCRATCH_COUNT - 2) /* [1..14] */
 
diff --git a/drivers/gpu/drm/i915/intel_guc_loader.c b/drivers/gpu/drm/i915/intel_guc_loader.c
index d9045b6e897b..8cd5c2bf9510 100644
--- a/drivers/gpu/drm/i915/intel_guc_loader.c
+++ b/drivers/gpu/drm/i915/intel_guc_loader.c
@@ -108,6 +108,7 @@ static void guc_params_init(struct drm_i915_private *dev_priv)
 {
 	struct intel_guc *guc = &dev_priv->guc;
 	u32 params[GUC_CTL_MAX_DWORDS];
+	struct i915_gem_context *ctx;
 	int i;
 
 	memset(&params, 0, sizeof(params));
@@ -156,6 +157,16 @@ static void guc_params_init(struct drm_i915_private *dev_priv)
 		params[GUC_CTL_FEATURE] &= ~GUC_CTL_DISABLE_SCHEDULER;
 	}
 
+	/*
+	 * For watchdog / media reset, GuC must know the address of the shared
+	 * data page, which is the first page of the default context.
+	 * We will also use this page in several places (suspend/resume/reset),
+	 * so save the ggtt offset.
+	 */
+	ctx = dev_priv->kernel_context;
+	guc->shared_data_offset = guc_ggtt_offset(ctx->engine[RCS].state);
+	params[GUC_CTL_SHARED_DATA] = guc->shared_data_offset;
+
 	I915_WRITE(SOFT_SCRATCH(0), 0);
 
 	for (i = 0; i < GUC_CTL_MAX_DWORDS; i++)
diff --git a/drivers/gpu/drm/i915/intel_uc.h b/drivers/gpu/drm/i915/intel_uc.h
index a348d08ce227..c686f20082d7 100644
--- a/drivers/gpu/drm/i915/intel_uc.h
+++ b/drivers/gpu/drm/i915/intel_uc.h
@@ -195,6 +195,8 @@ struct intel_guc {
 	DECLARE_BITMAP(doorbell_bitmap, GUC_NUM_DOORBELLS);
 	uint32_t db_cacheline;		/* Cyclic counter mod pagesize	*/
 
+	uint32_t shared_data_offset;	/* First page of default ctx */
+
 	/* Action status & statistics */
 	uint64_t action_count;		/* Total commands issued	*/
 	uint32_t action_cmd;		/* Last command word		*/
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (14 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 15/20] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-19 10:20   ` Chris Wilson
  2017-04-18 20:23 ` [PATCH v6 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

*** General ***

Watchdog timeout (or "media engine reset") is a feature that allows
userland applications to enable hang detection on individual batch buffers.
The detection mechanism itself is mostly bound to the hardware and the only
thing that the driver needs to do to support this form of hang detection
is to implement the interrupt handling support as well as watchdog command
emission before and after the emitted batch buffer start instruction in the
ring buffer.

The principle of the hang detection mechanism is as follows:

1. Once the decision has been made to enable watchdog timeout for a
particular batch buffer and the driver is in the process of emitting the
batch buffer start instruction into the ring buffer it also emits a
watchdog timer start instruction before and a watchdog timer cancellation
instruction after the batch buffer start instruction in the ring buffer.

2. Once the GPU execution reaches the watchdog timer start instruction
the hardware watchdog counter is started by the hardware. The counter
keeps counting until either reaching a previously configured threshold
value or the timer cancellation instruction is executed.

2a. If the counter reaches the threshold value the hardware fires a
watchdog interrupt that is picked up by the watchdog interrupt handler.
This means that a hang has been detected and the driver needs to deal with
it the same way it would deal with a engine hang detected by the periodic
hang checker. The only difference between the two is that we already blamed
the active request (to ensure an engine reset).

2b. If the batch buffer completes and the execution reaches the watchdog
cancellation instruction before the watchdog counter reaches its
threshold value the watchdog is cancelled and nothing more comes of it.
No hang is detected.

Note about future interaction with preemption: Preemption could happen
in a command sequence prior to watchdog counter getting disabled,
resulting in watchdog being triggered following preemption. The driver will
need to explicitly disable the watchdog counter as part of the
preemption sequence.

*** This patch introduces: ***

1. IRQ handler code for watchdog timeout allowing direct hang recovery
based on hardware-driven hang detection, which then integrates directly
with the hang recovery path. This is independent of having per-engine reset
or just full gpu reset.

2. Watchdog specific register information.

Currently the render engine and all available media engines support
watchdog timeout (VECS is only supported in GEN9). The specifications elude
to the BCS engine being supported but that is currently not supported by
this commit.

Note that the value to stop the counter is different between render and
non-render engines in GEN8; GEN9 onwards it's the same.

v2: Move irq handler to tasklet, arm watchdog for a 2nd time to check
against false-positives.

v3: Don't use high priority tasklet, use engine_last_submit while
checking for false-positives. From GEN9 onwards, the stop counter bit is
the same for all engines.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |  4 ++
 drivers/gpu/drm/i915/i915_irq.c         | 13 ++++++-
 drivers/gpu/drm/i915/i915_reg.h         |  6 +++
 drivers/gpu/drm/i915/intel_hangcheck.c  | 13 +++++--
 drivers/gpu/drm/i915/intel_lrc.c        | 69 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h |  4 ++
 6 files changed, 104 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 5f2345fbff44..203f2112dd18 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1608,6 +1608,9 @@ struct i915_gpu_error {
 	 * inspect the bit and do the reset directly, otherwise the worker
 	 * waits for the struct_mutex.
 	 *
+	 * #I915_RESET_WATCHDOG - When hw detects a hang before us, we can use
+	 * I915_RESET_WATCHDOG to report the hang detection cause accurately.
+	 *
 	 * #I915_WEDGED - If reset fails and we can no longer use the GPU,
 	 * we set the #I915_WEDGED bit. Prior to command submission, e.g.
 	 * i915_gem_request_alloc(), this bit is checked and the sequence
@@ -1616,6 +1619,7 @@ struct i915_gpu_error {
 	unsigned long flags;
 #define I915_RESET_BACKOFF	0
 #define I915_RESET_HANDOFF	1
+#define I915_RESET_WATCHDOG	2
 #define I915_WEDGED		(BITS_PER_LONG - 1)
 
 	/* if available, engine-specific reset is tried before full gpu reset */
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index ab1d77ab0977..39ed432db19e 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1370,6 +1370,10 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
 
 	if (tasklet)
 		tasklet_hi_schedule(&engine->irq_tasklet);
+
+	if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift)) {
+		tasklet_schedule(&engine->watchdog_tasklet);
+	}
 }
 
 static irqreturn_t gen8_gt_irq_ack(struct drm_i915_private *dev_priv,
@@ -3456,12 +3460,15 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
 	uint32_t gt_interrupts[] = {
 		GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 			GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT,
 		GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
 			GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
-			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
+			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
 		0,
 		GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT
@@ -3470,6 +3477,10 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
 	if (HAS_L3_DPF(dev_priv))
 		gt_interrupts[0] |= GT_RENDER_L3_PARITY_ERROR_INTERRUPT;
 
+	/* VECS watchdog is only available in skl+ */
+	if (INTEL_GEN(dev_priv) >= 9)
+		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
+
 	dev_priv->pm_ier = 0x0;
 	dev_priv->pm_imr = ~dev_priv->pm_ier;
 	GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 4c72adae368b..c6a47648f99c 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -1908,6 +1908,11 @@ enum skl_disp_power_wells {
 #define RING_START(base)	_MMIO((base)+0x38)
 #define RING_CTL(base)		_MMIO((base)+0x3c)
 #define   RING_CTL_SIZE(size)	((size) - PAGE_SIZE) /* in bytes -> pages */
+#define RING_CNTR(base)        _MMIO((base) + 0x178)
+#define   GEN8_WATCHDOG_ENABLE		0
+#define   GEN8_WATCHDOG_DISABLE	1
+#define   GEN8_XCS_WATCHDOG_DISABLE	0xFFFFFFFF /* GEN8 & non-render only */
+#define RING_THRESH(base)      _MMIO((base) + 0x17C)
 #define RING_SYNC_0(base)	_MMIO((base)+0x40)
 #define RING_SYNC_1(base)	_MMIO((base)+0x44)
 #define RING_SYNC_2(base)	_MMIO((base)+0x48)
@@ -2386,6 +2391,7 @@ enum skl_disp_power_wells {
 #define GT_BSD_USER_INTERRUPT			(1 << 12)
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /* hsw+; rsvd on snb, ivb, vlv */
 #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
+#define GT_GEN8_WATCHDOG_INTERRUPT		(1 <<  6) /* gen8+ */
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /* !snb */
 #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
 #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index dce742243ba6..0e9272c97096 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -388,7 +388,8 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
 
 static void hangcheck_declare_hang(struct drm_i915_private *i915,
 				   unsigned int hung,
-				   unsigned int stuck)
+				   unsigned int stuck,
+				   unsigned int watchdog)
 {
 	struct intel_engine_cs *engine;
 	char msg[80];
@@ -401,7 +402,8 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
 	if (stuck != hung)
 		hung &= ~stuck;
 	len = scnprintf(msg, sizeof(msg),
-			"%s on ", stuck == hung ? "No progress" : "Hang");
+			"%s on ", watchdog ? "Watchdog timeout" :
+				  stuck == hung ? "No progress" : "Hang");
 	for_each_engine_masked(engine, i915, hung, tmp)
 		len += scnprintf(msg + len, sizeof(msg) - len,
 				 "%s, ", engine->name);
@@ -425,7 +427,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 			     gpu_error.hangcheck_work.work);
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
-	unsigned int hung = 0, stuck = 0;
+	unsigned int hung = 0, stuck = 0, watchdog = 0;
 	int busy_count = 0;
 
 	if (!i915.enable_hangcheck)
@@ -437,6 +439,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	if (i915_terminally_wedged(&dev_priv->gpu_error))
 		return;
 
+	if (test_and_clear_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags))
+		watchdog = 1;
+
 	/* As enabling the GPU requires fairly extensive mmio access,
 	 * periodically arm the mmio checker to see if we are triggering
 	 * any invalid access.
@@ -463,7 +468,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	}
 
 	if (hung)
-		hangcheck_declare_hang(dev_priv, hung, stuck);
+		hangcheck_declare_hang(dev_priv, hung, stuck, watchdog);
 
 	/* Reset timer in case GPU hangs without another request being added */
 	if (busy_count)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 6295760098a1..2263b9fb9b50 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1461,6 +1461,53 @@ static int gen8_emit_flush_render(struct drm_i915_gem_request *request,
 	return 0;
 }
 
+/* From GEN9 onwards, all engines use the same RING_CNTR format */
+static inline u32 get_watchdog_disable(struct intel_engine_cs *engine)
+{
+	if (engine->id == RCS || INTEL_GEN(engine->i915) >= 9)
+		return GEN8_WATCHDOG_DISABLE;
+	else
+		return GEN8_XCS_WATCHDOG_DISABLE;
+}
+
+#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
+static void gen8_watchdog_irq_handler(unsigned long data)
+{
+	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
+	struct drm_i915_private *dev_priv = engine->i915;
+	u32 current_seqno;
+
+	intel_uncore_forcewake_get(dev_priv, engine->fw_domains);
+
+	/* Stop the counter to prevent further timeout interrupts */
+	I915_WRITE_FW(RING_CNTR(engine->mmio_base), get_watchdog_disable(engine));
+
+	current_seqno = intel_engine_get_seqno(engine);
+
+	/* did the request complete after the timer expired? */
+	if (intel_engine_last_submit(engine) == current_seqno)
+		goto fw_put;
+
+	if (engine->hangcheck.watchdog == current_seqno) {
+		/* Make sure the active request will be marked as guilty */
+		engine->hangcheck.stalled = true;
+		engine->hangcheck.seqno = intel_engine_get_seqno(engine);
+
+		/* And try to run the hangcheck_work as soon as possible */
+		set_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags);
+		queue_delayed_work(system_long_wq,
+				   &dev_priv->gpu_error.hangcheck_work, 0);
+	} else {
+		engine->hangcheck.watchdog = current_seqno;
+		/* Re-start the counter, if really hung, it will expire again */
+		I915_WRITE_FW(RING_THRESH(engine->mmio_base), GEN8_WATCHDOG_1000US);
+		I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
+	}
+
+fw_put:
+	intel_uncore_forcewake_put(dev_priv, engine->fw_domains);
+}
+
 /*
  * Reserve space for 2 NOOPs at the end of each request to be
  * used as a workaround for not being allowed to do lite
@@ -1554,6 +1601,9 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine)
 	if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &engine->irq_tasklet.state)))
 		tasklet_kill(&engine->irq_tasklet);
 
+	if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &engine->watchdog_tasklet.state)))
+		tasklet_kill(&engine->watchdog_tasklet);
+
 	dev_priv = engine->i915;
 
 	if (engine->buffer) {
@@ -1612,6 +1662,22 @@ logical_ring_default_irqs(struct intel_engine_cs *engine)
 	unsigned shift = engine->irq_shift;
 	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
 	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
+
+	switch (engine->id) {
+	default:
+		/* BCS engine does not support hw watchdog */
+		break;
+	case RCS:
+	case VCS:
+	case VCS2:
+		engine->irq_keep_mask |= (GT_GEN8_WATCHDOG_INTERRUPT << shift);
+		break;
+	case VECS:
+		if (INTEL_GEN(engine->i915) >= 9)
+			engine->irq_keep_mask |=
+				(GT_GEN8_WATCHDOG_INTERRUPT << shift);
+		break;
+	}
 }
 
 static int
@@ -1660,6 +1726,9 @@ logical_ring_setup(struct intel_engine_cs *engine)
 	tasklet_init(&engine->irq_tasklet,
 		     intel_lrc_irq_handler, (unsigned long)engine);
 
+	tasklet_init(&engine->watchdog_tasklet,
+		     gen8_watchdog_irq_handler, (unsigned long)engine);
+
 	logical_ring_default_vfuncs(engine);
 	logical_ring_default_irqs(engine);
 }
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 6e5577dc1f7a..927f0c06cbe6 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -128,6 +128,7 @@ struct intel_instdone {
 struct intel_engine_hangcheck {
 	u64 acthd;
 	u32 seqno;
+	u32 watchdog;
 	enum intel_engine_hangcheck_action action;
 	unsigned long action_timestamp;
 	int deadlock;
@@ -409,6 +410,9 @@ struct intel_engine_cs {
 
 	struct intel_engine_hangcheck hangcheck;
 
+	/* watchdog_tasklet: stop counter and re-schedule hangcheck_work asap */
+	struct tasklet_struct watchdog_tasklet;
+
 	bool needs_cmd_parser;
 
 	/*
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (15 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-18 21:20   ` Chris Wilson
  2017-04-18 20:23 ` [PATCH v6 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

Emit the required commands into the ring buffer for starting and
stopping the watchdog timer before/after batch buffer start during
batch buffer submission.

v2: Support watchdog threshold per context engine, merge lri commands,
and move watchdog commands emission to emit_bb_start. Request space of
combined start_watchdog, bb_start and stop_watchdog to avoid any error
after emitting bb_start.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.h |  4 ++
 drivers/gpu/drm/i915/intel_lrc.c        | 81 ++++++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  4 ++
 3 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 4af2ab94558b..88700bdbb4e1 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -150,6 +150,10 @@ struct i915_gem_context {
 		u32 *lrc_reg_state;
 		u64 lrc_desc;
 		int pin_count;
+		/** watchdog_threshold: hw watchdog threshold value,
+		 * in clock counts
+		 */
+		u32 watchdog_threshold;
 		bool initialised;
 	} engine[I915_NUM_ENGINES];
 
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 2263b9fb9b50..7a202e73ce9b 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1310,6 +1310,8 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
 			      const unsigned int flags)
 {
 	u32 *cs;
+	u32 num_dwords;
+	bool watchdog_running = false;
 	int ret;
 
 	/* Don't rely in hw updating PDPs, specially in lite-restore.
@@ -1329,10 +1331,29 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
 		req->ctx->ppgtt->pd_dirty_rings &= ~intel_engine_flag(req->engine);
 	}
 
-	cs = intel_ring_begin(req, 4);
+	/* bb_start only */
+	num_dwords = 4;
+
+	/* check if watchdog will be required */
+	if (req->ctx->engine[req->engine->id].watchdog_threshold != 0) {
+		if (!req->engine->emit_start_watchdog ||
+		    !req->engine->emit_stop_watchdog)
+			return -EINVAL;
+
+		/* + start_watchdog (6) + stop_watchdog (4) */
+		num_dwords += 10;
+		watchdog_running = true;
+	}
+
+	cs = intel_ring_begin(req, num_dwords);
 	if (IS_ERR(cs))
 		return PTR_ERR(cs);
 
+	if (watchdog_running) {
+		/* Start watchdog timer */
+		cs = req->engine->emit_start_watchdog(req, cs);
+	}
+
 	/* FIXME(BDW): Address space and security selectors. */
 	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
 		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8)) |
@@ -1340,8 +1361,13 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
 	*cs++ = lower_32_bits(offset);
 	*cs++ = upper_32_bits(offset);
 	*cs++ = MI_NOOP;
-	intel_ring_advance(req, cs);
 
+	if (watchdog_running) {
+		/* Cancel watchdog timer */
+		cs = req->engine->emit_stop_watchdog(req, cs);
+	}
+
+	intel_ring_advance(req, cs);
 	return 0;
 }
 
@@ -1508,6 +1534,49 @@ static void gen8_watchdog_irq_handler(unsigned long data)
 	intel_uncore_forcewake_put(dev_priv, engine->fw_domains);
 }
 
+static u32 *gen8_emit_start_watchdog(struct drm_i915_gem_request *req, u32 *cs)
+{
+	struct intel_engine_cs *engine = req->engine;
+	struct i915_gem_context *ctx = req->ctx;
+	struct intel_context *ce = &ctx->engine[engine->id];
+
+	/* XXX: no watchdog support in BCS engine */
+	GEM_BUG_ON(engine->id == BCS);
+
+	/*
+	 * watchdog register must never be programmed to zero. This would
+	 * cause the watchdog counter to exceed and not allow the engine to
+	 * go into IDLE state
+	 */
+	GEM_BUG_ON(ce->watchdog_threshold == 0);
+
+	/* Set counter period */
+	*cs++ = MI_LOAD_REGISTER_IMM(2);
+	*cs++ = i915_mmio_reg_offset(RING_THRESH(engine->mmio_base));
+	*cs++ = ce->watchdog_threshold;
+	/* Start counter */
+	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
+	*cs++ = GEN8_WATCHDOG_ENABLE;
+	*cs++ = MI_NOOP;
+
+	return cs;
+}
+
+static u32 *gen8_emit_stop_watchdog(struct drm_i915_gem_request *req, u32 *cs)
+{
+	struct intel_engine_cs *engine = req->engine;
+
+	/* XXX: no watchdog support in BCS engine */
+	GEM_BUG_ON(engine->id == BCS);
+
+	*cs++ = MI_LOAD_REGISTER_IMM(2);
+	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
+	*cs++ = get_watchdog_disable(engine);
+	*cs++ = MI_NOOP;
+
+	return cs;
+}
+
 /*
  * Reserve space for 2 NOOPs at the end of each request to be
  * used as a workaround for not being allowed to do lite
@@ -1776,6 +1845,8 @@ int logical_render_ring_init(struct intel_engine_cs *engine)
 	engine->emit_flush = gen8_emit_flush_render;
 	engine->emit_breadcrumb = gen8_emit_breadcrumb_render;
 	engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_render_sz;
+	engine->emit_start_watchdog = gen8_emit_start_watchdog;
+	engine->emit_stop_watchdog = gen8_emit_stop_watchdog;
 
 	ret = intel_engine_create_scratch(engine, PAGE_SIZE);
 	if (ret)
@@ -1799,6 +1870,12 @@ int logical_xcs_ring_init(struct intel_engine_cs *engine)
 {
 	logical_ring_setup(engine);
 
+	/* BCS engine does not have a watchdog-expired irq */
+	if (engine->id != BCS) {
+		engine->emit_start_watchdog = gen8_emit_start_watchdog;
+		engine->emit_stop_watchdog = gen8_emit_stop_watchdog;
+	}
+
 	return logical_ring_init(engine);
 }
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 927f0c06cbe6..0392bd5fd300 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -279,6 +279,10 @@ struct intel_engine_cs {
 
 	int		(*emit_flush)(struct drm_i915_gem_request *request,
 				      u32 mode);
+	u32 *		(*emit_start_watchdog)(struct drm_i915_gem_request *req,
+					       u32 *cs);
+	u32 *		(*emit_stop_watchdog)(struct drm_i915_gem_request *req,
+					      u32 *cs);
 #define EMIT_INVALIDATE	BIT(0)
 #define EMIT_FLUSH	BIT(1)
 #define EMIT_BARRIER	(EMIT_INVALIDATE | EMIT_FLUSH)
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (16 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-19 16:56   ` Jeff McGee
                     ` (2 more replies)
  2017-04-18 20:23 ` [PATCH v6 19/20] drm/i915: Watchdog timeout: Include threshold value in error state Michel Thierry
                   ` (2 subsequent siblings)
  20 siblings, 3 replies; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

Final enablement patch for GPU hang detection using watchdog timeout.
Using the gem_context_setparam ioctl, users can specify the desired
timeout value in microseconds, and the driver will do the conversion to
'timestamps'.

The recommended default watchdog threshold for video engines is 60000 us,
since this has been _empirically determined_ to be a good compromise for
low-latency requirements and low rate of false positives. The default
register value is ~106000us and the theoretical max value (all 1s) is
353 seconds.

Note, UABI engine ids and i915 engine ids are different, and this patch
uses the i915 ones. Some kind of mapping table [1] is required if we
decide to use the UABI engine ids.

[1] http://patchwork.freedesktop.org/patch/msgid/20170329135831.30254-2-chris@chris-wilson.co.uk

v2: Fixed get api to return values in microseconds. Threshold updated to
be per context engine. Check for u32 overflow. Capture ctx threshold
value in error state.

v3: Add a way to get array size, short-cut to disable all thresholds,
return EFAULT / EINVAL as needed. Move the capture of the threshold
value in the error state into a new patch. BXT has a different
timestamp base (because why not?).

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |  29 +++++++++
 drivers/gpu/drm/i915/i915_gem_context.c | 102 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.c        |   5 +-
 include/uapi/drm/i915_drm.h             |   1 +
 4 files changed, 135 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 203f2112dd18..f65a236fddef 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3574,6 +3574,35 @@ i915_gem_context_lookup_timeline(struct i915_gem_context *ctx,
 	return &vm->timeline.engine[engine->id];
 }
 
+/*
+ * BDW & SKL+ Timestamp timer resolution = 0.080 uSec,
+ * or 12500000 counts per second, or ~12 counts per microsecond.
+ *
+ * But Broxton Timestamp timer resolution is different, 0.052 uSec,
+ * or 19200000 counts per second, or ~19 counts per microsecond.
+ */
+#define SKL_TIMESTAMP_CNTS_PER_USEC 12
+#define BXT_TIMESTAMP_CNTS_PER_USEC 19
+#define TIMESTAMP_CNTS_PER_USEC(dev_priv) (IS_BROXTON(dev_priv) ? \
+					   BXT_TIMESTAMP_CNTS_PER_USEC : \
+					   SKL_TIMESTAMP_CNTS_PER_USEC)
+static inline u32
+watchdog_to_us(struct drm_i915_private *dev_priv, u32 value_in_clock_counts)
+{
+	return value_in_clock_counts / TIMESTAMP_CNTS_PER_USEC(dev_priv);
+}
+
+static inline u32
+watchdog_to_clock_counts(struct drm_i915_private *dev_priv, u64 value_in_us)
+{
+	u64 threshold = value_in_us * TIMESTAMP_CNTS_PER_USEC(dev_priv);
+
+	if (overflows_type(threshold, u32))
+		return -EINVAL;
+
+	return threshold;
+}
+
 int i915_perf_open_ioctl(struct drm_device *dev, void *data,
 			 struct drm_file *file);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index edbed85a1c88..85a6467a25a6 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -422,6 +422,102 @@ i915_gem_context_create_gvt(struct drm_device *dev)
 	return ctx;
 }
 
+/* Return the timer count threshold in microseconds. */
+int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
+				  struct drm_i915_gem_context_param *args)
+{
+	struct drm_i915_private *dev_priv = ctx->i915;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	u32 threshold_in_us[I915_NUM_ENGINES];
+
+	if (args->size == 0)
+		goto out;
+
+	if (args->size < sizeof(threshold_in_us))
+		return -EFAULT;
+
+	if (!dev_priv->engine[VCS]->emit_start_watchdog)
+		return -ENODEV;
+
+	for_each_engine(engine, dev_priv, id) {
+		struct intel_context *ce = &ctx->engine[id];
+
+		threshold_in_us[id] = watchdog_to_us(dev_priv,
+						     ce->watchdog_threshold);
+	}
+
+	mutex_unlock(&dev_priv->drm.struct_mutex);
+	if (__copy_to_user(u64_to_user_ptr(args->value),
+			   &threshold_in_us,
+			   sizeof(threshold_in_us))) {
+		mutex_lock(&dev_priv->drm.struct_mutex);
+		return -EFAULT;
+	}
+	mutex_lock(&dev_priv->drm.struct_mutex);
+
+out:
+	args->size = sizeof(threshold_in_us);
+
+	return 0;
+}
+
+/*
+ * Based on time out value in microseconds (us) calculate
+ * timer count thresholds needed based on core frequency.
+ * Watchdog can be disabled by setting it to 0.
+ */
+int i915_gem_context_set_watchdog(struct i915_gem_context *ctx,
+				  struct drm_i915_gem_context_param *args)
+{
+	struct drm_i915_private *dev_priv = ctx->i915;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	u32 threshold[I915_NUM_ENGINES];
+
+	memset(&threshold, 0, sizeof(threshold));
+
+	/* shortcut to disable in all engines */
+	if (args->size == 0)
+		goto set_watchdog;
+
+	if (args->size < sizeof(threshold))
+		return -EFAULT;
+
+	if (!dev_priv->engine[VCS]->emit_start_watchdog)
+		return -ENODEV;
+
+	mutex_unlock(&dev_priv->drm.struct_mutex);
+	if (copy_from_user(&threshold,
+			   u64_to_user_ptr(args->value),
+			   sizeof(threshold))) {
+		mutex_lock(&dev_priv->drm.struct_mutex);
+		return -EFAULT;
+	}
+	mutex_lock(&dev_priv->drm.struct_mutex);
+
+	/* not supported in blitter engine */
+	if (threshold[BCS] != 0)
+		return -EINVAL;
+
+	for_each_engine(engine, dev_priv, id) {
+		threshold[id] = watchdog_to_clock_counts(dev_priv,
+							 threshold[id]);
+
+		if (threshold[id] == -EINVAL)
+			return -EINVAL;
+	}
+
+set_watchdog:
+	for_each_engine(engine, dev_priv, id) {
+		struct intel_context *ce = &ctx->engine[id];
+
+		ce->watchdog_threshold = threshold[id];
+	}
+
+	return 0;
+}
+
 int i915_gem_context_init(struct drm_i915_private *dev_priv)
 {
 	struct i915_gem_context *ctx;
@@ -1061,6 +1157,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_BANNABLE:
 		args->value = i915_gem_context_is_bannable(ctx);
 		break;
+	case I915_CONTEXT_PARAM_WATCHDOG:
+		ret = i915_gem_context_get_watchdog(ctx, args);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -1118,6 +1217,9 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 		else
 			i915_gem_context_clear_bannable(ctx);
 		break;
+	case I915_CONTEXT_PARAM_WATCHDOG:
+		ret = i915_gem_context_set_watchdog(ctx, args);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 7a202e73ce9b..f4e05531a9a0 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1496,7 +1496,7 @@ static inline u32 get_watchdog_disable(struct intel_engine_cs *engine)
 		return GEN8_XCS_WATCHDOG_DISABLE;
 }
 
-#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
+#define GEN8_WATCHDOG_1000US(dev_priv) watchdog_to_clock_counts(dev_priv, 1000)
 static void gen8_watchdog_irq_handler(unsigned long data)
 {
 	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
@@ -1526,7 +1526,8 @@ static void gen8_watchdog_irq_handler(unsigned long data)
 	} else {
 		engine->hangcheck.watchdog = current_seqno;
 		/* Re-start the counter, if really hung, it will expire again */
-		I915_WRITE_FW(RING_THRESH(engine->mmio_base), GEN8_WATCHDOG_1000US);
+		I915_WRITE_FW(RING_THRESH(engine->mmio_base),
+			      GEN8_WATCHDOG_1000US(dev_priv));
 		I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
 	}
 
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index fadedefba6db..18bc0ec618dd 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1308,6 +1308,7 @@ struct drm_i915_gem_context_param {
 #define I915_CONTEXT_PARAM_GTT_SIZE	0x3
 #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE	0x4
 #define I915_CONTEXT_PARAM_BANNABLE	0x5
+#define I915_CONTEXT_PARAM_WATCHDOG	0x6
 	__u64 value;
 };
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 19/20] drm/i915: Watchdog timeout: Include threshold value in error state
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (17 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-18 20:23 ` [PATCH v6 20/20] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs Michel Thierry
  2017-04-18 20:44 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev2) Patchwork
  20 siblings, 0 replies; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

Save the watchdog threshold (in us) as part of the engine state.

Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h       |  1 +
 drivers/gpu/drm/i915/i915_gpu_error.c | 11 +++++++----
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index f65a236fddef..02ee26199cfe 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1022,6 +1022,7 @@ struct i915_gpu_state {
 			int ban_score;
 			int active;
 			int guilty;
+			int watchdog_threshold;
 		} context;
 
 		struct drm_i915_error_object {
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index a2ffb1ef2cfa..1b1a49bc0c3c 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -388,9 +388,10 @@ static void error_print_context(struct drm_i915_error_state_buf *m,
 				const char *header,
 				const struct drm_i915_error_context *ctx)
 {
-	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, ban score %d guilty %d active %d\n",
+	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, ban score %d guilty %d active %d, watchdog %dus\n",
 		   header, ctx->comm, ctx->pid, ctx->handle, ctx->hw_id,
-		   ctx->ban_score, ctx->guilty, ctx->active);
+		   ctx->ban_score, ctx->guilty, ctx->active,
+		   watchdog_to_us(m->i915, ctx->watchdog_threshold));
 }
 
 static void error_print_engine(struct drm_i915_error_state_buf *m,
@@ -1344,7 +1345,8 @@ static void error_record_engine_execlists(struct intel_engine_cs *engine,
 }
 
 static void record_context(struct drm_i915_error_context *e,
-			   struct i915_gem_context *ctx)
+			   struct i915_gem_context *ctx,
+			   u32 engine_id)
 {
 	if (ctx->pid) {
 		struct task_struct *task;
@@ -1363,6 +1365,7 @@ static void record_context(struct drm_i915_error_context *e,
 	e->ban_score = ctx->ban_score;
 	e->guilty = ctx->guilty_count;
 	e->active = ctx->active_count;
+	e->watchdog_threshold =	ctx->engine[engine_id].watchdog_threshold;
 }
 
 static void request_record_user_bo(struct drm_i915_gem_request *request,
@@ -1426,7 +1429,7 @@ static void i915_gem_record_rings(struct drm_i915_private *dev_priv,
 			ee->vm = request->ctx->ppgtt ?
 				&request->ctx->ppgtt->base : &ggtt->base;
 
-			record_context(&ee->context, request->ctx);
+			record_context(&ee->context, request->ctx, engine->id);
 
 			/* We need to copy these to an anonymous buffer
 			 * as the simplest method to avoid being overwritten
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v6 20/20] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (18 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 19/20] drm/i915: Watchdog timeout: Include threshold value in error state Michel Thierry
@ 2017-04-18 20:23 ` Michel Thierry
  2017-04-18 20:44 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev2) Patchwork
  20 siblings, 0 replies; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 20:23 UTC (permalink / raw)
  To: intel-gfx

From firmware v8.8, GuC provides the count of media engine resets
(watchdog timeout). This information is available in the GuC shared
context data struct, which resides in the first page of the default
(kernel) lrc context.

Since GuC handled engine resets are transparent for kernel and user,
provide a simple debugfs entry to see the number of times media reset
has happened.

v2: Remove unnecessary struct_mutex, _get_dirty_page and kmap_atomic;
use READ_ONCE. (Chris)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c   | 22 ++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_guc_fwif.h | 18 ++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 6444c1a9bd22..35ce771c8b8f 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1403,6 +1403,26 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 	return 0;
 }
 
+static u32 i915_watchdog_reset_count(struct drm_i915_private *dev_priv)
+{
+	struct i915_gem_context *ctx;
+	struct page *page;
+	struct guc_shared_ctx_data *guc_shared_data;
+	u32 guc_media_reset_count;
+
+	if (!i915.enable_guc_submission)
+		return 0;
+
+	ctx = dev_priv->kernel_context;
+	page = i915_gem_object_get_page(ctx->engine[RCS].state->obj,
+					LRC_GUCSHR_PN);
+	guc_shared_data = kmap(page);
+	guc_media_reset_count = READ_ONCE(guc_shared_data->media_reset_count);
+	kunmap(page);
+
+	return guc_media_reset_count;
+}
+
 static int i915_reset_info(struct seq_file *m, void *unused)
 {
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
@@ -1411,6 +1431,8 @@ static int i915_reset_info(struct seq_file *m, void *unused)
 	enum intel_engine_id id;
 
 	seq_printf(m, "full gpu reset = %u\n", i915_reset_count(error));
+	seq_printf(m, "GuC watchdog/media reset = %u\n",
+		   i915_watchdog_reset_count(dev_priv));
 
 	for_each_engine(engine, dev_priv, id) {
 		seq_printf(m, "%s = %u\n", engine->name,
diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
index a2d0cba2f8b9..e45987f7aa50 100644
--- a/drivers/gpu/drm/i915/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
@@ -502,6 +502,24 @@ union guc_log_control {
 	u32 value;
 } __packed;
 
+/* GuC Shared Context Data Struct */
+struct guc_shared_ctx_data {
+	u32 addr_of_last_preempted_data_low;
+	u32 addr_of_last_preempted_data_high;
+	u32 addr_of_last_preempted_data_high_tmp;
+	u32 padding;
+	u32 is_mapped_to_proxy;
+	u32 proxy_ctx_id;
+	u32 engine_reset_ctx_id;
+	u32 media_reset_count;
+	u32 reserved[8];
+	u32 uk_last_ctx_switch_reason;
+	u32 was_reset;
+	u32 lrca_gpu_addr;
+	u32 execlist_ctx;
+	u32 reserved1[32];
+} __packed;
+
 /* This Action will be programmed in C180 - SOFT_SCRATCH_O_REG */
 enum intel_guc_action {
 	INTEL_GUC_ACTION_DEFAULT = 0x0,
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev2)
  2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
                   ` (19 preceding siblings ...)
  2017-04-18 20:23 ` [PATCH v6 20/20] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs Michel Thierry
@ 2017-04-18 20:44 ` Patchwork
  20 siblings, 0 replies; 50+ messages in thread
From: Patchwork @ 2017-04-18 20:44 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

== Series Details ==

Series: Gen8+ engine-reset (rev2)
URL   : https://patchwork.freedesktop.org/series/21868/
State : success

== Summary ==

Series 21868v2 Gen8+ engine-reset
https://patchwork.freedesktop.org/api/1.0/series/21868/revisions/2/mbox/

Test gem_exec_suspend:
        Subgroup basic-s4-devices:
                pass       -> DMESG-WARN (fi-kbl-7560u) fdo#100125

fdo#100125 https://bugs.freedesktop.org/show_bug.cgi?id=100125

fi-bdw-5557u     total:278  pass:267  dwarn:0   dfail:0   fail:0   skip:11  time:428s
fi-bdw-gvtdvm    total:278  pass:256  dwarn:8   dfail:0   fail:0   skip:14  time:420s
fi-bsw-n3050     total:278  pass:242  dwarn:0   dfail:0   fail:0   skip:36  time:570s
fi-bxt-j4205     total:278  pass:259  dwarn:0   dfail:0   fail:0   skip:19  time:506s
fi-bxt-t5700     total:278  pass:258  dwarn:0   dfail:0   fail:0   skip:20  time:535s
fi-byt-j1900     total:278  pass:254  dwarn:0   dfail:0   fail:0   skip:24  time:482s
fi-byt-n2820     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:474s
fi-hsw-4770      total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:408s
fi-hsw-4770r     total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:401s
fi-ilk-650       total:278  pass:228  dwarn:0   dfail:0   fail:0   skip:50  time:419s
fi-ivb-3520m     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:482s
fi-ivb-3770      total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:467s
fi-kbl-7500u     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:467s
fi-kbl-7560u     total:278  pass:267  dwarn:1   dfail:0   fail:0   skip:10  time:566s
fi-skl-6260u     total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:449s
fi-skl-6700hq    total:278  pass:261  dwarn:0   dfail:0   fail:0   skip:17  time:572s
fi-skl-6700k     total:278  pass:256  dwarn:4   dfail:0   fail:0   skip:18  time:456s
fi-skl-6770hq    total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:498s
fi-skl-gvtdvm    total:278  pass:265  dwarn:0   dfail:0   fail:0   skip:13  time:430s
fi-snb-2520m     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:536s
fi-snb-2600      total:278  pass:249  dwarn:0   dfail:0   fail:0   skip:29  time:417s

7204beb80dcdfd1f8b1cff6a448301407f91a99c drm-tip: 2017y-04m-18d-16h-09m-11s UTC integration manifest
5fcdfc4 drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs
785721e drm/i915: Watchdog timeout: Include threshold value in error state
a4f2785 drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
afa9e30 drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
c4bbdfb drm/i915: Watchdog timeout: IRQ handler for gen8+
3521c7c drm/i915: Watchdog timeout: Pass GuC shared data structure during param load
085d20b drm/i915/guc: Add support for reset engine using GuC commands
9d8ec0d drm/i915/guc: Provide register list to be saved/restored during engine reset
4f69f95 drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder
6bafaca7 drm/i915/selftests: reset engine self tests
f61701f drm/i915: Add engine reset count in get-reset-stats ioctl
49b446f drm/i915/tdr: Enable Engine reset and recovery support
b3d3f4d drm/i915/tdr: Export per-engine reset count info to debugfs
0e7cbfe drm/i915/tdr: Add engine reset count to error state
51d0a90 drm/i915: Skip reset request if there is one already
f5dacd2 drm/i915/tdr: Add support for per engine reset recovery
ea5a776 drm/i915/tdr: Modify error handler for per engine hang recovery
46ab0e7 drm/i915: Update i915.reset to handle engine resets
9cdf282 drm/i915: Rename gen8_(un)request_engine_reset to gen8_reset_engine_start/cancel
ac75580 drm/i915: Fix stale comment about I915_RESET_IN_PROGRESS flag

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_4514/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 15/20] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load
  2017-04-18 20:23 ` [PATCH v6 15/20] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load Michel Thierry
@ 2017-04-18 21:18   ` Daniele Ceraolo Spurio
  0 siblings, 0 replies; 50+ messages in thread
From: Daniele Ceraolo Spurio @ 2017-04-18 21:18 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx



On 18/04/17 13:23, Michel Thierry wrote:
> For watchdog / media reset, the firmware must know the address of the shared
> data page (the first page of the default context).
>
> This information should be in DWORD 9 of the GUC_CTL structure.
>
> v2: Use guc_ggtt_offset (Chris).
> Store the ggtt offset of the default ctx as we needed for
> suspend/resume/reset (Daniele).
>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Regards,
Daniele
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2017-04-18 20:23 ` [PATCH v6 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
@ 2017-04-18 21:20   ` Chris Wilson
  2017-04-18 21:36     ` Michel Thierry
  0 siblings, 1 reply; 50+ messages in thread
From: Chris Wilson @ 2017-04-18 21:20 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Tue, Apr 18, 2017 at 01:23:32PM -0700, Michel Thierry wrote:
> @@ -1329,10 +1331,29 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
>  		req->ctx->ppgtt->pd_dirty_rings &= ~intel_engine_flag(req->engine);
>  	}
>  
> -	cs = intel_ring_begin(req, 4);
> +	/* bb_start only */
> +	num_dwords = 4;
> +
> +	/* check if watchdog will be required */
> +	if (req->ctx->engine[req->engine->id].watchdog_threshold != 0) {
> +		if (!req->engine->emit_start_watchdog ||
> +		    !req->engine->emit_stop_watchdog)
> +			return -EINVAL;

This is still a bug in the context setparam to get to this point without
a watchdog.

> +
> +		/* + start_watchdog (6) + stop_watchdog (4) */
> +		num_dwords += 10;
> +		watchdog_running = true;
> +	}
> +static u32 *gen8_emit_stop_watchdog(struct drm_i915_gem_request *req, u32 *cs)
> +{
> +	struct intel_engine_cs *engine = req->engine;
> +
> +	/* XXX: no watchdog support in BCS engine */
> +	GEM_BUG_ON(engine->id == BCS);
> +
> +	*cs++ = MI_LOAD_REGISTER_IMM(2);
> +	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
> +	*cs++ = get_watchdog_disable(engine);
> +	*cs++ = MI_NOOP;

Oops.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2017-04-18 21:20   ` Chris Wilson
@ 2017-04-18 21:36     ` Michel Thierry
  2017-04-18 23:06       ` Chris Wilson
  0 siblings, 1 reply; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 21:36 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 18/04/17 14:20, Chris Wilson wrote:
> On Tue, Apr 18, 2017 at 01:23:32PM -0700, Michel Thierry wrote:
>> @@ -1329,10 +1331,29 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
>>  		req->ctx->ppgtt->pd_dirty_rings &= ~intel_engine_flag(req->engine);
>>  	}
>>
>> -	cs = intel_ring_begin(req, 4);
>> +	/* bb_start only */
>> +	num_dwords = 4;
>> +
>> +	/* check if watchdog will be required */
>> +	if (req->ctx->engine[req->engine->id].watchdog_threshold != 0) {
>> +		if (!req->engine->emit_start_watchdog ||
>> +		    !req->engine->emit_stop_watchdog)
>> +			return -EINVAL;
>
> This is still a bug in the context setparam to get to this point without
> a watchdog.
>

This can't happen (threshold != 0 && no emit_watchdog func), 
i915_gem_context_set_watchdog returns -ENODEV if vcs's 
emit_start_watchdog is not defined (the assumption is if the vcs has it, 
rcs does too).

I can remove it, if that's what you mean.

But re i915_gem_context_set_watchdog, I think maybe it should return 
ENODEV when there's no watchdog and the user is trying to get the array 
size (args->size == 0), and don't give false hopes.

>> +
>> +		/* + start_watchdog (6) + stop_watchdog (4) */
>> +		num_dwords += 10;
>> +		watchdog_running = true;
>> +	}
>> +static u32 *gen8_emit_stop_watchdog(struct drm_i915_gem_request *req, u32 *cs)
>> +{
>> +	struct intel_engine_cs *engine = req->engine;
>> +
>> +	/* XXX: no watchdog support in BCS engine */
>> +	GEM_BUG_ON(engine->id == BCS);
>> +
>> +	*cs++ = MI_LOAD_REGISTER_IMM(2);
>> +	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
>> +	*cs++ = get_watchdog_disable(engine);
>> +	*cs++ = MI_NOOP;
>
> Oops.

_context_set_watchdog also rejects if threshold[BCS] != 0.

> -Chris
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 04/20] drm/i915/tdr: Modify error handler for per engine hang recovery
  2017-04-18 20:23 ` [PATCH v6 04/20] drm/i915/tdr: Modify error handler for per engine hang recovery Michel Thierry
@ 2017-04-18 21:40   ` Chris Wilson
  2017-04-18 22:01     ` Michel Thierry
  0 siblings, 1 reply; 50+ messages in thread
From: Chris Wilson @ 2017-04-18 21:40 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Tue, Apr 18, 2017 at 01:23:19PM -0700, Michel Thierry wrote:
> From: Arun Siluvery <arun.siluvery@linux.intel.com>
> 
> This is a preparatory patch which modifies error handler to do per engine
> hang recovery. The actual patch which implements this sequence follows
> later in the series. The aim is to prepare existing recovery function to
> adapt to this new function where applicable (which fails at this point
> because core implementation is lacking) and continue recovery using legacy
> full gpu reset.
> 
> A helper function is also added to query the availability of engine
> reset.
> 
> The error events behaviour that are used to notify user of reset are
> adapted to engine reset such that it doesn't break users listening to these
> events. In legacy we report an error event, a reset event before resetting
> the gpu and a reset done event marking the completion of reset. The same
> behaviour is adapted but reset event is only dispatched once even when
> multiple engines are hung. Finally once reset is complete we send reset
> done event as usual.
> 
> Note that this implementation of engine reset is for i915 directly
> submitting to the ELSP, where the driver manages the hang detection,
> recovery and resubmission. With GuC submission these tasks are shared
> between driver and firmware; i915 will still responsible for detecting a
> hang, and when it does it will have to request GuC to reset that Engine and
> remind the firmware about the outstanding submissions. This will be
> added in different patch.
> 
> v2: rebase, advertise engine reset availability in platform definition,
> add note about GuC submission.
> v3: s/*engine_reset*/*reset_engine*/. (Chris)
> Handle reset as 2 level resets, by first going to engine only and fall
> backing to full/chip reset as needed, i.e. reset_engine will need the
> struct_mutex.
> v4: Pass the engine mask to i915_reset. (Chris)
> v5: Rebase, update selftests.
> v6: Rebase, prepare for mutex-less reset engine.

I'm not sure if there is any trace of the original patch left. Certainly
this is the first that has come close to making me happy and looks like
it might actually work. :)

> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index e06af46f5a57..7bc5f552add7 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -814,6 +814,7 @@ struct intel_csr {
>  	func(has_ddi); \
>  	func(has_decoupled_mmio); \
>  	func(has_dp_mst); \
> +	func(has_reset_engine); \
>  	func(has_fbc); \
>  	func(has_fpga_dbg); \
>  	func(has_full_ppgtt); \
> @@ -1616,6 +1617,9 @@ struct i915_gpu_error {
>  #define I915_RESET_HANDOFF	1
>  #define I915_WEDGED		(BITS_PER_LONG - 1)
>  
> +	/* if available, engine-specific reset is tried before full gpu reset */
> +	u32 reset_engine_mask;

I want to convince myself that storing this here is sensible. My
expectation was that this would be passed along as a function parameter,
so I'll need to go back and see why that doesn't work.

>  	/**
>  	 * Waitqueue to signal when a hang is detected. Used to for waiters
>  	 * to release the struct_mutex for the reset to procede.
> @@ -3019,6 +3023,8 @@ extern void i915_driver_unload(struct drm_device *dev);
>  extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
>  extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
>  extern void i915_reset(struct drm_i915_private *dev_priv);
> +extern int i915_reset_engine(struct intel_engine_cs *engine);
> +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
>  extern int intel_guc_reset(struct drm_i915_private *dev_priv);
>  extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
>  extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index fd97fe00cd0d..ab1d77ab0977 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2645,12 +2645,33 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>  	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
>  	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
>  	char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
> +	u32 engine_mask = dev_priv->gpu_error.reset_engine_mask;
>  
>  	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
>  
> -	DRM_DEBUG_DRIVER("resetting chip\n");
> +	/*
> +	 * This event needs to be sent before performing gpu reset. When
> +	 * engine resets are supported we iterate through all engines and
> +	 * reset hung engines individually. To keep the event dispatch
> +	 * mechanism consistent with full gpu reset, this is only sent once
> +	 * even when multiple engines are hung. It is also safe to move this
> +	 * here because when we are in this function, we will definitely
> +	 * perform gpu reset.
> +	 */
>  	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
>  
> +	/* try engine reset first, and continue if fails; look mom, no mutex! */
> +	if (intel_has_reset_engine(dev_priv) && !(engine_mask & (engine_mask - 1))) {

if (has_reset_engine() && is_power_of_2(engine_mask)) {
	engine = dev_priv->engine[ilog2(engine_mask)];

Not worth iterating over the engine_mask?

for_each_bit() {
	if (i915_reset_engine(engine) == 0)
		dev_prv->gpu_error.reset_engine_mask &= ~intel_engine_flag()
if (reset_engine_mask)
	DRM_WARN("per-engine reset failed, promoting to full gpu_reset\n");

> +		struct intel_engine_cs *engine =
> +			dev_priv->engine[intel_engineid_from_flag(engine_mask)];
> +
> +		if (i915_reset_engine(engine) == 0)
> +			goto finish;
> +
> +		DRM_ERROR("reset %s failed, promoting to full gpu reset\n",
> +			  engine->name);

Justify an *ERROR*. Is it a catastrophe? Do we have a recovery
mechanism? A DRM_WARN is appropriate as something went wrong at the
hw/driver level, but it should not be the end of the world - unlike the
global reset itself failing.

> +	}
> +
>  	intel_prepare_reset(dev_priv);

I am much happier now that we circumvent the big hammer global/display
reset (and so this series now can remain blissfully ignorant of the
display reset regression ;)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 04/20] drm/i915/tdr: Modify error handler for per engine hang recovery
  2017-04-18 21:40   ` Chris Wilson
@ 2017-04-18 22:01     ` Michel Thierry
  2017-04-18 23:13       ` Chris Wilson
  0 siblings, 1 reply; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 22:01 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Mika Kuoppala



On 18/04/17 14:40, Chris Wilson wrote:
> On Tue, Apr 18, 2017 at 01:23:19PM -0700, Michel Thierry wrote:
>> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>>
>> This is a preparatory patch which modifies error handler to do per engine
>> hang recovery. The actual patch which implements this sequence follows
>> later in the series. The aim is to prepare existing recovery function to
>> adapt to this new function where applicable (which fails at this point
>> because core implementation is lacking) and continue recovery using legacy
>> full gpu reset.
>>
>> A helper function is also added to query the availability of engine
>> reset.
>>
>> The error events behaviour that are used to notify user of reset are
>> adapted to engine reset such that it doesn't break users listening to these
>> events. In legacy we report an error event, a reset event before resetting
>> the gpu and a reset done event marking the completion of reset. The same
>> behaviour is adapted but reset event is only dispatched once even when
>> multiple engines are hung. Finally once reset is complete we send reset
>> done event as usual.
>>
>> Note that this implementation of engine reset is for i915 directly
>> submitting to the ELSP, where the driver manages the hang detection,
>> recovery and resubmission. With GuC submission these tasks are shared
>> between driver and firmware; i915 will still responsible for detecting a
>> hang, and when it does it will have to request GuC to reset that Engine and
>> remind the firmware about the outstanding submissions. This will be
>> added in different patch.
>>
>> v2: rebase, advertise engine reset availability in platform definition,
>> add note about GuC submission.
>> v3: s/*engine_reset*/*reset_engine*/. (Chris)
>> Handle reset as 2 level resets, by first going to engine only and fall
>> backing to full/chip reset as needed, i.e. reset_engine will need the
>> struct_mutex.
>> v4: Pass the engine mask to i915_reset. (Chris)
>> v5: Rebase, update selftests.
>> v6: Rebase, prepare for mutex-less reset engine.
>
> I'm not sure if there is any trace of the original patch left. Certainly
> this is the first that has come close to making me happy and looks like
> it might actually work. :)
>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>> index e06af46f5a57..7bc5f552add7 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -814,6 +814,7 @@ struct intel_csr {
>>  	func(has_ddi); \
>>  	func(has_decoupled_mmio); \
>>  	func(has_dp_mst); \
>> +	func(has_reset_engine); \
>>  	func(has_fbc); \
>>  	func(has_fpga_dbg); \
>>  	func(has_full_ppgtt); \
>> @@ -1616,6 +1617,9 @@ struct i915_gpu_error {
>>  #define I915_RESET_HANDOFF	1
>>  #define I915_WEDGED		(BITS_PER_LONG - 1)
>>
>> +	/* if available, engine-specific reset is tried before full gpu reset */
>> +	u32 reset_engine_mask;
>
> I want to convince myself that storing this here is sensible. My
> expectation was that this would be passed along as a function parameter,
> so I'll need to go back and see why that doesn't work.
>

This is a left-over from the previous version, the engine mask can be 
passed as a parameter to i915_reset_and_wakeup.

>>  	/**
>>  	 * Waitqueue to signal when a hang is detected. Used to for waiters
>>  	 * to release the struct_mutex for the reset to procede.
>> @@ -3019,6 +3023,8 @@ extern void i915_driver_unload(struct drm_device *dev);
>>  extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
>>  extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
>>  extern void i915_reset(struct drm_i915_private *dev_priv);
>> +extern int i915_reset_engine(struct intel_engine_cs *engine);
>> +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
>>  extern int intel_guc_reset(struct drm_i915_private *dev_priv);
>>  extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
>>  extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>> index fd97fe00cd0d..ab1d77ab0977 100644
>> --- a/drivers/gpu/drm/i915/i915_irq.c
>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>> @@ -2645,12 +2645,33 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>>  	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
>>  	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
>>  	char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
>> +	u32 engine_mask = dev_priv->gpu_error.reset_engine_mask;
>>
>>  	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
>>
>> -	DRM_DEBUG_DRIVER("resetting chip\n");
>> +	/*
>> +	 * This event needs to be sent before performing gpu reset. When
>> +	 * engine resets are supported we iterate through all engines and
>> +	 * reset hung engines individually. To keep the event dispatch
>> +	 * mechanism consistent with full gpu reset, this is only sent once
>> +	 * even when multiple engines are hung. It is also safe to move this
>> +	 * here because when we are in this function, we will definitely
>> +	 * perform gpu reset.
>> +	 */
>>  	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
>>
>> +	/* try engine reset first, and continue if fails; look mom, no mutex! */
>> +	if (intel_has_reset_engine(dev_priv) && !(engine_mask & (engine_mask - 1))) {
>
> if (has_reset_engine() && is_power_of_2(engine_mask)) {
> 	engine = dev_priv->engine[ilog2(engine_mask)];
>
> Not worth iterating over the engine_mask?
>
> for_each_bit() {
> 	if (i915_reset_engine(engine) == 0)
> 		dev_prv->gpu_error.reset_engine_mask &= ~intel_engine_flag()
> if (reset_engine_mask)
> 	DRM_WARN("per-engine reset failed, promoting to full gpu_reset\n");
>

Time-wise, I'm not so sure about the benefits of 2 reset_engines vs full 
gpu reset. But it can be done... and thanks for the ilog2 tip.

>> +		struct intel_engine_cs *engine =
>> +			dev_priv->engine[intel_engineid_from_flag(engine_mask)];
>> +
>> +		if (i915_reset_engine(engine) == 0)
>> +			goto finish;
>> +
>> +		DRM_ERROR("reset %s failed, promoting to full gpu reset\n",
>> +			  engine->name);
>
> Justify an *ERROR*. Is it a catastrophe? Do we have a recovery
> mechanism? A DRM_WARN is appropriate as something went wrong at the
> hw/driver level, but it should not be the end of the world - unlike the
> global reset itself failing.
>

I'll change it to DRM_WARN. I haven't been able to reproduce it (only 
when I change the code on purpose, or with the live test - 
igt_render_engine_reset_fallback)

>> +	}
>> +
>>  	intel_prepare_reset(dev_priv);
>
> I am much happier now that we circumvent the big hammer global/display
> reset (and so this series now can remain blissfully ignorant of the
> display reset regression ;)
> -Chris
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2017-04-18 21:36     ` Michel Thierry
@ 2017-04-18 23:06       ` Chris Wilson
  2017-04-18 23:11         ` Michel Thierry
  0 siblings, 1 reply; 50+ messages in thread
From: Chris Wilson @ 2017-04-18 23:06 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Tue, Apr 18, 2017 at 02:36:14PM -0700, Michel Thierry wrote:
> 
> On 18/04/17 14:20, Chris Wilson wrote:
> >On Tue, Apr 18, 2017 at 01:23:32PM -0700, Michel Thierry wrote:
> >>@@ -1329,10 +1331,29 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
> >> 		req->ctx->ppgtt->pd_dirty_rings &= ~intel_engine_flag(req->engine);
> >> 	}
> >>
> >>-	cs = intel_ring_begin(req, 4);
> >>+	/* bb_start only */
> >>+	num_dwords = 4;
> >>+
> >>+	/* check if watchdog will be required */
> >>+	if (req->ctx->engine[req->engine->id].watchdog_threshold != 0) {
> >>+		if (!req->engine->emit_start_watchdog ||
> >>+		    !req->engine->emit_stop_watchdog)
> >>+			return -EINVAL;
> >
> >This is still a bug in the context setparam to get to this point without
> >a watchdog.
> >
> 
> This can't happen (threshold != 0 && no emit_watchdog func),
> i915_gem_context_set_watchdog returns -ENODEV if vcs's
> emit_start_watchdog is not defined (the assumption is if the vcs has
> it, rcs does too).
> 
> I can remove it, if that's what you mean.

Yes, we shouldn't be setting the watchdog threshold if the watchdog is
not available. GEM_BUG_ON() would be fine. Throwing a very, very late
EINVAL is disconcerting.
 
> But re i915_gem_context_set_watchdog, I think maybe it should return
> ENODEV when there's no watchdog and the user is trying to get the
> array size (args->size == 0), and don't give false hopes.

Seems reasonable.

> >>+
> >>+		/* + start_watchdog (6) + stop_watchdog (4) */
> >>+		num_dwords += 10;
> >>+		watchdog_running = true;
> >>+	}
> >>+static u32 *gen8_emit_stop_watchdog(struct drm_i915_gem_request *req, u32 *cs)
> >>+{
> >>+	struct intel_engine_cs *engine = req->engine;
> >>+
> >>+	/* XXX: no watchdog support in BCS engine */
> >>+	GEM_BUG_ON(engine->id == BCS);
> >>+
> >>+	*cs++ = MI_LOAD_REGISTER_IMM(2);
> >>+	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
> >>+	*cs++ = get_watchdog_disable(engine);
> >>+	*cs++ = MI_NOOP;
> >
> >Oops.
> 
> _context_set_watchdog also rejects if threshold[BCS] != 0.

LRI(2), but only setting one register not two.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2017-04-18 23:06       ` Chris Wilson
@ 2017-04-18 23:11         ` Michel Thierry
  0 siblings, 0 replies; 50+ messages in thread
From: Michel Thierry @ 2017-04-18 23:11 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 18/04/17 16:06, Chris Wilson wrote:
> On Tue, Apr 18, 2017 at 02:36:14PM -0700, Michel Thierry wrote:
>>
>> On 18/04/17 14:20, Chris Wilson wrote:
>>> On Tue, Apr 18, 2017 at 01:23:32PM -0700, Michel Thierry wrote:
>>>> @@ -1329,10 +1331,29 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
>>>> 		req->ctx->ppgtt->pd_dirty_rings &= ~intel_engine_flag(req->engine);
>>>> 	}
>>>>
>>>> -	cs = intel_ring_begin(req, 4);
>>>> +	/* bb_start only */
>>>> +	num_dwords = 4;
>>>> +
>>>> +	/* check if watchdog will be required */
>>>> +	if (req->ctx->engine[req->engine->id].watchdog_threshold != 0) {
>>>> +		if (!req->engine->emit_start_watchdog ||
>>>> +		    !req->engine->emit_stop_watchdog)
>>>> +			return -EINVAL;
>>>
>>> This is still a bug in the context setparam to get to this point without
>>> a watchdog.
>>>
>>
>> This can't happen (threshold != 0 && no emit_watchdog func),
>> i915_gem_context_set_watchdog returns -ENODEV if vcs's
>> emit_start_watchdog is not defined (the assumption is if the vcs has
>> it, rcs does too).
>>
>> I can remove it, if that's what you mean.
>
> Yes, we shouldn't be setting the watchdog threshold if the watchdog is
> not available. GEM_BUG_ON() would be fine. Throwing a very, very late
> EINVAL is disconcerting.
>
>> But re i915_gem_context_set_watchdog, I think maybe it should return
>> ENODEV when there's no watchdog and the user is trying to get the
>> array size (args->size == 0), and don't give false hopes.
>
> Seems reasonable.
>
>>>> +
>>>> +		/* + start_watchdog (6) + stop_watchdog (4) */
>>>> +		num_dwords += 10;
>>>> +		watchdog_running = true;
>>>> +	}
>>>> +static u32 *gen8_emit_stop_watchdog(struct drm_i915_gem_request *req, u32 *cs)
>>>> +{
>>>> +	struct intel_engine_cs *engine = req->engine;
>>>> +
>>>> +	/* XXX: no watchdog support in BCS engine */
>>>> +	GEM_BUG_ON(engine->id == BCS);
>>>> +
>>>> +	*cs++ = MI_LOAD_REGISTER_IMM(2);
>>>> +	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
>>>> +	*cs++ = get_watchdog_disable(engine);
>>>> +	*cs++ = MI_NOOP;
>>>
>>> Oops.
>>
>> _context_set_watchdog also rejects if threshold[BCS] != 0.
>
> LRI(2), but only setting one register not two.

Oh... proof that most of the time I only copy+paste stuff.

> -Chris
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 04/20] drm/i915/tdr: Modify error handler for per engine hang recovery
  2017-04-18 22:01     ` Michel Thierry
@ 2017-04-18 23:13       ` Chris Wilson
  0 siblings, 0 replies; 50+ messages in thread
From: Chris Wilson @ 2017-04-18 23:13 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Tue, Apr 18, 2017 at 03:01:02PM -0700, Michel Thierry wrote:
> 
> 
> On 18/04/17 14:40, Chris Wilson wrote:
> >>+	/* try engine reset first, and continue if fails; look mom, no mutex! */
> >>+	if (intel_has_reset_engine(dev_priv) && !(engine_mask & (engine_mask - 1))) {
> >
> >if (has_reset_engine() && is_power_of_2(engine_mask)) {
> >	engine = dev_priv->engine[ilog2(engine_mask)];
> >
> >Not worth iterating over the engine_mask?
> >
> >for_each_bit() {
> >	if (i915_reset_engine(engine) == 0)
> >		dev_prv->gpu_error.reset_engine_mask &= ~intel_engine_flag()
> >if (reset_engine_mask)
> >	DRM_WARN("per-engine reset failed, promoting to full gpu_reset\n");
> >
> 
> Time-wise, I'm not so sure about the benefits of 2 reset_engines vs
> full gpu reset. But it can be done... and thanks for the ilog2 tip.

Hmm, looping isn't quite right. You would need to pass down the
engine_mask so all engines were paused (in prepare), then reset, then
restarted atomically. Will think a bit more about that.

The immediate benefit is avoiding the struct_mutex + display futzing
more often.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 13/20] drm/i915/guc: Provide register list to be saved/restored during engine reset
  2017-04-18 20:23 ` [PATCH v6 13/20] drm/i915/guc: Provide register list to be saved/restored during engine reset Michel Thierry
@ 2017-04-19  0:26   ` Daniele Ceraolo Spurio
  2017-04-19  0:44     ` Michel Thierry
  0 siblings, 1 reply; 50+ messages in thread
From: Daniele Ceraolo Spurio @ 2017-04-19  0:26 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx, Argenziano, Antonio



On 18/04/17 13:23, Michel Thierry wrote:
> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>
> GuC expects a list of registers from the driver which are saved/restored
> during engine reset. The type of value to be saved is controlled by
> flags. We provide a minimal set of registers that we want GuC to save and
> restore. This is not an issue in case of engine reset as driver initializes
> most of them following an engine reset, but in case of media reset (aka
> watchdog reset) which is completely internal to GuC (including resubmission
> of hung workload), it is necessary to provide this list, otherwise GuC won't
> be able to schedule further workloads after a reset. This is the minimal
> set of registers identified for things to work as expected but if we see
> any new issues, this register list can be expanded.
>
> v2: REGSET_MASKED is too difficult for GuC, use REGSET_SAVE_DEFAULT_VALUE
> and current value from RING_MODE reg instead; no need to preserve
> head/tail either, be extra paranoid and save whitelisted registers (Daniele).
>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_guc_submission.c | 60 +++++++++++++++++++++++++++++-
>  1 file changed, 58 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
> index 1ea36a88d2fb..d772718861df 100644
> --- a/drivers/gpu/drm/i915/i915_guc_submission.c
> +++ b/drivers/gpu/drm/i915/i915_guc_submission.c
> @@ -1003,6 +1003,24 @@ static void guc_policies_init(struct guc_policies *policies)
>  	policies->is_valid = 1;
>  }
>
> +/*
> + * In this macro it is highly unlikely to exceed max value but even if we did
> + * it is not an error so just throw a warning and continue. Only side effect
> + * in continuing further means some registers won't be added to save/restore
> + * list.
> + */
> +#define GUC_ADD_MMIO_REG_ADS(node, reg_addr, _flags, defvalue)		\
> +	do {								\
> +		u32 __count = node->number_of_registers;		\
> +		if (WARN_ON(__count >= GUC_REGSET_MAX_REGISTERS))	\
> +			continue;					\
> +		node->registers[__count].offset = reg_addr.reg;		\
> +		node->registers[__count].flags = (_flags);		\
> +		if (defvalue)						\
> +			node->registers[__count].value = (defvalue);	\
> +		node->number_of_registers++;				\
> +	} while (0)
> +
>  static int guc_ads_create(struct intel_guc *guc)
>  {
>  	struct drm_i915_private *dev_priv = guc_to_i915(guc);
> @@ -1016,6 +1034,7 @@ static int guc_ads_create(struct intel_guc *guc)
>  		u8 reg_state_buffer[GUC_S3_SAVE_SPACE_PAGES * PAGE_SIZE];
>  	} __packed *blob;
>  	struct intel_engine_cs *engine;
> +	struct i915_workarounds *workarounds = &dev_priv->workarounds;
>  	enum intel_engine_id id;
>  	u32 base;
>
> @@ -1035,6 +1054,39 @@ static int guc_ads_create(struct intel_guc *guc)
>
>  	/* MMIO reg state */
>  	for_each_engine(engine, dev_priv, id) {
> +		u32 i;
> +		struct guc_mmio_regset *eng_reg =
> +			&blob->reg_state.engine_reg[engine->guc_id];
> +
> +		/*
> +		 * Provide a list of registers to be saved/restored during gpu
> +		 * reset. This is mainly required for Media reset (aka watchdog
> +		 * timeout) which is completely under the control of GuC
> +		 * (resubmission of hung workload is handled inside GuC).
> +		 */
> +		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_HWS_PGA(engine->mmio_base),
> +				     GUC_REGSET_ENGINERESET |
> +				     GUC_REGSET_SAVE_CURRENT_VALUE, 0);
> +
> +		/*
> +		 * Workaround the guc issue with masked registers, note that
> +		 * at this point guc submission is still disabled and the mode
> +		 * register doesnt have the irq_steering bit set, which we
> +		 * need to fwd irqs to GuC.
> +		 */
> +		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_MODE_GEN7(engine),
> +				     GUC_REGSET_ENGINERESET |
> +				     GUC_REGSET_SAVE_DEFAULT_VALUE,
> +				     I915_READ(RING_MODE_GEN7(engine)) |
> +				     GFX_INTERRUPT_STEERING | (0xFFFF<<16));
> +
> +		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_IMR(engine->mmio_base),
> +				     GUC_REGSET_ENGINERESET |
> +				     GUC_REGSET_SAVE_CURRENT_VALUE, 0);
> +

I might just be too paranoid, but I think that we also have to add the 
registers that we use for WAs via mmio (i.e. not using an LRI in the 
ringbuffer). I did a quick test for the registers in the 
gen9_init_workarounds and skl_init_workarounds functions that we write 
using I915_WRITE and it looks like some of them lose their value after 
and RCS media reset:

  REG	   WA BITS	VAL PRE-MR	VAL POST-MR
0x20D4	(1<<2)		0x00000004	0x00000000
0xb11c	(1<<2)		0x00000005	0x00000001
0x4090	(1<<8) (1<<25)	0x02000100	0x02000100
0xb118	(1<<21)		0x40600000	0x40400000
0x20e0	(1<<14)		0x00004000	0x00000000
0xB004	(1<<7)		0x29124180	0x29124100

For some reason the media reset count in debugfs is still showing zero 
though, so I might be doing something wrong. Can you double check what 
happens to those registers?

Thanks,
Daniele

> +		DRM_DEBUG_DRIVER("%s register save/restore count: %u\n",
> +				 engine->name, eng_reg->number_of_registers);
> +
>  		blob->reg_state.white_list[engine->guc_id].mmio_start =
>  			i915_mmio_reg_offset(RING_FORCE_TO_NONPRIV(engine->mmio_base, 0));
>
> @@ -1044,9 +1096,13 @@ static int guc_ads_create(struct intel_guc *guc)
>  		 * inconsistencies with the handling of FORCE_TO_NONPRIV
>  		 * registers.
>  		 */
> -		blob->reg_state.white_list[engine->guc_id].count = 0;
> +		blob->reg_state.white_list[engine->guc_id].count =
> +					workarounds->hw_whitelist_count[id];
>
> -		/* Nothing to be saved or restored for now. */
> +		for (i = 0; i < workarounds->hw_whitelist_count[id]; i++) {
> +			blob->reg_state.white_list[engine->guc_id].offsets[i] =
> +				I915_READ(RING_FORCE_TO_NONPRIV(engine->mmio_base, i));
> +		}
>  	}
>
>  	/*
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 13/20] drm/i915/guc: Provide register list to be saved/restored during engine reset
  2017-04-19  0:26   ` Daniele Ceraolo Spurio
@ 2017-04-19  0:44     ` Michel Thierry
  0 siblings, 0 replies; 50+ messages in thread
From: Michel Thierry @ 2017-04-19  0:44 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio, intel-gfx, Argenziano, Antonio

On 18/04/17 17:26, Daniele Ceraolo Spurio wrote:
>
>
> On 18/04/17 13:23, Michel Thierry wrote:
>> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>>
>> GuC expects a list of registers from the driver which are saved/restored
>> during engine reset. The type of value to be saved is controlled by
>> flags. We provide a minimal set of registers that we want GuC to save and
>> restore. This is not an issue in case of engine reset as driver
>> initializes
>> most of them following an engine reset, but in case of media reset (aka
>> watchdog reset) which is completely internal to GuC (including
>> resubmission
>> of hung workload), it is necessary to provide this list, otherwise GuC
>> won't
>> be able to schedule further workloads after a reset. This is the minimal
>> set of registers identified for things to work as expected but if we see
>> any new issues, this register list can be expanded.
>>
>> v2: REGSET_MASKED is too difficult for GuC, use REGSET_SAVE_DEFAULT_VALUE
>> and current value from RING_MODE reg instead; no need to preserve
>> head/tail either, be extra paranoid and save whitelisted registers
>> (Daniele).
>>
>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_guc_submission.c | 60
>> +++++++++++++++++++++++++++++-
>>  1 file changed, 58 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c
>> b/drivers/gpu/drm/i915/i915_guc_submission.c
>> index 1ea36a88d2fb..d772718861df 100644
>> --- a/drivers/gpu/drm/i915/i915_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/i915_guc_submission.c
>> @@ -1003,6 +1003,24 @@ static void guc_policies_init(struct
>> guc_policies *policies)
>>      policies->is_valid = 1;
>>  }
>>
>> +/*
>> + * In this macro it is highly unlikely to exceed max value but even
>> if we did
>> + * it is not an error so just throw a warning and continue. Only side
>> effect
>> + * in continuing further means some registers won't be added to
>> save/restore
>> + * list.
>> + */
>> +#define GUC_ADD_MMIO_REG_ADS(node, reg_addr, _flags, defvalue)        \
>> +    do {                                \
>> +        u32 __count = node->number_of_registers;        \
>> +        if (WARN_ON(__count >= GUC_REGSET_MAX_REGISTERS))    \
>> +            continue;                    \
>> +        node->registers[__count].offset = reg_addr.reg;        \
>> +        node->registers[__count].flags = (_flags);        \
>> +        if (defvalue)                        \
>> +            node->registers[__count].value = (defvalue);    \
>> +        node->number_of_registers++;                \
>> +    } while (0)
>> +
>>  static int guc_ads_create(struct intel_guc *guc)
>>  {
>>      struct drm_i915_private *dev_priv = guc_to_i915(guc);
>> @@ -1016,6 +1034,7 @@ static int guc_ads_create(struct intel_guc *guc)
>>          u8 reg_state_buffer[GUC_S3_SAVE_SPACE_PAGES * PAGE_SIZE];
>>      } __packed *blob;
>>      struct intel_engine_cs *engine;
>> +    struct i915_workarounds *workarounds = &dev_priv->workarounds;
>>      enum intel_engine_id id;
>>      u32 base;
>>
>> @@ -1035,6 +1054,39 @@ static int guc_ads_create(struct intel_guc *guc)
>>
>>      /* MMIO reg state */
>>      for_each_engine(engine, dev_priv, id) {
>> +        u32 i;
>> +        struct guc_mmio_regset *eng_reg =
>> +            &blob->reg_state.engine_reg[engine->guc_id];
>> +
>> +        /*
>> +         * Provide a list of registers to be saved/restored during gpu
>> +         * reset. This is mainly required for Media reset (aka watchdog
>> +         * timeout) which is completely under the control of GuC
>> +         * (resubmission of hung workload is handled inside GuC).
>> +         */
>> +        GUC_ADD_MMIO_REG_ADS(eng_reg, RING_HWS_PGA(engine->mmio_base),
>> +                     GUC_REGSET_ENGINERESET |
>> +                     GUC_REGSET_SAVE_CURRENT_VALUE, 0);
>> +
>> +        /*
>> +         * Workaround the guc issue with masked registers, note that
>> +         * at this point guc submission is still disabled and the mode
>> +         * register doesnt have the irq_steering bit set, which we
>> +         * need to fwd irqs to GuC.
>> +         */
>> +        GUC_ADD_MMIO_REG_ADS(eng_reg, RING_MODE_GEN7(engine),
>> +                     GUC_REGSET_ENGINERESET |
>> +                     GUC_REGSET_SAVE_DEFAULT_VALUE,
>> +                     I915_READ(RING_MODE_GEN7(engine)) |
>> +                     GFX_INTERRUPT_STEERING | (0xFFFF<<16));
>> +
>> +        GUC_ADD_MMIO_REG_ADS(eng_reg, RING_IMR(engine->mmio_base),
>> +                     GUC_REGSET_ENGINERESET |
>> +                     GUC_REGSET_SAVE_CURRENT_VALUE, 0);
>> +
>
> I might just be too paranoid, but I think that we also have to add the
> registers that we use for WAs via mmio (i.e. not using an LRI in the
> ringbuffer). I did a quick test for the registers in the
> gen9_init_workarounds and skl_init_workarounds functions that we write
> using I915_WRITE and it looks like some of them lose their value after
> and RCS media reset:
>
>  REG       WA BITS    VAL PRE-MR    VAL POST-MR
> 0x20D4    (1<<2)        0x00000004    0x00000000
> 0xb11c    (1<<2)        0x00000005    0x00000001
> 0x4090    (1<<8) (1<<25)    0x02000100    0x02000100
> 0xb118    (1<<21)        0x40600000    0x40400000
> 0x20e0    (1<<14)        0x00004000    0x00000000
> 0xB004    (1<<7)        0x29124180    0x29124100
>

Indeed. Kind of makes sense since i915 won't be aware of the reset at 
all (compared to the non-guc version). Btw it only seems to happen if 
the watchdog is triggered in the render engine.

> For some reason the media reset count in debugfs is still showing zero
> though, so I might be doing something wrong. Can you double check what
> happens to those registers?
>

You need fw version 9.x+

> Thanks,
> Daniele
>
>> +        DRM_DEBUG_DRIVER("%s register save/restore count: %u\n",
>> +                 engine->name, eng_reg->number_of_registers);
>> +
>>          blob->reg_state.white_list[engine->guc_id].mmio_start =
>>
>> i915_mmio_reg_offset(RING_FORCE_TO_NONPRIV(engine->mmio_base, 0));
>>
>> @@ -1044,9 +1096,13 @@ static int guc_ads_create(struct intel_guc *guc)
>>           * inconsistencies with the handling of FORCE_TO_NONPRIV
>>           * registers.
>>           */
>> -        blob->reg_state.white_list[engine->guc_id].count = 0;
>> +        blob->reg_state.white_list[engine->guc_id].count =
>> +                    workarounds->hw_whitelist_count[id];
>>
>> -        /* Nothing to be saved or restored for now. */
>> +        for (i = 0; i < workarounds->hw_whitelist_count[id]; i++) {
>> +            blob->reg_state.white_list[engine->guc_id].offsets[i] =
>> +                I915_READ(RING_FORCE_TO_NONPRIV(engine->mmio_base, i));
>> +        }
>>      }
>>
>>      /*
>>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2017-04-18 20:23 ` [PATCH v6 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
@ 2017-04-19 10:20   ` Chris Wilson
  2017-04-19 17:11     ` Michel Thierry
  0 siblings, 1 reply; 50+ messages in thread
From: Chris Wilson @ 2017-04-19 10:20 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Tue, Apr 18, 2017 at 01:23:31PM -0700, Michel Thierry wrote:
> *** General ***
> 
> Watchdog timeout (or "media engine reset") is a feature that allows
> userland applications to enable hang detection on individual batch buffers.
> The detection mechanism itself is mostly bound to the hardware and the only
> thing that the driver needs to do to support this form of hang detection
> is to implement the interrupt handling support as well as watchdog command
> emission before and after the emitted batch buffer start instruction in the
> ring buffer.
> 
> The principle of the hang detection mechanism is as follows:
> 
> 1. Once the decision has been made to enable watchdog timeout for a
> particular batch buffer and the driver is in the process of emitting the
> batch buffer start instruction into the ring buffer it also emits a
> watchdog timer start instruction before and a watchdog timer cancellation
> instruction after the batch buffer start instruction in the ring buffer.
> 
> 2. Once the GPU execution reaches the watchdog timer start instruction
> the hardware watchdog counter is started by the hardware. The counter
> keeps counting until either reaching a previously configured threshold
> value or the timer cancellation instruction is executed.
> 
> 2a. If the counter reaches the threshold value the hardware fires a
> watchdog interrupt that is picked up by the watchdog interrupt handler.
> This means that a hang has been detected and the driver needs to deal with
> it the same way it would deal with a engine hang detected by the periodic
> hang checker. The only difference between the two is that we already blamed
> the active request (to ensure an engine reset).
> 
> 2b. If the batch buffer completes and the execution reaches the watchdog
> cancellation instruction before the watchdog counter reaches its
> threshold value the watchdog is cancelled and nothing more comes of it.
> No hang is detected.
> 
> Note about future interaction with preemption: Preemption could happen
> in a command sequence prior to watchdog counter getting disabled,
> resulting in watchdog being triggered following preemption. The driver will
> need to explicitly disable the watchdog counter as part of the
> preemption sequence.

Does MI_ARB_ON_OFF do the trick? Shouldn't we basically be only turning
preemption on for the user buffers as it just causes hassle if we allow
preemption in our preamble + breadcrumb. (And there's little point in
preempting in the flushes.)

> *** This patch introduces: ***
> 
> 1. IRQ handler code for watchdog timeout allowing direct hang recovery
> based on hardware-driven hang detection, which then integrates directly
> with the hang recovery path. This is independent of having per-engine reset
> or just full gpu reset.
> 
> 2. Watchdog specific register information.
> 
> Currently the render engine and all available media engines support
> watchdog timeout (VECS is only supported in GEN9). The specifications elude
> to the BCS engine being supported but that is currently not supported by
> this commit.
> 
> Note that the value to stop the counter is different between render and
> non-render engines in GEN8; GEN9 onwards it's the same.

Should mention the choice to piggyback the current hangcheck + capture
scheme.

> +	if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift)) {
> +		tasklet_schedule(&engine->watchdog_tasklet);
> +	}

Kill unwanted braces.

> +#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
> +static void gen8_watchdog_irq_handler(unsigned long data)
> +{
> +	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> +	struct drm_i915_private *dev_priv = engine->i915;
> +	u32 current_seqno;
> +
> +	intel_uncore_forcewake_get(dev_priv, engine->fw_domains);
> +
> +	/* Stop the counter to prevent further timeout interrupts */
> +	I915_WRITE_FW(RING_CNTR(engine->mmio_base), get_watchdog_disable(engine));
> +
> +	current_seqno = intel_engine_get_seqno(engine);
> +
> +	/* did the request complete after the timer expired? */
> +	if (intel_engine_last_submit(engine) == current_seqno)
> +		goto fw_put;
> +
> +	if (engine->hangcheck.watchdog == current_seqno) {
> +		/* Make sure the active request will be marked as guilty */
> +		engine->hangcheck.stalled = true;
> +		engine->hangcheck.seqno = intel_engine_get_seqno(engine);

Use current_seqno again. intel_engine_get_seqno() may have just changed.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 14/20] drm/i915/guc: Add support for reset engine using GuC commands
  2017-04-18 20:23 ` [PATCH v6 14/20] drm/i915/guc: Add support for reset engine using GuC commands Michel Thierry
@ 2017-04-19 10:27   ` Chris Wilson
  2017-04-19 23:22     ` Michel Thierry
  0 siblings, 1 reply; 50+ messages in thread
From: Chris Wilson @ 2017-04-19 10:27 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Tue, Apr 18, 2017 at 01:23:29PM -0700, Michel Thierry wrote:
> This patch adds per engine reset and recovery (TDR) support when GuC is
> used to submit workloads to GPU.
> 
> In the case of i915 directly submission to ELSP, driver manages hang
> detection, recovery and resubmission. With GuC submission these tasks
> are shared between driver and GuC. i915 is still responsible for detecting
> a hang, and when it does it only requests GuC to reset that Engine. GuC
> internally manages acquiring forcewake and idling the engine before actually
> resetting it.
> 
> Once the reset is successful, i915 takes over again and handles resubmission.
> The scheduler in i915 knows which requests are pending so after resetting
> a engine, pending workloads/requests are resubmitted again.
> 
> v2: s/i915_guc_request_engine_reset/i915_guc_reset_engine/ to match the
> non-guc funtion names.
> 
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 7df278fe492e..6295760098a1 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1176,14 +1176,15 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
>  
>  	/* After a GPU reset, we may have requests to replay */
>  	clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
> -	if (!i915.enable_guc_submission && !execlists_elsp_idle(engine)) {
> +	if (!execlists_elsp_idle(engine)) {
>  		DRM_DEBUG_DRIVER("Restarting %s from requests [0x%x, 0x%x]\n",
>  				 engine->name,
>  				 port_seqno(&engine->execlist_port[0]),
>  				 port_seqno(&engine->execlist_port[1]));
>  		engine->execlist_port[0].count = 0;
>  		engine->execlist_port[1].count = 0;
> -		execlists_submit_ports(engine);
> +		if (!dev_priv->guc.execbuf_client)
> +			execlists_submit_ports(engine);

Not sure what you were intending to do here as this only resets the
submission count -- which is not used by guc dequeue. Some merit in the
making the code look similar, certainly adds the dbg message but I think
it is unrelated to the rest of the patch.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery
  2017-04-18 20:23 ` [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery Michel Thierry
@ 2017-04-19 10:49   ` Chris Wilson
  2017-04-19 13:49     ` Chris Wilson
  2017-04-21  0:17     ` Michel Thierry
  0 siblings, 2 replies; 50+ messages in thread
From: Chris Wilson @ 2017-04-19 10:49 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Tue, Apr 18, 2017 at 01:23:20PM -0700, Michel Thierry wrote:
> From: Arun Siluvery <arun.siluvery@linux.intel.com>
> 
> This change implements support for per-engine reset as an initial, less
> intrusive hang recovery option to be attempted before falling back to the
> legacy full GPU reset recovery mode if necessary. This is only supported
> from Gen8 onwards.
> 
> Hangchecker determines which engines are hung and invokes error handler to
> recover from it. Error handler schedules recovery for each of those engines
> that are hung. The recovery procedure is as follows,
>  - identifies the request that caused the hang and it is dropped
>  - force engine to idle: this is done by issuing a reset request
>  - reset and re-init engine
>  - restart submissions to the engine
> 
> If engine reset fails then we fall back to heavy weight full gpu reset
> which resets all engines and reinitiazes complete state of HW and SW.
> 
> v2: Rebase.
> v3: s/*engine_reset*/*reset_engine*/; freeze engine and irqs before
> calling i915_gem_reset_engine (Chris).
> v4: Rebase, modify i915_gem_reset_prepare to use a ring mask and
> reuse the function for reset_engine.
> v5: intel_reset_engine_start/cancel instead of request/unrequest_reset.
> v6: Clean up reset_engine function to not require mutex, i.e. no need to call
> revoke/restore_fences and _retire_requests (Chris).
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c         | 76 ++++++++++++++++++++++++--
>  drivers/gpu/drm/i915/i915_drv.h         | 12 +++-
>  drivers/gpu/drm/i915/i915_gem.c         | 97 +++++++++++++++++++--------------
>  drivers/gpu/drm/i915/i915_gem_request.c |  2 +-
>  drivers/gpu/drm/i915/intel_uncore.c     | 20 +++++++
>  5 files changed, 158 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index e03d0643dbd6..634893cd93b3 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1810,7 +1810,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
>  
>  	pr_notice("drm/i915: Resetting chip after gpu hang\n");
>  	disable_irq(dev_priv->drm.irq);
> -	ret = i915_gem_reset_prepare(dev_priv);
> +	ret = i915_gem_reset_prepare(dev_priv, ALL_ENGINES);
>  	if (ret) {
>  		DRM_ERROR("GPU recovery failed\n");
>  		intel_gpu_reset(dev_priv, ALL_ENGINES);
> @@ -1852,7 +1852,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
>  	i915_queue_hangcheck(dev_priv);
>  
>  finish:
> -	i915_gem_reset_finish(dev_priv);
> +	i915_gem_reset_finish(dev_priv, ALL_ENGINES);
>  	enable_irq(dev_priv->drm.irq);
>  
>  wakeup:
> @@ -1871,11 +1871,79 @@ void i915_reset(struct drm_i915_private *dev_priv)
>   *
>   * Reset a specific GPU engine. Useful if a hang is detected.
>   * Returns zero on successful reset or otherwise an error code.
> + *
> + * Procedure is:
> + *  - identifies the request that caused the hang and it is dropped
> + *  - force engine to idle: this is done by issuing a reset request
> + *  - reset engine
> + *  - restart submissions to the engine
>   */
>  int i915_reset_engine(struct intel_engine_cs *engine)
>  {
> -	/* FIXME: replace me with engine reset sequence */
> -	return -ENODEV;
> +	int ret;
> +	struct drm_i915_private *dev_priv = engine->i915;
> +	struct i915_gpu_error *error = &dev_priv->gpu_error;
> +
> +	GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, &error->flags));
> +
> +	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
> +
> +	/*
> +	 * We need to first idle the engine by issuing a reset request,
> +	 * then perform soft reset and re-initialize hw state, for all of
> +	 * this GT power need to be awake so ensure it does throughout the
> +	 * process
> +	 */
> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);

Hmm, what path did we take to get here without taking rpm? I thought I
had fixed the last offender.

> +	disable_irq(dev_priv->drm.irq);

I am 99% certain that we don't need to disable_irq() now for per-engine
reset... I'd keep it in the global reset as simple paranoia.

> +	ret = i915_gem_reset_prepare_engine(engine);
> +	if (ret) {
> +		DRM_ERROR("Previous reset failed - promote to full reset\n");
> +		goto error;
> +	}
> +
> +	/*
> +	 * the request that caused the hang is stuck on elsp, identify the
> +	 * active request and drop it, adjust head to skip the offending
> +	 * request to resume executing remaining requests in the queue.
> +	 */

Hmm. Interesting. This relies on i915_gem_retire_requests() (i.e.
struct_mutex) to skip replaying innocent requests, but here we should be
asserting that we do have the hung request.

i.e.
request = i915_gem_find_active_request(engine);
if (!request)
	goto skip.

Bonus points for tying that into i915_gem_reset_prepare_engine() so that
we only seach for the active_request once.

> +	i915_gem_reset_engine(engine);

Where does skip_request() get called?

> +
> +	/* forcing engine to idle */
> +	ret = intel_reset_engine_start(engine);
> +	if (ret) {
> +		DRM_ERROR("Failed to disable %s\n", engine->name);
> +		goto error;
> +	}
> +
> +	/* finally, reset engine */
> +	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
> +	if (ret) {
> +		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
> +		intel_reset_engine_cancel(engine);
> +		goto error;
> +	}
> +
> +	/* be sure the request reset bit gets cleared */
> +	intel_reset_engine_cancel(engine);
> +
> +	i915_gem_reset_finish_engine(engine);
> +
> +	/* replay remaining requests in the queue */
> +	ret = engine->init_hw(engine);
> +	if (ret)
> +		goto error;
> +
> +wakeup:
> +	enable_irq(dev_priv->drm.irq);
> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
> +	wake_up_bit(&error->flags, I915_RESET_HANDOFF);

No handoff here anymore.

> +	return ret;
> +
> +error:
> +	/* use full gpu reset to recover on error */
> +	goto wakeup;
>  }

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery
  2017-04-19 10:49   ` Chris Wilson
@ 2017-04-19 13:49     ` Chris Wilson
  2017-04-21  0:17     ` Michel Thierry
  1 sibling, 0 replies; 50+ messages in thread
From: Chris Wilson @ 2017-04-19 13:49 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx, Mika Kuoppala

On Wed, Apr 19, 2017 at 11:49:26AM +0100, Chris Wilson wrote:
> On Tue, Apr 18, 2017 at 01:23:20PM -0700, Michel Thierry wrote:
> > +	ret = i915_gem_reset_prepare_engine(engine);
> > +	if (ret) {
> > +		DRM_ERROR("Previous reset failed - promote to full reset\n");
> > +		goto error;
> > +	}
> > +
> > +	/*
> > +	 * the request that caused the hang is stuck on elsp, identify the
> > +	 * active request and drop it, adjust head to skip the offending
> > +	 * request to resume executing remaining requests in the queue.
> > +	 */
> 
> Hmm. Interesting. This relies on i915_gem_retire_requests() (i.e.
> struct_mutex) to skip replaying innocent requests, but here we should be
> asserting that we do have the hung request.
> 
> i.e.
> request = i915_gem_find_active_request(engine);
> if (!request)
> 	goto skip.

Forgot to mention this should include a "pardoned" check, i.e. that the
active request still matches the watchdog seqno.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-04-18 20:23 ` [PATCH v6 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
@ 2017-04-19 16:56   ` Jeff McGee
  2017-04-19 17:07   ` Daniele Ceraolo Spurio
  2017-04-20  1:09   ` Michel Thierry
  2 siblings, 0 replies; 50+ messages in thread
From: Jeff McGee @ 2017-04-19 16:56 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Tue, Apr 18, 2017 at 01:23:33PM -0700, Michel Thierry wrote:
> Final enablement patch for GPU hang detection using watchdog timeout.
> Using the gem_context_setparam ioctl, users can specify the desired
> timeout value in microseconds, and the driver will do the conversion to
> 'timestamps'.
> 
> The recommended default watchdog threshold for video engines is 60000 us,
> since this has been _empirically determined_ to be a good compromise for
> low-latency requirements and low rate of false positives. The default
> register value is ~106000us and the theoretical max value (all 1s) is
> 353 seconds.
> 
> Note, UABI engine ids and i915 engine ids are different, and this patch
> uses the i915 ones. Some kind of mapping table [1] is required if we
> decide to use the UABI engine ids.
> 
> [1] http://patchwork.freedesktop.org/patch/msgid/20170329135831.30254-2-chris@chris-wilson.co.uk
> 
> v2: Fixed get api to return values in microseconds. Threshold updated to
> be per context engine. Check for u32 overflow. Capture ctx threshold
> value in error state.
> 
> v3: Add a way to get array size, short-cut to disable all thresholds,
> return EFAULT / EINVAL as needed. Move the capture of the threshold
> value in the error state into a new patch. BXT has a different
> timestamp base (because why not?).
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h         |  29 +++++++++
>  drivers/gpu/drm/i915/i915_gem_context.c | 102 ++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/intel_lrc.c        |   5 +-
>  include/uapi/drm/i915_drm.h             |   1 +
>  4 files changed, 135 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 203f2112dd18..f65a236fddef 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -3574,6 +3574,35 @@ i915_gem_context_lookup_timeline(struct i915_gem_context *ctx,
>  	return &vm->timeline.engine[engine->id];
>  }
>  
> +/*
> + * BDW & SKL+ Timestamp timer resolution = 0.080 uSec,
> + * or 12500000 counts per second, or ~12 counts per microsecond.
> + *
> + * But Broxton Timestamp timer resolution is different, 0.052 uSec,
> + * or 19200000 counts per second, or ~19 counts per microsecond.
> + */
> +#define SKL_TIMESTAMP_CNTS_PER_USEC 12
> +#define BXT_TIMESTAMP_CNTS_PER_USEC 19
> +#define TIMESTAMP_CNTS_PER_USEC(dev_priv) (IS_BROXTON(dev_priv) ? \
> +					   BXT_TIMESTAMP_CNTS_PER_USEC : \
> +					   SKL_TIMESTAMP_CNTS_PER_USEC)
> +static inline u32
> +watchdog_to_us(struct drm_i915_private *dev_priv, u32 value_in_clock_counts)
> +{
> +	return value_in_clock_counts / TIMESTAMP_CNTS_PER_USEC(dev_priv);
> +}
> +
> +static inline u32
> +watchdog_to_clock_counts(struct drm_i915_private *dev_priv, u64 value_in_us)
> +{
> +	u64 threshold = value_in_us * TIMESTAMP_CNTS_PER_USEC(dev_priv);
> +
> +	if (overflows_type(threshold, u32))
> +		return -EINVAL;
> +
> +	return threshold;
> +}
> +
>  int i915_perf_open_ioctl(struct drm_device *dev, void *data,
>  			 struct drm_file *file);
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index edbed85a1c88..85a6467a25a6 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -422,6 +422,102 @@ i915_gem_context_create_gvt(struct drm_device *dev)
>  	return ctx;
>  }
>  
> +/* Return the timer count threshold in microseconds. */
> +int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
> +				  struct drm_i915_gem_context_param *args)
> +{
> +	struct drm_i915_private *dev_priv = ctx->i915;
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	u32 threshold_in_us[I915_NUM_ENGINES];
> +
> +	if (args->size == 0)
> +		goto out;
> +
> +	if (args->size < sizeof(threshold_in_us))
> +		return -EFAULT;
> +
> +	if (!dev_priv->engine[VCS]->emit_start_watchdog)
> +		return -ENODEV;
> +
> +	for_each_engine(engine, dev_priv, id) {
> +		struct intel_context *ce = &ctx->engine[id];
> +
> +		threshold_in_us[id] = watchdog_to_us(dev_priv,
> +						     ce->watchdog_threshold);
> +	}
> +
> +	mutex_unlock(&dev_priv->drm.struct_mutex);
> +	if (__copy_to_user(u64_to_user_ptr(args->value),
> +			   &threshold_in_us,
> +			   sizeof(threshold_in_us))) {
> +		mutex_lock(&dev_priv->drm.struct_mutex);
> +		return -EFAULT;
> +	}
> +	mutex_lock(&dev_priv->drm.struct_mutex);
> +
> +out:
> +	args->size = sizeof(threshold_in_us);
> +
> +	return 0;
> +}
> +
> +/*
> + * Based on time out value in microseconds (us) calculate
> + * timer count thresholds needed based on core frequency.
> + * Watchdog can be disabled by setting it to 0.
> + */
> +int i915_gem_context_set_watchdog(struct i915_gem_context *ctx,
> +				  struct drm_i915_gem_context_param *args)
> +{
> +	struct drm_i915_private *dev_priv = ctx->i915;
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	u32 threshold[I915_NUM_ENGINES];
> +
> +	memset(&threshold, 0, sizeof(threshold));
> +
memset(threshold, 0, sizeof(threshold));

> +	/* shortcut to disable in all engines */
> +	if (args->size == 0)
> +		goto set_watchdog;
> +
> +	if (args->size < sizeof(threshold))
> +		return -EFAULT;
> +
> +	if (!dev_priv->engine[VCS]->emit_start_watchdog)
> +		return -ENODEV;
> +
> +	mutex_unlock(&dev_priv->drm.struct_mutex);
> +	if (copy_from_user(&threshold,
if (copy_from_user(threshold,

> +			   u64_to_user_ptr(args->value),
> +			   sizeof(threshold))) {
> +		mutex_lock(&dev_priv->drm.struct_mutex);
> +		return -EFAULT;
> +	}
> +	mutex_lock(&dev_priv->drm.struct_mutex);
> +
> +	/* not supported in blitter engine */
> +	if (threshold[BCS] != 0)
> +		return -EINVAL;
> +
> +	for_each_engine(engine, dev_priv, id) {
> +		threshold[id] = watchdog_to_clock_counts(dev_priv,
> +							 threshold[id]);
> +
> +		if (threshold[id] == -EINVAL)
> +			return -EINVAL;
> +	}
> +
> +set_watchdog:
> +	for_each_engine(engine, dev_priv, id) {
> +		struct intel_context *ce = &ctx->engine[id];
> +
> +		ce->watchdog_threshold = threshold[id];
> +	}
> +
> +	return 0;
> +}
> +
>  int i915_gem_context_init(struct drm_i915_private *dev_priv)
>  {
>  	struct i915_gem_context *ctx;
> @@ -1061,6 +1157,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>  	case I915_CONTEXT_PARAM_BANNABLE:
>  		args->value = i915_gem_context_is_bannable(ctx);
>  		break;
> +	case I915_CONTEXT_PARAM_WATCHDOG:
> +		ret = i915_gem_context_get_watchdog(ctx, args);
> +		break;
>  	default:
>  		ret = -EINVAL;
>  		break;
> @@ -1118,6 +1217,9 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>  		else
>  			i915_gem_context_clear_bannable(ctx);
>  		break;
> +	case I915_CONTEXT_PARAM_WATCHDOG:
> +		ret = i915_gem_context_set_watchdog(ctx, args);
> +		break;
>  	default:
>  		ret = -EINVAL;
>  		break;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 7a202e73ce9b..f4e05531a9a0 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1496,7 +1496,7 @@ static inline u32 get_watchdog_disable(struct intel_engine_cs *engine)
>  		return GEN8_XCS_WATCHDOG_DISABLE;
>  }
>  
> -#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
> +#define GEN8_WATCHDOG_1000US(dev_priv) watchdog_to_clock_counts(dev_priv, 1000)
>  static void gen8_watchdog_irq_handler(unsigned long data)
>  {
>  	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> @@ -1526,7 +1526,8 @@ static void gen8_watchdog_irq_handler(unsigned long data)
>  	} else {
>  		engine->hangcheck.watchdog = current_seqno;
>  		/* Re-start the counter, if really hung, it will expire again */
> -		I915_WRITE_FW(RING_THRESH(engine->mmio_base), GEN8_WATCHDOG_1000US);
> +		I915_WRITE_FW(RING_THRESH(engine->mmio_base),
> +			      GEN8_WATCHDOG_1000US(dev_priv));
>  		I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
>  	}
>  
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index fadedefba6db..18bc0ec618dd 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1308,6 +1308,7 @@ struct drm_i915_gem_context_param {
>  #define I915_CONTEXT_PARAM_GTT_SIZE	0x3
>  #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE	0x4
>  #define I915_CONTEXT_PARAM_BANNABLE	0x5
> +#define I915_CONTEXT_PARAM_WATCHDOG	0x6
>  	__u64 value;
>  };
>  
> -- 
> 2.11.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-04-18 20:23 ` [PATCH v6 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
  2017-04-19 16:56   ` Jeff McGee
@ 2017-04-19 17:07   ` Daniele Ceraolo Spurio
  2017-04-20  1:09   ` Michel Thierry
  2 siblings, 0 replies; 50+ messages in thread
From: Daniele Ceraolo Spurio @ 2017-04-19 17:07 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx



On 18/04/17 13:23, Michel Thierry wrote:
> Final enablement patch for GPU hang detection using watchdog timeout.
> Using the gem_context_setparam ioctl, users can specify the desired
> timeout value in microseconds, and the driver will do the conversion to
> 'timestamps'.
>
> The recommended default watchdog threshold for video engines is 60000 us,
> since this has been _empirically determined_ to be a good compromise for
> low-latency requirements and low rate of false positives. The default
> register value is ~106000us and the theoretical max value (all 1s) is
> 353 seconds.
>
> Note, UABI engine ids and i915 engine ids are different, and this patch
> uses the i915 ones. Some kind of mapping table [1] is required if we
> decide to use the UABI engine ids.
>
> [1] http://patchwork.freedesktop.org/patch/msgid/20170329135831.30254-2-chris@chris-wilson.co.uk

Now that we have class and instance definitions, could it be worth to 
use those instead to index the engines? If we pair it with the engine 
discovery ioctl that Tvrtko proposed we could have something that is 
reasonably future-proof.

Thanks,
Daniele
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2017-04-19 10:20   ` Chris Wilson
@ 2017-04-19 17:11     ` Michel Thierry
  2017-04-19 17:51       ` Chris Wilson
  0 siblings, 1 reply; 50+ messages in thread
From: Michel Thierry @ 2017-04-19 17:11 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 19/04/17 03:20, Chris Wilson wrote:
> On Tue, Apr 18, 2017 at 01:23:31PM -0700, Michel Thierry wrote:
>> *** General ***
>>
>> Watchdog timeout (or "media engine reset") is a feature that allows
>> userland applications to enable hang detection on individual batch buffers.
>> The detection mechanism itself is mostly bound to the hardware and the only
>> thing that the driver needs to do to support this form of hang detection
>> is to implement the interrupt handling support as well as watchdog command
>> emission before and after the emitted batch buffer start instruction in the
>> ring buffer.
>>
>> The principle of the hang detection mechanism is as follows:
>>
>> 1. Once the decision has been made to enable watchdog timeout for a
>> particular batch buffer and the driver is in the process of emitting the
>> batch buffer start instruction into the ring buffer it also emits a
>> watchdog timer start instruction before and a watchdog timer cancellation
>> instruction after the batch buffer start instruction in the ring buffer.
>>
>> 2. Once the GPU execution reaches the watchdog timer start instruction
>> the hardware watchdog counter is started by the hardware. The counter
>> keeps counting until either reaching a previously configured threshold
>> value or the timer cancellation instruction is executed.
>>
>> 2a. If the counter reaches the threshold value the hardware fires a
>> watchdog interrupt that is picked up by the watchdog interrupt handler.
>> This means that a hang has been detected and the driver needs to deal with
>> it the same way it would deal with a engine hang detected by the periodic
>> hang checker. The only difference between the two is that we already blamed
>> the active request (to ensure an engine reset).
>>
>> 2b. If the batch buffer completes and the execution reaches the watchdog
>> cancellation instruction before the watchdog counter reaches its
>> threshold value the watchdog is cancelled and nothing more comes of it.
>> No hang is detected.
>>
>> Note about future interaction with preemption: Preemption could happen
>> in a command sequence prior to watchdog counter getting disabled,
>> resulting in watchdog being triggered following preemption. The driver will
>> need to explicitly disable the watchdog counter as part of the
>> preemption sequence.
>
> Does MI_ARB_ON_OFF do the trick? Shouldn't we basically be only turning
> preemption on for the user buffers as it just causes hassle if we allow
> preemption in our preamble + breadcrumb. (And there's little point in
> preempting in the flushes.)
>

Mid-batch?
The watchdog counter is not aware of MI_ARB_ON_OFF (or any other cmd) 
and would keep running / expire. We could call emit_stop_watchdog 
unconditionally to prevent this.

>> *** This patch introduces: ***
>>
>> 1. IRQ handler code for watchdog timeout allowing direct hang recovery
>> based on hardware-driven hang detection, which then integrates directly
>> with the hang recovery path. This is independent of having per-engine reset
>> or just full gpu reset.
>>
>> 2. Watchdog specific register information.
>>
>> Currently the render engine and all available media engines support
>> watchdog timeout (VECS is only supported in GEN9). The specifications elude
>> to the BCS engine being supported but that is currently not supported by
>> this commit.
>>
>> Note that the value to stop the counter is different between render and
>> non-render engines in GEN8; GEN9 onwards it's the same.
>
> Should mention the choice to piggyback the current hangcheck + capture
> scheme.
>
>> +	if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift)) {
>> +		tasklet_schedule(&engine->watchdog_tasklet);
>> +	}
>
> Kill unwanted braces.
>
>> +#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
>> +static void gen8_watchdog_irq_handler(unsigned long data)
>> +{
>> +	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
>> +	struct drm_i915_private *dev_priv = engine->i915;
>> +	u32 current_seqno;
>> +
>> +	intel_uncore_forcewake_get(dev_priv, engine->fw_domains);
>> +
>> +	/* Stop the counter to prevent further timeout interrupts */
>> +	I915_WRITE_FW(RING_CNTR(engine->mmio_base), get_watchdog_disable(engine));
>> +
>> +	current_seqno = intel_engine_get_seqno(engine);
>> +
>> +	/* did the request complete after the timer expired? */
>> +	if (intel_engine_last_submit(engine) == current_seqno)
>> +		goto fw_put;
>> +
>> +	if (engine->hangcheck.watchdog == current_seqno) {
>> +		/* Make sure the active request will be marked as guilty */
>> +		engine->hangcheck.stalled = true;
>> +		engine->hangcheck.seqno = intel_engine_get_seqno(engine);
>
> Use current_seqno again. intel_engine_get_seqno() may have just changed.
> -Chris
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2017-04-19 17:11     ` Michel Thierry
@ 2017-04-19 17:51       ` Chris Wilson
  2017-04-19 18:13         ` Michel Thierry
  0 siblings, 1 reply; 50+ messages in thread
From: Chris Wilson @ 2017-04-19 17:51 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Wed, Apr 19, 2017 at 10:11:37AM -0700, Michel Thierry wrote:
> 
> 
> On 19/04/17 03:20, Chris Wilson wrote:
> >On Tue, Apr 18, 2017 at 01:23:31PM -0700, Michel Thierry wrote:
> >>*** General ***
> >>
> >>Watchdog timeout (or "media engine reset") is a feature that allows
> >>userland applications to enable hang detection on individual batch buffers.
> >>The detection mechanism itself is mostly bound to the hardware and the only
> >>thing that the driver needs to do to support this form of hang detection
> >>is to implement the interrupt handling support as well as watchdog command
> >>emission before and after the emitted batch buffer start instruction in the
> >>ring buffer.
> >>
> >>The principle of the hang detection mechanism is as follows:
> >>
> >>1. Once the decision has been made to enable watchdog timeout for a
> >>particular batch buffer and the driver is in the process of emitting the
> >>batch buffer start instruction into the ring buffer it also emits a
> >>watchdog timer start instruction before and a watchdog timer cancellation
> >>instruction after the batch buffer start instruction in the ring buffer.
> >>
> >>2. Once the GPU execution reaches the watchdog timer start instruction
> >>the hardware watchdog counter is started by the hardware. The counter
> >>keeps counting until either reaching a previously configured threshold
> >>value or the timer cancellation instruction is executed.
> >>
> >>2a. If the counter reaches the threshold value the hardware fires a
> >>watchdog interrupt that is picked up by the watchdog interrupt handler.
> >>This means that a hang has been detected and the driver needs to deal with
> >>it the same way it would deal with a engine hang detected by the periodic
> >>hang checker. The only difference between the two is that we already blamed
> >>the active request (to ensure an engine reset).
> >>
> >>2b. If the batch buffer completes and the execution reaches the watchdog
> >>cancellation instruction before the watchdog counter reaches its
> >>threshold value the watchdog is cancelled and nothing more comes of it.
> >>No hang is detected.
> >>
> >>Note about future interaction with preemption: Preemption could happen
> >>in a command sequence prior to watchdog counter getting disabled,
> >>resulting in watchdog being triggered following preemption. The driver will
> >>need to explicitly disable the watchdog counter as part of the
> >>preemption sequence.
> >
> >Does MI_ARB_ON_OFF do the trick? Shouldn't we basically be only turning
> >preemption on for the user buffers as it just causes hassle if we allow
> >preemption in our preamble + breadcrumb. (And there's little point in
> >preempting in the flushes.)
> >
> 
> Mid-batch?
> The watchdog counter is not aware of MI_ARB_ON_OFF (or any other
> cmd) and would keep running / expire. We could call
> emit_stop_watchdog unconditionally to prevent this.

No, I was thinking of the opposite where we had preemption after the
batch. Completely missed the point of the watchdog being abled for the
low priority batch then being inherited by the high priority batch - and
vice versa that the watchdog counter would not be restored on the
context switch back. Does suggest that the watchdog should really be
part of the context image...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2017-04-19 17:51       ` Chris Wilson
@ 2017-04-19 18:13         ` Michel Thierry
  0 siblings, 0 replies; 50+ messages in thread
From: Michel Thierry @ 2017-04-19 18:13 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 19/04/17 10:51, Chris Wilson wrote:
> On Wed, Apr 19, 2017 at 10:11:37AM -0700, Michel Thierry wrote:
>>
>>
>> On 19/04/17 03:20, Chris Wilson wrote:
>>> On Tue, Apr 18, 2017 at 01:23:31PM -0700, Michel Thierry wrote:
>>>> *** General ***
>>>>
>>>> Watchdog timeout (or "media engine reset") is a feature that allows
>>>> userland applications to enable hang detection on individual batch buffers.
>>>> The detection mechanism itself is mostly bound to the hardware and the only
>>>> thing that the driver needs to do to support this form of hang detection
>>>> is to implement the interrupt handling support as well as watchdog command
>>>> emission before and after the emitted batch buffer start instruction in the
>>>> ring buffer.
>>>>
>>>> The principle of the hang detection mechanism is as follows:
>>>>
>>>> 1. Once the decision has been made to enable watchdog timeout for a
>>>> particular batch buffer and the driver is in the process of emitting the
>>>> batch buffer start instruction into the ring buffer it also emits a
>>>> watchdog timer start instruction before and a watchdog timer cancellation
>>>> instruction after the batch buffer start instruction in the ring buffer.
>>>>
>>>> 2. Once the GPU execution reaches the watchdog timer start instruction
>>>> the hardware watchdog counter is started by the hardware. The counter
>>>> keeps counting until either reaching a previously configured threshold
>>>> value or the timer cancellation instruction is executed.
>>>>
>>>> 2a. If the counter reaches the threshold value the hardware fires a
>>>> watchdog interrupt that is picked up by the watchdog interrupt handler.
>>>> This means that a hang has been detected and the driver needs to deal with
>>>> it the same way it would deal with a engine hang detected by the periodic
>>>> hang checker. The only difference between the two is that we already blamed
>>>> the active request (to ensure an engine reset).
>>>>
>>>> 2b. If the batch buffer completes and the execution reaches the watchdog
>>>> cancellation instruction before the watchdog counter reaches its
>>>> threshold value the watchdog is cancelled and nothing more comes of it.
>>>> No hang is detected.
>>>>
>>>> Note about future interaction with preemption: Preemption could happen
>>>> in a command sequence prior to watchdog counter getting disabled,
>>>> resulting in watchdog being triggered following preemption. The driver will
>>>> need to explicitly disable the watchdog counter as part of the
>>>> preemption sequence.
>>>
>>> Does MI_ARB_ON_OFF do the trick? Shouldn't we basically be only turning
>>> preemption on for the user buffers as it just causes hassle if we allow
>>> preemption in our preamble + breadcrumb. (And there's little point in
>>> preempting in the flushes.)
>>>
>>
>> Mid-batch?
>> The watchdog counter is not aware of MI_ARB_ON_OFF (or any other
>> cmd) and would keep running / expire. We could call
>> emit_stop_watchdog unconditionally to prevent this.
>
> No, I was thinking of the opposite where we had preemption after the
> batch. Completely missed the point of the watchdog being abled for the
> low priority batch then being inherited by the high priority batch - and
> vice versa that the watchdog counter would not be restored on the
> context switch back. Does suggest that the watchdog should really be
> part of the context image...

RING_CNTR (0x2178) & RING_THRESH (0x217c) are part of the context image, 
but there's still the issue of the ctx restore being slower (or maybe 
it's a lite-restore).

And the 'counter' isn't part of the image; when the pre-empted batch 
resumes, the counter will re-start from 0.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 14/20] drm/i915/guc: Add support for reset engine using GuC commands
  2017-04-19 10:27   ` Chris Wilson
@ 2017-04-19 23:22     ` Michel Thierry
  2017-04-20  9:05       ` Chris Wilson
  0 siblings, 1 reply; 50+ messages in thread
From: Michel Thierry @ 2017-04-19 23:22 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 19/04/17 03:27, Chris Wilson wrote:
> On Tue, Apr 18, 2017 at 01:23:29PM -0700, Michel Thierry wrote:
>> This patch adds per engine reset and recovery (TDR) support when GuC is
>> used to submit workloads to GPU.
>>
>> In the case of i915 directly submission to ELSP, driver manages hang
>> detection, recovery and resubmission. With GuC submission these tasks
>> are shared between driver and GuC. i915 is still responsible for detecting
>> a hang, and when it does it only requests GuC to reset that Engine. GuC
>> internally manages acquiring forcewake and idling the engine before actually
>> resetting it.
>>
>> Once the reset is successful, i915 takes over again and handles resubmission.
>> The scheduler in i915 knows which requests are pending so after resetting
>> a engine, pending workloads/requests are resubmitted again.
>>
>> v2: s/i915_guc_request_engine_reset/i915_guc_reset_engine/ to match the
>> non-guc funtion names.
>>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>> ---
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>> index 7df278fe492e..6295760098a1 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -1176,14 +1176,15 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
>>
>>  	/* After a GPU reset, we may have requests to replay */
>>  	clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
>> -	if (!i915.enable_guc_submission && !execlists_elsp_idle(engine)) {
>> +	if (!execlists_elsp_idle(engine)) {
>>  		DRM_DEBUG_DRIVER("Restarting %s from requests [0x%x, 0x%x]\n",
>>  				 engine->name,
>>  				 port_seqno(&engine->execlist_port[0]),
>>  				 port_seqno(&engine->execlist_port[1]));
>>  		engine->execlist_port[0].count = 0;
>>  		engine->execlist_port[1].count = 0;
>> -		execlists_submit_ports(engine);
>> +		if (!dev_priv->guc.execbuf_client)
>> +			execlists_submit_ports(engine);
>
> Not sure what you were intending to do here as this only resets the
> submission count -- which is not used by guc dequeue. Some merit in the
> making the code look similar, certainly adds the dbg message but I think
> it is unrelated to the rest of the patch.

Yes, it only keeps the same debug message (originally added to check it 
was taking the right path). I can remove if you think it doesn't provide 
anything useful.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-04-18 20:23 ` [PATCH v6 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
  2017-04-19 16:56   ` Jeff McGee
  2017-04-19 17:07   ` Daniele Ceraolo Spurio
@ 2017-04-20  1:09   ` Michel Thierry
  2017-04-20  8:52     ` Chris Wilson
  2 siblings, 1 reply; 50+ messages in thread
From: Michel Thierry @ 2017-04-20  1:09 UTC (permalink / raw)
  To: intel-gfx

On 18/04/17 13:23, Michel Thierry wrote:
> Final enablement patch for GPU hang detection using watchdog timeout.
> Using the gem_context_setparam ioctl, users can specify the desired
> timeout value in microseconds, and the driver will do the conversion to
> 'timestamps'.
>
> The recommended default watchdog threshold for video engines is 60000 us,
> since this has been _empirically determined_ to be a good compromise for
> low-latency requirements and low rate of false positives. The default
> register value is ~106000us and the theoretical max value (all 1s) is
> 353 seconds.
>
> Note, UABI engine ids and i915 engine ids are different, and this patch
> uses the i915 ones. Some kind of mapping table [1] is required if we
> decide to use the UABI engine ids.
>
> [1] http://patchwork.freedesktop.org/patch/msgid/20170329135831.30254-2-chris@chris-wilson.co.uk
>
> v2: Fixed get api to return values in microseconds. Threshold updated to
> be per context engine. Check for u32 overflow. Capture ctx threshold
> value in error state.
>
> v3: Add a way to get array size, short-cut to disable all thresholds,
> return EFAULT / EINVAL as needed. Move the capture of the threshold
> value in the error state into a new patch. BXT has a different
> timestamp base (because why not?).
>
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h         |  29 +++++++++
>  drivers/gpu/drm/i915/i915_gem_context.c | 102 ++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/intel_lrc.c        |   5 +-
>  include/uapi/drm/i915_drm.h             |   1 +
>  4 files changed, 135 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 203f2112dd18..f65a236fddef 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -3574,6 +3574,35 @@ i915_gem_context_lookup_timeline(struct i915_gem_context *ctx,
>         return &vm->timeline.engine[engine->id];
>  }
>
> +/*
> + * BDW & SKL+ Timestamp timer resolution = 0.080 uSec,
> + * or 12500000 counts per second, or ~12 counts per microsecond.
> + *
> + * But Broxton Timestamp timer resolution is different, 0.052 uSec,
> + * or 19200000 counts per second, or ~19 counts per microsecond.
> + */
> +#define SKL_TIMESTAMP_CNTS_PER_USEC 12
> +#define BXT_TIMESTAMP_CNTS_PER_USEC 19
> +#define TIMESTAMP_CNTS_PER_USEC(dev_priv) (IS_BROXTON(dev_priv) ? \
> +                                          BXT_TIMESTAMP_CNTS_PER_USEC : \
> +                                          SKL_TIMESTAMP_CNTS_PER_USEC)
> +static inline u32
> +watchdog_to_us(struct drm_i915_private *dev_priv, u32 value_in_clock_counts)
> +{
> +       return value_in_clock_counts / TIMESTAMP_CNTS_PER_USEC(dev_priv);
> +}
> +
> +static inline u32
> +watchdog_to_clock_counts(struct drm_i915_private *dev_priv, u64 value_in_us)
> +{
> +       u64 threshold = value_in_us * TIMESTAMP_CNTS_PER_USEC(dev_priv);
> +
> +       if (overflows_type(threshold, u32))
> +               return -EINVAL;
> +
> +       return threshold;
> +}
> +
>  int i915_perf_open_ioctl(struct drm_device *dev, void *data,
>                          struct drm_file *file);
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index edbed85a1c88..85a6467a25a6 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -422,6 +422,102 @@ i915_gem_context_create_gvt(struct drm_device *dev)
>         return ctx;
>  }
>
> +/* Return the timer count threshold in microseconds. */
> +int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
> +                                 struct drm_i915_gem_context_param *args)
> +{
> +       struct drm_i915_private *dev_priv = ctx->i915;
> +       struct intel_engine_cs *engine;
> +       enum intel_engine_id id;
> +       u32 threshold_in_us[I915_NUM_ENGINES];
> +
> +       if (args->size == 0)
> +               goto out;
> +
> +       if (args->size < sizeof(threshold_in_us))
> +               return -EFAULT;
> +
> +       if (!dev_priv->engine[VCS]->emit_start_watchdog)
> +               return -ENODEV;
> +
> +       for_each_engine(engine, dev_priv, id) {
> +               struct intel_context *ce = &ctx->engine[id];
> +
> +               threshold_in_us[id] = watchdog_to_us(dev_priv,
> +                                                    ce->watchdog_threshold);
> +       }
> +
> +       mutex_unlock(&dev_priv->drm.struct_mutex);
> +       if (__copy_to_user(u64_to_user_ptr(args->value),
> +                          &threshold_in_us,
> +                          sizeof(threshold_in_us))) {
> +               mutex_lock(&dev_priv->drm.struct_mutex);
> +               return -EFAULT;
> +       }
> +       mutex_lock(&dev_priv->drm.struct_mutex);
> +
> +out:
> +       args->size = sizeof(threshold_in_us);
> +
> +       return 0;
> +}
> +
> +/*
> + * Based on time out value in microseconds (us) calculate
> + * timer count thresholds needed based on core frequency.
> + * Watchdog can be disabled by setting it to 0.
> + */
> +int i915_gem_context_set_watchdog(struct i915_gem_context *ctx,
> +                                 struct drm_i915_gem_context_param *args)
> +{
> +       struct drm_i915_private *dev_priv = ctx->i915;
> +       struct intel_engine_cs *engine;
> +       enum intel_engine_id id;
> +       u32 threshold[I915_NUM_ENGINES];
> +
> +       memset(&threshold, 0, sizeof(threshold));
> +
> +       /* shortcut to disable in all engines */
> +       if (args->size == 0)
> +               goto set_watchdog;
> +
> +       if (args->size < sizeof(threshold))
> +               return -EFAULT;
> +
> +       if (!dev_priv->engine[VCS]->emit_start_watchdog)
> +               return -ENODEV;
> +
> +       mutex_unlock(&dev_priv->drm.struct_mutex);
> +       if (copy_from_user(&threshold,
> +                          u64_to_user_ptr(args->value),
> +                          sizeof(threshold))) {
> +               mutex_lock(&dev_priv->drm.struct_mutex);
> +               return -EFAULT;
> +       }
> +       mutex_lock(&dev_priv->drm.struct_mutex);
> +
> +       /* not supported in blitter engine */
> +       if (threshold[BCS] != 0)
> +               return -EINVAL;
> +
> +       for_each_engine(engine, dev_priv, id) {
> +               threshold[id] = watchdog_to_clock_counts(dev_priv,
> +                                                        threshold[id]);
> +
> +               if (threshold[id] == -EINVAL)
> +                       return -EINVAL;
> +       }
> +
> +set_watchdog:
> +       for_each_engine(engine, dev_priv, id) {
> +               struct intel_context *ce = &ctx->engine[id];
> +
> +               ce->watchdog_threshold = threshold[id];
> +       }
> +
> +       return 0;
> +}
> +
>  int i915_gem_context_init(struct drm_i915_private *dev_priv)
>  {
>         struct i915_gem_context *ctx;

This patch is missing:

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c 
b/drivers/gpu/drm/i915/i915_gem_context.c
index c1013af0b910..a8bdea43a217 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -1135,7 +1135,7 @@ int i915_gem_context_getparam_ioctl(struct 
drm_device *dev, void *data,
  		return PTR_ERR(ctx);
  	}

-	args->size = 0;
+	args->size = (args->param != I915_CONTEXT_PARAM_WATCHDOG) ? 0 : 
args->size;
  	switch (args->param) {
  	case I915_CONTEXT_PARAM_BAN_PERIOD:
  		ret = -EINVAL;

Or there will be no way to get the current thresholds (chunk was missed 
due to some TRTT code nearby). I'll be sure to include it in the next 
version.

> @@ -1061,6 +1157,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>         case I915_CONTEXT_PARAM_BANNABLE:
>                 args->value = i915_gem_context_is_bannable(ctx);
>                 break;
> +       case I915_CONTEXT_PARAM_WATCHDOG:
> +               ret = i915_gem_context_get_watchdog(ctx, args);
> +               break;
>         default:
>                 ret = -EINVAL;
>                 break;
> @@ -1118,6 +1217,9 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>                 else
>                         i915_gem_context_clear_bannable(ctx);
>                 break;
> +       case I915_CONTEXT_PARAM_WATCHDOG:
> +               ret = i915_gem_context_set_watchdog(ctx, args);
> +               break;
>         default:
>                 ret = -EINVAL;
>                 break;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 7a202e73ce9b..f4e05531a9a0 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1496,7 +1496,7 @@ static inline u32 get_watchdog_disable(struct intel_engine_cs *engine)
>                 return GEN8_XCS_WATCHDOG_DISABLE;
>  }
>
> -#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
> +#define GEN8_WATCHDOG_1000US(dev_priv) watchdog_to_clock_counts(dev_priv, 1000)
>  static void gen8_watchdog_irq_handler(unsigned long data)
>  {
>         struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> @@ -1526,7 +1526,8 @@ static void gen8_watchdog_irq_handler(unsigned long data)
>         } else {
>                 engine->hangcheck.watchdog = current_seqno;
>                 /* Re-start the counter, if really hung, it will expire again */
> -               I915_WRITE_FW(RING_THRESH(engine->mmio_base), GEN8_WATCHDOG_1000US);
> +               I915_WRITE_FW(RING_THRESH(engine->mmio_base),
> +                             GEN8_WATCHDOG_1000US(dev_priv));
>                 I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
>         }
>
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index fadedefba6db..18bc0ec618dd 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1308,6 +1308,7 @@ struct drm_i915_gem_context_param {
>  #define I915_CONTEXT_PARAM_GTT_SIZE    0x3
>  #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE    0x4
>  #define I915_CONTEXT_PARAM_BANNABLE    0x5
> +#define I915_CONTEXT_PARAM_WATCHDOG    0x6
>         __u64 value;
>  };
>
> --
> 2.11.0
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-04-20  1:09   ` Michel Thierry
@ 2017-04-20  8:52     ` Chris Wilson
  2017-04-20 17:19       ` Michel Thierry
  0 siblings, 1 reply; 50+ messages in thread
From: Chris Wilson @ 2017-04-20  8:52 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Wed, Apr 19, 2017 at 06:09:00PM -0700, Michel Thierry wrote:
> This patch is missing:
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
> b/drivers/gpu/drm/i915/i915_gem_context.c
> index c1013af0b910..a8bdea43a217 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -1135,7 +1135,7 @@ int i915_gem_context_getparam_ioctl(struct
> drm_device *dev, void *data,
>  		return PTR_ERR(ctx);
>  	}
> 
> -	args->size = 0;
> +	args->size = (args->param != I915_CONTEXT_PARAM_WATCHDOG) ? 0 :
> args->size;
>  	switch (args->param) {
>  	case I915_CONTEXT_PARAM_BAN_PERIOD:
>  		ret = -EINVAL;
> 
> Or there will be no way to get the current thresholds (chunk was
> missed due to some TRTT code nearby). I'll be sure to include it in
> the next version.

No. It is always preset to 0. The PARAM should set it to the actual
struct size (it would write) and *not* the user's size.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 14/20] drm/i915/guc: Add support for reset engine using GuC commands
  2017-04-19 23:22     ` Michel Thierry
@ 2017-04-20  9:05       ` Chris Wilson
  0 siblings, 0 replies; 50+ messages in thread
From: Chris Wilson @ 2017-04-20  9:05 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Wed, Apr 19, 2017 at 04:22:43PM -0700, Michel Thierry wrote:
> On 19/04/17 03:27, Chris Wilson wrote:
> >On Tue, Apr 18, 2017 at 01:23:29PM -0700, Michel Thierry wrote:
> >>This patch adds per engine reset and recovery (TDR) support when GuC is
> >>used to submit workloads to GPU.
> >>
> >>In the case of i915 directly submission to ELSP, driver manages hang
> >>detection, recovery and resubmission. With GuC submission these tasks
> >>are shared between driver and GuC. i915 is still responsible for detecting
> >>a hang, and when it does it only requests GuC to reset that Engine. GuC
> >>internally manages acquiring forcewake and idling the engine before actually
> >>resetting it.
> >>
> >>Once the reset is successful, i915 takes over again and handles resubmission.
> >>The scheduler in i915 knows which requests are pending so after resetting
> >>a engine, pending workloads/requests are resubmitted again.
> >>
> >>v2: s/i915_guc_request_engine_reset/i915_guc_reset_engine/ to match the
> >>non-guc funtion names.
> >>
> >>Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> >>Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> >>Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> >>---
> >>diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> >>index 7df278fe492e..6295760098a1 100644
> >>--- a/drivers/gpu/drm/i915/intel_lrc.c
> >>+++ b/drivers/gpu/drm/i915/intel_lrc.c
> >>@@ -1176,14 +1176,15 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
> >>
> >> 	/* After a GPU reset, we may have requests to replay */
> >> 	clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
> >>-	if (!i915.enable_guc_submission && !execlists_elsp_idle(engine)) {
> >>+	if (!execlists_elsp_idle(engine)) {
> >> 		DRM_DEBUG_DRIVER("Restarting %s from requests [0x%x, 0x%x]\n",
> >> 				 engine->name,
> >> 				 port_seqno(&engine->execlist_port[0]),
> >> 				 port_seqno(&engine->execlist_port[1]));
> >> 		engine->execlist_port[0].count = 0;
> >> 		engine->execlist_port[1].count = 0;
> >>-		execlists_submit_ports(engine);
> >>+		if (!dev_priv->guc.execbuf_client)
> >>+			execlists_submit_ports(engine);
> >
> >Not sure what you were intending to do here as this only resets the
> >submission count -- which is not used by guc dequeue. Some merit in the
> >making the code look similar, certainly adds the dbg message but I think
> >it is unrelated to the rest of the patch.
> 
> Yes, it only keeps the same debug message (originally added to check
> it was taking the right path). I can remove if you think it doesn't
> provide anything useful.

Just a small patch by itself, it is only a distraction to the larger
patch.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-04-20  8:52     ` Chris Wilson
@ 2017-04-20 17:19       ` Michel Thierry
  0 siblings, 0 replies; 50+ messages in thread
From: Michel Thierry @ 2017-04-20 17:19 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 20/04/17 01:52, Chris Wilson wrote:
> On Wed, Apr 19, 2017 at 06:09:00PM -0700, Michel Thierry wrote:
>> This patch is missing:
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
>> b/drivers/gpu/drm/i915/i915_gem_context.c
>> index c1013af0b910..a8bdea43a217 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -1135,7 +1135,7 @@ int i915_gem_context_getparam_ioctl(struct
>> drm_device *dev, void *data,
>>  		return PTR_ERR(ctx);
>>  	}
>>
>> -	args->size = 0;
>> +	args->size = (args->param != I915_CONTEXT_PARAM_WATCHDOG) ? 0 :
>> args->size;
>>  	switch (args->param) {
>>  	case I915_CONTEXT_PARAM_BAN_PERIOD:
>>  		ret = -EINVAL;
>>
>> Or there will be no way to get the current thresholds (chunk was
>> missed due to some TRTT code nearby). I'll be sure to include it in
>> the next version.
>
> No. It is always preset to 0. The PARAM should set it to the actual
> struct size (it would write) and *not* the user's size.
> -Chris
>

Ok, then I'll change the shortcut in get_watchdog, because as it is you 
can query the size, but not the thresholds.

int i915_gem_context_get_watchdog()
{
...
	if (args->size == 0)
		goto out;
...
out:
	args->size = sizeof(threshold_in_us);

	return 0;
}
}
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery
  2017-04-19 10:49   ` Chris Wilson
  2017-04-19 13:49     ` Chris Wilson
@ 2017-04-21  0:17     ` Michel Thierry
  2017-04-24 21:22       ` Michel Thierry
  1 sibling, 1 reply; 50+ messages in thread
From: Michel Thierry @ 2017-04-21  0:17 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Mika Kuoppala

On 19/04/17 03:49, Chris Wilson wrote:
> On Tue, Apr 18, 2017 at 01:23:20PM -0700, Michel Thierry wrote:
>> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>>
>> This change implements support for per-engine reset as an initial, less
>> intrusive hang recovery option to be attempted before falling back to the
>> legacy full GPU reset recovery mode if necessary. This is only supported
>> from Gen8 onwards.
>>
>> Hangchecker determines which engines are hung and invokes error handler to
>> recover from it. Error handler schedules recovery for each of those engines
>> that are hung. The recovery procedure is as follows,
>>  - identifies the request that caused the hang and it is dropped
>>  - force engine to idle: this is done by issuing a reset request
>>  - reset and re-init engine
>>  - restart submissions to the engine
>>
>> If engine reset fails then we fall back to heavy weight full gpu reset
>> which resets all engines and reinitiazes complete state of HW and SW.
>>
>> v2: Rebase.
>> v3: s/*engine_reset*/*reset_engine*/; freeze engine and irqs before
>> calling i915_gem_reset_engine (Chris).
>> v4: Rebase, modify i915_gem_reset_prepare to use a ring mask and
>> reuse the function for reset_engine.
>> v5: intel_reset_engine_start/cancel instead of request/unrequest_reset.
>> v6: Clean up reset_engine function to not require mutex, i.e. no need to call
>> revoke/restore_fences and _retire_requests (Chris).
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_drv.c         | 76 ++++++++++++++++++++++++--
>>  drivers/gpu/drm/i915/i915_drv.h         | 12 +++-
>>  drivers/gpu/drm/i915/i915_gem.c         | 97 +++++++++++++++++++--------------
>>  drivers/gpu/drm/i915/i915_gem_request.c |  2 +-
>>  drivers/gpu/drm/i915/intel_uncore.c     | 20 +++++++
>>  5 files changed, 158 insertions(+), 49 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index e03d0643dbd6..634893cd93b3 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1810,7 +1810,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>
>>  	pr_notice("drm/i915: Resetting chip after gpu hang\n");
>>  	disable_irq(dev_priv->drm.irq);
>> -	ret = i915_gem_reset_prepare(dev_priv);
>> +	ret = i915_gem_reset_prepare(dev_priv, ALL_ENGINES);
>>  	if (ret) {
>>  		DRM_ERROR("GPU recovery failed\n");
>>  		intel_gpu_reset(dev_priv, ALL_ENGINES);
>> @@ -1852,7 +1852,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>  	i915_queue_hangcheck(dev_priv);
>>
>>  finish:
>> -	i915_gem_reset_finish(dev_priv);
>> +	i915_gem_reset_finish(dev_priv, ALL_ENGINES);
>>  	enable_irq(dev_priv->drm.irq);
>>
>>  wakeup:
>> @@ -1871,11 +1871,79 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>   *
>>   * Reset a specific GPU engine. Useful if a hang is detected.
>>   * Returns zero on successful reset or otherwise an error code.
>> + *
>> + * Procedure is:
>> + *  - identifies the request that caused the hang and it is dropped
>> + *  - force engine to idle: this is done by issuing a reset request
>> + *  - reset engine
>> + *  - restart submissions to the engine
>>   */
>>  int i915_reset_engine(struct intel_engine_cs *engine)
>>  {
>> -	/* FIXME: replace me with engine reset sequence */
>> -	return -ENODEV;
>> +	int ret;
>> +	struct drm_i915_private *dev_priv = engine->i915;
>> +	struct i915_gpu_error *error = &dev_priv->gpu_error;
>> +
>> +	GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, &error->flags));
>> +
>> +	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
>> +
>> +	/*
>> +	 * We need to first idle the engine by issuing a reset request,
>> +	 * then perform soft reset and re-initialize hw state, for all of
>> +	 * this GT power need to be awake so ensure it does throughout the
>> +	 * process
>> +	 */
>> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>
> Hmm, what path did we take to get here without taking rpm? I thought I
> had fixed the last offender.
>

Too many rebases... As you say, this is no longer needed after 
1604a86d08053 "drm/i915: Extend rpm wakelock during i915_handle_error()"

>> +	disable_irq(dev_priv->drm.irq);
>
> I am 99% certain that we don't need to disable_irq() now for per-engine
> reset... I'd keep it in the global reset as simple paranoia.
>

100% correct.

>> +	ret = i915_gem_reset_prepare_engine(engine);
>> +	if (ret) {
>> +		DRM_ERROR("Previous reset failed - promote to full reset\n");
>> +		goto error;
>> +	}
>> +
>> +	/*
>> +	 * the request that caused the hang is stuck on elsp, identify the
>> +	 * active request and drop it, adjust head to skip the offending
>> +	 * request to resume executing remaining requests in the queue.
>> +	 */
>
> Hmm. Interesting. This relies on i915_gem_retire_requests() (i.e.
> struct_mutex) to skip replaying innocent requests, but here we should be
> asserting that we do have the hung request.
>
> i.e.
> request = i915_gem_find_active_request(engine);
> if (!request)
> 	goto skip.
>
> Bonus points for tying that into i915_gem_reset_prepare_engine() so that
> we only seach for the active_request once.
>

What about something like this?

int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
...
         if (engine_stalled(engine)) {
                 request = i915_gem_find_active_request(engine);
+               if (!request)
+                       err = -ECANCELED;
+
                 if (request && request->fence.error == -EIO)
                         err = -EIO; /* Previous reset failed! */
         }

and

int i915_reset_engine(struct intel_engine_cs *engine)
...
         DRM_DEBUG_DRIVER("resetting %s\n", engine->name);

         ret = i915_gem_reset_prepare_engine(engine);
-       if (ret) {
+       if (ret == -ECANCELED) {
+               DRM_INFO("no active request found, skip reset\n");
+               goto skip;
+       } else  if (ret) {
                 DRM_ERROR("Previous reset failed - promote to full 
reset\n");
                 goto out;
...

+skip:
+       i915_gem_reset_finish_engine(engine);
+       goto out;
...

I'll still have to keep the request (if found) and pass it to 
i915_gem_reset_engine, to avoid the extra find_active_request.

>> +	i915_gem_reset_engine(engine);
>
> Where does skip_request() get called?
>

Sorry, I didn't get this, skip_request happens under this path:
i915_gem_reset_engine
--> i915_gem_reset_request
-----> if (guilty)

>> +
>> +	/* forcing engine to idle */
>> +	ret = intel_reset_engine_start(engine);
>> +	if (ret) {
>> +		DRM_ERROR("Failed to disable %s\n", engine->name);
>> +		goto error;
>> +	}
>> +
>> +	/* finally, reset engine */
>> +	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
>> +	if (ret) {
>> +		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
>> +		intel_reset_engine_cancel(engine);
>> +		goto error;
>> +	}
>> +
>> +	/* be sure the request reset bit gets cleared */
>> +	intel_reset_engine_cancel(engine);
>> +
>> +	i915_gem_reset_finish_engine(engine);
>> +
>> +	/* replay remaining requests in the queue */
>> +	ret = engine->init_hw(engine);
>> +	if (ret)
>> +		goto error;
>> +
>> +wakeup:
>> +	enable_irq(dev_priv->drm.irq);
>> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>> +	wake_up_bit(&error->flags, I915_RESET_HANDOFF);
>
> No handoff here anymore.
>

All these are leftovers from the previous version.

>> +	return ret;
>> +
>> +error:
>> +	/* use full gpu reset to recover on error */
>> +	goto wakeup;
>>  }
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery
  2017-04-21  0:17     ` Michel Thierry
@ 2017-04-24 21:22       ` Michel Thierry
  2017-04-25  9:42         ` Chris Wilson
  0 siblings, 1 reply; 50+ messages in thread
From: Michel Thierry @ 2017-04-24 21:22 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 20/04/17 17:17, Michel Thierry wrote:
>> Hmm. Interesting. This relies on i915_gem_retire_requests() (i.e.
>> struct_mutex) to skip replaying innocent requests, but here we should be
>> asserting that we do have the hung request.
>>
>> i.e.
>> request = i915_gem_find_active_request(engine);
>> if (!request)
>>     goto skip.
>>
>> Bonus points for tying that into i915_gem_reset_prepare_engine() so that
>> we only seach for the active_request once.
>>

Will this do it?
https://patchwork.freedesktop.org/patch/152494/  (ignore the DRM_ERROR I 
still have to change)

I'm not sure about reusing the active request in full-reset (what if we 
have more than one engine hung?).

Thanks
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery
  2017-04-24 21:22       ` Michel Thierry
@ 2017-04-25  9:42         ` Chris Wilson
  0 siblings, 0 replies; 50+ messages in thread
From: Chris Wilson @ 2017-04-25  9:42 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Mon, Apr 24, 2017 at 02:22:21PM -0700, Michel Thierry wrote:
> 
> 
> On 20/04/17 17:17, Michel Thierry wrote:
> >>Hmm. Interesting. This relies on i915_gem_retire_requests() (i.e.
> >>struct_mutex) to skip replaying innocent requests, but here we should be
> >>asserting that we do have the hung request.
> >>
> >>i.e.
> >>request = i915_gem_find_active_request(engine);
> >>if (!request)
> >>    goto skip.
> >>
> >>Bonus points for tying that into i915_gem_reset_prepare_engine() so that
> >>we only seach for the active_request once.
> >>
> 
> Will this do it?
> https://patchwork.freedesktop.org/patch/152494/  (ignore the
> DRM_ERROR I still have to change)

That's a little uglier than I was hoping for :(
 
> I'm not sure about reusing the active request in full-reset (what if
> we have more than one engine hung?).

We can have more than one active request, just don't forget a
if (i915_gem_request_completed(engine->hangcheck.hung_request))
	/* skip the reset! */
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v6 02/20] drm/i915: Rename gen8_(un)request_engine_reset to gen8_reset_engine_start/cancel
  2017-04-18 20:23 ` [PATCH v6 02/20] drm/i915: Rename gen8_(un)request_engine_reset to gen8_reset_engine_start/cancel Michel Thierry
@ 2017-04-27  8:13   ` Chris Wilson
  0 siblings, 0 replies; 50+ messages in thread
From: Chris Wilson @ 2017-04-27  8:13 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Tue, Apr 18, 2017 at 01:23:17PM -0700, Michel Thierry wrote:
> As all other functions related to resetting engines are using
> reset_engine.
> 
> v2: remove _request_ and use start/cancel instead (Chris)
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>

Picked up the first pair of trivial fixes, thanks.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2017-04-27  8:13 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-18 20:23 [PATCH v6 00/20] Gen8+ engine-reset Michel Thierry
2017-04-18 20:23 ` [PATCH v6 01/20] drm/i915: Fix stale comment about I915_RESET_IN_PROGRESS flag Michel Thierry
2017-04-18 20:23 ` [PATCH v6 02/20] drm/i915: Rename gen8_(un)request_engine_reset to gen8_reset_engine_start/cancel Michel Thierry
2017-04-27  8:13   ` Chris Wilson
2017-04-18 20:23 ` [PATCH v6 03/20] drm/i915: Update i915.reset to handle engine resets Michel Thierry
2017-04-18 20:23 ` [PATCH v6 04/20] drm/i915/tdr: Modify error handler for per engine hang recovery Michel Thierry
2017-04-18 21:40   ` Chris Wilson
2017-04-18 22:01     ` Michel Thierry
2017-04-18 23:13       ` Chris Wilson
2017-04-18 20:23 ` [PATCH v6 05/20] drm/i915/tdr: Add support for per engine reset recovery Michel Thierry
2017-04-19 10:49   ` Chris Wilson
2017-04-19 13:49     ` Chris Wilson
2017-04-21  0:17     ` Michel Thierry
2017-04-24 21:22       ` Michel Thierry
2017-04-25  9:42         ` Chris Wilson
2017-04-18 20:23 ` [PATCH v6 06/20] drm/i915: Skip reset request if there is one already Michel Thierry
2017-04-18 20:23 ` [PATCH v6 07/20] drm/i915/tdr: Add engine reset count to error state Michel Thierry
2017-04-18 20:23 ` [PATCH v6 08/20] drm/i915/tdr: Export per-engine reset count info to debugfs Michel Thierry
2017-04-18 20:23 ` [PATCH v6 09/20] drm/i915/tdr: Enable Engine reset and recovery support Michel Thierry
2017-04-18 20:23 ` [PATCH v6 10/20] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
2017-04-18 20:23 ` [PATCH v6 11/20] drm/i915/selftests: reset engine self tests Michel Thierry
2017-04-18 20:23 ` [PATCH v6 12/20] drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder Michel Thierry
2017-04-18 20:23 ` [PATCH v6 13/20] drm/i915/guc: Provide register list to be saved/restored during engine reset Michel Thierry
2017-04-19  0:26   ` Daniele Ceraolo Spurio
2017-04-19  0:44     ` Michel Thierry
2017-04-18 20:23 ` [PATCH v6 14/20] drm/i915/guc: Add support for reset engine using GuC commands Michel Thierry
2017-04-19 10:27   ` Chris Wilson
2017-04-19 23:22     ` Michel Thierry
2017-04-20  9:05       ` Chris Wilson
2017-04-18 20:23 ` [PATCH v6 15/20] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load Michel Thierry
2017-04-18 21:18   ` Daniele Ceraolo Spurio
2017-04-18 20:23 ` [PATCH v6 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
2017-04-19 10:20   ` Chris Wilson
2017-04-19 17:11     ` Michel Thierry
2017-04-19 17:51       ` Chris Wilson
2017-04-19 18:13         ` Michel Thierry
2017-04-18 20:23 ` [PATCH v6 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
2017-04-18 21:20   ` Chris Wilson
2017-04-18 21:36     ` Michel Thierry
2017-04-18 23:06       ` Chris Wilson
2017-04-18 23:11         ` Michel Thierry
2017-04-18 20:23 ` [PATCH v6 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
2017-04-19 16:56   ` Jeff McGee
2017-04-19 17:07   ` Daniele Ceraolo Spurio
2017-04-20  1:09   ` Michel Thierry
2017-04-20  8:52     ` Chris Wilson
2017-04-20 17:19       ` Michel Thierry
2017-04-18 20:23 ` [PATCH v6 19/20] drm/i915: Watchdog timeout: Include threshold value in error state Michel Thierry
2017-04-18 20:23 ` [PATCH v6 20/20] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs Michel Thierry
2017-04-18 20:44 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev2) Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.