All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/18] Gen8+ engine-reset
@ 2017-03-25  1:29 Michel Thierry
  2017-03-25  1:29 ` [PATCH v5 01/18] drm/i915: Fix stale comment about I915_RESET_IN_PROGRESS flag Michel Thierry
                   ` (18 more replies)
  0 siblings, 19 replies; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:29 UTC (permalink / raw)
  To: intel-gfx

These patches add the reset-engine feature from Gen8. This is also
referred to as Timeout detection and recovery (TDR). This complements to
the full gpu reset feature available in i915 but it only allows to reset a
particular engine instead of all engines thus providing a light weight
engine reset and recovery mechanism.

Thanks to recent changes merged, this implementation is now not only for
execlists, but for GuC based submission too; it is still limited from
Gen8 onwards. I have also included the changes for watchdog timeout
detection. The GuC related patches can be seen as RFC.

Timeout detection relies on the existing hangcheck, which remains the same;
main changes are to the recovery mechanism. Once we detect a hang on a
particular engine we identify the request that caused the hang, skip the
request and adjust head pointers to allow the execution to proceed
normally. After some cleanup, submissions are restarted to process
remaining work queued to that engine.

If engine reset fails to recover engine correctly then we fallback to full
gpu reset.

We can argue about the effectiveness of reset-engine vs full reset when
more than one ring is hung, but the benefits of just resetting one engine
are reduced when the driver has to do it multiple times.

Note - with guc submission enabled, Fi.CI.BAT is reporting a regression
in gem_ringfill@basic-default-hang. But local testing show it can only be
reproducible while running IGT v1.7; IGT v1.8 or newer don't show the
problem.

v2: ELSP queue request tracking and reset path changes to handle incomplete
requests during reset. Thanks to Chris Wilson for providing these patches.

v3: Let the waiter keep handling the full gpu reset if it already has the
lock; point out that GuC submission needs a different method to restart
workloads after the engine reset completes.

v4: Handle reset as 2 level resets, by first going to engine only and fall
backing to full/chip reset as needed, i.e. reset_engine will need the
struct_mutex.

v5: Rebased after reset flag split in 2, add GuC support, include watchdog
detection patches, addressing comments from prev RFC.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Arun Siluvery (7):
  drm/i915: Update i915.reset to handle engine resets
  drm/i915/tdr: Modify error handler for per engine hang recovery
  drm/i915/tdr: Add support for per engine reset recovery
  drm/i915/tdr: Add engine reset count to error state
  drm/i915/tdr: Export per-engine reset count info to debugfs
  drm/i915/tdr: Enable Engine reset and recovery support
  drm/i915/guc: Provide register list to be saved/restored during engine
    reset

Michel Thierry (10):
  drm/i915: Fix stale comment about I915_RESET_IN_PROGRESS flag
  drm/i915: Rename gen8_(un)request_engine_reset to
    gen8_(un)request_reset_engine
  drm/i915: Add engine reset count in get-reset-stats ioctl
  drm/i915/selftests: reset engine self tests
  drm/i915/guc: Add support for reset engine using GuC commands
  drm/i915: Watchdog timeout: Pass GuC shared data structure during
    param load
  drm/i915: Watchdog timeout: IRQ handler for gen8+
  drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  drm/i915: Watchdog timeout: Export media reset count from GuC to
    debugfs

Mika Kuoppala (1):
  drm/i915: Skip reset request if there is one already

 drivers/gpu/drm/i915/i915_debugfs.c              |  49 +++++++
 drivers/gpu/drm/i915/i915_drv.c                  | 150 +++++++++++++++++++++-
 drivers/gpu/drm/i915/i915_drv.h                  |  34 ++++-
 drivers/gpu/drm/i915/i915_gem.c                  |  16 ++-
 drivers/gpu/drm/i915/i915_gem_context.c          |  92 +++++++++++++-
 drivers/gpu/drm/i915/i915_gem_context.h          |  24 ++++
 drivers/gpu/drm/i915/i915_gem_request.c          |   5 +-
 drivers/gpu/drm/i915/i915_gpu_error.c            |  14 +-
 drivers/gpu/drm/i915/i915_guc_submission.c       |  98 +++++++++++++-
 drivers/gpu/drm/i915/i915_irq.c                  |  32 ++++-
 drivers/gpu/drm/i915/i915_params.c               |   6 +-
 drivers/gpu/drm/i915/i915_params.h               |   2 +-
 drivers/gpu/drm/i915/i915_pci.c                  |   5 +-
 drivers/gpu/drm/i915/i915_reg.h                  |   6 +
 drivers/gpu/drm/i915/intel_guc_fwif.h            |  26 +++-
 drivers/gpu/drm/i915/intel_guc_loader.c          |   8 ++
 drivers/gpu/drm/i915/intel_hangcheck.c           |  13 +-
 drivers/gpu/drm/i915/intel_lrc.c                 | 155 ++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h          |  24 ++++
 drivers/gpu/drm/i915/intel_uc.h                  |   1 +
 drivers/gpu/drm/i915/intel_uncore.c              |  43 ++++++-
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 148 +++++++++++++++++++++-
 include/uapi/drm/i915_drm.h                      |   7 +-
 23 files changed, 906 insertions(+), 52 deletions(-)

-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v5 01/18] drm/i915: Fix stale comment about I915_RESET_IN_PROGRESS flag
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
@ 2017-03-25  1:29 ` Michel Thierry
  2017-03-25  1:29 ` [PATCH v5 02/18] drm/i915: Rename gen8_(un)request_engine_reset to gen8_(un)request_reset_engine Michel Thierry
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:29 UTC (permalink / raw)
  To: intel-gfx

It has been replaced by I915_RESET_BACKOFF / I915_RESET_HANDOFF.

Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index e8250bb79d12..6f19f5467f74 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1588,7 +1588,7 @@ struct i915_gpu_error {
 	 *
 	 * This is a counter which gets incremented when reset is triggered,
 	 *
-	 * Before the reset commences, the I915_RESET_IN_PROGRESS bit is set
+	 * Before the reset commences, the I915_RESET_BACKOFF bit is set
 	 * meaning that any waiters holding onto the struct_mutex should
 	 * relinquish the lock immediately in order for the reset to start.
 	 *
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 02/18] drm/i915: Rename gen8_(un)request_engine_reset to gen8_(un)request_reset_engine
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
  2017-03-25  1:29 ` [PATCH v5 01/18] drm/i915: Fix stale comment about I915_RESET_IN_PROGRESS flag Michel Thierry
@ 2017-03-25  1:29 ` Michel Thierry
  2017-03-25  9:01   ` Chris Wilson
  2017-03-25  1:29 ` [PATCH v5 03/18] drm/i915: Update i915.reset to handle engine resets Michel Thierry
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:29 UTC (permalink / raw)
  To: intel-gfx

As all other functions related to resetting engines are using reset_engine.

Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 6d1ea26b2493..5f815a100a8d 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1658,7 +1658,7 @@ int intel_wait_for_register(struct drm_i915_private *dev_priv,
 	return ret;
 }
 
-static int gen8_request_engine_reset(struct intel_engine_cs *engine)
+static int gen8_request_reset_engine(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
 	int ret;
@@ -1677,7 +1677,7 @@ static int gen8_request_engine_reset(struct intel_engine_cs *engine)
 	return ret;
 }
 
-static void gen8_unrequest_engine_reset(struct intel_engine_cs *engine)
+static void gen8_unrequest_reset_engine(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
 
@@ -1692,14 +1692,14 @@ static int gen8_reset_engines(struct drm_i915_private *dev_priv,
 	unsigned int tmp;
 
 	for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
-		if (gen8_request_engine_reset(engine))
+		if (gen8_request_reset_engine(engine))
 			goto not_ready;
 
 	return gen6_reset_engines(dev_priv, engine_mask);
 
 not_ready:
 	for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
-		gen8_unrequest_engine_reset(engine);
+		gen8_unrequest_reset_engine(engine);
 
 	return -EIO;
 }
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 03/18] drm/i915: Update i915.reset to handle engine resets
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
  2017-03-25  1:29 ` [PATCH v5 01/18] drm/i915: Fix stale comment about I915_RESET_IN_PROGRESS flag Michel Thierry
  2017-03-25  1:29 ` [PATCH v5 02/18] drm/i915: Rename gen8_(un)request_engine_reset to gen8_(un)request_reset_engine Michel Thierry
@ 2017-03-25  1:29 ` Michel Thierry
  2017-03-25  1:29 ` [PATCH v5 04/18] drm/i915/tdr: Modify error handler for per engine hang recovery Michel Thierry
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:29 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

In preparation for engine reset work update this parameter to handle more
than one type of reset. Default at the moment is still full gpu reset.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_params.c | 6 +++---
 drivers/gpu/drm/i915/i915_params.h | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index b6a7e363d076..045cadb77285 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -46,7 +46,7 @@ struct i915_params i915 __read_mostly = {
 	.prefault_disable = 0,
 	.load_detect_test = 0,
 	.force_reset_modeset_test = 0,
-	.reset = true,
+	.reset = 1,
 	.error_capture = true,
 	.invert_brightness = 0,
 	.disable_display = 0,
@@ -115,8 +115,8 @@ MODULE_PARM_DESC(vbt_sdvo_panel_type,
 	"Override/Ignore selection of SDVO panel mode in the VBT "
 	"(-2=ignore, -1=auto [default], index in VBT BIOS table)");
 
-module_param_named_unsafe(reset, i915.reset, bool, 0600);
-MODULE_PARM_DESC(reset, "Attempt GPU resets (default: true)");
+module_param_named_unsafe(reset, i915.reset, int, 0600);
+MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset [default], 2=engine reset)");
 
 #if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
 module_param_named(error_capture, i915.error_capture, bool, 0600);
diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
index 34148cc8637c..febbfdbd30bd 100644
--- a/drivers/gpu/drm/i915/i915_params.h
+++ b/drivers/gpu/drm/i915/i915_params.h
@@ -51,6 +51,7 @@
 	func(int, use_mmio_flip); \
 	func(int, mmio_debug); \
 	func(int, edp_vswing); \
+	func(int, reset); \
 	func(unsigned int, inject_load_failure); \
 	/* leave bools at the end to not create holes */ \
 	func(bool, alpha_support); \
@@ -60,7 +61,6 @@
 	func(bool, prefault_disable); \
 	func(bool, load_detect_test); \
 	func(bool, force_reset_modeset_test); \
-	func(bool, reset); \
 	func(bool, error_capture); \
 	func(bool, disable_display); \
 	func(bool, verbose_state_checks); \
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 04/18] drm/i915/tdr: Modify error handler for per engine hang recovery
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
                   ` (2 preceding siblings ...)
  2017-03-25  1:29 ` [PATCH v5 03/18] drm/i915: Update i915.reset to handle engine resets Michel Thierry
@ 2017-03-25  1:29 ` Michel Thierry
  2017-03-25  9:10   ` Chris Wilson
  2017-03-25  1:29 ` [PATCH v5 05/18] drm/i915/tdr: Add support for per engine reset recovery Michel Thierry
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:29 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

This is a preparatory patch which modifies error handler to do per engine
hang recovery. The actual patch which implements this sequence follows
later in the series. The aim is to prepare existing recovery function to
adapt to this new function where applicable (which fails at this point
because core implementation is lacking) and continue recovery using legacy
full gpu reset.

A helper function is also added to query the availability of engine
reset.

The error events behaviour that are used to notify user of reset are
adapted to engine reset such that it doesn't break users listening to these
events. In legacy we report an error event, a reset event before resetting
the gpu and a reset done event marking the completion of reset. The same
behaviour is adapted but reset event is only dispatched once even when
multiple engines are hung. Finally once reset is complete we send reset
done event as usual.

Note that this implementation of engine reset is for i915 directly
submitting to the ELSP, where the driver manages the hang detection,
recovery and resubmission. With GuC submission these tasks are shared
between driver and firmware; i915 will still responsible for detecting a
hang, and when it does it will have to request GuC to reset that Engine and
remind the firmware about the outstanding submissions. This will be
added in different patch.

v2: rebase, advertise engine reset availability in platform definition,
add note about GuC submission.
v3: s/*engine_reset*/*reset_engine*/. (Chris)
Handle reset as 2 level resets, by first going to engine only and fall
backing to full/chip reset as needed, i.e. reset_engine will need the
struct_mutex.
v4: Pass the engine mask to i915_reset. (Chris)
v5: Rebase, update selftests.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c                  | 49 +++++++++++++++++++++++-
 drivers/gpu/drm/i915/i915_drv.h                  |  7 +++-
 drivers/gpu/drm/i915/i915_gem_request.c          |  3 +-
 drivers/gpu/drm/i915/i915_irq.c                  | 19 ++++++++-
 drivers/gpu/drm/i915/i915_pci.c                  |  5 ++-
 drivers/gpu/drm/i915/intel_ringbuffer.h          | 16 ++++++++
 drivers/gpu/drm/i915/intel_uncore.c              | 11 ++++++
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c |  4 +-
 8 files changed, 105 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 6d9944a00b7d..c7eb43deb33e 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1794,7 +1794,7 @@ static int i915_resume_switcheroo(struct drm_device *dev)
 }
 
 /**
- * i915_reset - reset chip after a hang
+ * i915_reset_chip - reset chip after a hang
  * @dev_priv: device private to reset
  *
  * Reset the chip.  Useful if a hang is detected. Marks the device as wedged
@@ -1810,7 +1810,7 @@ static int i915_resume_switcheroo(struct drm_device *dev)
  *   - re-init interrupt state
  *   - re-init display
  */
-void i915_reset(struct drm_i915_private *dev_priv)
+void i915_reset_chip(struct drm_i915_private *dev_priv)
 {
 	struct i915_gpu_error *error = &dev_priv->gpu_error;
 	int ret;
@@ -1821,6 +1821,8 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	if (!test_bit(I915_RESET_HANDOFF, &error->flags))
 		return;
 
+	DRM_DEBUG_DRIVER("resetting chip\n");
+
 	/* Clear any previous failed attempts at recovery. Time to try again. */
 	if (!i915_gem_unset_wedged(dev_priv))
 		goto wakeup;
@@ -1884,6 +1886,49 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	goto finish;
 }
 
+/**
+ * i915_reset_engine - reset GPU engine to recover from a hang
+ * @engine: engine to reset
+ *
+ * Reset a specific GPU engine. Useful if a hang is detected.
+ * Returns zero on successful reset or otherwise an error code.
+ */
+int i915_reset_engine(struct intel_engine_cs *engine)
+{
+	/* FIXME: replace me with engine reset sequence */
+	return -ENODEV;
+}
+
+/**
+ * i915_reset - start either engine or full GPU reset to recover from a hang
+ * @dev_priv: device private
+ * @engine_mask: mask representing engines that are hung
+ *
+ * Wrapper function to initiate a GPU reset. When platform supports it, attempt
+ * to reset the hung engine only. If engine reset fails (or is not supported),
+ * reset the full GPU. If more than one engine is hung, the speed gains of
+ * reset_engine are negligible, thus promote to full reset.
+ *
+ * Caller must hold the struct_mutex.
+ */
+void i915_reset(struct drm_i915_private *dev_priv, u32 engine_mask)
+{
+	/* try engine reset first */
+	if (intel_has_reset_engine(dev_priv) &&
+	    !(engine_mask & (engine_mask - 1))) {
+		struct intel_engine_cs *engine =
+			dev_priv->engine[intel_engineid_from_flag(engine_mask)];
+
+		if (i915_reset_engine(engine))
+			goto reset_chip;
+
+		return;
+	}
+
+reset_chip:
+	i915_reset_chip(dev_priv);
+}
+
 static int i915_pm_suspend(struct device *kdev)
 {
 	struct pci_dev *pdev = to_pci_dev(kdev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 6f19f5467f74..b02393f33753 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -832,6 +832,7 @@ struct intel_csr {
 	func(has_ddi); \
 	func(has_decoupled_mmio); \
 	func(has_dp_mst); \
+	func(has_reset_engine); \
 	func(has_fbc); \
 	func(has_fpga_dbg); \
 	func(has_full_ppgtt); \
@@ -1636,6 +1637,9 @@ struct i915_gpu_error {
 #define I915_RESET_HANDOFF	1
 #define I915_WEDGED		(BITS_PER_LONG - 1)
 
+	/* if available, engine-specific reset is tried before full gpu reset */
+	u32 reset_engine_mask;
+
 	/**
 	 * Waitqueue to signal when a hang is detected. Used to for waiters
 	 * to release the struct_mutex for the reset to procede.
@@ -3038,7 +3042,8 @@ extern int i915_driver_load(struct pci_dev *pdev,
 extern void i915_driver_unload(struct drm_device *dev);
 extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
-extern void i915_reset(struct drm_i915_private *dev_priv);
+extern void i915_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
+extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index dbfa9db2419d..5ecb5a7a0c64 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -1016,7 +1016,8 @@ static bool __i915_wait_request_check_and_reset(struct drm_i915_gem_request *req
 		return false;
 
 	__set_current_state(TASK_RUNNING);
-	i915_reset(request->i915);
+	i915_reset(request->i915,
+		   request->i915->gpu_error.reset_engine_mask);
 	return true;
 }
 
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 3d2e6232f17a..87e76ef589b1 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2647,7 +2647,15 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 
 	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
 
-	DRM_DEBUG_DRIVER("resetting chip\n");
+	/*
+	 * This event needs to be sent before performing gpu reset. When
+	 * engine resets are supported we iterate through all engines and
+	 * reset hung engines individually. To keep the event dispatch
+	 * mechanism consistent with full gpu reset, this is only sent once
+	 * even when multiple engines are hung. It is also safe to move this
+	 * here because when we are in this function, we will definitely
+	 * perform gpu reset.
+	 */
 	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
 
 	intel_prepare_reset(dev_priv);
@@ -2663,7 +2671,8 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 		 * deadlocks with the reset work.
 		 */
 		if (mutex_trylock(&dev_priv->drm.struct_mutex)) {
-			i915_reset(dev_priv);
+			i915_reset(dev_priv,
+				   dev_priv->gpu_error.reset_engine_mask);
 			mutex_unlock(&dev_priv->drm.struct_mutex);
 		}
 
@@ -2780,6 +2789,12 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 			     &dev_priv->gpu_error.flags))
 		goto out;
 
+	/*
+	 * Save which engines need reset; if engine support is available,
+	 * we can just reset the hung engines.
+	 */
+	dev_priv->gpu_error.reset_engine_mask = engine_mask;
+
 	i915_reset_and_wakeup(dev_priv);
 
 out:
diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index 732101ed57fb..da8b82acd055 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -308,7 +308,8 @@ static const struct intel_device_info intel_haswell_info = {
 	BDW_COLORS, \
 	.has_logical_ring_contexts = 1, \
 	.has_full_48bit_ppgtt = 1, \
-	.has_64bit_reloc = 1
+	.has_64bit_reloc = 1, \
+	.has_reset_engine = 1
 
 static const struct intel_device_info intel_broadwell_info = {
 	BDW_FEATURES,
@@ -340,6 +341,7 @@ static const struct intel_device_info intel_cherryview_info = {
 	.has_gmch_display = 1,
 	.has_aliasing_ppgtt = 1,
 	.has_full_ppgtt = 1,
+	.has_reset_engine = 1,
 	.display_mmio_offset = VLV_DISPLAY_BASE,
 	GEN_CHV_PIPEOFFSETS,
 	CURSOR_OFFSETS,
@@ -389,6 +391,7 @@ static const struct intel_device_info intel_skylake_gt3_info = {
 	.has_aliasing_ppgtt = 1, \
 	.has_full_ppgtt = 1, \
 	.has_full_48bit_ppgtt = 1, \
+	.has_reset_engine = 1, \
 	GEN_DEFAULT_PIPEOFFSETS, \
 	IVB_CURSOR_OFFSETS, \
 	BDW_COLORS
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 2ecb41788fb6..e8faf2c34c97 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -438,6 +438,22 @@ intel_engine_flag(const struct intel_engine_cs *engine)
 	return 1 << engine->id;
 }
 
+/* works only for engine_mask with 1 engine bit set */
+static inline unsigned
+intel_engineid_from_flag(unsigned engine_mask)
+{
+	unsigned engine_id = 0;
+
+	GEM_BUG_ON(engine_mask & (engine_mask - 1));
+
+	while (engine_mask >>= 1)
+		engine_id++;
+
+	GEM_BUG_ON(engine_id >= I915_NUM_ENGINES);
+
+	return engine_id;
+}
+
 static inline void
 intel_flush_status_page(struct intel_engine_cs *engine, int reg)
 {
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 5f815a100a8d..be51b5a0689a 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1751,6 +1751,17 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv)
 	return intel_get_gpu_reset(dev_priv) != NULL;
 }
 
+/*
+ * When GuC submission is enabled, GuC manages ELSP and can initiate the
+ * engine reset too. For now, fall back to full GPU reset if it is enabled.
+ */
+bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
+{
+	return (dev_priv->info.has_reset_engine &&
+		!dev_priv->guc.execbuf_client &&
+		i915.reset == 2);
+}
+
 int intel_guc_reset(struct drm_i915_private *dev_priv)
 {
 	int ret;
diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index 6ec7c731a267..76d5f78c1724 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -307,7 +307,7 @@ static int igt_global_reset(void *arg)
 	mutex_lock(&i915->drm.struct_mutex);
 	reset_count = i915_reset_count(&i915->gpu_error);
 
-	i915_reset(i915);
+	i915_reset(i915, ALL_ENGINES);
 
 	if (i915_reset_count(&i915->gpu_error) == reset_count) {
 		pr_err("No GPU reset recorded!\n");
@@ -472,7 +472,7 @@ static int igt_reset_queue(void *arg)
 
 			reset_count = fake_hangcheck(prev);
 
-			i915_reset(i915);
+			i915_reset(i915, ALL_ENGINES);
 
 			GEM_BUG_ON(test_bit(I915_RESET_HANDOFF,
 					    &i915->gpu_error.flags));
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 05/18] drm/i915/tdr: Add support for per engine reset recovery
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
                   ` (3 preceding siblings ...)
  2017-03-25  1:29 ` [PATCH v5 04/18] drm/i915/tdr: Modify error handler for per engine hang recovery Michel Thierry
@ 2017-03-25  1:29 ` Michel Thierry
  2017-03-25  1:29 ` [PATCH v5 06/18] drm/i915: Skip reset request if there is one already Michel Thierry
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:29 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

This change implements support for per-engine reset as an initial, less
intrusive hang recovery option to be attempted before falling back to the
legacy full GPU reset recovery mode if necessary. This is only supported
from Gen8 onwards.

Hangchecker determines which engines are hung and invokes error handler to
recover from it. Error handler schedules recovery for each of those engines
that are hung. The recovery procedure is as follows,
 - identifies the request that caused the hang and it is dropped
 - force engine to idle: this is done by issuing a reset request
 - reset and re-init engine
 - restart submissions to the engine

If engine reset fails then we fall back to heavy weight full gpu reset
which resets all engines and reinitiazes complete state of HW and SW.

v2: Rebase.
v3: s/*engine_reset*/*reset_engine*/; freeze engine and irqs before
calling i915_gem_reset_engine (Chris).
v4: Rebase, modify i915_gem_reset_prepare to use a ring mask and
reuse the function for reset_engine.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c         | 88 +++++++++++++++++++++++++++++++--
 drivers/gpu/drm/i915/i915_drv.h         | 11 +++--
 drivers/gpu/drm/i915/i915_gem.c         | 16 +++---
 drivers/gpu/drm/i915/i915_gem_request.c |  2 +-
 drivers/gpu/drm/i915/intel_uncore.c     | 20 ++++++++
 5 files changed, 122 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index c7eb43deb33e..9c7bdcf03c13 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1831,7 +1831,7 @@ void i915_reset_chip(struct drm_i915_private *dev_priv)
 
 	pr_notice("drm/i915: Resetting chip after gpu hang\n");
 	disable_irq(dev_priv->drm.irq);
-	ret = i915_gem_reset_prepare(dev_priv);
+	ret = i915_gem_reset_prepare(dev_priv, ALL_ENGINES);
 	if (ret) {
 		DRM_ERROR("GPU recovery failed\n");
 		intel_gpu_reset(dev_priv, ALL_ENGINES);
@@ -1873,7 +1873,7 @@ void i915_reset_chip(struct drm_i915_private *dev_priv)
 	i915_queue_hangcheck(dev_priv);
 
 finish:
-	i915_gem_reset_finish(dev_priv);
+	i915_gem_reset_finish(dev_priv, ALL_ENGINES);
 	enable_irq(dev_priv->drm.irq);
 
 wakeup:
@@ -1892,11 +1892,91 @@ void i915_reset_chip(struct drm_i915_private *dev_priv)
  *
  * Reset a specific GPU engine. Useful if a hang is detected.
  * Returns zero on successful reset or otherwise an error code.
+ *
+ * Caller must hold the struct_mutex.
+ *
+ * Procedure is:
+ *  - identifies the request that caused the hang and it is dropped
+ *  - force engine to idle: this is done by issuing a reset request
+ *  - reset engine
+ *  - restart submissions to the engine
  */
 int i915_reset_engine(struct intel_engine_cs *engine)
 {
-	/* FIXME: replace me with engine reset sequence */
-	return -ENODEV;
+	int ret;
+	struct drm_i915_private *dev_priv = engine->i915;
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
+
+	lockdep_assert_held(&dev_priv->drm.struct_mutex);
+	GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, &error->flags));
+
+	if (!test_and_clear_bit(I915_RESET_HANDOFF, &error->flags))
+		return 0;
+
+	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
+
+	/*
+	 * We need to first idle the engine by issuing a reset request,
+	 * then perform soft reset and re-initialize hw state, for all of
+	 * this GT power need to be awake so ensure it does throughout the
+	 * process
+	 */
+	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+
+	disable_irq(dev_priv->drm.irq);
+	ret = i915_gem_reset_prepare(dev_priv, intel_engine_flag(engine));
+	if (ret) {
+		DRM_ERROR("Previous reset failed - promote to full reset\n");
+		goto error;
+	}
+
+	if (dev_priv->gt.active_requests)
+		engine_retire_requests(engine);
+
+	/*
+	 * the request that caused the hang is stuck on elsp, identify the
+	 * active request and drop it, adjust head to skip the offending
+	 * request to resume executing remaining requests in the queue.
+	 */
+	i915_gem_reset_engine(engine);
+
+	/* forcing engine to idle */
+	ret = intel_request_reset_engine(engine);
+	if (ret) {
+		DRM_ERROR("Failed to disable %s\n", engine->name);
+		goto error;
+	}
+
+	/* finally, reset engine */
+	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
+	if (ret) {
+		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
+		intel_unrequest_reset_engine(engine);
+		goto error;
+	}
+
+	/* be sure the request reset bit gets cleared */
+	intel_unrequest_reset_engine(engine);
+
+	/* i915_gem_reset_prepare revoked the fences */
+	i915_gem_restore_fences(dev_priv);
+	i915_gem_reset_finish(dev_priv, intel_engine_flag(engine));
+
+	/* replay remaining requests in the queue */
+	ret = engine->init_hw(engine);
+	if (ret)
+		goto error;
+
+wakeup:
+	enable_irq(dev_priv->drm.irq);
+	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
+	wake_up_bit(&error->flags, I915_RESET_HANDOFF);
+	return ret;
+
+error:
+	/* use full gpu reset to recover on error */
+	set_bit(I915_RESET_HANDOFF, &error->flags);
+	goto wakeup;
 }
 
 /**
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index b02393f33753..67201ecd98e5 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3044,6 +3044,8 @@ extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern void i915_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
+extern int intel_request_reset_engine(struct intel_engine_cs *engine);
+extern void intel_unrequest_reset_engine(struct intel_engine_cs *engine);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
@@ -3420,7 +3422,7 @@ int __must_check i915_gem_set_global_seqno(struct drm_device *dev, u32 seqno);
 
 struct drm_i915_gem_request *
 i915_gem_find_active_request(struct intel_engine_cs *engine);
-
+void engine_retire_requests(struct intel_engine_cs *engine);
 void i915_gem_retire_requests(struct drm_i915_private *dev_priv);
 
 static inline bool i915_reset_backoff(struct i915_gpu_error *error)
@@ -3448,11 +3450,14 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
-int i915_gem_reset_prepare(struct drm_i915_private *dev_priv);
+int i915_gem_reset_prepare(struct drm_i915_private *dev_priv,
+			   unsigned int engine_mask);
 void i915_gem_reset(struct drm_i915_private *dev_priv);
-void i915_gem_reset_finish(struct drm_i915_private *dev_priv);
+void i915_gem_reset_finish(struct drm_i915_private *dev_priv,
+			   unsigned int engine_mask);
 void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
 bool i915_gem_unset_wedged(struct drm_i915_private *dev_priv);
+void i915_gem_reset_engine(struct intel_engine_cs *engine);
 
 void i915_gem_init_mmio(struct drm_i915_private *i915);
 int __must_check i915_gem_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 9515daea6358..ba0bdae30d5a 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2741,14 +2741,15 @@ static bool engine_stalled(struct intel_engine_cs *engine)
 	return true;
 }
 
-int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
+int i915_gem_reset_prepare(struct drm_i915_private *dev_priv,
+			   unsigned int engine_mask)
 {
 	struct intel_engine_cs *engine;
-	enum intel_engine_id id;
+	unsigned int tmp;
 	int err = 0;
 
 	/* Ensure irq handler finishes, and not run again. */
-	for_each_engine(engine, dev_priv, id) {
+	for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
 		struct drm_i915_gem_request *request;
 
 		/* Prevent the signaler thread from updating the request
@@ -2868,7 +2869,7 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 	return guilty;
 }
 
-static void i915_gem_reset_engine(struct intel_engine_cs *engine)
+void i915_gem_reset_engine(struct intel_engine_cs *engine)
 {
 	struct drm_i915_gem_request *request;
 
@@ -2914,14 +2915,15 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
 	}
 }
 
-void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
+void i915_gem_reset_finish(struct drm_i915_private *dev_priv,
+			   unsigned int engine_mask)
 {
 	struct intel_engine_cs *engine;
-	enum intel_engine_id id;
+	unsigned int tmp;
 
 	lockdep_assert_held(&dev_priv->drm.struct_mutex);
 
-	for_each_engine(engine, dev_priv, id) {
+	for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
 		tasklet_enable(&engine->irq_tasklet);
 		kthread_unpark(engine->breadcrumbs.signaler);
 	}
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 5ecb5a7a0c64..180bd84614fc 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -1177,7 +1177,7 @@ long i915_wait_request(struct drm_i915_gem_request *req,
 	return timeout;
 }
 
-static void engine_retire_requests(struct intel_engine_cs *engine)
+void engine_retire_requests(struct intel_engine_cs *engine)
 {
 	struct drm_i915_gem_request *request, *next;
 	u32 seqno = intel_engine_get_seqno(engine);
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index be51b5a0689a..596b99dbbff8 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1781,6 +1781,26 @@ int intel_guc_reset(struct drm_i915_private *dev_priv)
 	return ret;
 }
 
+/*
+ * On gen8+ a reset request has to be issued via the reset control register
+ * before a GPU engine can be reset in order to stop the command streamer
+ * and idle the engine. This replaces the legacy way of stopping an engine
+ * by writing to the stop ring bit in the MI_MODE register.
+ */
+int intel_request_reset_engine(struct intel_engine_cs *engine)
+{
+	return gen8_request_reset_engine(engine);
+}
+
+/*
+ * It is possible to back off from a previously issued reset request by simply
+ * clearing the reset request bit in the reset control register.
+ */
+void intel_unrequest_reset_engine(struct intel_engine_cs *engine)
+{
+	gen8_unrequest_reset_engine(engine);
+}
+
 bool intel_uncore_unclaimed_mmio(struct drm_i915_private *dev_priv)
 {
 	return check_for_unclaimed_mmio(dev_priv);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 06/18] drm/i915: Skip reset request if there is one already
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
                   ` (4 preceding siblings ...)
  2017-03-25  1:29 ` [PATCH v5 05/18] drm/i915/tdr: Add support for per engine reset recovery Michel Thierry
@ 2017-03-25  1:29 ` Michel Thierry
  2017-03-25  1:29 ` [PATCH v5 07/18] drm/i915/tdr: Add engine reset count to error state Michel Thierry
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:29 UTC (permalink / raw)
  To: intel-gfx

From: Mika Kuoppala <mika.kuoppala@linux.intel.com>

To perform engine reset we first disable engine to capture its state. This
is done by issuing a reset request. Because we are reusing existing
infrastructure, again when we actually reset an engine, reset function
checks engine mask and issues reset request again which is unnecessary. To
avoid this we check if the engine is already prepared, if so we just exit
from that point.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 596b99dbbff8..9886d7bd11ba 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1661,10 +1661,15 @@ int intel_wait_for_register(struct drm_i915_private *dev_priv,
 static int gen8_request_reset_engine(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
+	const i915_reg_t reset_ctrl = RING_RESET_CTL(engine->mmio_base);
+	const u32 ready = RESET_CTL_REQUEST_RESET | RESET_CTL_READY_TO_RESET;
 	int ret;
 
-	I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
-		      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
+	/* If engine has been already prepared, we can shortcut here */
+	if ((I915_READ_FW(reset_ctrl) & ready) == ready)
+		return 0;
+
+	I915_WRITE_FW(reset_ctrl, _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
 
 	ret = intel_wait_for_register_fw(dev_priv,
 					 RING_RESET_CTL(engine->mmio_base),
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 07/18] drm/i915/tdr: Add engine reset count to error state
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
                   ` (5 preceding siblings ...)
  2017-03-25  1:29 ` [PATCH v5 06/18] drm/i915: Skip reset request if there is one already Michel Thierry
@ 2017-03-25  1:29 ` Michel Thierry
  2017-03-25  1:30 ` [PATCH v5 08/18] drm/i915/tdr: Export per-engine reset count info to debugfs Michel Thierry
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:29 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

Driver maintains count of how many times a given engine is reset, useful to
capture this in error state also. It gives an idea of how engine is coping
up with the workloads it is executing before this error state.

A follow-up patch will provide this information in debugfs.

v2: s/engine_reset/reset_engine/ (Chris)
    Define count as unsigned int (Tvrtko)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c       | 2 ++
 drivers/gpu/drm/i915/i915_drv.h       | 8 ++++++++
 drivers/gpu/drm/i915/i915_gpu_error.c | 3 +++
 3 files changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 9c7bdcf03c13..a111b39bbc12 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1967,6 +1967,8 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	if (ret)
 		goto error;
 
+	error->reset_engine_count[engine->id]++;
+
 wakeup:
 	enable_irq(dev_priv->drm.irq);
 	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 67201ecd98e5..fbb4f200756a 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -999,6 +999,7 @@ struct i915_gpu_state {
 		enum intel_engine_hangcheck_action hangcheck_action;
 		struct i915_address_space *vm;
 		int num_requests;
+		u32 reset_count;
 
 		/* position of active request inside the ring */
 		u32 rq_head, rq_post, rq_tail;
@@ -1639,6 +1640,7 @@ struct i915_gpu_error {
 
 	/* if available, engine-specific reset is tried before full gpu reset */
 	u32 reset_engine_mask;
+	u32 reset_engine_count[I915_NUM_ENGINES];
 
 	/**
 	 * Waitqueue to signal when a hang is detected. Used to for waiters
@@ -3450,6 +3452,12 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
+static inline u32 i915_reset_engine_count(struct i915_gpu_error *error,
+					  struct intel_engine_cs *engine)
+{
+	return READ_ONCE(error->reset_engine_count[engine->id]);
+}
+
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv,
 			   unsigned int engine_mask);
 void i915_gem_reset(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 5099b3599c34..5d015bcc7484 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -463,6 +463,7 @@ static void error_print_engine(struct drm_i915_error_state_buf *m,
 	err_printf(m, "  hangcheck action timestamp: %lu, %u ms ago\n",
 		   ee->hangcheck_timestamp,
 		   jiffies_to_msecs(jiffies - ee->hangcheck_timestamp));
+	err_printf(m, "  engine reset count: %u\n", ee->reset_count);
 
 	error_print_request(m, "  ELSP[0]: ", &ee->execlist[0]);
 	error_print_request(m, "  ELSP[1]: ", &ee->execlist[1]);
@@ -1236,6 +1237,8 @@ static void error_record_engine_registers(struct i915_gpu_state *error,
 	ee->hangcheck_timestamp = engine->hangcheck.action_timestamp;
 	ee->hangcheck_action = engine->hangcheck.action;
 	ee->hangcheck_stalled = engine->hangcheck.stalled;
+	ee->reset_count = i915_reset_engine_count(&dev_priv->gpu_error,
+						  engine);
 
 	if (USES_PPGTT(dev_priv)) {
 		int i;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 08/18] drm/i915/tdr: Export per-engine reset count info to debugfs
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
                   ` (6 preceding siblings ...)
  2017-03-25  1:29 ` [PATCH v5 07/18] drm/i915/tdr: Add engine reset count to error state Michel Thierry
@ 2017-03-25  1:30 ` Michel Thierry
  2017-03-25  1:30 ` [PATCH v5 09/18] drm/i915/tdr: Enable Engine reset and recovery support Michel Thierry
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:30 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

A new variable is added to export the reset counts to debugfs, this
includes full gpu reset and engine reset count. This is useful for tests
where they are expected to trigger reset; these counts are checked before
and after the test to ensure the same.

v2: Include reset engine count in i915_engine_info too (Chris).

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 3ebdbc798f33..8db850541e03 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1400,6 +1400,23 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 	return 0;
 }
 
+static int i915_reset_info(struct seq_file *m, void *unused)
+{
+	struct drm_i915_private *dev_priv = node_to_i915(m->private);
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+
+	seq_printf(m, "full gpu reset = %u\n", i915_reset_count(error));
+
+	for_each_engine(engine, dev_priv, id) {
+		seq_printf(m, "%s = %u\n", engine->name,
+			   i915_reset_engine_count(error, engine));
+	}
+
+	return 0;
+}
+
 static int ironlake_drpc_info(struct seq_file *m)
 {
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
@@ -3241,6 +3258,7 @@ static int i915_display_info(struct seq_file *m, void *unused)
 static int i915_engine_info(struct seq_file *m, void *unused)
 {
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
 
@@ -3264,6 +3282,8 @@ static int i915_engine_info(struct seq_file *m, void *unused)
 			   engine->hangcheck.seqno,
 			   jiffies_to_msecs(jiffies - engine->hangcheck.action_timestamp),
 			   engine->timeline->inflight_seqnos);
+		seq_printf(m, "\tReset count: %d\n",
+			   i915_reset_engine_count(error, engine));
 
 		rcu_read_lock();
 
@@ -4774,6 +4794,7 @@ static const struct drm_info_list i915_debugfs_list[] = {
 	{"i915_huc_load_status", i915_huc_load_status_info, 0},
 	{"i915_frequency_info", i915_frequency_info, 0},
 	{"i915_hangcheck_info", i915_hangcheck_info, 0},
+	{"i915_reset_info", i915_reset_info, 0},
 	{"i915_drpc_info", i915_drpc_info, 0},
 	{"i915_emon_status", i915_emon_status, 0},
 	{"i915_ring_freq_table", i915_ring_freq_table, 0},
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 09/18] drm/i915/tdr: Enable Engine reset and recovery support
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
                   ` (7 preceding siblings ...)
  2017-03-25  1:30 ` [PATCH v5 08/18] drm/i915/tdr: Export per-engine reset count info to debugfs Michel Thierry
@ 2017-03-25  1:30 ` Michel Thierry
  2017-03-25  1:30 ` [PATCH v5 10/18] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:30 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

This feature is made available only from Gen8, for previous gen devices
driver uses legacy full gpu reset.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_params.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index 045cadb77285..14e2c2e57f96 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -46,7 +46,7 @@ struct i915_params i915 __read_mostly = {
 	.prefault_disable = 0,
 	.load_detect_test = 0,
 	.force_reset_modeset_test = 0,
-	.reset = 1,
+	.reset = 2,
 	.error_capture = true,
 	.invert_brightness = 0,
 	.disable_display = 0,
@@ -116,7 +116,7 @@ MODULE_PARM_DESC(vbt_sdvo_panel_type,
 	"(-2=ignore, -1=auto [default], index in VBT BIOS table)");
 
 module_param_named_unsafe(reset, i915.reset, int, 0600);
-MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset [default], 2=engine reset)");
+MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset, 2=engine reset [default])");
 
 #if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
 module_param_named(error_capture, i915.error_capture, bool, 0600);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 10/18] drm/i915: Add engine reset count in get-reset-stats ioctl
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
                   ` (8 preceding siblings ...)
  2017-03-25  1:30 ` [PATCH v5 09/18] drm/i915/tdr: Enable Engine reset and recovery support Michel Thierry
@ 2017-03-25  1:30 ` Michel Thierry
  2017-03-25  1:30 ` [PATCH v5 11/18] drm/i915/selftests: reset engine self tests Michel Thierry
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:30 UTC (permalink / raw)
  To: intel-gfx; +Cc: mesa-dev

Users/tests relying on the total reset count will start seeing a smaller
number since most of the hangs can be handled by engine reset.
Note that if reset engine x, context a running on engine y will be unaware
and unaffected.

To start the discussion, include just a total engine reset count. If it
is deemed useful, it can be extended to report each engine separately.

v2: s/engine_reset/reset_engine/, use union in uapi to not break compatibility.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: mesa-dev@lists.freedesktop.org
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 14 +++++++++++---
 include/uapi/drm/i915_drm.h             |  6 +++++-
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 8bd0c4966913..edbed85a1c88 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -1133,9 +1133,11 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct drm_i915_reset_stats *args = data;
 	struct i915_gem_context *ctx;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
 	int ret;
 
-	if (args->flags || args->pad)
+	if (args->flags)
 		return -EINVAL;
 
 	if (args->ctx_id == DEFAULT_CONTEXT_HANDLE && !capable(CAP_SYS_ADMIN))
@@ -1151,10 +1153,16 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 		return PTR_ERR(ctx);
 	}
 
-	if (capable(CAP_SYS_ADMIN))
+	if (capable(CAP_SYS_ADMIN)) {
 		args->reset_count = i915_reset_count(&dev_priv->gpu_error);
-	else
+		for_each_engine(engine, dev_priv, id)
+			args->reset_engine_count +=
+				i915_reset_engine_count(&dev_priv->gpu_error,
+							engine);
+	} else {
 		args->reset_count = 0;
+		args->reset_engine_count = 0;
+	}
 
 	args->batch_active = ctx->guilty_count;
 	args->batch_pending = ctx->active_count;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 3554495bef13..f083931a7809 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1263,7 +1263,11 @@ struct drm_i915_reset_stats {
 	/* Number of batches lost pending for execution, for this context */
 	__u32 batch_pending;
 
-	__u32 pad;
+	union {
+		__u32 pad;
+		/* Engine resets since boot/module reload, for all contexts */
+		__u32 reset_engine_count;
+	};
 };
 
 struct drm_i915_gem_userptr {
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 11/18] drm/i915/selftests: reset engine self tests
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
                   ` (9 preceding siblings ...)
  2017-03-25  1:30 ` [PATCH v5 10/18] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
@ 2017-03-25  1:30 ` Michel Thierry
  2017-03-25  9:50   ` Chris Wilson
  2017-03-25  1:30 ` [PATCH v5 12/18] drm/i915/guc: Provide register list to be saved/restored during engine reset Michel Thierry
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:30 UTC (permalink / raw)
  To: intel-gfx

Check that we can reset specific engines, also check the fallback to
full reset if something didn't work.

Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 144 +++++++++++++++++++++++
 1 file changed, 144 insertions(+)

diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index 76d5f78c1724..91b78e638494 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -323,6 +323,56 @@ static int igt_global_reset(void *arg)
 	return err;
 }
 
+static int igt_reset_engine(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	unsigned int reset_count, reset_engine_count;
+	int err = 0;
+
+	/* Check that we can issue a global GPU and engine reset */
+
+	if (!intel_has_gpu_reset(i915))
+		return 0;
+
+	if (!intel_has_reset_engine(i915))
+		return 0;
+
+	set_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
+	mutex_lock(&i915->drm.struct_mutex);
+
+	for_each_engine(engine, i915, id) {
+		reset_count = i915_reset_count(&i915->gpu_error);
+		reset_engine_count = i915_reset_engine_count(&i915->gpu_error,
+							     engine);
+
+		set_bit(I915_RESET_HANDOFF, &i915->gpu_error.flags);
+		i915_reset(i915, intel_engine_flag(engine));
+		GEM_BUG_ON(test_bit(I915_RESET_HANDOFF, &i915->gpu_error.flags));
+
+		if (i915_reset_count(&i915->gpu_error) != reset_count) {
+			pr_err("Full GPU reset recorded! (engine reset expected)\n");
+			err = -EINVAL;
+			break;
+		}
+
+		if (i915_reset_engine_count(&i915->gpu_error, engine) ==
+		    reset_engine_count) {
+			pr_err("No %s engine reset recorded!\n", engine->name);
+			err = -EINVAL;
+			break;
+		}
+	}
+	mutex_unlock(&i915->drm.struct_mutex);
+
+	clear_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
+	if (i915_terminally_wedged(&i915->gpu_error))
+		err = -EIO;
+
+	return err;
+}
+
 static u32 fake_hangcheck(struct drm_i915_gem_request *rq)
 {
 	u32 reset_count;
@@ -527,13 +577,107 @@ static int igt_reset_queue(void *arg)
 	return err;
 }
 
+static int igt_render_engine_reset_fallback(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	struct intel_engine_cs *engine = i915->engine[RCS];
+	struct hang h;
+	struct drm_i915_gem_request *rq;
+	unsigned int reset_count, reset_engine_count;
+	int err = 0;
+
+	/* Check that we can issue a global GPU and engine reset */
+
+	if (!intel_has_gpu_reset(i915))
+		return 0;
+
+	if (!intel_has_reset_engine(i915))
+		return 0;
+
+	set_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
+	mutex_lock(&i915->drm.struct_mutex);
+
+	err = hang_init(&h, i915);
+	if (err)
+		goto unlock;
+
+	rq = hang_create_request(&h, engine, i915->kernel_context);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto fini;
+	}
+
+	i915_gem_request_get(rq);
+	__i915_add_request(rq, true);
+
+	/* make reset engine fail */
+	rq->fence.error = -EIO;
+
+	if (!wait_for_hang(&h, rq)) {
+		pr_err("Failed to start request %x\n", rq->fence.seqno);
+		err = -EIO;
+		goto fini;
+	}
+
+	reset_engine_count = i915_reset_engine_count(&i915->gpu_error, engine);
+	reset_count = fake_hangcheck(rq);
+
+	i915_reset(i915, intel_engine_flag(engine));
+	GEM_BUG_ON(test_bit(I915_RESET_HANDOFF, &i915->gpu_error.flags));
+
+	if (i915_reset_engine_count(&i915->gpu_error, engine) !=
+	    reset_engine_count) {
+		pr_err("render engine reset recorded! (full reset expected)\n");
+		err = -EINVAL;
+		goto fini;
+	}
+
+	if (i915_reset_count(&i915->gpu_error) == reset_count) {
+		pr_err("No full GPU reset recorded!\n");
+		err = -EINVAL;
+		goto fini;
+	}
+
+	/*
+	 * by using fence.error = -EIO, full reset sets the wedged flag, do one
+	 * more full reset to re-enable the hw.
+	 */
+	if (i915_terminally_wedged(&i915->gpu_error)) {
+		rq->fence.error = 0;
+
+		set_bit(I915_RESET_HANDOFF, &i915->gpu_error.flags);
+		i915_reset(i915, ALL_ENGINES);
+		GEM_BUG_ON(test_bit(I915_RESET_HANDOFF,
+				    &i915->gpu_error.flags));
+
+		if (i915_reset_count(&i915->gpu_error) == reset_count) {
+			pr_err("No full GPU reset recorded!\n");
+			err = -EINVAL;
+			goto fini;
+		}
+	}
+
+fini:
+	hang_fini(&h);
+unlock:
+	mutex_unlock(&i915->drm.struct_mutex);
+	clear_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
+
+	if (i915_terminally_wedged(&i915->gpu_error))
+		return -EIO;
+
+	return err;
+}
+
 int intel_hangcheck_live_selftests(struct drm_i915_private *i915)
 {
 	static const struct i915_subtest tests[] = {
 		SUBTEST(igt_hang_sanitycheck),
 		SUBTEST(igt_global_reset),
+		SUBTEST(igt_reset_engine),
 		SUBTEST(igt_wait_reset),
 		SUBTEST(igt_reset_queue),
+		SUBTEST(igt_render_engine_reset_fallback),
 	};
 
 	if (!intel_has_gpu_reset(i915))
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 12/18] drm/i915/guc: Provide register list to be saved/restored during engine reset
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
                   ` (10 preceding siblings ...)
  2017-03-25  1:30 ` [PATCH v5 11/18] drm/i915/selftests: reset engine self tests Michel Thierry
@ 2017-03-25  1:30 ` Michel Thierry
  2017-04-14  0:16   ` Daniele Ceraolo Spurio
  2017-03-25  1:30 ` [PATCH v5 13/18] drm/i915/guc: Add support for reset engine using GuC commands Michel Thierry
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:30 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

GuC expects a list of registers from the driver which are saved/restored
during engine reset. The type of value to be saved is controlled by
flags. We provide a minimal set of registers that we want GuC to save and
restore. This is not an issue in case of engine reset as driver initializes
most of them following an engine reset, but in case of media reset (aka
watchdog reset) which is completely internal to GuC (including resubmission
of hung workload), it is necessary to provide this list, otherwise GuC won't
be able to schedule further workloads after a reset. This is the minimal
set of registers identified for things to work as expected but if we see
any new issues, this register list can be expanded.

Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_guc_submission.c | 50 +++++++++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index 991e76e10f82..2445af96aa71 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -1015,6 +1015,22 @@ static void guc_policies_init(struct guc_policies *policies)
 	policies->is_valid = 1;
 }
 
+/*
+ * In this macro it is highly unlikely to exceed max value but even if we did
+ * it is not an error so just throw a warning and continue. Only side effect
+ * in continuing further means some registers won't be added to save/restore
+ * list.
+ */
+#define GUC_ADD_MMIO_REG_ADS(node, reg_addr, _flags)			\
+	do {								\
+		u32 __count = node->number_of_registers;		\
+		if (WARN_ON(__count >= GUC_REGSET_MAX_REGISTERS))	\
+			continue;					\
+		node->registers[__count].offset = reg_addr.reg;		\
+		node->registers[__count].flags = (_flags);		\
+		node->number_of_registers++;				\
+	} while (0)
+
 static int guc_ads_create(struct intel_guc *guc)
 {
 	struct drm_i915_private *dev_priv = guc_to_i915(guc);
@@ -1047,10 +1063,42 @@ static int guc_ads_create(struct intel_guc *guc)
 
 	/* MMIO reg state */
 	for_each_engine(engine, dev_priv, id) {
+		u32 flags;
+		struct guc_mmio_regset *eng_reg =
+			&blob->reg_state.engine_reg[engine->guc_id];
+
+		/*
+		 * Provide a list of registers to be saved/restored during gpu
+		 * reset. This is mainly required for Media reset (aka watchdog
+		 * timeout) which is completely under the control of GuC
+		 * (resubmission of hung workload is handled inside GuC).
+		 */
+		flags = GUC_REGSET_POWERCYCLE |
+			GUC_REGSET_ENGINERESET |
+			GUC_REGSET_SAVE_CURRENT_VALUE;
+
+		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_HEAD(engine->mmio_base),
+				     flags);
+
+		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_TAIL(engine->mmio_base),
+				     flags);
+
+		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_HWS_PGA(engine->mmio_base),
+				     flags);
+
+		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_MODE_GEN7(engine),
+				     (flags | GUC_REGSET_MASKED));
+
+		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_IMR(engine->mmio_base),
+				     flags);
+
+		DRM_DEBUG_DRIVER("%s register save/restore count: %u\n",
+				 engine->name, eng_reg->number_of_registers);
+
+		 /* Nothing to be white listed for now. */
 		blob->reg_state.white_list[engine->guc_id].mmio_start =
 			engine->mmio_base + GUC_MMIO_WHITE_LIST_START;
 
-		/* Nothing to be saved or restored for now. */
 		blob->reg_state.white_list[engine->guc_id].count = 0;
 	}
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 13/18] drm/i915/guc: Add support for reset engine using GuC commands
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
                   ` (11 preceding siblings ...)
  2017-03-25  1:30 ` [PATCH v5 12/18] drm/i915/guc: Provide register list to be saved/restored during engine reset Michel Thierry
@ 2017-03-25  1:30 ` Michel Thierry
  2017-03-25  1:30 ` [PATCH v5 14/18] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load Michel Thierry
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:30 UTC (permalink / raw)
  To: intel-gfx

This patch adds per engine reset and recovery (TDR) support when GuC is
used to submit workloads to GPU.

In the case of i915 directly submission to ELSP, driver manages hang
detection, recovery and resubmission. With GuC submission these tasks
are shared between driver and GuC. i915 is still responsible for detecting
a hang, and when it does it only requests GuC to reset that Engine. GuC
internally manages acquiring forcewake and idling the engine before actually
resetting it.

Once the reset is successful, i915 takes over again and handles resubmission.
The scheduler in i915 knows which requests are pending so after resetting
a engine, pending workloads/requests are resubmitted again.

Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c            | 43 +++++++++++++++++---------
 drivers/gpu/drm/i915/i915_drv.h            |  1 +
 drivers/gpu/drm/i915/i915_guc_submission.c | 48 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_guc_fwif.h      |  6 ++++
 drivers/gpu/drm/i915/intel_lrc.c           |  5 ++--
 drivers/gpu/drm/i915/intel_uc.h            |  1 +
 drivers/gpu/drm/i915/intel_uncore.c        |  5 ----
 7 files changed, 88 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index a111b39bbc12..3da0e7146ff8 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1940,23 +1940,34 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	 */
 	i915_gem_reset_engine(engine);
 
-	/* forcing engine to idle */
-	ret = intel_request_reset_engine(engine);
-	if (ret) {
-		DRM_ERROR("Failed to disable %s\n", engine->name);
-		goto error;
-	}
+	if (!dev_priv->guc.execbuf_client) {
+		/* forcing engine to idle */
+		ret = intel_request_reset_engine(engine);
+		if (ret) {
+			DRM_ERROR("Failed to disable %s\n", engine->name);
+			goto error;
+		}
 
-	/* finally, reset engine */
-	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
-	if (ret) {
-		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
+		/* finally, reset engine */
+		ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
+		if (ret) {
+			DRM_ERROR("Failed to reset %s, ret=%d\n",
+				  engine->name, ret);
+			intel_unrequest_reset_engine(engine);
+			goto error;
+		}
+
+		/* be sure the request reset bit gets cleared */
 		intel_unrequest_reset_engine(engine);
-		goto error;
-	}
 
-	/* be sure the request reset bit gets cleared */
-	intel_unrequest_reset_engine(engine);
+	} else {
+		ret = i915_guc_request_engine_reset(engine);
+		if (ret) {
+			DRM_ERROR("GuC failed to reset %s, ret=%d\n",
+				  engine->name, ret);
+			goto error;
+		}
+	}
 
 	/* i915_gem_reset_prepare revoked the fences */
 	i915_gem_restore_fences(dev_priv);
@@ -1967,6 +1978,10 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	if (ret)
 		goto error;
 
+	/* for guc too */
+	if (dev_priv->guc.execbuf_client)
+		i915_guc_submission_reenable_engine(engine);
+
 	error->reset_engine_count[engine->id]++;
 
 wakeup:
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index fbb4f200756a..d5c12ddd35b3 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3048,6 +3048,7 @@ extern void i915_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
 extern int intel_request_reset_engine(struct intel_engine_cs *engine);
 extern void intel_unrequest_reset_engine(struct intel_engine_cs *engine);
+extern int i915_guc_request_engine_reset(struct intel_engine_cs *engine);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index 2445af96aa71..fc21ec733f93 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -1335,6 +1335,25 @@ void i915_guc_submission_disable(struct drm_i915_private *dev_priv)
 	guc->execbuf_client = NULL;
 }
 
+void i915_guc_submission_reenable_engine(struct intel_engine_cs *engine)
+{
+	struct drm_i915_private *dev_priv = engine->i915;
+	struct intel_guc *guc = &dev_priv->guc;
+	struct i915_guc_client *client = guc->execbuf_client;
+	const int wqi_size = sizeof(struct guc_wq_item);
+	struct drm_i915_gem_request *rq;
+
+	GEM_BUG_ON(!client);
+	intel_guc_sample_forcewake(guc);
+
+	spin_lock_irq(&engine->timeline->lock);
+	list_for_each_entry(rq, &engine->timeline->requests, link) {
+		guc_client_update_wq_rsvd(client, wqi_size);
+		__i915_guc_submit(rq);
+	}
+	spin_unlock_irq(&engine->timeline->lock);
+}
+
 /**
  * intel_guc_suspend() - notify GuC entering suspend state
  * @dev_priv:	i915 device private
@@ -1386,3 +1405,32 @@ int intel_guc_resume(struct drm_i915_private *dev_priv)
 
 	return intel_guc_send(guc, data, ARRAY_SIZE(data));
 }
+
+int i915_guc_request_engine_reset(struct intel_engine_cs *engine)
+{
+	struct drm_i915_private *dev_priv = engine->i915;
+	struct intel_guc *guc = &dev_priv->guc;
+	struct i915_gem_context *ctx;
+	u32 data[7];
+
+	if (!i915.enable_guc_submission)
+		return 0;
+
+	ctx = dev_priv->kernel_context;
+
+	/*
+	 * The affected context report is populated by GuC and is provided
+	 * to the driver using the shared page. We request for it but don't
+	 * use it as scheduler has all of these details.
+	 */
+	data[0] = INTEL_GUC_ACTION_REQUEST_ENGINE_RESET;
+	data[1] = engine->guc_id;
+	data[2] = INTEL_GUC_RESET_OPTION_REPORT_AFFECTED_CONTEXTS;
+	data[3] = 0;
+	data[4] = 0;
+	data[5] = guc->execbuf_client->stage_id;
+	/* first page is shared data with GuC */
+	data[6] = guc_ggtt_offset(ctx->engine[RCS].state);
+
+	return intel_guc_send(guc, data, ARRAY_SIZE(data));
+}
diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
index cb36cbf3818f..b627206b8f56 100644
--- a/drivers/gpu/drm/i915/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
@@ -506,6 +506,7 @@ union guc_log_control {
 /* This Action will be programmed in C180 - SOFT_SCRATCH_O_REG */
 enum intel_guc_action {
 	INTEL_GUC_ACTION_DEFAULT = 0x0,
+	INTEL_GUC_ACTION_REQUEST_ENGINE_RESET = 0x3,
 	INTEL_GUC_ACTION_SAMPLE_FORCEWAKE = 0x6,
 	INTEL_GUC_ACTION_ALLOCATE_DOORBELL = 0x10,
 	INTEL_GUC_ACTION_DEALLOCATE_DOORBELL = 0x20,
@@ -519,6 +520,11 @@ enum intel_guc_action {
 	INTEL_GUC_ACTION_LIMIT
 };
 
+/* Reset engine options */
+enum action_engine_reset_options {
+	INTEL_GUC_RESET_OPTION_REPORT_AFFECTED_CONTEXTS = 0x10,
+};
+
 /*
  * The GuC sends its response to a command by overwriting the
  * command in SS0. The response is distinguishable from a command
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index dd0e9d587852..bc224a24ddad 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1187,14 +1187,15 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
 
 	/* After a GPU reset, we may have requests to replay */
 	clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
-	if (!i915.enable_guc_submission && !execlists_elsp_idle(engine)) {
+	if (!execlists_elsp_idle(engine)) {
 		DRM_DEBUG_DRIVER("Restarting %s from requests [0x%x, 0x%x]\n",
 				 engine->name,
 				 port_seqno(&engine->execlist_port[0]),
 				 port_seqno(&engine->execlist_port[1]));
 		engine->execlist_port[0].count = 0;
 		engine->execlist_port[1].count = 0;
-		execlists_submit_ports(engine);
+		if (!dev_priv->guc.execbuf_client)
+			execlists_submit_ports(engine);
 	}
 
 	return 0;
diff --git a/drivers/gpu/drm/i915/intel_uc.h b/drivers/gpu/drm/i915/intel_uc.h
index 6cf2d14fa0dc..4fb93f682b9c 100644
--- a/drivers/gpu/drm/i915/intel_uc.h
+++ b/drivers/gpu/drm/i915/intel_uc.h
@@ -221,6 +221,7 @@ int i915_guc_wq_reserve(struct drm_i915_gem_request *rq);
 void i915_guc_wq_unreserve(struct drm_i915_gem_request *request);
 void i915_guc_submission_disable(struct drm_i915_private *dev_priv);
 void i915_guc_submission_fini(struct drm_i915_private *dev_priv);
+void i915_guc_submission_reenable_engine(struct intel_engine_cs *engine);
 struct i915_vma *intel_guc_allocate_vma(struct intel_guc *guc, u32 size);
 
 /* intel_guc_log.c */
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 9886d7bd11ba..533c86f41092 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1756,14 +1756,9 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv)
 	return intel_get_gpu_reset(dev_priv) != NULL;
 }
 
-/*
- * When GuC submission is enabled, GuC manages ELSP and can initiate the
- * engine reset too. For now, fall back to full GPU reset if it is enabled.
- */
 bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
 {
 	return (dev_priv->info.has_reset_engine &&
-		!dev_priv->guc.execbuf_client &&
 		i915.reset == 2);
 }
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 14/18] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
                   ` (12 preceding siblings ...)
  2017-03-25  1:30 ` [PATCH v5 13/18] drm/i915/guc: Add support for reset engine using GuC commands Michel Thierry
@ 2017-03-25  1:30 ` Michel Thierry
  2017-03-25  9:15   ` Chris Wilson
  2017-04-17 21:28   ` Daniele Ceraolo Spurio
  2017-03-25  1:30 ` [PATCH v5 15/18] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
                   ` (4 subsequent siblings)
  18 siblings, 2 replies; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:30 UTC (permalink / raw)
  To: intel-gfx

For watchdog / media reset, the firmware must know the address of the shared
data page (the first page of the default context).

This information should be in DWORD 9 of the GUC_CTL structure.

Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/intel_guc_fwif.h   | 2 +-
 drivers/gpu/drm/i915/intel_guc_loader.c | 8 ++++++++
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
index b627206b8f56..5db3def5f74e 100644
--- a/drivers/gpu/drm/i915/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
@@ -135,7 +135,7 @@
 #define   GUC_ADS_ADDR_SHIFT		11
 #define   GUC_ADS_ADDR_MASK		0xfffff800
 
-#define GUC_CTL_RSRVD			9
+#define GUC_CTL_SHARED_DATA		9
 
 #define GUC_CTL_MAX_DWORDS		(SOFT_SCRATCH_COUNT - 2) /* [1..14] */
 
diff --git a/drivers/gpu/drm/i915/intel_guc_loader.c b/drivers/gpu/drm/i915/intel_guc_loader.c
index 7d92321f8731..afa584864cb5 100644
--- a/drivers/gpu/drm/i915/intel_guc_loader.c
+++ b/drivers/gpu/drm/i915/intel_guc_loader.c
@@ -119,6 +119,7 @@ static void guc_params_init(struct drm_i915_private *dev_priv)
 {
 	struct intel_guc *guc = &dev_priv->guc;
 	u32 params[GUC_CTL_MAX_DWORDS];
+	struct i915_gem_context *ctx;
 	int i;
 
 	memset(&params, 0, sizeof(params));
@@ -167,6 +168,13 @@ static void guc_params_init(struct drm_i915_private *dev_priv)
 		params[GUC_CTL_FEATURE] &= ~GUC_CTL_DISABLE_SCHEDULER;
 	}
 
+	/*
+	 * For watchdog / media reset, GuC must know the address of the shared
+	 * data page, which is the first page of the default context.
+	 */
+	ctx = dev_priv->kernel_context;
+	params[GUC_CTL_SHARED_DATA] = i915_ggtt_offset(ctx->engine[RCS].state);
+
 	I915_WRITE(SOFT_SCRATCH(0), 0);
 
 	for (i = 0; i < GUC_CTL_MAX_DWORDS; i++)
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 15/18] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
                   ` (13 preceding siblings ...)
  2017-03-25  1:30 ` [PATCH v5 14/18] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load Michel Thierry
@ 2017-03-25  1:30 ` Michel Thierry
  2017-03-25  9:26   ` Chris Wilson
  2017-03-25  1:30 ` [PATCH v5 16/18] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:30 UTC (permalink / raw)
  To: intel-gfx

*** General ***

Watchdog timeout (or "media engine reset") is a feature that allows
userland applications to enable hang detection on individual batch buffers.
The detection mechanism itself is mostly bound to the hardware and the only
thing that the driver needs to do to support this form of hang detection
is to implement the interrupt handling support as well as watchdog command
emission before and after the emitted batch buffer start instruction in the
ring buffer.

The principle of the hang detection mechanism is as follows:

1. Once the decision has been made to enable watchdog timeout for a
particular batch buffer and the driver is in the process of emitting the
batch buffer start instruction into the ring buffer it also emits a
watchdog timer start instruction before and a watchdog timer cancellation
instruction after the batch buffer start instruction in the ring buffer.

2. Once the GPU execution reaches the watchdog timer start instruction
the hardware watchdog counter is started by the hardware. The counter
keeps counting until either reaching a previously configured threshold
value or the timer cancellation instruction is executed.

2a. If the counter reaches the threshold value the hardware fires a
watchdog interrupt that is picked up by the watchdog interrupt handler.
This means that a hang has been detected and the driver needs to deal with
it the same way it would deal with a engine hang detected by the periodic
hang checker. The only difference between the two is that we already blamed
the active request (to ensure an engine reset).

2b. If the batch buffer completes and the execution reaches the watchdog
cancellation instruction before the watchdog counter reaches its
threshold value the watchdog is cancelled and nothing more comes of it.
No hang is detected.

Note about future interaction with preemption: Preemption could happen
in a command sequence prior to watchdog counter getting disabled,
resulting in watchdog being triggered following preemption. The driver will
need to explicitly disable the watchdog counter as part of the
preemption sequence.

*** This patch introduces: ***

1. IRQ handler code for watchdog timeout allowing direct hang recovery
based on hardware-driven hang detection, which then integrates directly
with the hang recovery path. This is independent of having per-engine reset
or just full gpu reset.

2. Watchdog specific register information.

Currently the render engine and all available media engines support
watchdog timeout (VECS is only supported in GEN9). The specifications elude
to the BCS engine being supported but that is currently not supported by
this commit.

Note that the value to stop the counter is different between render and
non-render engines.

v2: Move irq handler to tasklet, arm watchdog for a 2nd time to check
against false-positives.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |  4 +++
 drivers/gpu/drm/i915/i915_irq.c         | 13 ++++++-
 drivers/gpu/drm/i915/i915_reg.h         |  6 ++++
 drivers/gpu/drm/i915/intel_hangcheck.c  | 13 ++++---
 drivers/gpu/drm/i915/intel_lrc.c        | 64 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h |  4 +++
 6 files changed, 99 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index d5c12ddd35b3..b43c37a911bb 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1628,6 +1628,9 @@ struct i915_gpu_error {
 	 * inspect the bit and do the reset directly, otherwise the worker
 	 * waits for the struct_mutex.
 	 *
+	 * #I915_RESET_WATCHDOG - When hw detects a hang before us, we can use
+	 * I915_RESET_WATCHDOG to report the hang detection cause accurately.
+	 *
 	 * #I915_WEDGED - If reset fails and we can no longer use the GPU,
 	 * we set the #I915_WEDGED bit. Prior to command submission, e.g.
 	 * i915_gem_request_alloc(), this bit is checked and the sequence
@@ -1636,6 +1639,7 @@ struct i915_gpu_error {
 	unsigned long flags;
 #define I915_RESET_BACKOFF	0
 #define I915_RESET_HANDOFF	1
+#define I915_RESET_WATCHDOG	2
 #define I915_WEDGED		(BITS_PER_LONG - 1)
 
 	/* if available, engine-specific reset is tried before full gpu reset */
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 87e76ef589b1..d484cbc561eb 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1369,6 +1369,10 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
 
 	if (tasklet)
 		tasklet_hi_schedule(&engine->irq_tasklet);
+
+	if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift)) {
+		tasklet_hi_schedule(&engine->watchdog_tasklet);
+	}
 }
 
 static irqreturn_t gen8_gt_irq_ack(struct drm_i915_private *dev_priv,
@@ -3442,12 +3446,15 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
 	uint32_t gt_interrupts[] = {
 		GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 			GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT,
 		GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
 			GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
-			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
+			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
 		0,
 		GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT
@@ -3456,6 +3463,10 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
 	if (HAS_L3_DPF(dev_priv))
 		gt_interrupts[0] |= GT_RENDER_L3_PARITY_ERROR_INTERRUPT;
 
+	/* VECS watchdog is only available in skl+ */
+	if (INTEL_GEN(dev_priv) >= 9)
+		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
+
 	dev_priv->pm_ier = 0x0;
 	dev_priv->pm_imr = ~dev_priv->pm_ier;
 	GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 04c8f69fcc62..89f5191b2635 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -1900,6 +1900,11 @@ enum skl_disp_power_wells {
 #define RING_START(base)	_MMIO((base)+0x38)
 #define RING_CTL(base)		_MMIO((base)+0x3c)
 #define   RING_CTL_SIZE(size)	((size) - PAGE_SIZE) /* in bytes -> pages */
+#define RING_CNTR(base)        _MMIO((base) + 0x178)
+#define   GEN8_WATCHDOG_ENABLE		0
+#define   GEN8_RCS_WATCHDOG_DISABLE	1
+#define   GEN8_XCS_WATCHDOG_DISABLE	0xFFFFFFFF
+#define RING_THRESH(base)      _MMIO((base) + 0x17C)
 #define RING_SYNC_0(base)	_MMIO((base)+0x40)
 #define RING_SYNC_1(base)	_MMIO((base)+0x44)
 #define RING_SYNC_2(base)	_MMIO((base)+0x48)
@@ -2378,6 +2383,7 @@ enum skl_disp_power_wells {
 #define GT_BSD_USER_INTERRUPT			(1 << 12)
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /* hsw+; rsvd on snb, ivb, vlv */
 #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
+#define GT_GEN8_WATCHDOG_INTERRUPT		(1 <<  6) /* gen8+ */
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /* !snb */
 #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
 #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index dce742243ba6..0e9272c97096 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -388,7 +388,8 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
 
 static void hangcheck_declare_hang(struct drm_i915_private *i915,
 				   unsigned int hung,
-				   unsigned int stuck)
+				   unsigned int stuck,
+				   unsigned int watchdog)
 {
 	struct intel_engine_cs *engine;
 	char msg[80];
@@ -401,7 +402,8 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
 	if (stuck != hung)
 		hung &= ~stuck;
 	len = scnprintf(msg, sizeof(msg),
-			"%s on ", stuck == hung ? "No progress" : "Hang");
+			"%s on ", watchdog ? "Watchdog timeout" :
+				  stuck == hung ? "No progress" : "Hang");
 	for_each_engine_masked(engine, i915, hung, tmp)
 		len += scnprintf(msg + len, sizeof(msg) - len,
 				 "%s, ", engine->name);
@@ -425,7 +427,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 			     gpu_error.hangcheck_work.work);
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
-	unsigned int hung = 0, stuck = 0;
+	unsigned int hung = 0, stuck = 0, watchdog = 0;
 	int busy_count = 0;
 
 	if (!i915.enable_hangcheck)
@@ -437,6 +439,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	if (i915_terminally_wedged(&dev_priv->gpu_error))
 		return;
 
+	if (test_and_clear_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags))
+		watchdog = 1;
+
 	/* As enabling the GPU requires fairly extensive mmio access,
 	 * periodically arm the mmio checker to see if we are triggering
 	 * any invalid access.
@@ -463,7 +468,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	}
 
 	if (hung)
-		hangcheck_declare_hang(dev_priv, hung, stuck);
+		hangcheck_declare_hang(dev_priv, hung, stuck, watchdog);
 
 	/* Reset timer in case GPU hangs without another request being added */
 	if (busy_count)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index bc224a24ddad..73f8fbdcf1fb 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1470,6 +1470,48 @@ static int gen8_emit_flush_render(struct drm_i915_gem_request *request,
 	return 0;
 }
 
+#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
+static void gen8_watchdog_irq_handler(unsigned long data)
+{
+	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
+	struct drm_i915_private *dev_priv = engine->i915;
+	u32 watchdog_disable, current_seqno;
+
+	intel_uncore_forcewake_get(dev_priv, engine->fw_domains);
+
+	if (engine->id == RCS)
+		watchdog_disable = GEN8_RCS_WATCHDOG_DISABLE;
+	else
+		watchdog_disable = GEN8_XCS_WATCHDOG_DISABLE;
+
+	/* Stop the counter to prevent further timeout interrupts */
+	I915_WRITE_FW(RING_CNTR(engine->mmio_base), watchdog_disable);
+
+	/* false-positive, request completed after the timer expired */
+	if (intel_engine_is_idle(engine))
+		goto fw_put;
+
+	current_seqno = intel_engine_get_seqno(engine);
+	if (engine->hangcheck.last_watchdog_seqno == current_seqno) {
+		/* Make sure the active request will be marked as guilty */
+		engine->hangcheck.stalled = true;
+		engine->hangcheck.seqno = intel_engine_get_seqno(engine);
+
+		/* And try to run the hangcheck_work as soon as possible */
+		set_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags);
+		queue_delayed_work(system_long_wq,
+				   &dev_priv->gpu_error.hangcheck_work, 0);
+	} else {
+		engine->hangcheck.last_watchdog_seqno = current_seqno;
+		/* Re-start the counter, if really hung, it will expire again */
+		I915_WRITE_FW(RING_THRESH(engine->mmio_base), GEN8_WATCHDOG_1000US);
+		I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
+	}
+
+fw_put:
+	intel_uncore_forcewake_put(dev_priv, engine->fw_domains);
+}
+
 /*
  * Reserve space for 2 NOOPs at the end of each request to be
  * used as a workaround for not being allowed to do lite
@@ -1563,6 +1605,9 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine)
 	if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &engine->irq_tasklet.state)))
 		tasklet_kill(&engine->irq_tasklet);
 
+	if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &engine->watchdog_tasklet.state)))
+		tasklet_kill(&engine->watchdog_tasklet);
+
 	dev_priv = engine->i915;
 
 	if (engine->buffer) {
@@ -1621,6 +1666,22 @@ logical_ring_default_irqs(struct intel_engine_cs *engine)
 	unsigned shift = engine->irq_shift;
 	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
 	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
+
+	switch (engine->id) {
+	default:
+		/* BCS engine does not support hw watchdog */
+		break;
+	case RCS:
+	case VCS:
+	case VCS2:
+		engine->irq_keep_mask |= (GT_GEN8_WATCHDOG_INTERRUPT << shift);
+		break;
+	case VECS:
+		if (INTEL_GEN(engine->i915) >= 9)
+			engine->irq_keep_mask |=
+				(GT_GEN8_WATCHDOG_INTERRUPT << shift);
+		break;
+	}
 }
 
 static int
@@ -1669,6 +1730,9 @@ logical_ring_setup(struct intel_engine_cs *engine)
 	tasklet_init(&engine->irq_tasklet,
 		     intel_lrc_irq_handler, (unsigned long)engine);
 
+	tasklet_init(&engine->watchdog_tasklet,
+		     gen8_watchdog_irq_handler, (unsigned long)engine);
+
 	logical_ring_default_vfuncs(engine);
 	logical_ring_default_irqs(engine);
 }
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index e8faf2c34c97..fffe69f5aed2 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -128,6 +128,7 @@ struct intel_instdone {
 struct intel_engine_hangcheck {
 	u64 acthd;
 	u32 seqno;
+	u32 last_watchdog_seqno;
 	enum intel_engine_hangcheck_action action;
 	unsigned long action_timestamp;
 	int deadlock;
@@ -405,6 +406,9 @@ struct intel_engine_cs {
 
 	struct intel_engine_hangcheck hangcheck;
 
+	/* watchdog_tasklet: stop counter and re-schedule hangcheck_work asap */
+	struct tasklet_struct watchdog_tasklet;
+
 	bool needs_cmd_parser;
 
 	/*
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 16/18] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
                   ` (14 preceding siblings ...)
  2017-03-25  1:30 ` [PATCH v5 15/18] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
@ 2017-03-25  1:30 ` Michel Thierry
  2017-03-25  9:46   ` Chris Wilson
  2017-03-25  1:30 ` [PATCH v5 17/18] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
                   ` (2 subsequent siblings)
  18 siblings, 1 reply; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:30 UTC (permalink / raw)
  To: intel-gfx

Emit the required commands into the ring buffer for starting and
stopping the watchdog timer before/after batch buffer start during
batch buffer submission.

v2: Support watchdog threshold per context engine, merge lri commands,
and move watchdog commands emission to emit_bb_start. Request space of
combined start_watchdog, bb_start and stop_watchdog to avoid any error
after emitting bb_start.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.h |  4 ++
 drivers/gpu/drm/i915/intel_lrc.c        | 86 ++++++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  4 ++
 3 files changed, 92 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 4af2ab94558b..88700bdbb4e1 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -150,6 +150,10 @@ struct i915_gem_context {
 		u32 *lrc_reg_state;
 		u64 lrc_desc;
 		int pin_count;
+		/** watchdog_threshold: hw watchdog threshold value,
+		 * in clock counts
+		 */
+		u32 watchdog_threshold;
 		bool initialised;
 	} engine[I915_NUM_ENGINES];
 
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 73f8fbdcf1fb..2736f642dc76 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1319,6 +1319,8 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
 			      const unsigned int flags)
 {
 	u32 *cs;
+	u32 num_dwords;
+	bool watchdog_running = false;
 	int ret;
 
 	/* Don't rely in hw updating PDPs, specially in lite-restore.
@@ -1338,10 +1340,29 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
 		req->ctx->ppgtt->pd_dirty_rings &= ~intel_engine_flag(req->engine);
 	}
 
-	cs = intel_ring_begin(req, 4);
+	/* bb_start only */
+	num_dwords = 4;
+
+	/* check if watchdog will be required */
+	if (req->ctx->engine[req->engine->id].watchdog_threshold != 0) {
+		if (!req->engine->emit_start_watchdog ||
+		    !req->engine->emit_stop_watchdog)
+			return -EINVAL;
+
+		/* start_watchdog (6) + bb_start (4) + stop_watchdog (4) */
+		num_dwords += 10;
+		watchdog_running = true;
+	}
+
+	cs = intel_ring_begin(req, num_dwords);
 	if (IS_ERR(cs))
 		return PTR_ERR(cs);
 
+	if (watchdog_running) {
+		/* Start watchdog timer */
+		cs = req->engine->emit_start_watchdog(req, cs);
+	}
+
 	/* FIXME(BDW): Address space and security selectors. */
 	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
 		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8)) |
@@ -1349,8 +1370,13 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
 	*cs++ = lower_32_bits(offset);
 	*cs++ = upper_32_bits(offset);
 	*cs++ = MI_NOOP;
-	intel_ring_advance(req, cs);
 
+	if (watchdog_running) {
+		/* Cancel watchdog timer */
+		cs = req->engine->emit_stop_watchdog(req, cs);
+	}
+
+	intel_ring_advance(req, cs);
 	return 0;
 }
 
@@ -1512,6 +1538,54 @@ static void gen8_watchdog_irq_handler(unsigned long data)
 	intel_uncore_forcewake_put(dev_priv, engine->fw_domains);
 }
 
+static u32 *gen8_emit_start_watchdog(struct drm_i915_gem_request *req, u32 *cs)
+{
+	struct intel_engine_cs *engine = req->engine;
+	struct i915_gem_context *ctx = req->ctx;
+	struct intel_context *ce = &ctx->engine[engine->id];
+
+	/* XXX: no watchdog support in BCS engine */
+	GEM_BUG_ON(engine->id == BCS);
+
+	/*
+	 * watchdog register must never be programmed to zero. This would
+	 * cause the watchdog counter to exceed and not allow the engine to
+	 * go into IDLE state
+	 */
+	GEM_BUG_ON(ce->watchdog_threshold == 0);
+
+	/* Set counter period */
+	*cs++ = MI_LOAD_REGISTER_IMM(2);
+	*cs++ = i915_mmio_reg_offset(RING_THRESH(engine->mmio_base));
+	*cs++ = ce->watchdog_threshold;
+	/* Start counter */
+	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
+	*cs++ = GEN8_WATCHDOG_ENABLE;
+	*cs++ = MI_NOOP;
+
+	return cs;
+}
+
+static u32 *gen8_emit_stop_watchdog(struct drm_i915_gem_request *req, u32 *cs)
+{
+	struct intel_engine_cs *engine = req->engine;
+
+	/* XXX: no watchdog support in BCS engine */
+	GEM_BUG_ON(engine->id == BCS);
+
+	*cs++ = MI_LOAD_REGISTER_IMM(2);
+	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
+
+	if (engine->id == RCS)
+		*cs++ = GEN8_RCS_WATCHDOG_DISABLE;
+	else
+		*cs++ = GEN8_XCS_WATCHDOG_DISABLE;
+
+	*cs++ = MI_NOOP;
+
+	return cs;
+}
+
 /*
  * Reserve space for 2 NOOPs at the end of each request to be
  * used as a workaround for not being allowed to do lite
@@ -1780,6 +1854,8 @@ int logical_render_ring_init(struct intel_engine_cs *engine)
 	engine->emit_flush = gen8_emit_flush_render;
 	engine->emit_breadcrumb = gen8_emit_breadcrumb_render;
 	engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_render_sz;
+	engine->emit_start_watchdog = gen8_emit_start_watchdog;
+	engine->emit_stop_watchdog = gen8_emit_stop_watchdog;
 
 	ret = intel_engine_create_scratch(engine, PAGE_SIZE);
 	if (ret)
@@ -1803,6 +1879,12 @@ int logical_xcs_ring_init(struct intel_engine_cs *engine)
 {
 	logical_ring_setup(engine);
 
+	/* BCS engine does not have a watchdog-expired irq */
+	if (engine->id != BCS) {
+		engine->emit_start_watchdog = gen8_emit_start_watchdog;
+		engine->emit_stop_watchdog = gen8_emit_stop_watchdog;
+	}
+
 	return logical_ring_init(engine);
 }
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index fffe69f5aed2..34ca0c5ac5b9 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -275,6 +275,10 @@ struct intel_engine_cs {
 
 	int		(*emit_flush)(struct drm_i915_gem_request *request,
 				      u32 mode);
+	u32 *		(*emit_start_watchdog)(struct drm_i915_gem_request *req,
+					       u32 *cs);
+	u32 *		(*emit_stop_watchdog)(struct drm_i915_gem_request *req,
+					      u32 *cs);
 #define EMIT_INVALIDATE	BIT(0)
 #define EMIT_FLUSH	BIT(1)
 #define EMIT_BARRIER	(EMIT_INVALIDATE | EMIT_FLUSH)
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 17/18] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
                   ` (15 preceding siblings ...)
  2017-03-25  1:30 ` [PATCH v5 16/18] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
@ 2017-03-25  1:30 ` Michel Thierry
  2017-03-25  9:38   ` Chris Wilson
  2017-04-14 16:05   ` Daniele Ceraolo Spurio
  2017-03-25  1:30 ` [PATCH v5 18/18] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs Michel Thierry
  2017-03-25  1:50 ` ✗ Fi.CI.BAT: failure for Gen8+ engine-reset Patchwork
  18 siblings, 2 replies; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:30 UTC (permalink / raw)
  To: intel-gfx

Final enablement patch for GPU hang detection using watchdog timeout.
Using the gem_context_setparam ioctl, users can specify the desired
timeout value in microseconds, and the driver will do the conversion to
'timestamps'.

The recommended default watchdog threshold for video engines is 60000 us,
since this has been _empirically determined_ to be a good compromise for
low-latency requirements and low rate of false positives. The default
register value is ~106000us and the theoretical max value (all 1s) is
353 seconds.

v2: Fixed get api to return values in microseconds. Threshold updated to
be per context engine. Check for u32 overflow. Capture ctx threshold
value in error state.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |  1 +
 drivers/gpu/drm/i915/i915_gem_context.c | 78 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_gem_context.h | 20 +++++++++
 drivers/gpu/drm/i915/i915_gpu_error.c   | 11 +++--
 drivers/gpu/drm/i915/intel_lrc.c        |  2 +-
 include/uapi/drm/i915_drm.h             |  1 +
 6 files changed, 108 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index b43c37a911bb..1741584d858f 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1039,6 +1039,7 @@ struct i915_gpu_state {
 			int ban_score;
 			int active;
 			int guilty;
+			int watchdog_threshold;
 		} context;
 
 		struct drm_i915_error_object {
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index edbed85a1c88..f5c126c0c681 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -422,6 +422,78 @@ i915_gem_context_create_gvt(struct drm_device *dev)
 	return ctx;
 }
 
+/* Return the timer count threshold in microseconds. */
+int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
+				  struct drm_i915_gem_context_param *args)
+{
+	struct drm_i915_private *dev_priv = ctx->i915;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	u32 threshold_in_us[I915_NUM_ENGINES];
+
+	if (!dev_priv->engine[VCS]->emit_start_watchdog)
+		return -ENODEV;
+
+	for_each_engine(engine, dev_priv, id) {
+		struct intel_context *ce = &ctx->engine[id];
+
+		threshold_in_us[id] = watchdog_to_us(ce->watchdog_threshold);
+	}
+
+	mutex_unlock(&dev_priv->drm.struct_mutex);
+	if (__copy_to_user(u64_to_user_ptr(args->value),
+			   &threshold_in_us,
+			   sizeof(threshold_in_us))) {
+		mutex_lock(&dev_priv->drm.struct_mutex);
+		return -EFAULT;
+	}
+	mutex_lock(&dev_priv->drm.struct_mutex);
+	args->size = sizeof(threshold_in_us);
+
+	return 0;
+}
+
+/*
+ * Based on time out value in microseconds (us) calculate
+ * timer count thresholds needed based on core frequency.
+ * Watchdog can be disabled by setting it to 0.
+ */
+int i915_gem_context_set_watchdog(struct i915_gem_context *ctx,
+				  struct drm_i915_gem_context_param *args)
+{
+	struct drm_i915_private *dev_priv = ctx->i915;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	u32 threshold_in_us[I915_NUM_ENGINES];
+
+	if (!dev_priv->engine[VCS]->emit_start_watchdog)
+		return -ENODEV;
+	else if (args->size < sizeof(threshold_in_us))
+		return -EINVAL;
+
+	mutex_unlock(&dev_priv->drm.struct_mutex);
+	if (copy_from_user(&threshold_in_us,
+			   u64_to_user_ptr(args->value),
+			   sizeof(threshold_in_us))) {
+		mutex_lock(&dev_priv->drm.struct_mutex);
+		return -EFAULT;
+	}
+	mutex_lock(&dev_priv->drm.struct_mutex);
+
+	/* not supported in blitter engine */
+	if (threshold_in_us[BCS] != 0)
+		return -EINVAL;
+
+	for_each_engine(engine, dev_priv, id) {
+		struct intel_context *ce = &ctx->engine[id];
+
+		ce->watchdog_threshold =
+			watchdog_to_clock_counts((u64)threshold_in_us[id]);
+	}
+
+	return 0;
+}
+
 int i915_gem_context_init(struct drm_i915_private *dev_priv)
 {
 	struct i915_gem_context *ctx;
@@ -1061,6 +1133,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_BANNABLE:
 		args->value = i915_gem_context_is_bannable(ctx);
 		break;
+	case I915_CONTEXT_PARAM_WATCHDOG:
+		ret = i915_gem_context_get_watchdog(ctx, args);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -1118,6 +1193,9 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 		else
 			i915_gem_context_clear_bannable(ctx);
 		break;
+	case I915_CONTEXT_PARAM_WATCHDOG:
+		ret = i915_gem_context_set_watchdog(ctx, args);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 88700bdbb4e1..6867b1fead8b 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -250,6 +250,26 @@ static inline bool i915_gem_context_is_kernel(struct i915_gem_context *ctx)
 	return !ctx->file_priv;
 }
 
+/*
+ * Timestamp timer resolution = 0.080 uSec,
+ * or 12500000 counts per second, or ~12 counts per microsecond.
+ */
+#define TIMESTAMP_CNTS_PER_USEC 12
+static inline u32 watchdog_to_us(u32 value_in_clock_counts)
+{
+	return (value_in_clock_counts) / (TIMESTAMP_CNTS_PER_USEC);
+}
+
+static inline u32 watchdog_to_clock_counts(u64 value_in_us)
+{
+	u64 threshold = (value_in_us) * (TIMESTAMP_CNTS_PER_USEC);
+
+	if (GEM_WARN_ON(overflows_type(threshold, u32)))
+		return 0;
+
+	return threshold;
+}
+
 /* i915_gem_context.c */
 int __must_check i915_gem_context_init(struct drm_i915_private *dev_priv);
 void i915_gem_context_lost(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 5d015bcc7484..8acb83778030 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -388,9 +388,10 @@ static void error_print_context(struct drm_i915_error_state_buf *m,
 				const char *header,
 				const struct drm_i915_error_context *ctx)
 {
-	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, ban score %d guilty %d active %d\n",
+	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, ban score %d guilty %d active %d, watchdog %dus\n",
 		   header, ctx->comm, ctx->pid, ctx->handle, ctx->hw_id,
-		   ctx->ban_score, ctx->guilty, ctx->active);
+		   ctx->ban_score, ctx->guilty, ctx->active,
+		   watchdog_to_us(ctx->watchdog_threshold));
 }
 
 static void error_print_engine(struct drm_i915_error_state_buf *m,
@@ -1336,7 +1337,8 @@ static void error_record_engine_execlists(struct intel_engine_cs *engine,
 }
 
 static void record_context(struct drm_i915_error_context *e,
-			   struct i915_gem_context *ctx)
+			   struct i915_gem_context *ctx,
+			   u32 engine_id)
 {
 	if (ctx->pid) {
 		struct task_struct *task;
@@ -1355,6 +1357,7 @@ static void record_context(struct drm_i915_error_context *e,
 	e->ban_score = ctx->ban_score;
 	e->guilty = ctx->guilty_count;
 	e->active = ctx->active_count;
+	e->watchdog_threshold =	ctx->engine[engine_id].watchdog_threshold;
 }
 
 static void i915_gem_record_rings(struct drm_i915_private *dev_priv,
@@ -1389,7 +1392,7 @@ static void i915_gem_record_rings(struct drm_i915_private *dev_priv,
 			ee->vm = request->ctx->ppgtt ?
 				&request->ctx->ppgtt->base : &ggtt->base;
 
-			record_context(&ee->context, request->ctx);
+			record_context(&ee->context, request->ctx, engine->id);
 
 			/* We need to copy these to an anonymous buffer
 			 * as the simplest method to avoid being overwritten
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 2736f642dc76..3f2b57a22338 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1496,7 +1496,7 @@ static int gen8_emit_flush_render(struct drm_i915_gem_request *request,
 	return 0;
 }
 
-#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
+#define GEN8_WATCHDOG_1000US watchdog_to_clock_counts(1000)
 static void gen8_watchdog_irq_handler(unsigned long data)
 {
 	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index f083931a7809..448c9c0faa69 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1293,6 +1293,7 @@ struct drm_i915_gem_context_param {
 #define I915_CONTEXT_PARAM_GTT_SIZE	0x3
 #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE	0x4
 #define I915_CONTEXT_PARAM_BANNABLE	0x5
+#define I915_CONTEXT_PARAM_WATCHDOG	0x6
 	__u64 value;
 };
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v5 18/18] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
                   ` (16 preceding siblings ...)
  2017-03-25  1:30 ` [PATCH v5 17/18] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
@ 2017-03-25  1:30 ` Michel Thierry
  2017-03-25  9:12   ` Chris Wilson
  2017-03-25  1:50 ` ✗ Fi.CI.BAT: failure for Gen8+ engine-reset Patchwork
  18 siblings, 1 reply; 41+ messages in thread
From: Michel Thierry @ 2017-03-25  1:30 UTC (permalink / raw)
  To: intel-gfx

From firmware v8.8, GuC provides the count of media engine resets
(watchdog timeout). This information is available in the GuC shared
context data struct, which resides in the first page of the default
(kernel) lrc context.

Since GuC handled engine resets are transparent for kernel and user,
provide a simple debugfs entry to see the number of times media reset
has happened.

Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c   | 28 ++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_guc_fwif.h | 18 ++++++++++++++++++
 2 files changed, 46 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 8db850541e03..f40a2c84b423 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1400,6 +1400,32 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 	return 0;
 }
 
+static u32 i915_watchdog_reset_count(struct drm_i915_private *dev_priv)
+{
+	struct drm_device *dev = &dev_priv->drm;
+	struct i915_gem_context *ctx;
+	struct page *page;
+	struct guc_shared_ctx_data *guc_shared_data;
+	u32 guc_media_reset_count;
+
+	if (!i915.enable_guc_submission)
+		return 0;
+
+	if (mutex_lock_interruptible(&dev->struct_mutex))
+		return 0;
+
+	ctx = dev_priv->kernel_context;
+	page = i915_gem_object_get_dirty_page(ctx->engine[RCS].state->obj,
+					      LRC_GUCSHR_PN);
+	guc_shared_data = kmap_atomic(page);
+	guc_media_reset_count = guc_shared_data->media_reset_count;
+	kunmap_atomic(guc_shared_data);
+
+	mutex_unlock(&dev->struct_mutex);
+
+	return guc_media_reset_count;
+}
+
 static int i915_reset_info(struct seq_file *m, void *unused)
 {
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
@@ -1408,6 +1434,8 @@ static int i915_reset_info(struct seq_file *m, void *unused)
 	enum intel_engine_id id;
 
 	seq_printf(m, "full gpu reset = %u\n", i915_reset_count(error));
+	seq_printf(m, "GuC watchdog/media reset = %u\n",
+		   i915_watchdog_reset_count(dev_priv));
 
 	for_each_engine(engine, dev_priv, id) {
 		seq_printf(m, "%s = %u\n", engine->name,
diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
index 5db3def5f74e..d9dc844fcce0 100644
--- a/drivers/gpu/drm/i915/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
@@ -503,6 +503,24 @@ union guc_log_control {
 	u32 value;
 } __packed;
 
+/* GuC Shared Context Data Struct */
+struct guc_shared_ctx_data {
+	u32 addr_of_last_preempted_data_low;
+	u32 addr_of_last_preempted_data_high;
+	u32 addr_of_last_preempted_data_high_tmp;
+	u32 padding;
+	u32 is_mapped_to_proxy;
+	u32 proxy_ctx_id;
+	u32 engine_reset_ctx_id;
+	u32 media_reset_count;
+	u32 reserved[8];
+	u32 uk_last_ctx_switch_reason;
+	u32 was_reset;
+	u32 lrca_gpu_addr;
+	u32 execlist_ctx;
+	u32 reserved1[32];
+} __packed;
+
 /* This Action will be programmed in C180 - SOFT_SCRATCH_O_REG */
 enum intel_guc_action {
 	INTEL_GUC_ACTION_DEFAULT = 0x0,
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* ✗ Fi.CI.BAT: failure for Gen8+ engine-reset
  2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
                   ` (17 preceding siblings ...)
  2017-03-25  1:30 ` [PATCH v5 18/18] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs Michel Thierry
@ 2017-03-25  1:50 ` Patchwork
  18 siblings, 0 replies; 41+ messages in thread
From: Patchwork @ 2017-03-25  1:50 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

== Series Details ==

Series: Gen8+ engine-reset
URL   : https://patchwork.freedesktop.org/series/21868/
State : failure

== Summary ==

Series 21868v1 Gen8+ engine-reset
https://patchwork.freedesktop.org/api/1.0/series/21868/revisions/1/mbox/

Test pm_rpm:
        Subgroup basic-rte:
                pass       -> DMESG-WARN (fi-skl-6770hq)
Test pm_rps:
        Subgroup basic-api:
                pass       -> INCOMPLETE (fi-skl-6770hq)

fi-bdw-5557u     total:278  pass:267  dwarn:0   dfail:0   fail:0   skip:11  time: 471s
fi-bdw-gvtdvm    total:278  pass:256  dwarn:8   dfail:0   fail:0   skip:14  time: 469s
fi-bsw-n3050     total:278  pass:239  dwarn:0   dfail:0   fail:0   skip:39  time: 585s
fi-bxt-j4205     total:278  pass:259  dwarn:0   dfail:0   fail:0   skip:19  time: 537s
fi-bxt-t5700     total:278  pass:258  dwarn:0   dfail:0   fail:0   skip:20  time: 574s
fi-byt-j1900     total:278  pass:251  dwarn:0   dfail:0   fail:0   skip:27  time: 505s
fi-hsw-4770      total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time: 439s
fi-hsw-4770r     total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time: 437s
fi-ilk-650       total:278  pass:228  dwarn:0   dfail:0   fail:0   skip:50  time: 444s
fi-ivb-3520m     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time: 513s
fi-ivb-3770      total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time: 491s
fi-kbl-7500u     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time: 481s
fi-skl-6260u     total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time: 486s
fi-skl-6700hq    total:278  pass:261  dwarn:0   dfail:0   fail:0   skip:17  time: 600s
fi-skl-6700k     total:278  pass:256  dwarn:4   dfail:0   fail:0   skip:18  time: 494s
fi-skl-6770hq    total:243  pass:232  dwarn:1   dfail:0   fail:0   skip:9   time: 0s
fi-skl-gvtdvm    total:278  pass:265  dwarn:0   dfail:0   fail:0   skip:13  time: 469s
fi-snb-2520m     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time: 554s
fi-snb-2600      total:278  pass:249  dwarn:0   dfail:0   fail:0   skip:29  time: 424s
fi-byt-n2820 failed to collect. IGT log at Patchwork_4300/fi-byt-n2820/igt.log

102397b5dbb1f68504739adefde2c28a5aba95b9 drm-tip: 2017y-03m-25d-00h-24m-58s UTC integration manifest
6dda863 drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs
a71c802 drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
8e10876 drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
a2167c8 drm/i915: Watchdog timeout: IRQ handler for gen8+
96e2ee6 drm/i915: Watchdog timeout: Pass GuC shared data structure during param load
5a60c42 drm/i915/guc: Add support for reset engine using GuC commands
6ed1757 drm/i915/guc: Provide register list to be saved/restored during engine reset
af217a8 drm/i915/selftests: reset engine self tests
d43b3eb drm/i915: Add engine reset count in get-reset-stats ioctl
dea8dca drm/i915/tdr: Enable Engine reset and recovery support
f872bd8 drm/i915/tdr: Export per-engine reset count info to debugfs
4d66882b1 drm/i915/tdr: Add engine reset count to error state
a3d1a331 drm/i915: Skip reset request if there is one already
42fdfd1 drm/i915/tdr: Add support for per engine reset recovery
1ce9821 drm/i915/tdr: Modify error handler for per engine hang recovery
15808ca drm/i915: Update i915.reset to handle engine resets
95bad53 drm/i915: Rename gen8_(un)request_engine_reset to gen8_(un)request_reset_engine
56350ec drm/i915: Fix stale comment about I915_RESET_IN_PROGRESS flag

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_4300/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 02/18] drm/i915: Rename gen8_(un)request_engine_reset to gen8_(un)request_reset_engine
  2017-03-25  1:29 ` [PATCH v5 02/18] drm/i915: Rename gen8_(un)request_engine_reset to gen8_(un)request_reset_engine Michel Thierry
@ 2017-03-25  9:01   ` Chris Wilson
  2017-03-27 20:35     ` Michel Thierry
  0 siblings, 1 reply; 41+ messages in thread
From: Chris Wilson @ 2017-03-25  9:01 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Fri, Mar 24, 2017 at 06:29:54PM -0700, Michel Thierry wrote:
> As all other functions related to resetting engines are using reset_engine.

However, we should aim to clear any confusion between this and our
requests. Maybe gen8_reset_engine_start and gen8_reset_engine_cancel?
I think gen8_reset_engine_foo is clearer.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 04/18] drm/i915/tdr: Modify error handler for per engine hang recovery
  2017-03-25  1:29 ` [PATCH v5 04/18] drm/i915/tdr: Modify error handler for per engine hang recovery Michel Thierry
@ 2017-03-25  9:10   ` Chris Wilson
  0 siblings, 0 replies; 41+ messages in thread
From: Chris Wilson @ 2017-03-25  9:10 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Fri, Mar 24, 2017 at 06:29:56PM -0700, Michel Thierry wrote:
> From: Arun Siluvery <arun.siluvery@linux.intel.com>
> 
> This is a preparatory patch which modifies error handler to do per engine
> hang recovery. The actual patch which implements this sequence follows
> later in the series. The aim is to prepare existing recovery function to
> adapt to this new function where applicable (which fails at this point
> because core implementation is lacking) and continue recovery using legacy
> full gpu reset.
> 
> A helper function is also added to query the availability of engine
> reset.
> 
> The error events behaviour that are used to notify user of reset are
> adapted to engine reset such that it doesn't break users listening to these
> events. In legacy we report an error event, a reset event before resetting
> the gpu and a reset done event marking the completion of reset. The same
> behaviour is adapted but reset event is only dispatched once even when
> multiple engines are hung. Finally once reset is complete we send reset
> done event as usual.
> 
> Note that this implementation of engine reset is for i915 directly
> submitting to the ELSP, where the driver manages the hang detection,
> recovery and resubmission. With GuC submission these tasks are shared
> between driver and firmware; i915 will still responsible for detecting a
> hang, and when it does it will have to request GuC to reset that Engine and
> remind the firmware about the outstanding submissions. This will be
> added in different patch.
> 
> v2: rebase, advertise engine reset availability in platform definition,
> add note about GuC submission.
> v3: s/*engine_reset*/*reset_engine*/. (Chris)
> Handle reset as 2 level resets, by first going to engine only and fall
> backing to full/chip reset as needed, i.e. reset_engine will need the
> struct_mutex.
> v4: Pass the engine mask to i915_reset. (Chris)
> v5: Rebase, update selftests.
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Ian Lister <ian.lister@intel.com>
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>

4 authors in and this patch is still trying to do reset_engine until the
mutex, requiring the handoff. Why? We should be ready now to be able to
do the first pass of resets before the mutex - we don't need to even
prepare the display for the per-engine resets and should be able to do
it much lighter. We can land the per-engine struct_mutex resets for
hangcheck as soon as it is ready as it would see immediate testing.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 18/18] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs
  2017-03-25  1:30 ` [PATCH v5 18/18] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs Michel Thierry
@ 2017-03-25  9:12   ` Chris Wilson
  0 siblings, 0 replies; 41+ messages in thread
From: Chris Wilson @ 2017-03-25  9:12 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Fri, Mar 24, 2017 at 06:30:10PM -0700, Michel Thierry wrote:
> From firmware v8.8, GuC provides the count of media engine resets
> (watchdog timeout). This information is available in the GuC shared
> context data struct, which resides in the first page of the default
> (kernel) lrc context.
> 
> Since GuC handled engine resets are transparent for kernel and user,
> provide a simple debugfs entry to see the number of times media reset
> has happened.
> 
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_debugfs.c   | 28 ++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/intel_guc_fwif.h | 18 ++++++++++++++++++
>  2 files changed, 46 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index 8db850541e03..f40a2c84b423 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -1400,6 +1400,32 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
>  	return 0;
>  }
>  
> +static u32 i915_watchdog_reset_count(struct drm_i915_private *dev_priv)
> +{
> +	struct drm_device *dev = &dev_priv->drm;
> +	struct i915_gem_context *ctx;
> +	struct page *page;
> +	struct guc_shared_ctx_data *guc_shared_data;
> +	u32 guc_media_reset_count;
> +
> +	if (!i915.enable_guc_submission)
> +		return 0;
> +
> +	if (mutex_lock_interruptible(&dev->struct_mutex))
> +		return 0;

Do you need the mutex?

> +	ctx = dev_priv->kernel_context;
> +	page = i915_gem_object_get_dirty_page(ctx->engine[RCS].state->obj,
> +					      LRC_GUCSHR_PN);

Are you writing?

> +	guc_shared_data = kmap_atomic(page);

Atomic?

> +	guc_media_reset_count = guc_shared_data->media_reset_count;

This is an unserialised access, mark it so (READ_ONCE).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 14/18] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load
  2017-03-25  1:30 ` [PATCH v5 14/18] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load Michel Thierry
@ 2017-03-25  9:15   ` Chris Wilson
  2017-04-17 21:28   ` Daniele Ceraolo Spurio
  1 sibling, 0 replies; 41+ messages in thread
From: Chris Wilson @ 2017-03-25  9:15 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Fri, Mar 24, 2017 at 06:30:06PM -0700, Michel Thierry wrote:
> For watchdog / media reset, the firmware must know the address of the shared
> data page (the first page of the default context).
> 
> This information should be in DWORD 9 of the GUC_CTL structure.
> 
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_guc_fwif.h   | 2 +-
>  drivers/gpu/drm/i915/intel_guc_loader.c | 8 ++++++++
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
> index b627206b8f56..5db3def5f74e 100644
> --- a/drivers/gpu/drm/i915/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
> @@ -135,7 +135,7 @@
>  #define   GUC_ADS_ADDR_SHIFT		11
>  #define   GUC_ADS_ADDR_MASK		0xfffff800
>  
> -#define GUC_CTL_RSRVD			9
> +#define GUC_CTL_SHARED_DATA		9
>  
>  #define GUC_CTL_MAX_DWORDS		(SOFT_SCRATCH_COUNT - 2) /* [1..14] */
>  
> diff --git a/drivers/gpu/drm/i915/intel_guc_loader.c b/drivers/gpu/drm/i915/intel_guc_loader.c
> index 7d92321f8731..afa584864cb5 100644
> --- a/drivers/gpu/drm/i915/intel_guc_loader.c
> +++ b/drivers/gpu/drm/i915/intel_guc_loader.c
> @@ -119,6 +119,7 @@ static void guc_params_init(struct drm_i915_private *dev_priv)
>  {
>  	struct intel_guc *guc = &dev_priv->guc;
>  	u32 params[GUC_CTL_MAX_DWORDS];
> +	struct i915_gem_context *ctx;
>  	int i;
>  
>  	memset(&params, 0, sizeof(params));
> @@ -167,6 +168,13 @@ static void guc_params_init(struct drm_i915_private *dev_priv)
>  		params[GUC_CTL_FEATURE] &= ~GUC_CTL_DISABLE_SCHEDULER;
>  	}
>  
> +	/*
> +	 * For watchdog / media reset, GuC must know the address of the shared
> +	 * data page, which is the first page of the default context.
> +	 */
> +	ctx = dev_priv->kernel_context;
> +	params[GUC_CTL_SHARED_DATA] = i915_ggtt_offset(ctx->engine[RCS].state);

guc_ggtt_offset(), it's the same as i915_ggtt_offset() but with a couple
of extra guc specific checks.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 15/18] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2017-03-25  1:30 ` [PATCH v5 15/18] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
@ 2017-03-25  9:26   ` Chris Wilson
  2017-03-27 21:48     ` Michel Thierry
  0 siblings, 1 reply; 41+ messages in thread
From: Chris Wilson @ 2017-03-25  9:26 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Fri, Mar 24, 2017 at 06:30:07PM -0700, Michel Thierry wrote:
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 87e76ef589b1..d484cbc561eb 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1369,6 +1369,10 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
>  
>  	if (tasklet)
>  		tasklet_hi_schedule(&engine->irq_tasklet);
> +
> +	if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift)) {
> +		tasklet_hi_schedule(&engine->watchdog_tasklet);

We don't need to set this as high, we definitely do want to process the
live engines first and so some small latency in detecting the reset is
no deal breaker. We probably don't even want to use a tasklet? (Or
actually we do!)

>  static void hangcheck_declare_hang(struct drm_i915_private *i915,
>  				   unsigned int hung,
> -				   unsigned int stuck)
> +				   unsigned int stuck,
> +				   unsigned int watchdog)

That's a very interesting question as to whether we want to use the very
heavy hangcheck and capture machine at all for the watchdog.

> +#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
> +static void gen8_watchdog_irq_handler(unsigned long data)
> +{
> +	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> +	struct drm_i915_private *dev_priv = engine->i915;
> +	u32 watchdog_disable, current_seqno;
> +
> +	intel_uncore_forcewake_get(dev_priv, engine->fw_domains);
> +
> +	if (engine->id == RCS)
> +		watchdog_disable = GEN8_RCS_WATCHDOG_DISABLE;
> +	else
> +		watchdog_disable = GEN8_XCS_WATCHDOG_DISABLE;
> +
> +	/* Stop the counter to prevent further timeout interrupts */
> +	I915_WRITE_FW(RING_CNTR(engine->mmio_base), watchdog_disable);
> +
> +	/* false-positive, request completed after the timer expired */

False optimism in spotting the false positive. engine_is_idle() means
all requests not the interesting one. Since you are using seqno, just
reject when seqno == intel_engine_last_submit().

> +	if (intel_engine_is_idle(engine))
> +		goto fw_put;
> +
> +	current_seqno = intel_engine_get_seqno(engine);
> +	if (engine->hangcheck.last_watchdog_seqno == current_seqno) {

Or you could just reset the engine directly, once we rid it of that
pesky mutex (which is done in all but name already). Doing that from
inside the tasklet has some advantages -- we don't need to disable the
execlists/guc tasklet.

> +		/* Make sure the active request will be marked as guilty */
> +		engine->hangcheck.stalled = true;
> +		engine->hangcheck.seqno = intel_engine_get_seqno(engine);
> +
> +		/* And try to run the hangcheck_work as soon as possible */
> +		set_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags);
> +		queue_delayed_work(system_long_wq,
> +				   &dev_priv->gpu_error.hangcheck_work, 0);
> +	} else {
> +		engine->hangcheck.last_watchdog_seqno = current_seqno;
> +		/* Re-start the counter, if really hung, it will expire again */
> +		I915_WRITE_FW(RING_THRESH(engine->mmio_base), GEN8_WATCHDOG_1000US);
> +		I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
> +	}
> +
> +fw_put:
> +	intel_uncore_forcewake_put(dev_priv, engine->fw_domains);
> +}

> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index e8faf2c34c97..fffe69f5aed2 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -128,6 +128,7 @@ struct intel_instdone {
>  struct intel_engine_hangcheck {
>  	u64 acthd;
>  	u32 seqno;
> +	u32 last_watchdog_seqno;

Just watchdog will be enough for its meaning to be clear.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 17/18] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-03-25  1:30 ` [PATCH v5 17/18] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
@ 2017-03-25  9:38   ` Chris Wilson
  2017-03-28  1:03     ` Michel Thierry
  2017-04-14 16:05   ` Daniele Ceraolo Spurio
  1 sibling, 1 reply; 41+ messages in thread
From: Chris Wilson @ 2017-03-25  9:38 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Fri, Mar 24, 2017 at 06:30:09PM -0700, Michel Thierry wrote:
> Final enablement patch for GPU hang detection using watchdog timeout.
> Using the gem_context_setparam ioctl, users can specify the desired
> timeout value in microseconds, and the driver will do the conversion to
> 'timestamps'.
> 
> The recommended default watchdog threshold for video engines is 60000 us,
> since this has been _empirically determined_ to be a good compromise for
> low-latency requirements and low rate of false positives. The default
> register value is ~106000us and the theoretical max value (all 1s) is
> 353 seconds.
> 
> v2: Fixed get api to return values in microseconds. Threshold updated to
> be per context engine. Check for u32 overflow. Capture ctx threshold
> value in error state.
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h         |  1 +
>  drivers/gpu/drm/i915/i915_gem_context.c | 78 +++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_gem_context.h | 20 +++++++++
>  drivers/gpu/drm/i915/i915_gpu_error.c   | 11 +++--
>  drivers/gpu/drm/i915/intel_lrc.c        |  2 +-
>  include/uapi/drm/i915_drm.h             |  1 +
>  6 files changed, 108 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index b43c37a911bb..1741584d858f 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1039,6 +1039,7 @@ struct i915_gpu_state {
>  			int ban_score;
>  			int active;
>  			int guilty;
> +			int watchdog_threshold;

Shouldn't this be added in the patch adding it to the context for real?

>  		} context;
>  
>  		struct drm_i915_error_object {
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index edbed85a1c88..f5c126c0c681 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -422,6 +422,78 @@ i915_gem_context_create_gvt(struct drm_device *dev)
>  	return ctx;
>  }
>  
> +/* Return the timer count threshold in microseconds. */
> +int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
> +				  struct drm_i915_gem_context_param *args)
> +{
> +	struct drm_i915_private *dev_priv = ctx->i915;
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	u32 threshold_in_us[I915_NUM_ENGINES];

Reading is a 2 step process. First CONTEXT_GET_PARAM passes in size==0
and gets told by the kernel what size array to allocate. Second pass
supplies a buffer with that size. (Or userspace can preallocate a large
enough buffer, declare it's full size and let the kernel fill as much as
it wants.)

if (args->size == 0)
	goto out;

if (args->size < sizeof(threshold_in_us))
	return -EFAULT;

> +
> +	if (!dev_priv->engine[VCS]->emit_start_watchdog)
> +		return -ENODEV;
> +
> +	for_each_engine(engine, dev_priv, id) {
> +		struct intel_context *ce = &ctx->engine[id];
> +
> +		threshold_in_us[id] = watchdog_to_us(ce->watchdog_threshold);
> +	}
> +
> +	mutex_unlock(&dev_priv->drm.struct_mutex);

Grr. We should look at why we have this lock here in the first place.
IIRC, it was there to make TRTT easier, but we can always move the
burden of work again.

> +	if (__copy_to_user(u64_to_user_ptr(args->value),
> +			   &threshold_in_us,
> +			   sizeof(threshold_in_us))) {
> +		mutex_lock(&dev_priv->drm.struct_mutex);
> +		return -EFAULT;
> +	}
> +	mutex_lock(&dev_priv->drm.struct_mutex);

out:
> +	args->size = sizeof(threshold_in_us);
> +
> +	return 0;
> +}
> +
> +/*
> + * Based on time out value in microseconds (us) calculate
> + * timer count thresholds needed based on core frequency.
> + * Watchdog can be disabled by setting it to 0.
> + */
> +int i915_gem_context_set_watchdog(struct i915_gem_context *ctx,
> +				  struct drm_i915_gem_context_param *args)
> +{
> +	struct drm_i915_private *dev_priv = ctx->i915;
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	u32 threshold_in_us[I915_NUM_ENGINES];
> +
> +	if (!dev_priv->engine[VCS]->emit_start_watchdog)
> +		return -ENODEV;
> +	else if (args->size < sizeof(threshold_in_us))
> +		return -EINVAL;

args->size == 0 here could be used to mean one value for all? That might
be a little clumsy compared to the getter, but easy for example for
disabling the watchdog on all engines.

if (args->size == 0) do_set_all()
if (args->size < sizeof())
	return -EFAULT;

> +
> +	mutex_unlock(&dev_priv->drm.struct_mutex);
> +	if (copy_from_user(&threshold_in_us,
> +			   u64_to_user_ptr(args->value),
> +			   sizeof(threshold_in_us))) {
> +		mutex_lock(&dev_priv->drm.struct_mutex);
> +		return -EFAULT;
> +	}
> +	mutex_lock(&dev_priv->drm.struct_mutex);
> +
> +	/* not supported in blitter engine */
> +	if (threshold_in_us[BCS] != 0)
> +		return -EINVAL;
> +
> +	for_each_engine(engine, dev_priv, id) {
> +		struct intel_context *ce = &ctx->engine[id];
> +
> +		ce->watchdog_threshold =
> +			watchdog_to_clock_counts((u64)threshold_in_us[id]);

Cast is superfluous.

> +	}
> +
> +	return 0;
> +}
> +
>  int i915_gem_context_init(struct drm_i915_private *dev_priv)
>  {
>  	struct i915_gem_context *ctx;
> @@ -1061,6 +1133,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>  	case I915_CONTEXT_PARAM_BANNABLE:
>  		args->value = i915_gem_context_is_bannable(ctx);
>  		break;
> +	case I915_CONTEXT_PARAM_WATCHDOG:
> +		ret = i915_gem_context_get_watchdog(ctx, args);
> +		break;
>  	default:
>  		ret = -EINVAL;
>  		break;
> @@ -1118,6 +1193,9 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>  		else
>  			i915_gem_context_clear_bannable(ctx);
>  		break;
> +	case I915_CONTEXT_PARAM_WATCHDOG:
> +		ret = i915_gem_context_set_watchdog(ctx, args);
> +		break;
>  	default:
>  		ret = -EINVAL;
>  		break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index 88700bdbb4e1..6867b1fead8b 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -250,6 +250,26 @@ static inline bool i915_gem_context_is_kernel(struct i915_gem_context *ctx)
>  	return !ctx->file_priv;
>  }
>  
> +/*
> + * Timestamp timer resolution = 0.080 uSec,
> + * or 12500000 counts per second, or ~12 counts per microsecond.
> + */
> +#define TIMESTAMP_CNTS_PER_USEC 12
> +static inline u32 watchdog_to_us(u32 value_in_clock_counts)
> +{
> +	return (value_in_clock_counts) / (TIMESTAMP_CNTS_PER_USEC);
> +}

(brackets) (for fun) (?)

> +
> +static inline u32 watchdog_to_clock_counts(u64 value_in_us)
> +{
> +	u64 threshold = (value_in_us) * (TIMESTAMP_CNTS_PER_USEC);
> +
> +	if (GEM_WARN_ON(overflows_type(threshold, u32)))

Nice idea, and yes we should do the range check. But this is a userspace
interface, so always do it and don't warn, just -EINVAL.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 16/18] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2017-03-25  1:30 ` [PATCH v5 16/18] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
@ 2017-03-25  9:46   ` Chris Wilson
  0 siblings, 0 replies; 41+ messages in thread
From: Chris Wilson @ 2017-03-25  9:46 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Fri, Mar 24, 2017 at 06:30:08PM -0700, Michel Thierry wrote:
> Emit the required commands into the ring buffer for starting and
> stopping the watchdog timer before/after batch buffer start during
> batch buffer submission.
> 
> v2: Support watchdog threshold per context engine, merge lri commands,
> and move watchdog commands emission to emit_bb_start. Request space of
> combined start_watchdog, bb_start and stop_watchdog to avoid any error
> after emitting bb_start.
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Ian Lister <ian.lister@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_context.h |  4 ++
>  drivers/gpu/drm/i915/intel_lrc.c        | 86 ++++++++++++++++++++++++++++++++-
>  drivers/gpu/drm/i915/intel_ringbuffer.h |  4 ++
>  3 files changed, 92 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index 4af2ab94558b..88700bdbb4e1 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -150,6 +150,10 @@ struct i915_gem_context {
>  		u32 *lrc_reg_state;
>  		u64 lrc_desc;
>  		int pin_count;
> +		/** watchdog_threshold: hw watchdog threshold value,
> +		 * in clock counts
> +		 */
> +		u32 watchdog_threshold;
>  		bool initialised;
>  	} engine[I915_NUM_ENGINES];
>  
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 73f8fbdcf1fb..2736f642dc76 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1319,6 +1319,8 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
>  			      const unsigned int flags)
>  {
>  	u32 *cs;
> +	u32 num_dwords;
> +	bool watchdog_running = false;
>  	int ret;
>  
>  	/* Don't rely in hw updating PDPs, specially in lite-restore.
> @@ -1338,10 +1340,29 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
>  		req->ctx->ppgtt->pd_dirty_rings &= ~intel_engine_flag(req->engine);
>  	}
>  
> -	cs = intel_ring_begin(req, 4);
> +	/* bb_start only */
> +	num_dwords = 4;
> +
> +	/* check if watchdog will be required */
> +	if (req->ctx->engine[req->engine->id].watchdog_threshold != 0) {
> +		if (!req->engine->emit_start_watchdog ||
> +		    !req->engine->emit_stop_watchdog)
> +			return -EINVAL;

A bit late! This will already be checked before we set the threshold.

> +		/* start_watchdog (6) + bb_start (4) + stop_watchdog (4) */

Don't mention bb_start since it isn't in the additional 10 and you just
confused me! :)

> +		num_dwords += 10;
> +		watchdog_running = true;
> +	}
> +
> +	cs = intel_ring_begin(req, num_dwords);
>  	if (IS_ERR(cs))
>  		return PTR_ERR(cs);
>  
> +	if (watchdog_running) {
> +		/* Start watchdog timer */
> +		cs = req->engine->emit_start_watchdog(req, cs);

Are they going to be specialised vfuncs in the near future? i.e.
something other than LRI. If the only advantage is for checking feature
compatability, we could just a bit instead?

> +	}
> +
>  	/* FIXME(BDW): Address space and security selectors. */
>  	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
>  		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8)) |
> @@ -1349,8 +1370,13 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
>  	*cs++ = lower_32_bits(offset);
>  	*cs++ = upper_32_bits(offset);
>  	*cs++ = MI_NOOP;
> -	intel_ring_advance(req, cs);
>  
> +	if (watchdog_running) {
> +		/* Cancel watchdog timer */
> +		cs = req->engine->emit_stop_watchdog(req, cs);
> +	}
> +
> +	intel_ring_advance(req, cs);
>  	return 0;
>  }
>  
> @@ -1512,6 +1538,54 @@ static void gen8_watchdog_irq_handler(unsigned long data)
>  	intel_uncore_forcewake_put(dev_priv, engine->fw_domains);
>  }
>  
> +static u32 *gen8_emit_start_watchdog(struct drm_i915_gem_request *req, u32 *cs)
> +{
> +	struct intel_engine_cs *engine = req->engine;
> +	struct i915_gem_context *ctx = req->ctx;
> +	struct intel_context *ce = &ctx->engine[engine->id];
> +
> +	/* XXX: no watchdog support in BCS engine */
> +	GEM_BUG_ON(engine->id == BCS);
> +
> +	/*
> +	 * watchdog register must never be programmed to zero. This would
> +	 * cause the watchdog counter to exceed and not allow the engine to
> +	 * go into IDLE state
> +	 */
> +	GEM_BUG_ON(ce->watchdog_threshold == 0);
> +
> +	/* Set counter period */
> +	*cs++ = MI_LOAD_REGISTER_IMM(2);
> +	*cs++ = i915_mmio_reg_offset(RING_THRESH(engine->mmio_base));
> +	*cs++ = ce->watchdog_threshold;
> +	/* Start counter */
> +	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
> +	*cs++ = GEN8_WATCHDOG_ENABLE;
> +	*cs++ = MI_NOOP;
> +
> +	return cs;
> +}
> +
> +static u32 *gen8_emit_stop_watchdog(struct drm_i915_gem_request *req, u32 *cs)
> +{
> +	struct intel_engine_cs *engine = req->engine;
> +
> +	/* XXX: no watchdog support in BCS engine */
> +	GEM_BUG_ON(engine->id == BCS);
> +
> +	*cs++ = MI_LOAD_REGISTER_IMM(2);
> +	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
> +
> +	if (engine->id == RCS)
> +		*cs++ = GEN8_RCS_WATCHDOG_DISABLE;
> +	else
> +		*cs++ = GEN8_XCS_WATCHDOG_DISABLE;
> +
> +	*cs++ = MI_NOOP;

2 spare NOOPS, cancel out.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 11/18] drm/i915/selftests: reset engine self tests
  2017-03-25  1:30 ` [PATCH v5 11/18] drm/i915/selftests: reset engine self tests Michel Thierry
@ 2017-03-25  9:50   ` Chris Wilson
  0 siblings, 0 replies; 41+ messages in thread
From: Chris Wilson @ 2017-03-25  9:50 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Fri, Mar 24, 2017 at 06:30:03PM -0700, Michel Thierry wrote:
> Check that we can reset specific engines, also check the fallback to
> full reset if something didn't work.
> 
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 144 +++++++++++++++++++++++
>  1 file changed, 144 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 76d5f78c1724..91b78e638494 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -323,6 +323,56 @@ static int igt_global_reset(void *arg)
>  	return err;
>  }
>  
> +static int igt_reset_engine(void *arg)
> +{
> +	struct drm_i915_private *i915 = arg;
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	unsigned int reset_count, reset_engine_count;
> +	int err = 0;
> +
> +	/* Check that we can issue a global GPU and engine reset */
> +
> +	if (!intel_has_gpu_reset(i915))
> +		return 0;
> +
> +	if (!intel_has_reset_engine(i915))
> +		return 0;

Where do you do the global reset in this function? So isn't the
has-engine-reset sufficient? And the caller does the global check for
you.

> +	set_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
> +	mutex_lock(&i915->drm.struct_mutex);

This will all change! ;)

> +static int igt_render_engine_reset_fallback(void *arg)
> +{
> +	struct drm_i915_private *i915 = arg;
> +	struct intel_engine_cs *engine = i915->engine[RCS];
> +	struct hang h;
> +	struct drm_i915_gem_request *rq;
> +	unsigned int reset_count, reset_engine_count;
> +	int err = 0;
> +
> +	/* Check that we can issue a global GPU and engine reset */
> +
> +	if (!intel_has_gpu_reset(i915))
> +		return 0;
> +
> +	if (!intel_has_reset_engine(i915))
> +		return 0;
> +
> +	set_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
> +	mutex_lock(&i915->drm.struct_mutex);
> +
> +	err = hang_init(&h, i915);
> +	if (err)
> +		goto unlock;
> +
> +	rq = hang_create_request(&h, engine, i915->kernel_context);
> +	if (IS_ERR(rq)) {
> +		err = PTR_ERR(rq);
> +		goto fini;
> +	}
> +
> +	i915_gem_request_get(rq);
> +	__i915_add_request(rq, true);

...

> +fini:

i915_gem_request_put(rq);

> +	hang_fini(&h);
> +unlock:
> +	mutex_unlock(&i915->drm.struct_mutex);
> +	clear_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
> +
> +	if (i915_terminally_wedged(&i915->gpu_error))
> +		return -EIO;
> +
> +	return err;
> +}

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 02/18] drm/i915: Rename gen8_(un)request_engine_reset to gen8_(un)request_reset_engine
  2017-03-25  9:01   ` Chris Wilson
@ 2017-03-27 20:35     ` Michel Thierry
  0 siblings, 0 replies; 41+ messages in thread
From: Michel Thierry @ 2017-03-27 20:35 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 25/03/17 02:01, Chris Wilson wrote:
> On Fri, Mar 24, 2017 at 06:29:54PM -0700, Michel Thierry wrote:
>> As all other functions related to resetting engines are using reset_engine.
>
> However, we should aim to clear any confusion between this and our
> requests. Maybe gen8_reset_engine_start and gen8_reset_engine_cancel?
> I think gen8_reset_engine_foo is clearer.

gen8_reset_engine_start & gen8_reset_engine_cancel sound good to me.

Thanks
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 15/18] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2017-03-25  9:26   ` Chris Wilson
@ 2017-03-27 21:48     ` Michel Thierry
  2017-03-28  8:34       ` Chris Wilson
  0 siblings, 1 reply; 41+ messages in thread
From: Michel Thierry @ 2017-03-27 21:48 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 25/03/17 02:26, Chris Wilson wrote:
> On Fri, Mar 24, 2017 at 06:30:07PM -0700, Michel Thierry wrote:
>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>> index 87e76ef589b1..d484cbc561eb 100644
>> --- a/drivers/gpu/drm/i915/i915_irq.c
>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>> @@ -1369,6 +1369,10 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
>>
>>  	if (tasklet)
>>  		tasklet_hi_schedule(&engine->irq_tasklet);
>> +
>> +	if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift)) {
>> +		tasklet_hi_schedule(&engine->watchdog_tasklet);
>
> We don't need to set this as high, we definitely do want to process the
> live engines first and so some small latency in detecting the reset is
> no deal breaker. We probably don't even want to use a tasklet? (Or
> actually we do!)
>
>>  static void hangcheck_declare_hang(struct drm_i915_private *i915,
>>  				   unsigned int hung,
>> -				   unsigned int stuck)
>> +				   unsigned int stuck,
>> +				   unsigned int watchdog)
>
> That's a very interesting question as to whether we want to use the very
> heavy hangcheck and capture machine at all for the watchdog.
>
>> +#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
>> +static void gen8_watchdog_irq_handler(unsigned long data)
>> +{
>> +	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
>> +	struct drm_i915_private *dev_priv = engine->i915;
>> +	u32 watchdog_disable, current_seqno;
>> +
>> +	intel_uncore_forcewake_get(dev_priv, engine->fw_domains);
>> +
>> +	if (engine->id == RCS)
>> +		watchdog_disable = GEN8_RCS_WATCHDOG_DISABLE;
>> +	else
>> +		watchdog_disable = GEN8_XCS_WATCHDOG_DISABLE;
>> +
>> +	/* Stop the counter to prevent further timeout interrupts */
>> +	I915_WRITE_FW(RING_CNTR(engine->mmio_base), watchdog_disable);
>> +
>> +	/* false-positive, request completed after the timer expired */
>
> False optimism in spotting the false positive. engine_is_idle() means
> all requests not the interesting one. Since you are using seqno, just
> reject when seqno == intel_engine_last_submit().
>
>> +	if (intel_engine_is_idle(engine))
>> +		goto fw_put;
>> +
>> +	current_seqno = intel_engine_get_seqno(engine);
>> +	if (engine->hangcheck.last_watchdog_seqno == current_seqno) {
>
> Or you could just reset the engine directly, once we rid it of that
> pesky mutex (which is done in all but name already). Doing that from
> inside the tasklet has some advantages -- we don't need to disable the
> execlists/guc tasklet.
>

True, as you said above, we probably don't need to capture the gpu state 
in this case. The error state may not even be meaningful (for example if 
the threshold was too short and the engine was not really hung).

>> +		/* Make sure the active request will be marked as guilty */
>> +		engine->hangcheck.stalled = true;
>> +		engine->hangcheck.seqno = intel_engine_get_seqno(engine);
>> +
>> +		/* And try to run the hangcheck_work as soon as possible */
>> +		set_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags);
>> +		queue_delayed_work(system_long_wq,
>> +				   &dev_priv->gpu_error.hangcheck_work, 0);
>> +	} else {
>> +		engine->hangcheck.last_watchdog_seqno = current_seqno;
>> +		/* Re-start the counter, if really hung, it will expire again */
>> +		I915_WRITE_FW(RING_THRESH(engine->mmio_base), GEN8_WATCHDOG_1000US);
>> +		I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
>> +	}
>> +
>> +fw_put:
>> +	intel_uncore_forcewake_put(dev_priv, engine->fw_domains);
>> +}
>
>> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
>> index e8faf2c34c97..fffe69f5aed2 100644
>> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
>> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
>> @@ -128,6 +128,7 @@ struct intel_instdone {
>>  struct intel_engine_hangcheck {
>>  	u64 acthd;
>>  	u32 seqno;
>> +	u32 last_watchdog_seqno;
>
> Just watchdog will be enough for its meaning to be clear.

watchdog or watchdog_seqno?
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 17/18] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-03-25  9:38   ` Chris Wilson
@ 2017-03-28  1:03     ` Michel Thierry
  2017-03-28  8:31       ` Chris Wilson
  0 siblings, 1 reply; 41+ messages in thread
From: Michel Thierry @ 2017-03-28  1:03 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 25/03/17 02:38, Chris Wilson wrote:
> On Fri, Mar 24, 2017 at 06:30:09PM -0700, Michel Thierry wrote:
>> Final enablement patch for GPU hang detection using watchdog timeout.
>> Using the gem_context_setparam ioctl, users can specify the desired
>> timeout value in microseconds, and the driver will do the conversion to
>> 'timestamps'.
>>
>> The recommended default watchdog threshold for video engines is 60000 us,
>> since this has been _empirically determined_ to be a good compromise for
>> low-latency requirements and low rate of false positives. The default
>> register value is ~106000us and the theoretical max value (all 1s) is
>> 353 seconds.
>>
>> v2: Fixed get api to return values in microseconds. Threshold updated to
>> be per context engine. Check for u32 overflow. Capture ctx threshold
>> value in error state.
>>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_drv.h         |  1 +
>>  drivers/gpu/drm/i915/i915_gem_context.c | 78 +++++++++++++++++++++++++++++++++
>>  drivers/gpu/drm/i915/i915_gem_context.h | 20 +++++++++
>>  drivers/gpu/drm/i915/i915_gpu_error.c   | 11 +++--
>>  drivers/gpu/drm/i915/intel_lrc.c        |  2 +-
>>  include/uapi/drm/i915_drm.h             |  1 +
>>  6 files changed, 108 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>> index b43c37a911bb..1741584d858f 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -1039,6 +1039,7 @@ struct i915_gpu_state {
>>  			int ban_score;
>>  			int active;
>>  			int guilty;
>> +			int watchdog_threshold;
>
> Shouldn't this be added in the patch adding it to the context for real?
>

I wanted to wait until I could print it in the error_state using 
watchdog_to_us (which is added until this patch).
I can also move all the i915_gpu_error.c changes to a new patch.

>>  		} context;
>>
>>  		struct drm_i915_error_object {
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
>> index edbed85a1c88..f5c126c0c681 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -422,6 +422,78 @@ i915_gem_context_create_gvt(struct drm_device *dev)
>>  	return ctx;
>>  }
>>
>> +/* Return the timer count threshold in microseconds. */
>> +int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
>> +				  struct drm_i915_gem_context_param *args)
>> +{
>> +	struct drm_i915_private *dev_priv = ctx->i915;
>> +	struct intel_engine_cs *engine;
>> +	enum intel_engine_id id;
>> +	u32 threshold_in_us[I915_NUM_ENGINES];
>
> Reading is a 2 step process. First CONTEXT_GET_PARAM passes in size==0
> and gets told by the kernel what size array to allocate. Second pass
> supplies a buffer with that size. (Or userspace can preallocate a large
> enough buffer, declare it's full size and let the kernel fill as much as
> it wants.)
>
> if (args->size == 0)
> 	goto out;
>
> if (args->size < sizeof(threshold_in_us))
> 	return -EFAULT;
EFAULT or EINVAL?

>> +
>> +	if (!dev_priv->engine[VCS]->emit_start_watchdog)
>> +		return -ENODEV;
>> +
>> +	for_each_engine(engine, dev_priv, id) {
>> +		struct intel_context *ce = &ctx->engine[id];
>> +
>> +		threshold_in_us[id] = watchdog_to_us(ce->watchdog_threshold);
>> +	}
>> +
>> +	mutex_unlock(&dev_priv->drm.struct_mutex);
>
> Grr. We should look at why we have this lock here in the first place.
> IIRC, it was there to make TRTT easier, but we can always move the
> burden of work again.
>

It helps TRTT (that code also takes an extra reference of the ctx)... 
but it all started in i915_gem_context_lookup (499f2697da1d)

>> +	if (__copy_to_user(u64_to_user_ptr(args->value),
>> +			   &threshold_in_us,
>> +			   sizeof(threshold_in_us))) {
>> +		mutex_lock(&dev_priv->drm.struct_mutex);
>> +		return -EFAULT;
>> +	}
>> +	mutex_lock(&dev_priv->drm.struct_mutex);
>
> out:
>> +	args->size = sizeof(threshold_in_us);
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Based on time out value in microseconds (us) calculate
>> + * timer count thresholds needed based on core frequency.
>> + * Watchdog can be disabled by setting it to 0.
>> + */
>> +int i915_gem_context_set_watchdog(struct i915_gem_context *ctx,
>> +				  struct drm_i915_gem_context_param *args)
>> +{
>> +	struct drm_i915_private *dev_priv = ctx->i915;
>> +	struct intel_engine_cs *engine;
>> +	enum intel_engine_id id;
>> +	u32 threshold_in_us[I915_NUM_ENGINES];
>> +
>> +	if (!dev_priv->engine[VCS]->emit_start_watchdog)
>> +		return -ENODEV;
>> +	else if (args->size < sizeof(threshold_in_us))
>> +		return -EINVAL;
>
> args->size == 0 here could be used to mean one value for all? That might
> be a little clumsy compared to the getter, but easy for example for
> disabling the watchdog on all engines.
>
> if (args->size == 0) do_set_all()
> if (args->size < sizeof())
> 	return -EFAULT;
>
>> +
>> +	mutex_unlock(&dev_priv->drm.struct_mutex);
>> +	if (copy_from_user(&threshold_in_us,
>> +			   u64_to_user_ptr(args->value),
>> +			   sizeof(threshold_in_us))) {
>> +		mutex_lock(&dev_priv->drm.struct_mutex);
>> +		return -EFAULT;
>> +	}
>> +	mutex_lock(&dev_priv->drm.struct_mutex);
>> +
>> +	/* not supported in blitter engine */
>> +	if (threshold_in_us[BCS] != 0)
>> +		return -EINVAL;
>> +
>> +	for_each_engine(engine, dev_priv, id) {
>> +		struct intel_context *ce = &ctx->engine[id];
>> +
>> +		ce->watchdog_threshold =
>> +			watchdog_to_clock_counts((u64)threshold_in_us[id]);
>
> Cast is superfluous.
>

without it, the operation in watchdog_to_clock_counts was always casted 
to u32.

>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>  int i915_gem_context_init(struct drm_i915_private *dev_priv)
>>  {
>>  	struct i915_gem_context *ctx;
>> @@ -1061,6 +1133,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>>  	case I915_CONTEXT_PARAM_BANNABLE:
>>  		args->value = i915_gem_context_is_bannable(ctx);
>>  		break;
>> +	case I915_CONTEXT_PARAM_WATCHDOG:
>> +		ret = i915_gem_context_get_watchdog(ctx, args);
>> +		break;
>>  	default:
>>  		ret = -EINVAL;
>>  		break;
>> @@ -1118,6 +1193,9 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>>  		else
>>  			i915_gem_context_clear_bannable(ctx);
>>  		break;
>> +	case I915_CONTEXT_PARAM_WATCHDOG:
>> +		ret = i915_gem_context_set_watchdog(ctx, args);
>> +		break;
>>  	default:
>>  		ret = -EINVAL;
>>  		break;
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
>> index 88700bdbb4e1..6867b1fead8b 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.h
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
>> @@ -250,6 +250,26 @@ static inline bool i915_gem_context_is_kernel(struct i915_gem_context *ctx)
>>  	return !ctx->file_priv;
>>  }
>>
>> +/*
>> + * Timestamp timer resolution = 0.080 uSec,
>> + * or 12500000 counts per second, or ~12 counts per microsecond.
>> + */
>> +#define TIMESTAMP_CNTS_PER_USEC 12
>> +static inline u32 watchdog_to_us(u32 value_in_clock_counts)
>> +{
>> +	return (value_in_clock_counts) / (TIMESTAMP_CNTS_PER_USEC);
>> +}
>
> (brackets) (for fun) (?)
>
>> +
>> +static inline u32 watchdog_to_clock_counts(u64 value_in_us)
>> +{
>> +	u64 threshold = (value_in_us) * (TIMESTAMP_CNTS_PER_USEC);
>> +
>> +	if (GEM_WARN_ON(overflows_type(threshold, u32)))
>
> Nice idea, and yes we should do the range check. But this is a userspace
> interface, so always do it and don't warn, just -EINVAL.

What if it fails/overflows after some engines' thresholds have been set, 
should it set them back to 0's or leave them enable?

All other fixes added.

Thanks
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 17/18] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-03-28  1:03     ` Michel Thierry
@ 2017-03-28  8:31       ` Chris Wilson
  2017-03-28  8:39         ` Chris Wilson
  0 siblings, 1 reply; 41+ messages in thread
From: Chris Wilson @ 2017-03-28  8:31 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Mon, Mar 27, 2017 at 06:03:37PM -0700, Michel Thierry wrote:
> On 25/03/17 02:38, Chris Wilson wrote:
> >On Fri, Mar 24, 2017 at 06:30:09PM -0700, Michel Thierry wrote:
> >>diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> >>index b43c37a911bb..1741584d858f 100644
> >>--- a/drivers/gpu/drm/i915/i915_drv.h
> >>+++ b/drivers/gpu/drm/i915/i915_drv.h
> >>@@ -1039,6 +1039,7 @@ struct i915_gpu_state {
> >> 			int ban_score;
> >> 			int active;
> >> 			int guilty;
> >>+			int watchdog_threshold;
> >
> >Shouldn't this be added in the patch adding it to the context for real?
> >
> 
> I wanted to wait until I could print it in the error_state using
> watchdog_to_us (which is added until this patch).
> I can also move all the i915_gpu_error.c changes to a new patch.

Probably the best compromise.

> >>+/* Return the timer count threshold in microseconds. */
> >>+int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
> >>+				  struct drm_i915_gem_context_param *args)
> >>+{
> >>+	struct drm_i915_private *dev_priv = ctx->i915;
> >>+	struct intel_engine_cs *engine;
> >>+	enum intel_engine_id id;
> >>+	u32 threshold_in_us[I915_NUM_ENGINES];
> >
> >Reading is a 2 step process. First CONTEXT_GET_PARAM passes in size==0
> >and gets told by the kernel what size array to allocate. Second pass
> >supplies a buffer with that size. (Or userspace can preallocate a large
> >enough buffer, declare it's full size and let the kernel fill as much as
> >it wants.)
> >
> >if (args->size == 0)
> >	goto out;
> >
> >if (args->size < sizeof(threshold_in_us))
> >	return -EFAULT;
> EFAULT or EINVAL?

EFAULT. The user buffer is not an absolute fixed requirement, so too
small a buffer is an overrun -> fault.
 
> >>+
> >>+	if (!dev_priv->engine[VCS]->emit_start_watchdog)
> >>+		return -ENODEV;
> >>+
> >>+	for_each_engine(engine, dev_priv, id) {
> >>+		struct intel_context *ce = &ctx->engine[id];
> >>+
> >>+		threshold_in_us[id] = watchdog_to_us(ce->watchdog_threshold);
> >>+	}
> >>+
> >>+	mutex_unlock(&dev_priv->drm.struct_mutex);
> >
> >Grr. We should look at why we have this lock here in the first place.
> >IIRC, it was there to make TRTT easier, but we can always move the
> >burden of work again.
> >
> 
> It helps TRTT (that code also takes an extra reference of the
> ctx)... but it all started in i915_gem_context_lookup (499f2697da1d)

It is just laziness that we kept it around the switch, and part of the
reason for that laziness to remain was trtt. Note, we don't need the
mutex around the context lookup (or unref) with a bit of jiggery-pokery.

> >>+	mutex_unlock(&dev_priv->drm.struct_mutex);
> >>+	if (copy_from_user(&threshold_in_us,
> >>+			   u64_to_user_ptr(args->value),
> >>+			   sizeof(threshold_in_us))) {
> >>+		mutex_lock(&dev_priv->drm.struct_mutex);
> >>+		return -EFAULT;
> >>+	}
> >>+	mutex_lock(&dev_priv->drm.struct_mutex);
> >>+
> >>+	/* not supported in blitter engine */
> >>+	if (threshold_in_us[BCS] != 0)
> >>+		return -EINVAL;
> >>+
> >>+	for_each_engine(engine, dev_priv, id) {
> >>+		struct intel_context *ce = &ctx->engine[id];
> >>+
> >>+		ce->watchdog_threshold =
> >>+			watchdog_to_clock_counts((u64)threshold_in_us[id]);
> >
> >Cast is superfluous.
> >
> 
> without it, the operation in watchdog_to_clock_counts was always
> casted to u32.

static inline u32 watchdog_to_clock_counts(u64 value_in_us) ? 

> >>+static inline u32 watchdog_to_clock_counts(u64 value_in_us)
> >>+{
> >>+	u64 threshold = (value_in_us) * (TIMESTAMP_CNTS_PER_USEC);
> >>+
> >>+	if (GEM_WARN_ON(overflows_type(threshold, u32)))
> >
> >Nice idea, and yes we should do the range check. But this is a userspace
> >interface, so always do it and don't warn, just -EINVAL.
> 
> What if it fails/overflows after some engines' thresholds have been
> set, should it set them back to 0's or leave them enable?

Yes. Validate userspace first, then apply, so the set of changes is
atomic and the ret is either success or EINVAL.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 15/18] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2017-03-27 21:48     ` Michel Thierry
@ 2017-03-28  8:34       ` Chris Wilson
  0 siblings, 0 replies; 41+ messages in thread
From: Chris Wilson @ 2017-03-28  8:34 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Mon, Mar 27, 2017 at 02:48:42PM -0700, Michel Thierry wrote:
> 
> 
> On 25/03/17 02:26, Chris Wilson wrote:
> >On Fri, Mar 24, 2017 at 06:30:07PM -0700, Michel Thierry wrote:
> >>diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> >>index e8faf2c34c97..fffe69f5aed2 100644
> >>--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> >>+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> >>@@ -128,6 +128,7 @@ struct intel_instdone {
> >> struct intel_engine_hangcheck {
> >> 	u64 acthd;
> >> 	u32 seqno;
> >>+	u32 last_watchdog_seqno;
> >
> >Just watchdog will be enough for its meaning to be clear.
> 
> watchdog or watchdog_seqno?

Here, intel_engine_hangcheck.watchdog is unique enough for it not to be
confusing. If we grow more interesting bits for the watchdog, we break
it out into its own substruct. Maybe we should task Mika with filling in
some kerneldoc for struct intel_engine_hangcheck?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 17/18] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-03-28  8:31       ` Chris Wilson
@ 2017-03-28  8:39         ` Chris Wilson
  0 siblings, 0 replies; 41+ messages in thread
From: Chris Wilson @ 2017-03-28  8:39 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx

On Tue, Mar 28, 2017 at 09:31:16AM +0100, Chris Wilson wrote:
> On Mon, Mar 27, 2017 at 06:03:37PM -0700, Michel Thierry wrote:
> > What if it fails/overflows after some engines' thresholds have been
> > set, should it set them back to 0's or leave them enable?
> 
> Yes. Validate userspace first, then apply, so the set of changes is
> atomic and the ret is either success or EINVAL.

Oh, it is much worse than that. We can't use engine->id as the UABI
index. We need to fix the exec_id indexing first and rename it to uabi_id.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 12/18] drm/i915/guc: Provide register list to be saved/restored during engine reset
  2017-03-25  1:30 ` [PATCH v5 12/18] drm/i915/guc: Provide register list to be saved/restored during engine reset Michel Thierry
@ 2017-04-14  0:16   ` Daniele Ceraolo Spurio
  2017-04-14  0:30     ` Michel Thierry
  0 siblings, 1 reply; 41+ messages in thread
From: Daniele Ceraolo Spurio @ 2017-04-14  0:16 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx, antonio.argenziano



On 24/03/17 18:30, Michel Thierry wrote:
> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>
> GuC expects a list of registers from the driver which are saved/restored
> during engine reset. The type of value to be saved is controlled by
> flags. We provide a minimal set of registers that we want GuC to save and
> restore. This is not an issue in case of engine reset as driver initializes
> most of them following an engine reset, but in case of media reset (aka
> watchdog reset) which is completely internal to GuC (including resubmission
> of hung workload), it is necessary to provide this list, otherwise GuC won't
> be able to schedule further workloads after a reset. This is the minimal
> set of registers identified for things to work as expected but if we see
> any new issues, this register list can be expanded.
>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_guc_submission.c | 50 +++++++++++++++++++++++++++++-
>  1 file changed, 49 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
> index 991e76e10f82..2445af96aa71 100644
> --- a/drivers/gpu/drm/i915/i915_guc_submission.c
> +++ b/drivers/gpu/drm/i915/i915_guc_submission.c
> @@ -1015,6 +1015,22 @@ static void guc_policies_init(struct guc_policies *policies)
>  	policies->is_valid = 1;
>  }
>
> +/*
> + * In this macro it is highly unlikely to exceed max value but even if we did
> + * it is not an error so just throw a warning and continue. Only side effect
> + * in continuing further means some registers won't be added to save/restore
> + * list.
> + */
> +#define GUC_ADD_MMIO_REG_ADS(node, reg_addr, _flags)			\
> +	do {								\
> +		u32 __count = node->number_of_registers;		\
> +		if (WARN_ON(__count >= GUC_REGSET_MAX_REGISTERS))	\
> +			continue;					\
> +		node->registers[__count].offset = reg_addr.reg;		\
> +		node->registers[__count].flags = (_flags);		\
> +		node->number_of_registers++;				\
> +	} while (0)
> +
>  static int guc_ads_create(struct intel_guc *guc)
>  {
>  	struct drm_i915_private *dev_priv = guc_to_i915(guc);
> @@ -1047,10 +1063,42 @@ static int guc_ads_create(struct intel_guc *guc)
>
>  	/* MMIO reg state */
>  	for_each_engine(engine, dev_priv, id) {
> +		u32 flags;
> +		struct guc_mmio_regset *eng_reg =
> +			&blob->reg_state.engine_reg[engine->guc_id];
> +
> +		/*
> +		 * Provide a list of registers to be saved/restored during gpu
> +		 * reset. This is mainly required for Media reset (aka watchdog
> +		 * timeout) which is completely under the control of GuC
> +		 * (resubmission of hung workload is handled inside GuC).
> +		 */
> +		flags = GUC_REGSET_POWERCYCLE |

This flag doesn't really do anything inside GuC, I guess it is probably 
a remnant of some functionality that has been removed.

> +			GUC_REGSET_ENGINERESET |
> +			GUC_REGSET_SAVE_CURRENT_VALUE;
> +
> +		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_HEAD(engine->mmio_base),
> +				     flags);
> +
> +		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_TAIL(engine->mmio_base),
> +				     flags);

Aren't head & tail context save/restored? there should be no need to 
have GuC restore them. Also, won't restoring them manually generate 
activity on the engine?

> +
> +		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_HWS_PGA(engine->mmio_base),
> +				     flags);
> +
> +		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_MODE_GEN7(engine),
> +				     (flags | GUC_REGSET_MASKED));

GUC_REGSET_MASKED doesn't do anything either (surprise!). Since this is 
actually something that is required, I've chatted with the GuC guys and 
they said they're going to look into re-adding proper functionality. In 
the meantime the suggestion is to use GUC_REGSET_SAVE_DEFAULT_VALUE, in 
which case the guc after the reset will write whatever we set in 
node->registers[__count].value and we can put an already masked value in 
there.

Thanks,
Daniele

> +
> +		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_IMR(engine->mmio_base),
> +				     flags);
> +
> +		DRM_DEBUG_DRIVER("%s register save/restore count: %u\n",
> +				 engine->name, eng_reg->number_of_registers);
> +
> +		 /* Nothing to be white listed for now. */
>  		blob->reg_state.white_list[engine->guc_id].mmio_start =
>  			engine->mmio_base + GUC_MMIO_WHITE_LIST_START;
>
> -		/* Nothing to be saved or restored for now. */
>  		blob->reg_state.white_list[engine->guc_id].count = 0;
>  	}
>
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 12/18] drm/i915/guc: Provide register list to be saved/restored during engine reset
  2017-04-14  0:16   ` Daniele Ceraolo Spurio
@ 2017-04-14  0:30     ` Michel Thierry
  0 siblings, 0 replies; 41+ messages in thread
From: Michel Thierry @ 2017-04-14  0:30 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio, intel-gfx, antonio.argenziano


On 13/04/17 17:16, Daniele Ceraolo Spurio wrote:
>
>
> On 24/03/17 18:30, Michel Thierry wrote:
>> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>>
>> GuC expects a list of registers from the driver which are saved/restored
>> during engine reset. The type of value to be saved is controlled by
>> flags. We provide a minimal set of registers that we want GuC to save and
>> restore. This is not an issue in case of engine reset as driver
>> initializes
>> most of them following an engine reset, but in case of media reset (aka
>> watchdog reset) which is completely internal to GuC (including
>> resubmission
>> of hung workload), it is necessary to provide this list, otherwise GuC
>> won't
>> be able to schedule further workloads after a reset. This is the minimal
>> set of registers identified for things to work as expected but if we see
>> any new issues, this register list can be expanded.
>>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_guc_submission.c | 50
>> +++++++++++++++++++++++++++++-
>>  1 file changed, 49 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c
>> b/drivers/gpu/drm/i915/i915_guc_submission.c
>> index 991e76e10f82..2445af96aa71 100644
>> --- a/drivers/gpu/drm/i915/i915_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/i915_guc_submission.c
>> @@ -1015,6 +1015,22 @@ static void guc_policies_init(struct
>> guc_policies *policies)
>>      policies->is_valid = 1;
>>  }
>>
>> +/*
>> + * In this macro it is highly unlikely to exceed max value but even
>> if we did
>> + * it is not an error so just throw a warning and continue. Only side
>> effect
>> + * in continuing further means some registers won't be added to
>> save/restore
>> + * list.
>> + */
>> +#define GUC_ADD_MMIO_REG_ADS(node, reg_addr, _flags)            \
>> +    do {                                \
>> +        u32 __count = node->number_of_registers;        \
>> +        if (WARN_ON(__count >= GUC_REGSET_MAX_REGISTERS))    \
>> +            continue;                    \
>> +        node->registers[__count].offset = reg_addr.reg;        \
>> +        node->registers[__count].flags = (_flags);        \
>> +        node->number_of_registers++;                \
>> +    } while (0)
>> +
>>  static int guc_ads_create(struct intel_guc *guc)
>>  {
>>      struct drm_i915_private *dev_priv = guc_to_i915(guc);
>> @@ -1047,10 +1063,42 @@ static int guc_ads_create(struct intel_guc *guc)
>>
>>      /* MMIO reg state */
>>      for_each_engine(engine, dev_priv, id) {
>> +        u32 flags;
>> +        struct guc_mmio_regset *eng_reg =
>> +            &blob->reg_state.engine_reg[engine->guc_id];
>> +
>> +        /*
>> +         * Provide a list of registers to be saved/restored during gpu
>> +         * reset. This is mainly required for Media reset (aka watchdog
>> +         * timeout) which is completely under the control of GuC
>> +         * (resubmission of hung workload is handled inside GuC).
>> +         */
>> +        flags = GUC_REGSET_POWERCYCLE |
>
> This flag doesn't really do anything inside GuC, I guess it is probably
> a remnant of some functionality that has been removed.
>
>> +            GUC_REGSET_ENGINERESET |
>> +            GUC_REGSET_SAVE_CURRENT_VALUE;
>> +
>> +        GUC_ADD_MMIO_REG_ADS(eng_reg, RING_HEAD(engine->mmio_base),
>> +                     flags);
>> +
>> +        GUC_ADD_MMIO_REG_ADS(eng_reg, RING_TAIL(engine->mmio_base),
>> +                     flags);
>
> Aren't head & tail context save/restored? there should be no need to
> have GuC restore them. Also, won't restoring them manually generate
> activity on the engine?
>
>> +
>> +        GUC_ADD_MMIO_REG_ADS(eng_reg, RING_HWS_PGA(engine->mmio_base),
>> +                     flags);
>> +
>> +        GUC_ADD_MMIO_REG_ADS(eng_reg, RING_MODE_GEN7(engine),
>> +                     (flags | GUC_REGSET_MASKED));
>
> GUC_REGSET_MASKED doesn't do anything either (surprise!). Since this is

s/doesn't do anything/guc is full of bugs/  :smirk:

> actually something that is required, I've chatted with the GuC guys and
> they said they're going to look into re-adding proper functionality. In
> the meantime the suggestion is to use GUC_REGSET_SAVE_DEFAULT_VALUE, in
> which case the guc after the reset will write whatever we set in
> node->registers[__count].value and we can put an already masked value in
> there.
>
> Thanks,
> Daniele
>

Thanks, I'll add it to the changes in the next version.

>> +
>> +        GUC_ADD_MMIO_REG_ADS(eng_reg, RING_IMR(engine->mmio_base),
>> +                     flags);
>> +
>> +        DRM_DEBUG_DRIVER("%s register save/restore count: %u\n",
>> +                 engine->name, eng_reg->number_of_registers);
>> +
>> +         /* Nothing to be white listed for now. */
>>          blob->reg_state.white_list[engine->guc_id].mmio_start =
>>              engine->mmio_base + GUC_MMIO_WHITE_LIST_START;
>>
>> -        /* Nothing to be saved or restored for now. */
>>          blob->reg_state.white_list[engine->guc_id].count = 0;
>>      }
>>
>>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 17/18] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-03-25  1:30 ` [PATCH v5 17/18] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
  2017-03-25  9:38   ` Chris Wilson
@ 2017-04-14 16:05   ` Daniele Ceraolo Spurio
  2017-04-14 16:16     ` Chris Wilson
  2017-04-14 16:47     ` Michel Thierry
  1 sibling, 2 replies; 41+ messages in thread
From: Daniele Ceraolo Spurio @ 2017-04-14 16:05 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx



On 24/03/17 18:30, Michel Thierry wrote:
> Final enablement patch for GPU hang detection using watchdog timeout.
> Using the gem_context_setparam ioctl, users can specify the desired
> timeout value in microseconds, and the driver will do the conversion to
> 'timestamps'.
>
> The recommended default watchdog threshold for video engines is 60000 us,
> since this has been _empirically determined_ to be a good compromise for
> low-latency requirements and low rate of false positives. The default
> register value is ~106000us and the theoretical max value (all 1s) is
> 353 seconds.
>
> v2: Fixed get api to return values in microseconds. Threshold updated to
> be per context engine. Check for u32 overflow. Capture ctx threshold
> value in error state.
>
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h         |  1 +
>  drivers/gpu/drm/i915/i915_gem_context.c | 78 +++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_gem_context.h | 20 +++++++++
>  drivers/gpu/drm/i915/i915_gpu_error.c   | 11 +++--
>  drivers/gpu/drm/i915/intel_lrc.c        |  2 +-
>  include/uapi/drm/i915_drm.h             |  1 +
>  6 files changed, 108 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index b43c37a911bb..1741584d858f 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1039,6 +1039,7 @@ struct i915_gpu_state {
>  			int ban_score;
>  			int active;
>  			int guilty;
> +			int watchdog_threshold;
>  		} context;
>
>  		struct drm_i915_error_object {
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index edbed85a1c88..f5c126c0c681 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -422,6 +422,78 @@ i915_gem_context_create_gvt(struct drm_device *dev)
>  	return ctx;
>  }
>
> +/* Return the timer count threshold in microseconds. */
> +int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
> +				  struct drm_i915_gem_context_param *args)
> +{
> +	struct drm_i915_private *dev_priv = ctx->i915;
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	u32 threshold_in_us[I915_NUM_ENGINES];
> +
> +	if (!dev_priv->engine[VCS]->emit_start_watchdog)
> +		return -ENODEV;
> +
> +	for_each_engine(engine, dev_priv, id) {
> +		struct intel_context *ce = &ctx->engine[id];
> +
> +		threshold_in_us[id] = watchdog_to_us(ce->watchdog_threshold);
> +	}
> +
> +	mutex_unlock(&dev_priv->drm.struct_mutex);
> +	if (__copy_to_user(u64_to_user_ptr(args->value),
> +			   &threshold_in_us,
> +			   sizeof(threshold_in_us))) {
> +		mutex_lock(&dev_priv->drm.struct_mutex);
> +		return -EFAULT;
> +	}
> +	mutex_lock(&dev_priv->drm.struct_mutex);
> +	args->size = sizeof(threshold_in_us);
> +
> +	return 0;
> +}
> +
> +/*
> + * Based on time out value in microseconds (us) calculate
> + * timer count thresholds needed based on core frequency.
> + * Watchdog can be disabled by setting it to 0.
> + */
> +int i915_gem_context_set_watchdog(struct i915_gem_context *ctx,
> +				  struct drm_i915_gem_context_param *args)
> +{
> +	struct drm_i915_private *dev_priv = ctx->i915;
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	u32 threshold_in_us[I915_NUM_ENGINES];
> +
> +	if (!dev_priv->engine[VCS]->emit_start_watchdog)
> +		return -ENODEV;
> +	else if (args->size < sizeof(threshold_in_us))
> +		return -EINVAL;

won't we break userspace with this check if we ever get more engines on 
a new platform and thus bump I915_NUM_ENGINES?

Thanks,
Daniele

> +
> +	mutex_unlock(&dev_priv->drm.struct_mutex);
> +	if (copy_from_user(&threshold_in_us,
> +			   u64_to_user_ptr(args->value),
> +			   sizeof(threshold_in_us))) {
> +		mutex_lock(&dev_priv->drm.struct_mutex);
> +		return -EFAULT;
> +	}
> +	mutex_lock(&dev_priv->drm.struct_mutex);
> +
> +	/* not supported in blitter engine */
> +	if (threshold_in_us[BCS] != 0)
> +		return -EINVAL;
> +
> +	for_each_engine(engine, dev_priv, id) {
> +		struct intel_context *ce = &ctx->engine[id];
> +
> +		ce->watchdog_threshold =
> +			watchdog_to_clock_counts((u64)threshold_in_us[id]);
> +	}
> +
> +	return 0;
> +}
> +
>  int i915_gem_context_init(struct drm_i915_private *dev_priv)
>  {
>  	struct i915_gem_context *ctx;
> @@ -1061,6 +1133,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>  	case I915_CONTEXT_PARAM_BANNABLE:
>  		args->value = i915_gem_context_is_bannable(ctx);
>  		break;
> +	case I915_CONTEXT_PARAM_WATCHDOG:
> +		ret = i915_gem_context_get_watchdog(ctx, args);
> +		break;
>  	default:
>  		ret = -EINVAL;
>  		break;
> @@ -1118,6 +1193,9 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>  		else
>  			i915_gem_context_clear_bannable(ctx);
>  		break;
> +	case I915_CONTEXT_PARAM_WATCHDOG:
> +		ret = i915_gem_context_set_watchdog(ctx, args);
> +		break;
>  	default:
>  		ret = -EINVAL;
>  		break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index 88700bdbb4e1..6867b1fead8b 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -250,6 +250,26 @@ static inline bool i915_gem_context_is_kernel(struct i915_gem_context *ctx)
>  	return !ctx->file_priv;
>  }
>
> +/*
> + * Timestamp timer resolution = 0.080 uSec,
> + * or 12500000 counts per second, or ~12 counts per microsecond.
> + */
> +#define TIMESTAMP_CNTS_PER_USEC 12
> +static inline u32 watchdog_to_us(u32 value_in_clock_counts)
> +{
> +	return (value_in_clock_counts) / (TIMESTAMP_CNTS_PER_USEC);
> +}
> +
> +static inline u32 watchdog_to_clock_counts(u64 value_in_us)
> +{
> +	u64 threshold = (value_in_us) * (TIMESTAMP_CNTS_PER_USEC);
> +
> +	if (GEM_WARN_ON(overflows_type(threshold, u32)))
> +		return 0;
> +
> +	return threshold;
> +}
> +
>  /* i915_gem_context.c */
>  int __must_check i915_gem_context_init(struct drm_i915_private *dev_priv);
>  void i915_gem_context_lost(struct drm_i915_private *dev_priv);
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 5d015bcc7484..8acb83778030 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -388,9 +388,10 @@ static void error_print_context(struct drm_i915_error_state_buf *m,
>  				const char *header,
>  				const struct drm_i915_error_context *ctx)
>  {
> -	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, ban score %d guilty %d active %d\n",
> +	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, ban score %d guilty %d active %d, watchdog %dus\n",
>  		   header, ctx->comm, ctx->pid, ctx->handle, ctx->hw_id,
> -		   ctx->ban_score, ctx->guilty, ctx->active);
> +		   ctx->ban_score, ctx->guilty, ctx->active,
> +		   watchdog_to_us(ctx->watchdog_threshold));
>  }
>
>  static void error_print_engine(struct drm_i915_error_state_buf *m,
> @@ -1336,7 +1337,8 @@ static void error_record_engine_execlists(struct intel_engine_cs *engine,
>  }
>
>  static void record_context(struct drm_i915_error_context *e,
> -			   struct i915_gem_context *ctx)
> +			   struct i915_gem_context *ctx,
> +			   u32 engine_id)
>  {
>  	if (ctx->pid) {
>  		struct task_struct *task;
> @@ -1355,6 +1357,7 @@ static void record_context(struct drm_i915_error_context *e,
>  	e->ban_score = ctx->ban_score;
>  	e->guilty = ctx->guilty_count;
>  	e->active = ctx->active_count;
> +	e->watchdog_threshold =	ctx->engine[engine_id].watchdog_threshold;
>  }
>
>  static void i915_gem_record_rings(struct drm_i915_private *dev_priv,
> @@ -1389,7 +1392,7 @@ static void i915_gem_record_rings(struct drm_i915_private *dev_priv,
>  			ee->vm = request->ctx->ppgtt ?
>  				&request->ctx->ppgtt->base : &ggtt->base;
>
> -			record_context(&ee->context, request->ctx);
> +			record_context(&ee->context, request->ctx, engine->id);
>
>  			/* We need to copy these to an anonymous buffer
>  			 * as the simplest method to avoid being overwritten
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 2736f642dc76..3f2b57a22338 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1496,7 +1496,7 @@ static int gen8_emit_flush_render(struct drm_i915_gem_request *request,
>  	return 0;
>  }
>
> -#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
> +#define GEN8_WATCHDOG_1000US watchdog_to_clock_counts(1000)
>  static void gen8_watchdog_irq_handler(unsigned long data)
>  {
>  	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index f083931a7809..448c9c0faa69 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1293,6 +1293,7 @@ struct drm_i915_gem_context_param {
>  #define I915_CONTEXT_PARAM_GTT_SIZE	0x3
>  #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE	0x4
>  #define I915_CONTEXT_PARAM_BANNABLE	0x5
> +#define I915_CONTEXT_PARAM_WATCHDOG	0x6
>  	__u64 value;
>  };
>
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 17/18] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-04-14 16:05   ` Daniele Ceraolo Spurio
@ 2017-04-14 16:16     ` Chris Wilson
  2017-04-14 16:47     ` Michel Thierry
  1 sibling, 0 replies; 41+ messages in thread
From: Chris Wilson @ 2017-04-14 16:16 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio; +Cc: intel-gfx

On Fri, Apr 14, 2017 at 09:05:06AM -0700, Daniele Ceraolo Spurio wrote:
> On 24/03/17 18:30, Michel Thierry wrote:
> >+/*
> >+ * Based on time out value in microseconds (us) calculate
> >+ * timer count thresholds needed based on core frequency.
> >+ * Watchdog can be disabled by setting it to 0.
> >+ */
> >+int i915_gem_context_set_watchdog(struct i915_gem_context *ctx,
> >+				  struct drm_i915_gem_context_param *args)
> >+{
> >+	struct drm_i915_private *dev_priv = ctx->i915;
> >+	struct intel_engine_cs *engine;
> >+	enum intel_engine_id id;
> >+	u32 threshold_in_us[I915_NUM_ENGINES];
> >+
> >+	if (!dev_priv->engine[VCS]->emit_start_watchdog)
> >+		return -ENODEV;
> >+	else if (args->size < sizeof(threshold_in_us))
> >+		return -EINVAL;
> 
> won't we break userspace with this check if we ever get more engines
> on a new platform and thus bump I915_NUM_ENGINES?

Not really. Userspace uses a 2 step process to first determine the array
length it needs to pass to the kernel. It will just fill those rings it
doesn't know about with 0.

The alternative to using a fixed length array is using an array of
(engine-id, threshold) pairs. Which is probably going to be more
convenient to userspace.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 17/18] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-04-14 16:05   ` Daniele Ceraolo Spurio
  2017-04-14 16:16     ` Chris Wilson
@ 2017-04-14 16:47     ` Michel Thierry
  1 sibling, 0 replies; 41+ messages in thread
From: Michel Thierry @ 2017-04-14 16:47 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio, intel-gfx



On 14/04/17 09:05, Daniele Ceraolo Spurio wrote:
>
>
> On 24/03/17 18:30, Michel Thierry wrote:
>> Final enablement patch for GPU hang detection using watchdog timeout.
>> Using the gem_context_setparam ioctl, users can specify the desired
>> timeout value in microseconds, and the driver will do the conversion to
>> 'timestamps'.
>>
>> The recommended default watchdog threshold for video engines is 60000 us,
>> since this has been _empirically determined_ to be a good compromise for
>> low-latency requirements and low rate of false positives. The default
>> register value is ~106000us and the theoretical max value (all 1s) is
>> 353 seconds.
>>
>> v2: Fixed get api to return values in microseconds. Threshold updated to
>> be per context engine. Check for u32 overflow. Capture ctx threshold
>> value in error state.
>>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_drv.h         |  1 +
>>  drivers/gpu/drm/i915/i915_gem_context.c | 78
>> +++++++++++++++++++++++++++++++++
>>  drivers/gpu/drm/i915/i915_gem_context.h | 20 +++++++++
>>  drivers/gpu/drm/i915/i915_gpu_error.c   | 11 +++--
>>  drivers/gpu/drm/i915/intel_lrc.c        |  2 +-
>>  include/uapi/drm/i915_drm.h             |  1 +
>>  6 files changed, 108 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h
>> b/drivers/gpu/drm/i915/i915_drv.h
>> index b43c37a911bb..1741584d858f 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -1039,6 +1039,7 @@ struct i915_gpu_state {
>>              int ban_score;
>>              int active;
>>              int guilty;
>> +            int watchdog_threshold;
>>          } context;
>>
>>          struct drm_i915_error_object {
>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
>> b/drivers/gpu/drm/i915/i915_gem_context.c
>> index edbed85a1c88..f5c126c0c681 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>> @@ -422,6 +422,78 @@ i915_gem_context_create_gvt(struct drm_device *dev)
>>      return ctx;
>>  }
>>
>> +/* Return the timer count threshold in microseconds. */
>> +int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
>> +                  struct drm_i915_gem_context_param *args)
>> +{
>> +    struct drm_i915_private *dev_priv = ctx->i915;
>> +    struct intel_engine_cs *engine;
>> +    enum intel_engine_id id;
>> +    u32 threshold_in_us[I915_NUM_ENGINES];
>> +
>> +    if (!dev_priv->engine[VCS]->emit_start_watchdog)
>> +        return -ENODEV;
>> +
>> +    for_each_engine(engine, dev_priv, id) {
>> +        struct intel_context *ce = &ctx->engine[id];
>> +
>> +        threshold_in_us[id] = watchdog_to_us(ce->watchdog_threshold);
>> +    }
>> +
>> +    mutex_unlock(&dev_priv->drm.struct_mutex);
>> +    if (__copy_to_user(u64_to_user_ptr(args->value),
>> +               &threshold_in_us,
>> +               sizeof(threshold_in_us))) {
>> +        mutex_lock(&dev_priv->drm.struct_mutex);
>> +        return -EFAULT;
>> +    }
>> +    mutex_lock(&dev_priv->drm.struct_mutex);
>> +    args->size = sizeof(threshold_in_us);
>> +
>> +    return 0;
>> +}
>> +
>> +/*
>> + * Based on time out value in microseconds (us) calculate
>> + * timer count thresholds needed based on core frequency.
>> + * Watchdog can be disabled by setting it to 0.
>> + */
>> +int i915_gem_context_set_watchdog(struct i915_gem_context *ctx,
>> +                  struct drm_i915_gem_context_param *args)
>> +{
>> +    struct drm_i915_private *dev_priv = ctx->i915;
>> +    struct intel_engine_cs *engine;
>> +    enum intel_engine_id id;
>> +    u32 threshold_in_us[I915_NUM_ENGINES];
>> +
>> +    if (!dev_priv->engine[VCS]->emit_start_watchdog)
>> +        return -ENODEV;
>> +    else if (args->size < sizeof(threshold_in_us))
>> +        return -EINVAL;
>
> won't we break userspace with this check if we ever get more engines on
> a new platform and thus bump I915_NUM_ENGINES?
>
> Thanks,
> Daniele
>

There's a v3 of this patch with Chris feedback,
https://patchwork.freedesktop.org/patch/148805/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 14/18] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load
  2017-03-25  1:30 ` [PATCH v5 14/18] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load Michel Thierry
  2017-03-25  9:15   ` Chris Wilson
@ 2017-04-17 21:28   ` Daniele Ceraolo Spurio
  2017-04-17 22:31     ` Michel Thierry
  1 sibling, 1 reply; 41+ messages in thread
From: Daniele Ceraolo Spurio @ 2017-04-17 21:28 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx



On 24/03/17 18:30, Michel Thierry wrote:
> For watchdog / media reset, the firmware must know the address of the shared
> data page (the first page of the default context).
>
> This information should be in DWORD 9 of the GUC_CTL structure.
>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_guc_fwif.h   | 2 +-
>  drivers/gpu/drm/i915/intel_guc_loader.c | 8 ++++++++
>  2 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
> index b627206b8f56..5db3def5f74e 100644
> --- a/drivers/gpu/drm/i915/intel_guc_fwif.h
> +++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
> @@ -135,7 +135,7 @@
>  #define   GUC_ADS_ADDR_SHIFT		11
>  #define   GUC_ADS_ADDR_MASK		0xfffff800
>
> -#define GUC_CTL_RSRVD			9
> +#define GUC_CTL_SHARED_DATA		9
>
>  #define GUC_CTL_MAX_DWORDS		(SOFT_SCRATCH_COUNT - 2) /* [1..14] */
>
> diff --git a/drivers/gpu/drm/i915/intel_guc_loader.c b/drivers/gpu/drm/i915/intel_guc_loader.c
> index 7d92321f8731..afa584864cb5 100644
> --- a/drivers/gpu/drm/i915/intel_guc_loader.c
> +++ b/drivers/gpu/drm/i915/intel_guc_loader.c
> @@ -119,6 +119,7 @@ static void guc_params_init(struct drm_i915_private *dev_priv)
>  {
>  	struct intel_guc *guc = &dev_priv->guc;
>  	u32 params[GUC_CTL_MAX_DWORDS];
> +	struct i915_gem_context *ctx;
>  	int i;
>
>  	memset(&params, 0, sizeof(params));
> @@ -167,6 +168,13 @@ static void guc_params_init(struct drm_i915_private *dev_priv)
>  		params[GUC_CTL_FEATURE] &= ~GUC_CTL_DISABLE_SCHEDULER;
>  	}
>
> +	/*
> +	 * For watchdog / media reset, GuC must know the address of the shared
> +	 * data page, which is the first page of the default context.
> +	 */
> +	ctx = dev_priv->kernel_context;
> +	params[GUC_CTL_SHARED_DATA] = i915_ggtt_offset(ctx->engine[RCS].state);

Since we use this page in several places (here, in the engine reset h2g 
and if I'm not mistaken also in the preemption h2g), is it worth saving 
its ggtt offset inside the guc struct to make things cleaner?

Regards,
Daniele

> +
>  	I915_WRITE(SOFT_SCRATCH(0), 0);
>
>  	for (i = 0; i < GUC_CTL_MAX_DWORDS; i++)
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v5 14/18] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load
  2017-04-17 21:28   ` Daniele Ceraolo Spurio
@ 2017-04-17 22:31     ` Michel Thierry
  0 siblings, 0 replies; 41+ messages in thread
From: Michel Thierry @ 2017-04-17 22:31 UTC (permalink / raw)
  To: Daniele Ceraolo Spurio, intel-gfx


On 17/04/17 14:28, Daniele Ceraolo Spurio wrote:
>
>
> On 24/03/17 18:30, Michel Thierry wrote:
>> For watchdog / media reset, the firmware must know the address of the
>> shared
>> data page (the first page of the default context).
>>
>> This information should be in DWORD 9 of the GUC_CTL structure.
>>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>> ---
>>  drivers/gpu/drm/i915/intel_guc_fwif.h   | 2 +-
>>  drivers/gpu/drm/i915/intel_guc_loader.c | 8 ++++++++
>>  2 files changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h
>> b/drivers/gpu/drm/i915/intel_guc_fwif.h
>> index b627206b8f56..5db3def5f74e 100644
>> --- a/drivers/gpu/drm/i915/intel_guc_fwif.h
>> +++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
>> @@ -135,7 +135,7 @@
>>  #define   GUC_ADS_ADDR_SHIFT        11
>>  #define   GUC_ADS_ADDR_MASK        0xfffff800
>>
>> -#define GUC_CTL_RSRVD            9
>> +#define GUC_CTL_SHARED_DATA        9
>>
>>  #define GUC_CTL_MAX_DWORDS        (SOFT_SCRATCH_COUNT - 2) /* [1..14] */
>>
>> diff --git a/drivers/gpu/drm/i915/intel_guc_loader.c
>> b/drivers/gpu/drm/i915/intel_guc_loader.c
>> index 7d92321f8731..afa584864cb5 100644
>> --- a/drivers/gpu/drm/i915/intel_guc_loader.c
>> +++ b/drivers/gpu/drm/i915/intel_guc_loader.c
>> @@ -119,6 +119,7 @@ static void guc_params_init(struct
>> drm_i915_private *dev_priv)
>>  {
>>      struct intel_guc *guc = &dev_priv->guc;
>>      u32 params[GUC_CTL_MAX_DWORDS];
>> +    struct i915_gem_context *ctx;
>>      int i;
>>
>>      memset(&params, 0, sizeof(params));
>> @@ -167,6 +168,13 @@ static void guc_params_init(struct
>> drm_i915_private *dev_priv)
>>          params[GUC_CTL_FEATURE] &= ~GUC_CTL_DISABLE_SCHEDULER;
>>      }
>>
>> +    /*
>> +     * For watchdog / media reset, GuC must know the address of the
>> shared
>> +     * data page, which is the first page of the default context.
>> +     */
>> +    ctx = dev_priv->kernel_context;
>> +    params[GUC_CTL_SHARED_DATA] =
>> i915_ggtt_offset(ctx->engine[RCS].state);
>
> Since we use this page in several places (here, in the engine reset h2g
> and if I'm not mistaken also in the preemption h2g), is it worth saving
> its ggtt offset inside the guc struct to make things cleaner?
>

In suspend/resume too.

>
>> +
>>      I915_WRITE(SOFT_SCRATCH(0), 0);
>>
>>      for (i = 0; i < GUC_CTL_MAX_DWORDS; i++)
>>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2017-04-17 22:31 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-25  1:29 [PATCH v5 00/18] Gen8+ engine-reset Michel Thierry
2017-03-25  1:29 ` [PATCH v5 01/18] drm/i915: Fix stale comment about I915_RESET_IN_PROGRESS flag Michel Thierry
2017-03-25  1:29 ` [PATCH v5 02/18] drm/i915: Rename gen8_(un)request_engine_reset to gen8_(un)request_reset_engine Michel Thierry
2017-03-25  9:01   ` Chris Wilson
2017-03-27 20:35     ` Michel Thierry
2017-03-25  1:29 ` [PATCH v5 03/18] drm/i915: Update i915.reset to handle engine resets Michel Thierry
2017-03-25  1:29 ` [PATCH v5 04/18] drm/i915/tdr: Modify error handler for per engine hang recovery Michel Thierry
2017-03-25  9:10   ` Chris Wilson
2017-03-25  1:29 ` [PATCH v5 05/18] drm/i915/tdr: Add support for per engine reset recovery Michel Thierry
2017-03-25  1:29 ` [PATCH v5 06/18] drm/i915: Skip reset request if there is one already Michel Thierry
2017-03-25  1:29 ` [PATCH v5 07/18] drm/i915/tdr: Add engine reset count to error state Michel Thierry
2017-03-25  1:30 ` [PATCH v5 08/18] drm/i915/tdr: Export per-engine reset count info to debugfs Michel Thierry
2017-03-25  1:30 ` [PATCH v5 09/18] drm/i915/tdr: Enable Engine reset and recovery support Michel Thierry
2017-03-25  1:30 ` [PATCH v5 10/18] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
2017-03-25  1:30 ` [PATCH v5 11/18] drm/i915/selftests: reset engine self tests Michel Thierry
2017-03-25  9:50   ` Chris Wilson
2017-03-25  1:30 ` [PATCH v5 12/18] drm/i915/guc: Provide register list to be saved/restored during engine reset Michel Thierry
2017-04-14  0:16   ` Daniele Ceraolo Spurio
2017-04-14  0:30     ` Michel Thierry
2017-03-25  1:30 ` [PATCH v5 13/18] drm/i915/guc: Add support for reset engine using GuC commands Michel Thierry
2017-03-25  1:30 ` [PATCH v5 14/18] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load Michel Thierry
2017-03-25  9:15   ` Chris Wilson
2017-04-17 21:28   ` Daniele Ceraolo Spurio
2017-04-17 22:31     ` Michel Thierry
2017-03-25  1:30 ` [PATCH v5 15/18] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
2017-03-25  9:26   ` Chris Wilson
2017-03-27 21:48     ` Michel Thierry
2017-03-28  8:34       ` Chris Wilson
2017-03-25  1:30 ` [PATCH v5 16/18] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
2017-03-25  9:46   ` Chris Wilson
2017-03-25  1:30 ` [PATCH v5 17/18] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
2017-03-25  9:38   ` Chris Wilson
2017-03-28  1:03     ` Michel Thierry
2017-03-28  8:31       ` Chris Wilson
2017-03-28  8:39         ` Chris Wilson
2017-04-14 16:05   ` Daniele Ceraolo Spurio
2017-04-14 16:16     ` Chris Wilson
2017-04-14 16:47     ` Michel Thierry
2017-03-25  1:30 ` [PATCH v5 18/18] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs Michel Thierry
2017-03-25  9:12   ` Chris Wilson
2017-03-25  1:50 ` ✗ Fi.CI.BAT: failure for Gen8+ engine-reset Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.