All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 00/20] Gen8+ engine-reset
@ 2017-05-22 17:46 Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 01/20] drm/i915: Look for active requests earlier in the reset path Michel Thierry
                   ` (21 more replies)
  0 siblings, 22 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

These patches add the reset-engine feature from Gen8. This is also
referred to as Timeout detection and recovery (TDR). This complements to
the full gpu reset feature available in i915 but it only allows to reset a
particular engine instead of all engines thus providing a light weight
engine reset and recovery mechanism.

Thanks to recent changes merged, this implementation is now not only for
execlists, but for GuC based submission too; it is still limited from
Gen8 onwards. I have also included the changes for watchdog timeout
detection. The GuC related patches are functional, but can be seen as RFC.

Timeout detection relies on the existing hangcheck, which remains the same;
main changes are to the recovery mechanism. Once we detect a hang on a
particular engine we identify the request that caused the hang, skip the
request and adjust head pointers to allow the execution to proceed
normally. After some cleanup, submissions are restarted to process
remaining work queued to that engine.

If engine reset fails to recover engine correctly then we fallback to full
gpu reset.

We can argue about the effectiveness of reset-engine vs full reset when
more than one ring is hung, but the benefits of just resetting one engine
are reduced when the driver has to do it multiple times.

v2: ELSP queue request tracking and reset path changes to handle incomplete
requests during reset. Thanks to Chris Wilson for providing these patches.

v3: Let the waiter keep handling the full gpu reset if it already has the
lock; point out that GuC submission needs a different method to restart
workloads after the engine reset completes.

v4: Handle reset as 2 level resets, by first going to engine only and fall
backing to full/chip reset as needed, i.e. reset_engine will need the
struct_mutex.

v5: Rebased after reset flag split in 2, add GuC support, include watchdog
detection patches, addressing comments from prev RFC.

v6: Mutex-less reset engine. Updates in watchdog abi and guc whitelist &
register-restore fixes (including an old patch from Daniele).

v7: Removed leftovers from v5; review comments; ability to cancel the reset
if there's no active request.

v8: Moved patch looking for active request at the beginning of these series;
drop patch to request the engine's reset readinnes earlier since this is
done during the reset flow too; warn if hw fails to acknowledge reset
_readiness_.

Daniele Ceraolo Spurio (1):
  drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder

Michel Thierry (19):
  drm/i915: Look for active requests earlier in the reset path
  drm/i915: Update i915.reset to handle engine resets
  drm/i915: Modify error handler for per engine hang recovery
  drm/i915: Add support for per engine reset recovery
  drm/i915: Add engine reset count to error state
  drm/i915: Export per-engine reset count info to debugfs
  drm/i915: Carry on with reset even if hw engine is not ready
  drm/i915: Enable Engine reset and recovery support
  drm/i915: Add engine reset count in get-reset-stats ioctl
  drm/i915/selftests: reset engine self tests
  drm/i915/guc: Provide register list to be saved/restored during engine
    reset
  drm/i915/guc: Rename the function that resets the GuC
  drm/i915/guc: Add support for reset engine using GuC commands
  drm/i915: Watchdog timeout: Pass GuC shared data structure during
    param load
  drm/i915: Watchdog timeout: IRQ handler for gen8+
  drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  drm/i915: Watchdog timeout: Include threshold value in error state
  drm/i915: Watchdog timeout: Export media reset count from GuC to
    debugfs

 drivers/gpu/drm/i915/i915_debugfs.c              |  43 +++++++
 drivers/gpu/drm/i915/i915_drv.c                  |  74 +++++++++++
 drivers/gpu/drm/i915/i915_drv.h                  |  90 ++++++++++++-
 drivers/gpu/drm/i915/i915_gem.c                  | 104 +++++++++------
 drivers/gpu/drm/i915/i915_gem_context.c          | 109 +++++++++++++++-
 drivers/gpu/drm/i915/i915_gem_context.h          |   4 +
 drivers/gpu/drm/i915/i915_gpu_error.c            |  15 ++-
 drivers/gpu/drm/i915/i915_guc_submission.c       | 138 ++++++++++++++++++--
 drivers/gpu/drm/i915/i915_irq.c                  |  36 +++++-
 drivers/gpu/drm/i915/i915_params.c               |   6 +-
 drivers/gpu/drm/i915/i915_params.h               |   2 +-
 drivers/gpu/drm/i915/i915_pci.c                  |   5 +-
 drivers/gpu/drm/i915/i915_reg.h                  |   6 +
 drivers/gpu/drm/i915/intel_engine_cs.c           |  65 +++++++---
 drivers/gpu/drm/i915/intel_guc_fwif.h            |  27 +++-
 drivers/gpu/drm/i915/intel_guc_loader.c          |  11 ++
 drivers/gpu/drm/i915/intel_hangcheck.c           |  13 +-
 drivers/gpu/drm/i915/intel_lrc.c                 | 155 ++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h          |   9 ++
 drivers/gpu/drm/i915/intel_uc.c                  |   4 +-
 drivers/gpu/drm/i915/intel_uc.h                  |   3 +
 drivers/gpu/drm/i915/intel_uncore.c              |  16 ++-
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 149 ++++++++++++++++++++++
 include/uapi/drm/i915_drm.h                      |   7 +-
 24 files changed, 986 insertions(+), 105 deletions(-)

-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v8 01/20] drm/i915: Look for active requests earlier in the reset path
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 02/20] drm/i915: Update i915.reset to handle engine resets Michel Thierry
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

And store the active request so that we only search for it once.

v2: Check for request completion inside _prepare_engine, don't use
ECANCELED, remove unnecessary null checks (Chris).

v3: Capture active requests during reset_prepare and store it the
engine hangcheck obj.

v4: Rename commit, change i915_gem_reset_request to just confirm the
active_request is still incomplete, instead of engine_stalled (Chris).

v5: With style; pass the active request to gem_reset_engine, keep single
return in reset_prepare_engine (Chris).

v6: Moved before reset-engine code appears (Chris)

Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> (v5)
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c         | 14 +++++++-------
 drivers/gpu/drm/i915/intel_ringbuffer.h |  1 +
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index a637cc05cc4a..8207476b0c35 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2801,7 +2801,7 @@ int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
 
 	/* Ensure irq handler finishes, and not run again. */
 	for_each_engine(engine, dev_priv, id) {
-		struct drm_i915_gem_request *request;
+		struct drm_i915_gem_request *request = NULL;
 
 		/* Prevent the signaler thread from updating the request
 		 * state (by calling dma_fence_signal) as we are processing
@@ -2833,6 +2833,8 @@ int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
 			if (request && request->fence.error == -EIO)
 				err = -EIO; /* Previous reset failed! */
 		}
+
+		engine->hangcheck.active_request = request;
 	}
 
 	i915_gem_revoke_fences(dev_priv);
@@ -2886,7 +2888,7 @@ static void engine_skip_context(struct drm_i915_gem_request *request)
 static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 {
 	/* Read once and return the resolution */
-	const bool guilty = engine_stalled(request->engine);
+	const bool guilty = !i915_gem_request_completed(request);
 
 	/* The guilty request will get skipped on a hung engine.
 	 *
@@ -2920,11 +2922,9 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 	return guilty;
 }
 
-static void i915_gem_reset_engine(struct intel_engine_cs *engine)
+static void i915_gem_reset_engine(struct intel_engine_cs *engine,
+				  struct drm_i915_gem_request *request)
 {
-	struct drm_i915_gem_request *request;
-
-	request = i915_gem_find_active_request(engine);
 	if (request && i915_gem_reset_request(request)) {
 		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
 				 engine->name, request->global_seqno);
@@ -2950,7 +2950,7 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
 	for_each_engine(engine, dev_priv, id) {
 		struct i915_gem_context *ctx;
 
-		i915_gem_reset_engine(engine);
+		i915_gem_reset_engine(engine, engine->hangcheck.active_request);
 		ctx = fetch_and_zero(&engine->last_retired_context);
 		if (ctx)
 			engine->context_unpin(engine, ctx);
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 6aa20ac8cde3..d33c93444c0d 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -121,6 +121,7 @@ struct intel_engine_hangcheck {
 	unsigned long action_timestamp;
 	int deadlock;
 	struct intel_instdone instdone;
+	struct drm_i915_gem_request *active_request;
 	bool stalled;
 };
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 02/20] drm/i915: Update i915.reset to handle engine resets
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 01/20] drm/i915: Look for active requests earlier in the reset path Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 03/20] drm/i915: Modify error handler for per engine hang recovery Michel Thierry
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

In preparation for engine reset work update this parameter to handle more
than one type of reset. Default at the moment is still full gpu reset.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_params.c | 6 +++---
 drivers/gpu/drm/i915/i915_params.h | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index b6a7e363d076..045cadb77285 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -46,7 +46,7 @@ struct i915_params i915 __read_mostly = {
 	.prefault_disable = 0,
 	.load_detect_test = 0,
 	.force_reset_modeset_test = 0,
-	.reset = true,
+	.reset = 1,
 	.error_capture = true,
 	.invert_brightness = 0,
 	.disable_display = 0,
@@ -115,8 +115,8 @@ MODULE_PARM_DESC(vbt_sdvo_panel_type,
 	"Override/Ignore selection of SDVO panel mode in the VBT "
 	"(-2=ignore, -1=auto [default], index in VBT BIOS table)");
 
-module_param_named_unsafe(reset, i915.reset, bool, 0600);
-MODULE_PARM_DESC(reset, "Attempt GPU resets (default: true)");
+module_param_named_unsafe(reset, i915.reset, int, 0600);
+MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset [default], 2=engine reset)");
 
 #if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
 module_param_named(error_capture, i915.error_capture, bool, 0600);
diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
index 34148cc8637c..febbfdbd30bd 100644
--- a/drivers/gpu/drm/i915/i915_params.h
+++ b/drivers/gpu/drm/i915/i915_params.h
@@ -51,6 +51,7 @@
 	func(int, use_mmio_flip); \
 	func(int, mmio_debug); \
 	func(int, edp_vswing); \
+	func(int, reset); \
 	func(unsigned int, inject_load_failure); \
 	/* leave bools at the end to not create holes */ \
 	func(bool, alpha_support); \
@@ -60,7 +61,6 @@
 	func(bool, prefault_disable); \
 	func(bool, load_detect_test); \
 	func(bool, force_reset_modeset_test); \
-	func(bool, reset); \
 	func(bool, error_capture); \
 	func(bool, disable_display); \
 	func(bool, verbose_state_checks); \
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 03/20] drm/i915: Modify error handler for per engine hang recovery
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 01/20] drm/i915: Look for active requests earlier in the reset path Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 02/20] drm/i915: Update i915.reset to handle engine resets Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-06-02 20:16   ` Chris Wilson
  2017-06-06  0:45   ` [PATCH v9] " Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 04/20] drm/i915: Add support for per engine reset recovery Michel Thierry
                   ` (18 subsequent siblings)
  21 siblings, 2 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

This is a preparatory patch which modifies error handler to do per engine
hang recovery. The actual patch which implements this sequence follows
later in the series. The aim is to prepare existing recovery function to
adapt to this new function where applicable (which fails at this point
because core implementation is lacking) and continue recovery using legacy
full gpu reset.

A helper function is also added to query the availability of engine
reset.

The error events behaviour that are used to notify user of reset are
adapted to engine reset such that it doesn't break users listening to these
events. In legacy we report an error event, a reset event before resetting
the gpu and a reset done event marking the completion of reset. The same
behaviour is adapted but reset event is only dispatched once even when
multiple engines are hung. Finally once reset is complete we send reset
done event as usual.

Note that this implementation of engine reset is for i915 directly
submitting to the ELSP, where the driver manages the hang detection,
recovery and resubmission. With GuC submission these tasks are shared
between driver and firmware; i915 will still responsible for detecting a
hang, and when it does it will have to request GuC to reset that Engine and
remind the firmware about the outstanding submissions. This will be
added in different patch.

v2: rebase, advertise engine reset availability in platform definition,
add note about GuC submission.
v3: s/*engine_reset*/*reset_engine*/. (Chris)
Handle reset as 2 level resets, by first going to engine only and fall
backing to full/chip reset as needed, i.e. reset_engine will need the
struct_mutex.
v4: Pass the engine mask to i915_reset. (Chris)
v5: Rebase, update selftests.
v6: Rebase, prepare for mutex-less reset engine.
v7: Pass reset_engine mask as a function parameter, and iterate over the
engine mask for reset_engine. (Chris)
v8: Use i915.reset >=2 in has_reset_engine; remove redundant reset
logging; add a reset-engine-in-progress flag to prevent concurrent
resets, and avoid dual purposing of reset-backoff. (Chris)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c     | 13 +++++++++++++
 drivers/gpu/drm/i915/i915_drv.h     |  8 ++++++++
 drivers/gpu/drm/i915/i915_irq.c     | 24 ++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_pci.c     |  5 ++++-
 drivers/gpu/drm/i915/intel_uncore.c | 11 +++++++++++
 5 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index d703897786e9..f49f9d9273f2 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1873,6 +1873,19 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	goto finish;
 }
 
+/**
+ * i915_reset_engine - reset GPU engine to recover from a hang
+ * @engine: engine to reset
+ *
+ * Reset a specific GPU engine. Useful if a hang is detected.
+ * Returns zero on successful reset or otherwise an error code.
+ */
+int i915_reset_engine(struct intel_engine_cs *engine)
+{
+	/* FIXME: replace me with engine reset sequence */
+	return -ENODEV;
+}
+
 static int i915_pm_suspend(struct device *kdev)
 {
 	struct pci_dev *pdev = to_pci_dev(kdev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index f2a01bc04168..6848ed75611d 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -753,6 +753,7 @@ struct intel_csr {
 	func(has_ddi); \
 	func(has_decoupled_mmio); \
 	func(has_dp_mst); \
+	func(has_reset_engine); \
 	func(has_fbc); \
 	func(has_fpga_dbg); \
 	func(has_full_ppgtt); \
@@ -1546,6 +1547,10 @@ struct i915_gpu_error {
 	 * inspect the bit and do the reset directly, otherwise the worker
 	 * waits for the struct_mutex.
 	 *
+	 * #I915_RESET_ENGINE_IN_PROGRESS - Since the driver doesn't need to
+	 * acquire the struct_mutex to reset an engine, we need an explicit
+	 * flag to prevent two concurrent reset-engine attempts.
+	 *
 	 * #I915_WEDGED - If reset fails and we can no longer use the GPU,
 	 * we set the #I915_WEDGED bit. Prior to command submission, e.g.
 	 * i915_gem_request_alloc(), this bit is checked and the sequence
@@ -1554,6 +1559,7 @@ struct i915_gpu_error {
 	unsigned long flags;
 #define I915_RESET_BACKOFF	0
 #define I915_RESET_HANDOFF	1
+#define I915_RESET_ENGINE_IN_PROGRESS	2
 #define I915_WEDGED		(BITS_PER_LONG - 1)
 
 	/**
@@ -3038,6 +3044,8 @@ extern void i915_driver_unload(struct drm_device *dev);
 extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern void i915_reset(struct drm_i915_private *dev_priv);
+extern int i915_reset_engine(struct intel_engine_cs *engine);
+extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 7b7f55a28eec..da4f49447fb0 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2743,6 +2743,30 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 	if (!engine_mask)
 		goto out;
 
+	/* try engine reset first, and continue if fails */
+	if (intel_has_reset_engine(dev_priv)) {
+		struct intel_engine_cs *engine;
+		unsigned int tmp;
+
+		/* protect against concurrent reset attempts */
+		if (test_and_set_bit(I915_RESET_ENGINE_IN_PROGRESS,
+				     &dev_priv->gpu_error.flags))
+			goto out;
+
+		for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
+			if (i915_reset_engine(engine) == 0)
+				engine_mask &= ~intel_engine_flag(engine);
+		}
+
+		/* clear unconditionally, full reset won't care about it */
+		clear_bit(I915_RESET_ENGINE_IN_PROGRESS,
+			  &dev_priv->gpu_error.flags);
+
+		if (!engine_mask)
+			goto out;
+	}
+
+	/* full reset needs the mutex, stop any other user trying to do so */
 	if (test_and_set_bit(I915_RESET_BACKOFF,
 			     &dev_priv->gpu_error.flags))
 		goto out;
diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index f80db2ccd92f..4dfb400aef85 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -310,7 +310,8 @@ static const struct intel_device_info intel_haswell_info = {
 	BDW_COLORS, \
 	.has_logical_ring_contexts = 1, \
 	.has_full_48bit_ppgtt = 1, \
-	.has_64bit_reloc = 1
+	.has_64bit_reloc = 1, \
+	.has_reset_engine = 1
 
 static const struct intel_device_info intel_broadwell_info = {
 	BDW_FEATURES,
@@ -341,6 +342,7 @@ static const struct intel_device_info intel_cherryview_info = {
 	.has_gmch_display = 1,
 	.has_aliasing_ppgtt = 1,
 	.has_full_ppgtt = 1,
+	.has_reset_engine = 1,
 	.display_mmio_offset = VLV_DISPLAY_BASE,
 	GEN_CHV_PIPEOFFSETS,
 	CURSOR_OFFSETS,
@@ -389,6 +391,7 @@ static const struct intel_device_info intel_skylake_gt3_info = {
 	.has_aliasing_ppgtt = 1, \
 	.has_full_ppgtt = 1, \
 	.has_full_48bit_ppgtt = 1, \
+	.has_reset_engine = 1, \
 	GEN_DEFAULT_PIPEOFFSETS, \
 	IVB_CURSOR_OFFSETS, \
 	BDW_COLORS
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 9269cae5713a..fdfd8c66c956 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1836,6 +1836,17 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv)
 	return intel_get_gpu_reset(dev_priv) != NULL;
 }
 
+/*
+ * When GuC submission is enabled, GuC manages ELSP and can initiate the
+ * engine reset too. For now, fall back to full GPU reset if it is enabled.
+ */
+bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
+{
+	return (dev_priv->info.has_reset_engine &&
+		!dev_priv->guc.execbuf_client &&
+		i915.reset >= 2);
+}
+
 int intel_guc_reset(struct drm_i915_private *dev_priv)
 {
 	int ret;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 04/20] drm/i915: Add support for per engine reset recovery
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (2 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 03/20] drm/i915: Modify error handler for per engine hang recovery Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 05/20] drm/i915: Add engine reset count to error state Michel Thierry
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

This change implements support for per-engine reset as an initial, less
intrusive hang recovery option to be attempted before falling back to the
legacy full GPU reset recovery mode if necessary. This is only supported
from Gen8 onwards.

Hangchecker determines which engines are hung and invokes error handler to
recover from it. Error handler schedules recovery for each of those engines
that are hung. The recovery procedure is as follows,
 - identifies the request that caused the hang and it is dropped
 - force engine to idle: this is done by issuing a reset request
 - reset the engine
 - re-init the engine to resume submissions.

If engine reset fails then we fall back to heavy weight full gpu reset
which resets all engines and reinitiazes complete state of HW and SW.

v2: Rebase.
v3: s/*engine_reset*/*reset_engine*/; freeze engine and irqs before
calling i915_gem_reset_engine (Chris).
v4: Rebase, modify i915_gem_reset_prepare to use a ring mask and
reuse the function for reset_engine.
v5: intel_reset_engine_start/cancel instead of request/unrequest_reset.
v6: Clean up reset_engine function to not require mutex, i.e. no need to call
revoke/restore_fences and _retire_requests (Chris).
v7: Remove leftovers from v5, i.e. no need to disable irq, hold
forcewake or wakeup the handoff bit (Chris).
v8: engine_retire_requests should be (and it was) static; explain that
we have to re-init the engine after reset, which is why the init_hw call
is needed; check reset-in-progress flag (Chris).
v9: Rebase, include code to pass the active request to gem_reset_engine
(as it is already done in full reset). Remove unnecessary
intel_reset_engine_start/cancel, these are executed as part of the
reset.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c | 48 ++++++++++++++++++++-
 drivers/gpu/drm/i915/i915_drv.h |  5 +++
 drivers/gpu/drm/i915/i915_gem.c | 96 +++++++++++++++++++++++++----------------
 3 files changed, 109 insertions(+), 40 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index f49f9d9273f2..7925e103c8bd 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1879,11 +1879,55 @@ void i915_reset(struct drm_i915_private *dev_priv)
  *
  * Reset a specific GPU engine. Useful if a hang is detected.
  * Returns zero on successful reset or otherwise an error code.
+ *
+ * Procedure is:
+ *  - identifies the request that caused the hang and it is dropped
+ *  - reset engine (which will force the engine to idle)
+ *  - re-init/configure engine
  */
 int i915_reset_engine(struct intel_engine_cs *engine)
 {
-	/* FIXME: replace me with engine reset sequence */
-	return -ENODEV;
+	int ret;
+	struct drm_i915_private *dev_priv = engine->i915;
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
+	struct drm_i915_gem_request *active_request;
+
+	GEM_BUG_ON(!test_bit(I915_RESET_ENGINE_IN_PROGRESS, &error->flags));
+
+	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
+
+	active_request = i915_gem_reset_prepare_engine(engine);
+	if (IS_ERR(active_request)) {
+		DRM_DEBUG_DRIVER("Previous reset failed, promote to full reset\n");
+		ret = PTR_ERR(active_request);
+		goto out;
+	}
+
+	/*
+	 * the request that caused the hang is stuck on elsp, we know the
+	 * active request and can drop it, adjust head to skip the offending
+	 * request to resume executing remaining requests in the queue.
+	 */
+	i915_gem_reset_engine(engine, active_request);
+
+	/* finally, reset engine */
+	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
+	if (ret) {
+		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
+		goto out;
+	}
+
+	i915_gem_reset_finish_engine(engine);
+
+	/*
+	 * The engine and its registers (and workarounds in case of render)
+	 * have been reset to their default values. Follow the init_ring
+	 * process to program RING_MODE, HWSP and re-enable submission.
+	 */
+	ret = engine->init_hw(engine);
+
+out:
+	return ret;
 }
 
 static int i915_pm_suspend(struct device *kdev)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 6848ed75611d..f9fd849dab8b 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3417,11 +3417,16 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
+struct drm_i915_gem_request *
+i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv);
 void i915_gem_reset(struct drm_i915_private *dev_priv);
+void i915_gem_reset_finish_engine(struct intel_engine_cs *engine);
 void i915_gem_reset_finish(struct drm_i915_private *dev_priv);
 void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
 bool i915_gem_unset_wedged(struct drm_i915_private *dev_priv);
+void i915_gem_reset_engine(struct intel_engine_cs *engine,
+			   struct drm_i915_gem_request *request);
 
 void i915_gem_init_mmio(struct drm_i915_private *i915);
 int __must_check i915_gem_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 8207476b0c35..d0fd22d4fe26 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2793,45 +2793,61 @@ static bool engine_stalled(struct intel_engine_cs *engine)
 	return true;
 }
 
+/*
+ * Ensure irq handler finishes, and not run again.
+ * Also return the active request so that we only search for it once.
+ */
+struct drm_i915_gem_request *
+i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
+{
+	struct drm_i915_gem_request *request = NULL;
+
+	/* Prevent the signaler thread from updating the request
+	 * state (by calling dma_fence_signal) as we are processing
+	 * the reset. The write from the GPU of the seqno is
+	 * asynchronous and the signaler thread may see a different
+	 * value to us and declare the request complete, even though
+	 * the reset routine have picked that request as the active
+	 * (incomplete) request. This conflict is not handled
+	 * gracefully!
+	 */
+	kthread_park(engine->breadcrumbs.signaler);
+
+	/* Prevent request submission to the hardware until we have
+	 * completed the reset in i915_gem_reset_finish(). If a request
+	 * is completed by one engine, it may then queue a request
+	 * to a second via its engine->irq_tasklet *just* as we are
+	 * calling engine->init_hw() and also writing the ELSP.
+	 * Turning off the engine->irq_tasklet until the reset is over
+	 * prevents the race.
+	 */
+	tasklet_kill(&engine->irq_tasklet);
+	tasklet_disable(&engine->irq_tasklet);
+
+	if (engine->irq_seqno_barrier)
+		engine->irq_seqno_barrier(engine);
+
+	if (engine_stalled(engine)) {
+		request = i915_gem_find_active_request(engine);
+		if (request && request->fence.error == -EIO)
+			request = ERR_PTR(-EIO); /* Previous reset failed! */
+	}
+
+	return request;
+}
+
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
 {
 	struct intel_engine_cs *engine;
+	struct drm_i915_gem_request *request;
 	enum intel_engine_id id;
 	int err = 0;
 
-	/* Ensure irq handler finishes, and not run again. */
 	for_each_engine(engine, dev_priv, id) {
-		struct drm_i915_gem_request *request = NULL;
-
-		/* Prevent the signaler thread from updating the request
-		 * state (by calling dma_fence_signal) as we are processing
-		 * the reset. The write from the GPU of the seqno is
-		 * asynchronous and the signaler thread may see a different
-		 * value to us and declare the request complete, even though
-		 * the reset routine have picked that request as the active
-		 * (incomplete) request. This conflict is not handled
-		 * gracefully!
-		 */
-		kthread_park(engine->breadcrumbs.signaler);
-
-		/* Prevent request submission to the hardware until we have
-		 * completed the reset in i915_gem_reset_finish(). If a request
-		 * is completed by one engine, it may then queue a request
-		 * to a second via its engine->irq_tasklet *just* as we are
-		 * calling engine->init_hw() and also writing the ELSP.
-		 * Turning off the engine->irq_tasklet until the reset is over
-		 * prevents the race.
-		 */
-		tasklet_kill(&engine->irq_tasklet);
-		tasklet_disable(&engine->irq_tasklet);
-
-		if (engine->irq_seqno_barrier)
-			engine->irq_seqno_barrier(engine);
-
-		if (engine_stalled(engine)) {
-			request = i915_gem_find_active_request(engine);
-			if (request && request->fence.error == -EIO)
-				err = -EIO; /* Previous reset failed! */
+		request = i915_gem_reset_prepare_engine(engine);
+		if (IS_ERR(request)) {
+			err = PTR_ERR(request);
+			break;
 		}
 
 		engine->hangcheck.active_request = request;
@@ -2922,8 +2938,8 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 	return guilty;
 }
 
-static void i915_gem_reset_engine(struct intel_engine_cs *engine,
-				  struct drm_i915_gem_request *request)
+void i915_gem_reset_engine(struct intel_engine_cs *engine,
+			   struct drm_i915_gem_request *request)
 {
 	if (request && i915_gem_reset_request(request)) {
 		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
@@ -2966,6 +2982,12 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
 	}
 }
 
+void i915_gem_reset_finish_engine(struct intel_engine_cs *engine)
+{
+	tasklet_enable(&engine->irq_tasklet);
+	kthread_unpark(engine->breadcrumbs.signaler);
+}
+
 void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
 {
 	struct intel_engine_cs *engine;
@@ -2973,10 +2995,8 @@ void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
 
 	lockdep_assert_held(&dev_priv->drm.struct_mutex);
 
-	for_each_engine(engine, dev_priv, id) {
-		tasklet_enable(&engine->irq_tasklet);
-		kthread_unpark(engine->breadcrumbs.signaler);
-	}
+	for_each_engine(engine, dev_priv, id)
+		i915_gem_reset_finish_engine(engine);
 }
 
 static void nop_submit_request(struct drm_i915_gem_request *request)
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 05/20] drm/i915: Add engine reset count to error state
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (3 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 04/20] drm/i915: Add support for per engine reset recovery Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 06/20] drm/i915: Export per-engine reset count info to debugfs Michel Thierry
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

Driver maintains count of how many times a given engine is reset, useful to
capture this in error state also. It gives an idea of how engine is coping
up with the workloads it is executing before this error state.

A follow-up patch will provide this information in debugfs.

v2: s/engine_reset/reset_engine/ (Chris)
    Define count as unsigned int (Tvrtko)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c       |  3 +++
 drivers/gpu/drm/i915/i915_drv.h       | 10 ++++++++++
 drivers/gpu/drm/i915/i915_gpu_error.c |  3 +++
 3 files changed, 16 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 7925e103c8bd..51663abda5db 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1925,7 +1925,10 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	 * process to program RING_MODE, HWSP and re-enable submission.
 	 */
 	ret = engine->init_hw(engine);
+	if (ret)
+		goto out;
 
+	error->reset_engine_count[engine->id]++;
 out:
 	return ret;
 }
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index f9fd849dab8b..ada9e2cc969c 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -920,6 +920,7 @@ struct i915_gpu_state {
 		enum intel_engine_hangcheck_action hangcheck_action;
 		struct i915_address_space *vm;
 		int num_requests;
+		u32 reset_count;
 
 		/* position of active request inside the ring */
 		u32 rq_head, rq_post, rq_tail;
@@ -1562,6 +1563,9 @@ struct i915_gpu_error {
 #define I915_RESET_ENGINE_IN_PROGRESS	2
 #define I915_WEDGED		(BITS_PER_LONG - 1)
 
+	/** Number of times an engine has been reset */
+	u32 reset_engine_count[I915_NUM_ENGINES];
+
 	/**
 	 * Waitqueue to signal when a hang is detected. Used to for waiters
 	 * to release the struct_mutex for the reset to procede.
@@ -3417,6 +3421,12 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
+static inline u32 i915_reset_engine_count(struct i915_gpu_error *error,
+					  struct intel_engine_cs *engine)
+{
+	return READ_ONCE(error->reset_engine_count[engine->id]);
+}
+
 struct drm_i915_gem_request *
 i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 762685ba86da..ebd5501ec134 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -463,6 +463,7 @@ static void error_print_engine(struct drm_i915_error_state_buf *m,
 	err_printf(m, "  hangcheck action timestamp: %lu, %u ms ago\n",
 		   ee->hangcheck_timestamp,
 		   jiffies_to_msecs(jiffies - ee->hangcheck_timestamp));
+	err_printf(m, "  engine reset count: %u\n", ee->reset_count);
 
 	error_print_request(m, "  ELSP[0]: ", &ee->execlist[0]);
 	error_print_request(m, "  ELSP[1]: ", &ee->execlist[1]);
@@ -1244,6 +1245,8 @@ static void error_record_engine_registers(struct i915_gpu_state *error,
 	ee->hangcheck_timestamp = engine->hangcheck.action_timestamp;
 	ee->hangcheck_action = engine->hangcheck.action;
 	ee->hangcheck_stalled = engine->hangcheck.stalled;
+	ee->reset_count = i915_reset_engine_count(&dev_priv->gpu_error,
+						  engine);
 
 	if (USES_PPGTT(dev_priv)) {
 		int i;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 06/20] drm/i915: Export per-engine reset count info to debugfs
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (4 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 05/20] drm/i915: Add engine reset count to error state Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 07/20] drm/i915: Carry on with reset even if hw engine is not ready Michel Thierry
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

A new variable is added to export the reset counts to debugfs, this
includes full gpu reset and engine reset count. This is useful for tests
where they are expected to trigger reset; these counts are checked before
and after the test to ensure the same.

v2: Include reset engine count in i915_engine_info too (Chris).

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index c08a6d8a4e07..01d6758d33d7 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1403,6 +1403,23 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 	return 0;
 }
 
+static int i915_reset_info(struct seq_file *m, void *unused)
+{
+	struct drm_i915_private *dev_priv = node_to_i915(m->private);
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+
+	seq_printf(m, "full gpu reset = %u\n", i915_reset_count(error));
+
+	for_each_engine(engine, dev_priv, id) {
+		seq_printf(m, "%s = %u\n", engine->name,
+			   i915_reset_engine_count(error, engine));
+	}
+
+	return 0;
+}
+
 static int ironlake_drpc_info(struct seq_file *m)
 {
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
@@ -3259,6 +3276,7 @@ static int i915_display_info(struct seq_file *m, void *unused)
 static int i915_engine_info(struct seq_file *m, void *unused)
 {
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
 
@@ -3282,6 +3300,8 @@ static int i915_engine_info(struct seq_file *m, void *unused)
 			   engine->hangcheck.seqno,
 			   jiffies_to_msecs(jiffies - engine->hangcheck.action_timestamp),
 			   engine->timeline->inflight_seqnos);
+		seq_printf(m, "\tReset count: %d\n",
+			   i915_reset_engine_count(error, engine));
 
 		rcu_read_lock();
 
@@ -4795,6 +4815,7 @@ static const struct drm_info_list i915_debugfs_list[] = {
 	{"i915_huc_load_status", i915_huc_load_status_info, 0},
 	{"i915_frequency_info", i915_frequency_info, 0},
 	{"i915_hangcheck_info", i915_hangcheck_info, 0},
+	{"i915_reset_info", i915_reset_info, 0},
 	{"i915_drpc_info", i915_drpc_info, 0},
 	{"i915_emon_status", i915_emon_status, 0},
 	{"i915_ring_freq_table", i915_ring_freq_table, 0},
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 07/20] drm/i915: Carry on with reset even if hw engine is not ready
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (5 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 06/20] drm/i915: Export per-engine reset count info to debugfs Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 08/20] drm/i915: Enable Engine reset and recovery support Michel Thierry
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

We try to get the engines ready/idle before triggering the reset, but it
has been seen that sometimes the hw never acknowledges this.

If we miss the acknowledgment, carry on with the reset instead of
leaving the GPU in a wedged state.

The frequency of missed acknowledgment from hw is low, but it has been
seen at least once in CI.

References: https://intel-gfx-ci.01.org/CI/Trybot_831/
Reported-by: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index fdfd8c66c956..a89738655460 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1747,8 +1747,12 @@ static int gen8_reset_engine_start(struct intel_engine_cs *engine)
 					 RESET_CTL_READY_TO_RESET,
 					 RESET_CTL_READY_TO_RESET,
 					 700);
-	if (ret)
-		DRM_ERROR("%s: reset request timeout\n", engine->name);
+	if (GEM_WARN_ON(ret)) {
+		/* hw did not ack ready-to-reset, reset anyway */
+		DRM_DEBUG_DRIVER("%s: reset request timeout, continue\n",
+				 engine->name);
+		ret = 0;
+	}
 
 	return ret;
 }
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 08/20] drm/i915: Enable Engine reset and recovery support
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (6 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 07/20] drm/i915: Carry on with reset even if hw engine is not ready Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 09/20] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

This feature is made available only from Gen8, for previous gen devices
driver uses legacy full gpu reset.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_params.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index 045cadb77285..14e2c2e57f96 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -46,7 +46,7 @@ struct i915_params i915 __read_mostly = {
 	.prefault_disable = 0,
 	.load_detect_test = 0,
 	.force_reset_modeset_test = 0,
-	.reset = 1,
+	.reset = 2,
 	.error_capture = true,
 	.invert_brightness = 0,
 	.disable_display = 0,
@@ -116,7 +116,7 @@ MODULE_PARM_DESC(vbt_sdvo_panel_type,
 	"(-2=ignore, -1=auto [default], index in VBT BIOS table)");
 
 module_param_named_unsafe(reset, i915.reset, int, 0600);
-MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset [default], 2=engine reset)");
+MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset, 2=engine reset [default])");
 
 #if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
 module_param_named(error_capture, i915.error_capture, bool, 0600);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 09/20] drm/i915: Add engine reset count in get-reset-stats ioctl
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (7 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 08/20] drm/i915: Enable Engine reset and recovery support Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 10/20] drm/i915/selftests: reset engine self tests Michel Thierry
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

Users/tests relying on the total reset count will start seeing a smaller
number since most of the hangs can be handled by engine reset.
Note that if reset engine x, context a running on engine y will be unaware
and unaffected.

To start the discussion, include just a total engine reset count. If it
is deemed useful, it can be extended to report each engine separately.

v2: s/engine_reset/reset_engine/, use union in uapi to not break compatibility.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 14 +++++++++++---
 include/uapi/drm/i915_drm.h             |  6 +++++-
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index c5d1666d7071..210d6638e97b 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -1029,9 +1029,11 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct drm_i915_reset_stats *args = data;
 	struct i915_gem_context *ctx;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
 	int ret;
 
-	if (args->flags || args->pad)
+	if (args->flags)
 		return -EINVAL;
 
 	if (args->ctx_id == DEFAULT_CONTEXT_HANDLE && !capable(CAP_SYS_ADMIN))
@@ -1047,10 +1049,16 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 		return PTR_ERR(ctx);
 	}
 
-	if (capable(CAP_SYS_ADMIN))
+	if (capable(CAP_SYS_ADMIN)) {
 		args->reset_count = i915_reset_count(&dev_priv->gpu_error);
-	else
+		for_each_engine(engine, dev_priv, id)
+			args->reset_engine_count +=
+				i915_reset_engine_count(&dev_priv->gpu_error,
+							engine);
+	} else {
 		args->reset_count = 0;
+		args->reset_engine_count = 0;
+	}
 
 	args->batch_active = ctx->guilty_count;
 	args->batch_pending = ctx->active_count;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index f24a80d2d42e..fadedefba6db 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1278,7 +1278,11 @@ struct drm_i915_reset_stats {
 	/* Number of batches lost pending for execution, for this context */
 	__u32 batch_pending;
 
-	__u32 pad;
+	union {
+		__u32 pad;
+		/* Engine resets since boot/module reload, for all contexts */
+		__u32 reset_engine_count;
+	};
 };
 
 struct drm_i915_gem_userptr {
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 10/20] drm/i915/selftests: reset engine self tests
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (8 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 09/20] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 11/20] drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder Michel Thierry
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

Check that we can reset specific engines, also check the fallback to
full reset if something didn't work.

v2: rebase.
v3: use RESET_ENGINE_IN_PROGRESS flag.

Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 149 +++++++++++++++++++++++
 1 file changed, 149 insertions(+)

diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index aa31d6c0cdfb..8a3edb8bd440 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -322,6 +322,56 @@ static int igt_global_reset(void *arg)
 	return err;
 }
 
+static int igt_reset_engine(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	unsigned int reset_count, reset_engine_count;
+	int err = 0;
+
+	/* Check that we can issue a global GPU and engine reset */
+
+	if (!intel_has_gpu_reset(i915))
+		return 0;
+
+	if (!intel_has_reset_engine(i915))
+		return 0;
+
+	set_bit(I915_RESET_ENGINE_IN_PROGRESS, &i915->gpu_error.flags);
+
+	for_each_engine(engine, i915, id) {
+		reset_count = i915_reset_count(&i915->gpu_error);
+		reset_engine_count = i915_reset_engine_count(&i915->gpu_error,
+							     engine);
+
+		err = i915_reset_engine(engine);
+		if (err) {
+			pr_err("i915_reset_engine failed\n");
+			break;
+		}
+
+		if (i915_reset_count(&i915->gpu_error) != reset_count) {
+			pr_err("Full GPU reset recorded! (engine reset expected)\n");
+			err = -EINVAL;
+			break;
+		}
+
+		if (i915_reset_engine_count(&i915->gpu_error, engine) ==
+		    reset_engine_count) {
+			pr_err("No %s engine reset recorded!\n", engine->name);
+			err = -EINVAL;
+			break;
+		}
+	}
+
+	clear_bit(I915_RESET_ENGINE_IN_PROGRESS, &i915->gpu_error.flags);
+	if (i915_terminally_wedged(&i915->gpu_error))
+		err = -EIO;
+
+	return err;
+}
+
 static u32 fake_hangcheck(struct drm_i915_gem_request *rq)
 {
 	u32 reset_count;
@@ -526,13 +576,112 @@ static int igt_reset_queue(void *arg)
 	return err;
 }
 
+static int igt_render_engine_reset_fallback(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	struct intel_engine_cs *engine = i915->engine[RCS];
+	struct hang h;
+	struct drm_i915_gem_request *rq;
+	unsigned int reset_count, reset_engine_count;
+	int err = 0;
+
+	/* Check that we can issue a global GPU and engine reset */
+
+	if (!intel_has_gpu_reset(i915))
+		return 0;
+
+	if (!intel_has_reset_engine(i915))
+		return 0;
+
+	set_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
+	mutex_lock(&i915->drm.struct_mutex);
+
+	err = hang_init(&h, i915);
+	if (err)
+		goto unlock;
+
+	rq = hang_create_request(&h, engine, i915->kernel_context);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto fini;
+	}
+
+	i915_gem_request_get(rq);
+	__i915_add_request(rq, true);
+
+	/* make reset engine fail */
+	rq->fence.error = -EIO;
+
+	if (!wait_for_hang(&h, rq)) {
+		pr_err("Failed to start request %x\n", rq->fence.seqno);
+		err = -EIO;
+		goto fini;
+	}
+
+	reset_engine_count = i915_reset_engine_count(&i915->gpu_error, engine);
+	reset_count = fake_hangcheck(rq);
+
+	/* unlock since we'll call handle_error */
+	mutex_unlock(&i915->drm.struct_mutex);
+	clear_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
+
+	i915_handle_error(i915, intel_engine_flag(engine), "live test");
+
+	if (i915_reset_engine_count(&i915->gpu_error, engine) !=
+	    reset_engine_count) {
+		pr_err("render engine reset recorded! (full reset expected)\n");
+		err = -EINVAL;
+		goto fini;
+	}
+
+	if (i915_reset_count(&i915->gpu_error) == reset_count) {
+		pr_err("No full GPU reset recorded!\n");
+		err = -EINVAL;
+		goto fini;
+	}
+
+	/*
+	 * by using fence.error = -EIO, full reset sets the wedged flag, do one
+	 * more full reset to re-enable the hw.
+	 */
+	if (i915_terminally_wedged(&i915->gpu_error)) {
+		set_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
+		mutex_lock(&i915->drm.struct_mutex);
+		rq->fence.error = 0;
+
+		set_bit(I915_RESET_HANDOFF, &i915->gpu_error.flags);
+		i915_reset(i915);
+		GEM_BUG_ON(test_bit(I915_RESET_HANDOFF,
+				    &i915->gpu_error.flags));
+
+		if (i915_reset_count(&i915->gpu_error) == reset_count) {
+			pr_err("No full GPU reset recorded!\n");
+			err = -EINVAL;
+			goto fini;
+		}
+	}
+
+fini:
+	hang_fini(&h);
+unlock:
+	mutex_unlock(&i915->drm.struct_mutex);
+	clear_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
+
+	if (i915_terminally_wedged(&i915->gpu_error))
+		return -EIO;
+
+	return err;
+}
+
 int intel_hangcheck_live_selftests(struct drm_i915_private *i915)
 {
 	static const struct i915_subtest tests[] = {
 		SUBTEST(igt_hang_sanitycheck),
 		SUBTEST(igt_global_reset),
+		SUBTEST(igt_reset_engine),
 		SUBTEST(igt_wait_reset),
 		SUBTEST(igt_reset_queue),
+		SUBTEST(igt_render_engine_reset_fallback),
 	};
 
 	if (!intel_has_gpu_reset(i915))
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 11/20] drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (9 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 10/20] drm/i915/selftests: reset engine self tests Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 12/20] drm/i915/guc: Provide register list to be saved/restored during engine reset Michel Thierry
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

From: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

The mmio_start offset for the whitelist is the first FORCE_TO_NONPRIV
register the GuC can use to restore the provided whitelist when an
engine reset via GuC (which we still don't support) is triggered.

We're currently adding the mmio_base of the engine to the absolute
address of the RCS version of the register, which results in the wrong
offset. Fix it by using the definition we already have instead of
re-defining it in the GuC FW header.

Also add a comment to avoid future issues with FORCE_TO_NONPRIV
registers, which are also used by the workaround framework.

v2: improve comment (Michal), move comment about save/restore because it
    is not related to the mmio_white_list field.

v3: rebase/resurrect.

Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Cc: Oscar Mateo <oscar.mateo@intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com> (v2)
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_guc_submission.c | 11 +++++++++--
 drivers/gpu/drm/i915/intel_guc_fwif.h      |  1 -
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index 777f48e9bb33..50d3e5aaf566 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -1050,10 +1050,17 @@ static int guc_ads_create(struct intel_guc *guc)
 	/* MMIO reg state */
 	for_each_engine(engine, dev_priv, id) {
 		blob->reg_state.white_list[engine->guc_id].mmio_start =
-			engine->mmio_base + GUC_MMIO_WHITE_LIST_START;
+			i915_mmio_reg_offset(RING_FORCE_TO_NONPRIV(engine->mmio_base, 0));
 
-		/* Nothing to be saved or restored for now. */
+		/*
+		 * Note: if the GuC whitelist management is enabled, the values
+		 * should be filled using the workaround framework to avoid
+		 * inconsistencies with the handling of FORCE_TO_NONPRIV
+		 * registers.
+		 */
 		blob->reg_state.white_list[engine->guc_id].count = 0;
+
+		/* Nothing to be saved or restored for now. */
 	}
 
 	/*
diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
index 6156845641a3..e6f8079df94a 100644
--- a/drivers/gpu/drm/i915/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
@@ -394,7 +394,6 @@ struct guc_policies {
 #define GUC_REGSET_SAVE_CURRENT_VALUE	0x10
 
 #define GUC_REGSET_MAX_REGISTERS	25
-#define GUC_MMIO_WHITE_LIST_START	0x24d0
 #define GUC_MMIO_WHITE_LIST_MAX		12
 #define GUC_S3_SAVE_SPACE_PAGES		10
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 12/20] drm/i915/guc: Provide register list to be saved/restored during engine reset
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (10 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 11/20] drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 13/20] drm/i915/guc: Rename the function that resets the GuC Michel Thierry
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

GuC expects a list of registers from the driver which are saved/restored
during engine reset. The type of value to be saved is controlled by
flags. We provide a minimal set of registers that we want GuC to save and
restore. This is not an issue in case of engine reset as driver initializes
most of them following an engine reset, but in case of media reset (aka
watchdog reset) which is completely internal to GuC (including resubmission
of hung workload), it is necessary to provide this list, otherwise GuC won't
be able to schedule further workloads after a reset. This is the minimal
set of registers identified for things to work as expected but if we see
any new issues, this register list can be expanded.

In order to not loose any existing workarounds, we have to let GuC know
the registers and its values. These will be reapplied after the reset.
Note that we can't just read the current value because most of these
registers are masked (so we have a workaround for a workaround for a
workaround).

v2: REGSET_MASKED is too difficult for GuC, use REGSET_SAVE_DEFAULT_VALUE
and current value from RING_MODE reg instead; no need to preserve
head/tail either, be extra paranoid and save whitelisted registers (Daniele).

v3: Workarounds added only once during _init_workarounds also have to
been restored, or we risk loosing them after internal GuC reset
(Daniele).

v4: Rename macro used to keep track the workaround registers we will
have to restore after reset (s/I915_GUC_REG_WRITE/WA_REG_WR_GUC_RESTORE).

v5: Only ask guc to reapply workarounds in case of render reset (Daniele).

Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h            |  3 ++
 drivers/gpu/drm/i915/i915_guc_submission.c | 70 +++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_engine_cs.c     | 65 ++++++++++++++++++---------
 3 files changed, 116 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index ada9e2cc969c..28f48678c91f 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1869,7 +1869,10 @@ struct i915_wa_reg {
 
 struct i915_workarounds {
 	struct i915_wa_reg reg[I915_MAX_WA_REGS];
+	/* list of registers (and their values) that GuC will have to restore */
+	struct i915_wa_reg guc_reg[GUC_REGSET_MAX_REGISTERS];
 	u32 count;
+	u32 guc_count;
 	u32 hw_whitelist_count[I915_NUM_ENGINES];
 };
 
diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index 50d3e5aaf566..8650b6ec2f2d 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -1017,6 +1017,24 @@ static void guc_policies_init(struct guc_policies *policies)
 	policies->is_valid = 1;
 }
 
+/*
+ * In this macro it is highly unlikely to exceed max value but even if we did
+ * it is not an error so just throw a warning and continue. Only side effect
+ * in continuing further means some registers won't be added to save/restore
+ * list.
+ */
+#define GUC_ADD_MMIO_REG_ADS(node, reg_addr, _flags, defvalue)		\
+	do {								\
+		u32 __count = node->number_of_registers;		\
+		if (WARN_ON(__count >= GUC_REGSET_MAX_REGISTERS))	\
+			continue;					\
+		node->registers[__count].offset = reg_addr.reg;		\
+		node->registers[__count].flags = (_flags);		\
+		if (defvalue)						\
+			node->registers[__count].value = (defvalue);	\
+		node->number_of_registers++;				\
+	} while (0)
+
 static int guc_ads_create(struct intel_guc *guc)
 {
 	struct drm_i915_private *dev_priv = guc_to_i915(guc);
@@ -1030,6 +1048,7 @@ static int guc_ads_create(struct intel_guc *guc)
 		u8 reg_state_buffer[GUC_S3_SAVE_SPACE_PAGES * PAGE_SIZE];
 	} __packed *blob;
 	struct intel_engine_cs *engine;
+	struct i915_workarounds *workarounds = &dev_priv->workarounds;
 	enum intel_engine_id id;
 	u32 base;
 
@@ -1049,6 +1068,49 @@ static int guc_ads_create(struct intel_guc *guc)
 
 	/* MMIO reg state */
 	for_each_engine(engine, dev_priv, id) {
+		u32 i;
+		struct guc_mmio_regset *eng_reg =
+			&blob->reg_state.engine_reg[engine->guc_id];
+
+		/*
+		 * Provide a list of registers to be saved/restored during gpu
+		 * reset. This is mainly required for Media reset (aka watchdog
+		 * timeout) which is completely under the control of GuC
+		 * (resubmission of hung workload is handled inside GuC).
+		 */
+		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_HWS_PGA(engine->mmio_base),
+				     GUC_REGSET_ENGINERESET |
+				     GUC_REGSET_SAVE_CURRENT_VALUE, 0);
+
+		/*
+		 * Workaround the guc issue with masked registers, note that
+		 * at this point guc submission is still disabled and the mode
+		 * register doesnt have the irq_steering bit set, which we
+		 * need to fwd irqs to GuC.
+		 */
+		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_MODE_GEN7(engine),
+				     GUC_REGSET_ENGINERESET |
+				     GUC_REGSET_SAVE_DEFAULT_VALUE,
+				     I915_READ(RING_MODE_GEN7(engine)) |
+				     GFX_INTERRUPT_STEERING | (0xFFFF<<16));
+
+		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_IMR(engine->mmio_base),
+				     GUC_REGSET_ENGINERESET |
+				     GUC_REGSET_SAVE_CURRENT_VALUE, 0);
+
+		/* ask guc to re-apply workarounds set in *_init_workarounds */
+		if (engine->id == RCS) {
+			for (i = 0; i < workarounds->guc_count; i++)
+				GUC_ADD_MMIO_REG_ADS(eng_reg,
+						     workarounds->guc_reg[i].addr,
+						     GUC_REGSET_ENGINERESET |
+						     GUC_REGSET_SAVE_DEFAULT_VALUE,
+						     workarounds->guc_reg[i].value);
+		}
+
+		DRM_DEBUG_DRIVER("%s register save/restore count: %u\n",
+				 engine->name, eng_reg->number_of_registers);
+
 		blob->reg_state.white_list[engine->guc_id].mmio_start =
 			i915_mmio_reg_offset(RING_FORCE_TO_NONPRIV(engine->mmio_base, 0));
 
@@ -1058,9 +1120,13 @@ static int guc_ads_create(struct intel_guc *guc)
 		 * inconsistencies with the handling of FORCE_TO_NONPRIV
 		 * registers.
 		 */
-		blob->reg_state.white_list[engine->guc_id].count = 0;
+		blob->reg_state.white_list[engine->guc_id].count =
+					workarounds->hw_whitelist_count[id];
 
-		/* Nothing to be saved or restored for now. */
+		for (i = 0; i < workarounds->hw_whitelist_count[id]; i++) {
+			blob->reg_state.white_list[engine->guc_id].offsets[i] =
+				I915_READ(RING_FORCE_TO_NONPRIV(engine->mmio_base, i));
+		}
 	}
 
 	/*
diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index 413bfd8d4bf4..ce005164d71c 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -708,6 +708,29 @@ static int wa_ring_whitelist_reg(struct intel_engine_cs *engine,
 	return 0;
 }
 
+static int guc_wa_add(struct drm_i915_private *dev_priv,
+		      i915_reg_t addr, const u32 val)
+{
+	const u32 idx = dev_priv->workarounds.guc_count;
+
+	I915_WRITE(addr, val);
+	if (WARN_ON(idx >= GUC_REGSET_MAX_REGISTERS))
+		return -ENOSPC;
+
+	dev_priv->workarounds.guc_reg[idx].addr = addr;
+	/* GuC can't handle masked regs, so we store the value|mask together */
+	dev_priv->workarounds.guc_reg[idx].value = val;
+	dev_priv->workarounds.guc_count++;
+
+	return 0;
+}
+
+#define WA_REG_WR_GUC_RESTORE(addr, val) do { \
+		const int r = guc_wa_add(dev_priv, (addr), (val)); \
+		if (r) \
+			return r; \
+	} while (0)
+
 static int gen8_init_workarounds(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
@@ -815,15 +838,16 @@ static int gen9_init_workarounds(struct intel_engine_cs *engine)
 	int ret;
 
 	/* WaConextSwitchWithConcurrentTLBInvalidate:skl,bxt,kbl,glk */
-	I915_WRITE(GEN9_CSFE_CHICKEN1_RCS, _MASKED_BIT_ENABLE(GEN9_PREEMPT_GPGPU_SYNC_SWITCH_DISABLE));
+	WA_REG_WR_GUC_RESTORE(GEN9_CSFE_CHICKEN1_RCS,
+			      _MASKED_BIT_ENABLE(GEN9_PREEMPT_GPGPU_SYNC_SWITCH_DISABLE));
 
 	/* WaEnableLbsSlaRetryTimerDecrement:skl,bxt,kbl,glk */
-	I915_WRITE(BDW_SCRATCH1, I915_READ(BDW_SCRATCH1) |
-		   GEN9_LBS_SLA_RETRY_TIMER_DECREMENT_ENABLE);
+	WA_REG_WR_GUC_RESTORE(BDW_SCRATCH1, I915_READ(BDW_SCRATCH1) |
+			      GEN9_LBS_SLA_RETRY_TIMER_DECREMENT_ENABLE);
 
 	/* WaDisableKillLogic:bxt,skl,kbl */
-	I915_WRITE(GAM_ECOCHK, I915_READ(GAM_ECOCHK) |
-		   ECOCHK_DIS_TLB);
+	WA_REG_WR_GUC_RESTORE(GAM_ECOCHK, I915_READ(GAM_ECOCHK) |
+			      ECOCHK_DIS_TLB);
 
 	/* WaClearFlowControlGpgpuContextSave:skl,bxt,kbl,glk */
 	/* WaDisablePartialInstShootdown:skl,bxt,kbl,glk */
@@ -894,8 +918,8 @@ static int gen9_init_workarounds(struct intel_engine_cs *engine)
 			  HDC_FORCE_NON_COHERENT);
 
 	/* WaDisableHDCInvalidation:skl,bxt,kbl */
-	I915_WRITE(GAM_ECOCHK, I915_READ(GAM_ECOCHK) |
-		   BDW_DISABLE_HDC_INVALIDATION);
+	WA_REG_WR_GUC_RESTORE(GAM_ECOCHK, I915_READ(GAM_ECOCHK) |
+			      BDW_DISABLE_HDC_INVALIDATION);
 
 	/* WaDisableSamplerPowerBypassForSOPingPong:skl,bxt,kbl */
 	if (IS_SKYLAKE(dev_priv) ||
@@ -908,8 +932,8 @@ static int gen9_init_workarounds(struct intel_engine_cs *engine)
 	WA_SET_BIT_MASKED(HALF_SLICE_CHICKEN2, GEN8_ST_PO_DISABLE);
 
 	/* WaOCLCoherentLineFlush:skl,bxt,kbl */
-	I915_WRITE(GEN8_L3SQCREG4, (I915_READ(GEN8_L3SQCREG4) |
-				    GEN8_LQSC_FLUSH_COHERENT_LINES));
+	WA_REG_WR_GUC_RESTORE(GEN8_L3SQCREG4, (I915_READ(GEN8_L3SQCREG4) |
+					       GEN8_LQSC_FLUSH_COHERENT_LINES));
 
 	/* WaVFEStateAfterPipeControlwithMediaStateClear:skl,bxt,glk */
 	ret = wa_ring_whitelist_reg(engine, GEN9_CTX_PREEMPT_REG);
@@ -984,12 +1008,12 @@ static int skl_init_workarounds(struct intel_engine_cs *engine)
 	 * until D0 which is the default case so this is equivalent to
 	 * !WaDisablePerCtxtPreemptionGranularityControl:skl
 	 */
-	I915_WRITE(GEN7_FF_SLICE_CS_CHICKEN1,
-		   _MASKED_BIT_ENABLE(GEN9_FFSC_PERCTX_PREEMPT_CTRL));
+	WA_REG_WR_GUC_RESTORE(GEN7_FF_SLICE_CS_CHICKEN1,
+			      _MASKED_BIT_ENABLE(GEN9_FFSC_PERCTX_PREEMPT_CTRL));
 
 	/* WaEnableGapsTsvCreditFix:skl */
-	I915_WRITE(GEN8_GARBCNTL, (I915_READ(GEN8_GARBCNTL) |
-				   GEN9_GAPS_TSV_CREDIT_DISABLE));
+	WA_REG_WR_GUC_RESTORE(GEN8_GARBCNTL, (I915_READ(GEN8_GARBCNTL) |
+					      GEN9_GAPS_TSV_CREDIT_DISABLE));
 
 	/* WaDisableGafsUnitClkGating:skl */
 	WA_SET_BIT(GEN7_UCGCTL4, GEN8_EU_GAUNIT_CLOCK_GATE_DISABLE);
@@ -1019,12 +1043,12 @@ static int bxt_init_workarounds(struct intel_engine_cs *engine)
 	/* WaStoreMultiplePTEenable:bxt */
 	/* This is a requirement according to Hardware specification */
 	if (IS_BXT_REVID(dev_priv, 0, BXT_REVID_A1))
-		I915_WRITE(TILECTL, I915_READ(TILECTL) | TILECTL_TLBPF);
+		WA_REG_WR_GUC_RESTORE(TILECTL, I915_READ(TILECTL) | TILECTL_TLBPF);
 
 	/* WaSetClckGatingDisableMedia:bxt */
 	if (IS_BXT_REVID(dev_priv, 0, BXT_REVID_A1)) {
-		I915_WRITE(GEN7_MISCCPCTL, (I915_READ(GEN7_MISCCPCTL) &
-					    ~GEN8_DOP_CLOCK_GATE_MEDIA_ENABLE));
+		WA_REG_WR_GUC_RESTORE(GEN7_MISCCPCTL, (I915_READ(GEN7_MISCCPCTL) &
+						       ~GEN8_DOP_CLOCK_GATE_MEDIA_ENABLE));
 	}
 
 	/* WaDisableThreadStallDopClockGating:bxt */
@@ -1060,8 +1084,8 @@ static int bxt_init_workarounds(struct intel_engine_cs *engine)
 
 	/* WaProgramL3SqcReg1DefaultForPerf:bxt */
 	if (IS_BXT_REVID(dev_priv, BXT_REVID_B0, REVID_FOREVER))
-		I915_WRITE(GEN8_L3SQCREG1, L3_GENERAL_PRIO_CREDITS(62) |
-					   L3_HIGH_PRIO_CREDITS(2));
+		WA_REG_WR_GUC_RESTORE(GEN8_L3SQCREG1, L3_GENERAL_PRIO_CREDITS(62) |
+						      L3_HIGH_PRIO_CREDITS(2));
 
 	/* WaToEnableHwFixForPushConstHWBug:bxt */
 	if (IS_BXT_REVID(dev_priv, BXT_REVID_C0, REVID_FOREVER))
@@ -1086,8 +1110,8 @@ static int kbl_init_workarounds(struct intel_engine_cs *engine)
 		return ret;
 
 	/* WaEnableGapsTsvCreditFix:kbl */
-	I915_WRITE(GEN8_GARBCNTL, (I915_READ(GEN8_GARBCNTL) |
-				   GEN9_GAPS_TSV_CREDIT_DISABLE));
+	WA_REG_WR_GUC_RESTORE(GEN8_GARBCNTL, (I915_READ(GEN8_GARBCNTL) |
+					      GEN9_GAPS_TSV_CREDIT_DISABLE));
 
 	/* WaDisableDynamicCreditSharing:kbl */
 	if (IS_KBL_REVID(dev_priv, 0, KBL_REVID_B0))
@@ -1148,6 +1172,7 @@ int init_workarounds_ring(struct intel_engine_cs *engine)
 	WARN_ON(engine->id != RCS);
 
 	dev_priv->workarounds.count = 0;
+	dev_priv->workarounds.guc_count = 0;
 	dev_priv->workarounds.hw_whitelist_count[engine->id] = 0;
 
 	if (IS_BROADWELL(dev_priv))
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 13/20] drm/i915/guc: Rename the function that resets the GuC
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (11 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 12/20] drm/i915/guc: Provide register list to be saved/restored during engine reset Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 14/20] drm/i915/guc: Add support for reset engine using GuC commands Michel Thierry
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

intel_guc_reset sounds more like the microcontroller is the one performing
a reset, while in this case is the opposite. intel_reset_guc not only
makes it clearer, it follows the other intel_reset functions available.

v2: Print error message in English.

Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h     | 2 +-
 drivers/gpu/drm/i915/intel_uc.c     | 4 ++--
 drivers/gpu/drm/i915/intel_uncore.c | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 28f48678c91f..d2709dd361b4 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3053,7 +3053,7 @@ extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern void i915_reset(struct drm_i915_private *dev_priv);
 extern int i915_reset_engine(struct intel_engine_cs *engine);
 extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
-extern int intel_guc_reset(struct drm_i915_private *dev_priv);
+extern int intel_reset_guc(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
 extern unsigned long i915_chipset_val(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/intel_uc.c b/drivers/gpu/drm/i915/intel_uc.c
index d27b5277901c..735ace31ca8f 100644
--- a/drivers/gpu/drm/i915/intel_uc.c
+++ b/drivers/gpu/drm/i915/intel_uc.c
@@ -46,9 +46,9 @@ static int __intel_uc_reset_hw(struct drm_i915_private *dev_priv)
 	int ret;
 	u32 guc_status;
 
-	ret = intel_guc_reset(dev_priv);
+	ret = intel_reset_guc(dev_priv);
 	if (ret) {
-		DRM_ERROR("GuC reset failed, ret = %d\n", ret);
+		DRM_ERROR("Failed to reset GuC, ret = %d\n", ret);
 		return ret;
 	}
 
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index a89738655460..56fd3e9d3fbf 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1851,7 +1851,7 @@ bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
 		i915.reset >= 2);
 }
 
-int intel_guc_reset(struct drm_i915_private *dev_priv)
+int intel_reset_guc(struct drm_i915_private *dev_priv)
 {
 	int ret;
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 14/20] drm/i915/guc: Add support for reset engine using GuC commands
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (12 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 13/20] drm/i915/guc: Rename the function that resets the GuC Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 15/20] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load Michel Thierry
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

This patch adds per engine reset and recovery (TDR) support when GuC is
used to submit workloads to GPU.

In the case of i915 directly submission to ELSP, driver manages hang
detection, recovery and resubmission. With GuC submission these tasks
are shared between driver and GuC. i915 is still responsible for detecting
a hang, and when it does it only requests GuC to reset that Engine. GuC
internally manages acquiring forcewake and idling the engine before actually
resetting it.

Once the reset is successful, i915 takes over again and handles resubmission.
The scheduler in i915 knows which requests are pending so after resetting
a engine, pending workloads/requests are resubmitted again.

v2: s/i915_guc_request_engine_reset/i915_guc_reset_engine/ to match the
non-guc funtion names.

v3: Removed debug message about engine restarting from which request,
since the new baseline do it regardless of submission mode. (Chris)

Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c            | 24 +++++++++++----
 drivers/gpu/drm/i915/i915_drv.h            |  1 +
 drivers/gpu/drm/i915/i915_guc_submission.c | 48 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_guc_fwif.h      |  6 ++++
 drivers/gpu/drm/i915/intel_uc.h            |  1 +
 drivers/gpu/drm/i915/intel_uncore.c        |  5 ----
 6 files changed, 75 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 51663abda5db..50c6d4cdecee 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1910,11 +1910,21 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	 */
 	i915_gem_reset_engine(engine, active_request);
 
-	/* finally, reset engine */
-	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
-	if (ret) {
-		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
-		goto out;
+	if (!dev_priv->guc.execbuf_client) {
+		/* finally, reset engine */
+		ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
+		if (ret) {
+			DRM_ERROR("Failed to reset %s, ret=%d\n",
+				  engine->name, ret);
+			goto out;
+		}
+	} else {
+		ret = i915_guc_reset_engine(engine);
+		if (ret) {
+			DRM_ERROR("GuC failed to reset %s, ret=%d\n",
+				  engine->name, ret);
+			goto out;
+		}
 	}
 
 	i915_gem_reset_finish_engine(engine);
@@ -1928,6 +1938,10 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	if (ret)
 		goto out;
 
+	/* for guc too */
+	if (dev_priv->guc.execbuf_client)
+		i915_guc_submission_reenable_engine(engine);
+
 	error->reset_engine_count[engine->id]++;
 out:
 	return ret;
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index d2709dd361b4..7fba5cf4db1b 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3054,6 +3054,7 @@ extern void i915_reset(struct drm_i915_private *dev_priv);
 extern int i915_reset_engine(struct intel_engine_cs *engine);
 extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
 extern int intel_reset_guc(struct drm_i915_private *dev_priv);
+extern int i915_guc_reset_engine(struct intel_engine_cs *engine);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
 extern unsigned long i915_chipset_val(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index 8650b6ec2f2d..66bf521e5007 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -1361,6 +1361,25 @@ void i915_guc_submission_disable(struct drm_i915_private *dev_priv)
 	guc->execbuf_client = NULL;
 }
 
+void i915_guc_submission_reenable_engine(struct intel_engine_cs *engine)
+{
+	struct drm_i915_private *dev_priv = engine->i915;
+	struct intel_guc *guc = &dev_priv->guc;
+	struct i915_guc_client *client = guc->execbuf_client;
+	const int wqi_size = sizeof(struct guc_wq_item);
+	struct drm_i915_gem_request *rq;
+
+	GEM_BUG_ON(!client);
+	intel_guc_sample_forcewake(guc);
+
+	spin_lock_irq(&engine->timeline->lock);
+	list_for_each_entry(rq, &engine->timeline->requests, link) {
+		guc_client_update_wq_rsvd(client, wqi_size);
+		__i915_guc_submit(rq);
+	}
+	spin_unlock_irq(&engine->timeline->lock);
+}
+
 /**
  * intel_guc_suspend() - notify GuC entering suspend state
  * @dev_priv:	i915 device private
@@ -1412,3 +1431,32 @@ int intel_guc_resume(struct drm_i915_private *dev_priv)
 
 	return intel_guc_send(guc, data, ARRAY_SIZE(data));
 }
+
+int i915_guc_reset_engine(struct intel_engine_cs *engine)
+{
+	struct drm_i915_private *dev_priv = engine->i915;
+	struct intel_guc *guc = &dev_priv->guc;
+	struct i915_gem_context *ctx;
+	u32 data[7];
+
+	if (!i915.enable_guc_submission)
+		return 0;
+
+	ctx = dev_priv->kernel_context;
+
+	/*
+	 * The affected context report is populated by GuC and is provided
+	 * to the driver using the shared page. We request for it but don't
+	 * use it as scheduler has all of these details.
+	 */
+	data[0] = INTEL_GUC_ACTION_REQUEST_ENGINE_RESET;
+	data[1] = engine->guc_id;
+	data[2] = INTEL_GUC_RESET_OPTION_REPORT_AFFECTED_CONTEXTS;
+	data[3] = 0;
+	data[4] = 0;
+	data[5] = guc->execbuf_client->stage_id;
+	/* first page is shared data with GuC */
+	data[6] = guc_ggtt_offset(ctx->engine[RCS].state);
+
+	return intel_guc_send(guc, data, ARRAY_SIZE(data));
+}
diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
index e6f8079df94a..081f2cf614e6 100644
--- a/drivers/gpu/drm/i915/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
@@ -505,6 +505,7 @@ union guc_log_control {
 /* This Action will be programmed in C180 - SOFT_SCRATCH_O_REG */
 enum intel_guc_action {
 	INTEL_GUC_ACTION_DEFAULT = 0x0,
+	INTEL_GUC_ACTION_REQUEST_ENGINE_RESET = 0x3,
 	INTEL_GUC_ACTION_SAMPLE_FORCEWAKE = 0x6,
 	INTEL_GUC_ACTION_ALLOCATE_DOORBELL = 0x10,
 	INTEL_GUC_ACTION_DEALLOCATE_DOORBELL = 0x20,
@@ -518,6 +519,11 @@ enum intel_guc_action {
 	INTEL_GUC_ACTION_LIMIT
 };
 
+/* Reset engine options */
+enum action_engine_reset_options {
+	INTEL_GUC_RESET_OPTION_REPORT_AFFECTED_CONTEXTS = 0x10,
+};
+
 /*
  * The GuC sends its response to a command by overwriting the
  * command in SS0. The response is distinguishable from a command
diff --git a/drivers/gpu/drm/i915/intel_uc.h b/drivers/gpu/drm/i915/intel_uc.h
index 930f2e17b863..6323d002664d 100644
--- a/drivers/gpu/drm/i915/intel_uc.h
+++ b/drivers/gpu/drm/i915/intel_uc.h
@@ -246,6 +246,7 @@ int i915_guc_wq_reserve(struct drm_i915_gem_request *rq);
 void i915_guc_wq_unreserve(struct drm_i915_gem_request *request);
 void i915_guc_submission_disable(struct drm_i915_private *dev_priv);
 void i915_guc_submission_fini(struct drm_i915_private *dev_priv);
+void i915_guc_submission_reenable_engine(struct intel_engine_cs *engine);
 struct i915_vma *intel_guc_allocate_vma(struct intel_guc *guc, u32 size);
 
 /* intel_guc_log.c */
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 56fd3e9d3fbf..cfa7f717a570 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1840,14 +1840,9 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv)
 	return intel_get_gpu_reset(dev_priv) != NULL;
 }
 
-/*
- * When GuC submission is enabled, GuC manages ELSP and can initiate the
- * engine reset too. For now, fall back to full GPU reset if it is enabled.
- */
 bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
 {
 	return (dev_priv->info.has_reset_engine &&
-		!dev_priv->guc.execbuf_client &&
 		i915.reset >= 2);
 }
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 15/20] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (13 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 14/20] drm/i915/guc: Add support for reset engine using GuC commands Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

For watchdog / media reset, the firmware must know the address of the shared
data page (the first page of the default context).

This information should be in DWORD 9 of the GUC_CTL structure.

v2: Use guc_ggtt_offset (Chris).
Store the ggtt offset of the default ctx as we needed for
suspend/resume/reset (Daniele).

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_guc_submission.c | 21 ++++++---------------
 drivers/gpu/drm/i915/intel_guc_fwif.h      |  2 +-
 drivers/gpu/drm/i915/intel_guc_loader.c    | 11 +++++++++++
 drivers/gpu/drm/i915/intel_uc.h            |  2 ++
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index 66bf521e5007..80da01a907cf 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -1387,7 +1387,6 @@ void i915_guc_submission_reenable_engine(struct intel_engine_cs *engine)
 int intel_guc_suspend(struct drm_i915_private *dev_priv)
 {
 	struct intel_guc *guc = &dev_priv->guc;
-	struct i915_gem_context *ctx;
 	u32 data[3];
 
 	if (guc->fw.load_status != INTEL_UC_FIRMWARE_SUCCESS)
@@ -1395,13 +1394,11 @@ int intel_guc_suspend(struct drm_i915_private *dev_priv)
 
 	gen9_disable_guc_interrupts(dev_priv);
 
-	ctx = dev_priv->kernel_context;
-
 	data[0] = INTEL_GUC_ACTION_ENTER_S_STATE;
 	/* any value greater than GUC_POWER_D0 */
 	data[1] = GUC_POWER_D1;
-	/* first page is shared data with GuC */
-	data[2] = guc_ggtt_offset(ctx->engine[RCS].state);
+	/* first page of default ctx is shared data with GuC */
+	data[2] = guc->shared_data_offset;
 
 	return intel_guc_send(guc, data, ARRAY_SIZE(data));
 }
@@ -1413,7 +1410,6 @@ int intel_guc_suspend(struct drm_i915_private *dev_priv)
 int intel_guc_resume(struct drm_i915_private *dev_priv)
 {
 	struct intel_guc *guc = &dev_priv->guc;
-	struct i915_gem_context *ctx;
 	u32 data[3];
 
 	if (guc->fw.load_status != INTEL_UC_FIRMWARE_SUCCESS)
@@ -1422,12 +1418,10 @@ int intel_guc_resume(struct drm_i915_private *dev_priv)
 	if (i915.guc_log_level >= 0)
 		gen9_enable_guc_interrupts(dev_priv);
 
-	ctx = dev_priv->kernel_context;
-
 	data[0] = INTEL_GUC_ACTION_EXIT_S_STATE;
 	data[1] = GUC_POWER_D0;
-	/* first page is shared data with GuC */
-	data[2] = guc_ggtt_offset(ctx->engine[RCS].state);
+	/* first page of default ctx is shared data with GuC */
+	data[2] = guc->shared_data_offset;
 
 	return intel_guc_send(guc, data, ARRAY_SIZE(data));
 }
@@ -1436,14 +1430,11 @@ int i915_guc_reset_engine(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
 	struct intel_guc *guc = &dev_priv->guc;
-	struct i915_gem_context *ctx;
 	u32 data[7];
 
 	if (!i915.enable_guc_submission)
 		return 0;
 
-	ctx = dev_priv->kernel_context;
-
 	/*
 	 * The affected context report is populated by GuC and is provided
 	 * to the driver using the shared page. We request for it but don't
@@ -1455,8 +1446,8 @@ int i915_guc_reset_engine(struct intel_engine_cs *engine)
 	data[3] = 0;
 	data[4] = 0;
 	data[5] = guc->execbuf_client->stage_id;
-	/* first page is shared data with GuC */
-	data[6] = guc_ggtt_offset(ctx->engine[RCS].state);
+	/* first page of default ctx is shared data with GuC */
+	data[6] = guc->shared_data_offset;
 
 	return intel_guc_send(guc, data, ARRAY_SIZE(data));
 }
diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
index 081f2cf614e6..a2d0cba2f8b9 100644
--- a/drivers/gpu/drm/i915/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
@@ -135,7 +135,7 @@
 #define   GUC_ADS_ADDR_SHIFT		11
 #define   GUC_ADS_ADDR_MASK		0xfffff800
 
-#define GUC_CTL_RSRVD			9
+#define GUC_CTL_SHARED_DATA		9
 
 #define GUC_CTL_MAX_DWORDS		(SOFT_SCRATCH_COUNT - 2) /* [1..14] */
 
diff --git a/drivers/gpu/drm/i915/intel_guc_loader.c b/drivers/gpu/drm/i915/intel_guc_loader.c
index d9045b6e897b..8cd5c2bf9510 100644
--- a/drivers/gpu/drm/i915/intel_guc_loader.c
+++ b/drivers/gpu/drm/i915/intel_guc_loader.c
@@ -108,6 +108,7 @@ static void guc_params_init(struct drm_i915_private *dev_priv)
 {
 	struct intel_guc *guc = &dev_priv->guc;
 	u32 params[GUC_CTL_MAX_DWORDS];
+	struct i915_gem_context *ctx;
 	int i;
 
 	memset(&params, 0, sizeof(params));
@@ -156,6 +157,16 @@ static void guc_params_init(struct drm_i915_private *dev_priv)
 		params[GUC_CTL_FEATURE] &= ~GUC_CTL_DISABLE_SCHEDULER;
 	}
 
+	/*
+	 * For watchdog / media reset, GuC must know the address of the shared
+	 * data page, which is the first page of the default context.
+	 * We will also use this page in several places (suspend/resume/reset),
+	 * so save the ggtt offset.
+	 */
+	ctx = dev_priv->kernel_context;
+	guc->shared_data_offset = guc_ggtt_offset(ctx->engine[RCS].state);
+	params[GUC_CTL_SHARED_DATA] = guc->shared_data_offset;
+
 	I915_WRITE(SOFT_SCRATCH(0), 0);
 
 	for (i = 0; i < GUC_CTL_MAX_DWORDS; i++)
diff --git a/drivers/gpu/drm/i915/intel_uc.h b/drivers/gpu/drm/i915/intel_uc.h
index 6323d002664d..e945e2231bbe 100644
--- a/drivers/gpu/drm/i915/intel_uc.h
+++ b/drivers/gpu/drm/i915/intel_uc.h
@@ -187,6 +187,8 @@ struct intel_guc {
 	DECLARE_BITMAP(doorbell_bitmap, GUC_NUM_DOORBELLS);
 	uint32_t db_cacheline;		/* Cyclic counter mod pagesize	*/
 
+	uint32_t shared_data_offset;	/* First page of default ctx */
+
 	/* GuC's FW specific registers used in MMIO send */
 	struct {
 		u32 base;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (14 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 15/20] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

*** General ***

Watchdog timeout (or "media engine reset") is a feature that allows
userland applications to enable hang detection on individual batch buffers.
The detection mechanism itself is mostly bound to the hardware and the only
thing that the driver needs to do to support this form of hang detection
is to implement the interrupt handling support as well as watchdog command
emission before and after the emitted batch buffer start instruction in the
ring buffer.

The principle of the hang detection mechanism is as follows:

1. Once the decision has been made to enable watchdog timeout for a
particular batch buffer and the driver is in the process of emitting the
batch buffer start instruction into the ring buffer it also emits a
watchdog timer start instruction before and a watchdog timer cancellation
instruction after the batch buffer start instruction in the ring buffer.

2. Once the GPU execution reaches the watchdog timer start instruction
the hardware watchdog counter is started by the hardware. The counter
keeps counting until either reaching a previously configured threshold
value or the timer cancellation instruction is executed.

2a. If the counter reaches the threshold value the hardware fires a
watchdog interrupt that is picked up by the watchdog interrupt handler.
This means that a hang has been detected and the driver needs to deal with
it the same way it would deal with a engine hang detected by the periodic
hang checker. The only difference between the two is that we already blamed
the active request (to ensure an engine reset).

2b. If the batch buffer completes and the execution reaches the watchdog
cancellation instruction before the watchdog counter reaches its
threshold value the watchdog is cancelled and nothing more comes of it.
No hang is detected.

Note about future interaction with preemption: Preemption could happen
in a command sequence prior to watchdog counter getting disabled,
resulting in watchdog being triggered following preemption (e.g. when
watchdog had been enabled in the low priority batch). The driver will
need to explicitly disable the watchdog counter as part of the
preemption sequence.

*** This patch introduces: ***

1. IRQ handler code for watchdog timeout allowing direct hang recovery
based on hardware-driven hang detection, which then integrates directly
with the hang recovery path. This is independent of having per-engine reset
or just full gpu reset.

2. Watchdog specific register information.

Currently the render engine and all available media engines support
watchdog timeout (VECS is only supported in GEN9). The specifications elude
to the BCS engine being supported but that is currently not supported by
this commit.

Note that the value to stop the counter is different between render and
non-render engines in GEN8; GEN9 onwards it's the same.

v2: Move irq handler to tasklet, arm watchdog for a 2nd time to check
against false-positives.

v3: Don't use high priority tasklet, use engine_last_submit while
checking for false-positives. From GEN9 onwards, the stop counter bit is
the same for all engines.

v4: Remove unnecessary brackets, use current_seqno to mark the request
as guilty in the hangcheck/capture code.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |  4 ++
 drivers/gpu/drm/i915/i915_irq.c         | 12 +++++-
 drivers/gpu/drm/i915/i915_reg.h         |  6 +++
 drivers/gpu/drm/i915/intel_hangcheck.c  | 13 +++++--
 drivers/gpu/drm/i915/intel_lrc.c        | 69 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h |  4 ++
 6 files changed, 103 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 7fba5cf4db1b..d7141d65c3a9 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1552,6 +1552,9 @@ struct i915_gpu_error {
 	 * acquire the struct_mutex to reset an engine, we need an explicit
 	 * flag to prevent two concurrent reset-engine attempts.
 	 *
+	 * #I915_RESET_WATCHDOG - When hw detects a hang before us, we can use
+	 * I915_RESET_WATCHDOG to report the hang detection cause accurately.
+	 *
 	 * #I915_WEDGED - If reset fails and we can no longer use the GPU,
 	 * we set the #I915_WEDGED bit. Prior to command submission, e.g.
 	 * i915_gem_request_alloc(), this bit is checked and the sequence
@@ -1561,6 +1564,7 @@ struct i915_gpu_error {
 #define I915_RESET_BACKOFF	0
 #define I915_RESET_HANDOFF	1
 #define I915_RESET_ENGINE_IN_PROGRESS	2
+#define I915_RESET_WATCHDOG	3
 #define I915_WEDGED		(BITS_PER_LONG - 1)
 
 	/** Number of times an engine has been reset */
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index da4f49447fb0..f195c8fcd2d7 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1336,6 +1336,9 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
 
 	if (tasklet)
 		tasklet_hi_schedule(&engine->irq_tasklet);
+
+	if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift))
+		tasklet_schedule(&engine->watchdog_tasklet);
 }
 
 static irqreturn_t gen8_gt_irq_ack(struct drm_i915_private *dev_priv,
@@ -3415,12 +3418,15 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
 	uint32_t gt_interrupts[] = {
 		GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 			GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT,
 		GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
 			GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
-			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
+			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
 		0,
 		GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT
@@ -3429,6 +3435,10 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
 	if (HAS_L3_DPF(dev_priv))
 		gt_interrupts[0] |= GT_RENDER_L3_PARITY_ERROR_INTERRUPT;
 
+	/* VECS watchdog is only available in skl+ */
+	if (INTEL_GEN(dev_priv) >= 9)
+		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
+
 	dev_priv->pm_ier = 0x0;
 	dev_priv->pm_imr = ~dev_priv->pm_ier;
 	GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 89888adb9af1..df4ec31cddb1 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -1908,6 +1908,11 @@ enum skl_disp_power_wells {
 #define RING_START(base)	_MMIO((base)+0x38)
 #define RING_CTL(base)		_MMIO((base)+0x3c)
 #define   RING_CTL_SIZE(size)	((size) - PAGE_SIZE) /* in bytes -> pages */
+#define RING_CNTR(base)        _MMIO((base) + 0x178)
+#define   GEN8_WATCHDOG_ENABLE		0
+#define   GEN8_WATCHDOG_DISABLE	1
+#define   GEN8_XCS_WATCHDOG_DISABLE	0xFFFFFFFF /* GEN8 & non-render only */
+#define RING_THRESH(base)      _MMIO((base) + 0x17C)
 #define RING_SYNC_0(base)	_MMIO((base)+0x40)
 #define RING_SYNC_1(base)	_MMIO((base)+0x44)
 #define RING_SYNC_2(base)	_MMIO((base)+0x48)
@@ -2386,6 +2391,7 @@ enum skl_disp_power_wells {
 #define GT_BSD_USER_INTERRUPT			(1 << 12)
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /* hsw+; rsvd on snb, ivb, vlv */
 #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
+#define GT_GEN8_WATCHDOG_INTERRUPT		(1 <<  6) /* gen8+ */
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /* !snb */
 #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
 #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index 9b0ece427bdc..254155ebab45 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -388,7 +388,8 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
 
 static void hangcheck_declare_hang(struct drm_i915_private *i915,
 				   unsigned int hung,
-				   unsigned int stuck)
+				   unsigned int stuck,
+				   unsigned int watchdog)
 {
 	struct intel_engine_cs *engine;
 	char msg[80];
@@ -401,7 +402,8 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
 	if (stuck != hung)
 		hung &= ~stuck;
 	len = scnprintf(msg, sizeof(msg),
-			"%s on ", stuck == hung ? "No progress" : "Hang");
+			"%s on ", watchdog ? "Watchdog timeout" :
+				  stuck == hung ? "No progress" : "Hang");
 	for_each_engine_masked(engine, i915, hung, tmp)
 		len += scnprintf(msg + len, sizeof(msg) - len,
 				 "%s, ", engine->name);
@@ -425,7 +427,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 			     gpu_error.hangcheck_work.work);
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
-	unsigned int hung = 0, stuck = 0;
+	unsigned int hung = 0, stuck = 0, watchdog = 0;
 	int busy_count = 0;
 
 	if (!i915.enable_hangcheck)
@@ -437,6 +439,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	if (i915_terminally_wedged(&dev_priv->gpu_error))
 		return;
 
+	if (test_and_clear_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags))
+		watchdog = 1;
+
 	/* As enabling the GPU requires fairly extensive mmio access,
 	 * periodically arm the mmio checker to see if we are triggering
 	 * any invalid access.
@@ -463,7 +468,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	}
 
 	if (hung)
-		hangcheck_declare_hang(dev_priv, hung, stuck);
+		hangcheck_declare_hang(dev_priv, hung, stuck, watchdog);
 
 	/* Reset timer in case GPU hangs without another request being added */
 	if (busy_count)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 014b30ace8a0..144bb9abd400 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1538,6 +1538,53 @@ static int gen8_emit_flush_render(struct drm_i915_gem_request *request,
 	return 0;
 }
 
+/* From GEN9 onwards, all engines use the same RING_CNTR format */
+static inline u32 get_watchdog_disable(struct intel_engine_cs *engine)
+{
+	if (engine->id == RCS || INTEL_GEN(engine->i915) >= 9)
+		return GEN8_WATCHDOG_DISABLE;
+	else
+		return GEN8_XCS_WATCHDOG_DISABLE;
+}
+
+#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
+static void gen8_watchdog_irq_handler(unsigned long data)
+{
+	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
+	struct drm_i915_private *dev_priv = engine->i915;
+	u32 current_seqno;
+
+	intel_uncore_forcewake_get(dev_priv, engine->fw_domains);
+
+	/* Stop the counter to prevent further timeout interrupts */
+	I915_WRITE_FW(RING_CNTR(engine->mmio_base), get_watchdog_disable(engine));
+
+	current_seqno = intel_engine_get_seqno(engine);
+
+	/* did the request complete after the timer expired? */
+	if (intel_engine_last_submit(engine) == current_seqno)
+		goto fw_put;
+
+	if (engine->hangcheck.watchdog == current_seqno) {
+		/* Make sure the active request will be marked as guilty */
+		engine->hangcheck.stalled = true;
+		engine->hangcheck.seqno = current_seqno;
+
+		/* And try to run the hangcheck_work as soon as possible */
+		set_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags);
+		queue_delayed_work(system_long_wq,
+				   &dev_priv->gpu_error.hangcheck_work, 0);
+	} else {
+		engine->hangcheck.watchdog = current_seqno;
+		/* Re-start the counter, if really hung, it will expire again */
+		I915_WRITE_FW(RING_THRESH(engine->mmio_base), GEN8_WATCHDOG_1000US);
+		I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
+	}
+
+fw_put:
+	intel_uncore_forcewake_put(dev_priv, engine->fw_domains);
+}
+
 /*
  * Reserve space for 2 NOOPs at the end of each request to be
  * used as a workaround for not being allowed to do lite
@@ -1631,6 +1678,9 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine)
 	if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &engine->irq_tasklet.state)))
 		tasklet_kill(&engine->irq_tasklet);
 
+	if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &engine->watchdog_tasklet.state)))
+		tasklet_kill(&engine->watchdog_tasklet);
+
 	dev_priv = engine->i915;
 
 	if (engine->buffer) {
@@ -1689,6 +1739,22 @@ logical_ring_default_irqs(struct intel_engine_cs *engine)
 	unsigned shift = engine->irq_shift;
 	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
 	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
+
+	switch (engine->id) {
+	default:
+		/* BCS engine does not support hw watchdog */
+		break;
+	case RCS:
+	case VCS:
+	case VCS2:
+		engine->irq_keep_mask |= (GT_GEN8_WATCHDOG_INTERRUPT << shift);
+		break;
+	case VECS:
+		if (INTEL_GEN(engine->i915) >= 9)
+			engine->irq_keep_mask |=
+				(GT_GEN8_WATCHDOG_INTERRUPT << shift);
+		break;
+	}
 }
 
 static int
@@ -1737,6 +1803,9 @@ logical_ring_setup(struct intel_engine_cs *engine)
 	tasklet_init(&engine->irq_tasklet,
 		     intel_lrc_irq_handler, (unsigned long)engine);
 
+	tasklet_init(&engine->watchdog_tasklet,
+		     gen8_watchdog_irq_handler, (unsigned long)engine);
+
 	logical_ring_default_vfuncs(engine);
 	logical_ring_default_irqs(engine);
 }
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index d33c93444c0d..3906b33f5e74 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -117,6 +117,7 @@ struct intel_instdone {
 struct intel_engine_hangcheck {
 	u64 acthd;
 	u32 seqno;
+	u32 watchdog;
 	enum intel_engine_hangcheck_action action;
 	unsigned long action_timestamp;
 	int deadlock;
@@ -416,6 +417,9 @@ struct intel_engine_cs {
 
 	struct intel_engine_hangcheck hangcheck;
 
+	/* watchdog_tasklet: stop counter and re-schedule hangcheck_work asap */
+	struct tasklet_struct watchdog_tasklet;
+
 	bool needs_cmd_parser;
 
 	/*
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (15 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

Emit the required commands into the ring buffer for starting and
stopping the watchdog timer before/after batch buffer start during
batch buffer submission.

v2: Support watchdog threshold per context engine, merge lri commands,
and move watchdog commands emission to emit_bb_start. Request space of
combined start_watchdog, bb_start and stop_watchdog to avoid any error
after emitting bb_start.

v3: There were too many req->engine in emit_bb_start.
Use GEM_BUG_ON instead of returning a very late EINVAL in the remote
case of watchdog misprogramming; set correct LRI cmd size in
emit_stop_watchdog. (Chris)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.h |  4 ++
 drivers/gpu/drm/i915/intel_lrc.c        | 85 +++++++++++++++++++++++++++++++--
 drivers/gpu/drm/i915/intel_ringbuffer.h |  4 ++
 3 files changed, 89 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 4af2ab94558b..88700bdbb4e1 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -150,6 +150,10 @@ struct i915_gem_context {
 		u32 *lrc_reg_state;
 		u64 lrc_desc;
 		int pin_count;
+		/** watchdog_threshold: hw watchdog threshold value,
+		 * in clock counts
+		 */
+		u32 watchdog_threshold;
 		bool initialised;
 	} engine[I915_NUM_ENGINES];
 
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 144bb9abd400..d7a6ba48cd18 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1386,7 +1386,10 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
 			      u64 offset, u32 len,
 			      const unsigned int flags)
 {
+	struct intel_engine_cs *engine = req->engine;
 	u32 *cs;
+	u32 num_dwords;
+	bool watchdog_running = false;
 	int ret;
 
 	/* Don't rely in hw updating PDPs, specially in lite-restore.
@@ -1396,20 +1399,38 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
 	 * not idle). PML4 is allocated during ppgtt init so this is
 	 * not needed in 48-bit.*/
 	if (req->ctx->ppgtt &&
-	    (intel_engine_flag(req->engine) & req->ctx->ppgtt->pd_dirty_rings) &&
+	    (intel_engine_flag(engine) & req->ctx->ppgtt->pd_dirty_rings) &&
 	    !i915_vm_is_48bit(&req->ctx->ppgtt->base) &&
 	    !intel_vgpu_active(req->i915)) {
 		ret = intel_logical_ring_emit_pdps(req);
 		if (ret)
 			return ret;
 
-		req->ctx->ppgtt->pd_dirty_rings &= ~intel_engine_flag(req->engine);
+		req->ctx->ppgtt->pd_dirty_rings &= ~intel_engine_flag(engine);
+	}
+
+	/* bb_start only */
+	num_dwords = 4;
+
+	/* check if watchdog will be required */
+	if (req->ctx->engine[engine->id].watchdog_threshold != 0) {
+		GEM_BUG_ON(!engine->emit_start_watchdog ||
+			   !engine->emit_stop_watchdog);
+
+		/* + start_watchdog (6) + stop_watchdog (4) */
+		num_dwords += 10;
+		watchdog_running = true;
 	}
 
-	cs = intel_ring_begin(req, 4);
+	cs = intel_ring_begin(req, num_dwords);
 	if (IS_ERR(cs))
 		return PTR_ERR(cs);
 
+	if (watchdog_running) {
+		/* Start watchdog timer */
+		cs = engine->emit_start_watchdog(req, cs);
+	}
+
 	/* FIXME(BDW): Address space and security selectors. */
 	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
 		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8)) |
@@ -1417,8 +1438,13 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
 	*cs++ = lower_32_bits(offset);
 	*cs++ = upper_32_bits(offset);
 	*cs++ = MI_NOOP;
-	intel_ring_advance(req, cs);
 
+	if (watchdog_running) {
+		/* Cancel watchdog timer */
+		cs = engine->emit_stop_watchdog(req, cs);
+	}
+
+	intel_ring_advance(req, cs);
 	return 0;
 }
 
@@ -1585,6 +1611,49 @@ static void gen8_watchdog_irq_handler(unsigned long data)
 	intel_uncore_forcewake_put(dev_priv, engine->fw_domains);
 }
 
+static u32 *gen8_emit_start_watchdog(struct drm_i915_gem_request *req, u32 *cs)
+{
+	struct intel_engine_cs *engine = req->engine;
+	struct i915_gem_context *ctx = req->ctx;
+	struct intel_context *ce = &ctx->engine[engine->id];
+
+	/* XXX: no watchdog support in BCS engine */
+	GEM_BUG_ON(engine->id == BCS);
+
+	/*
+	 * watchdog register must never be programmed to zero. This would
+	 * cause the watchdog counter to exceed and not allow the engine to
+	 * go into IDLE state
+	 */
+	GEM_BUG_ON(ce->watchdog_threshold == 0);
+
+	/* Set counter period */
+	*cs++ = MI_LOAD_REGISTER_IMM(2);
+	*cs++ = i915_mmio_reg_offset(RING_THRESH(engine->mmio_base));
+	*cs++ = ce->watchdog_threshold;
+	/* Start counter */
+	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
+	*cs++ = GEN8_WATCHDOG_ENABLE;
+	*cs++ = MI_NOOP;
+
+	return cs;
+}
+
+static u32 *gen8_emit_stop_watchdog(struct drm_i915_gem_request *req, u32 *cs)
+{
+	struct intel_engine_cs *engine = req->engine;
+
+	/* XXX: no watchdog support in BCS engine */
+	GEM_BUG_ON(engine->id == BCS);
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
+	*cs++ = get_watchdog_disable(engine);
+	*cs++ = MI_NOOP;
+
+	return cs;
+}
+
 /*
  * Reserve space for 2 NOOPs at the end of each request to be
  * used as a workaround for not being allowed to do lite
@@ -1853,6 +1922,8 @@ int logical_render_ring_init(struct intel_engine_cs *engine)
 	engine->emit_flush = gen8_emit_flush_render;
 	engine->emit_breadcrumb = gen8_emit_breadcrumb_render;
 	engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_render_sz;
+	engine->emit_start_watchdog = gen8_emit_start_watchdog;
+	engine->emit_stop_watchdog = gen8_emit_stop_watchdog;
 
 	ret = intel_engine_create_scratch(engine, PAGE_SIZE);
 	if (ret)
@@ -1876,6 +1947,12 @@ int logical_xcs_ring_init(struct intel_engine_cs *engine)
 {
 	logical_ring_setup(engine);
 
+	/* BCS engine does not have a watchdog-expired irq */
+	if (engine->id != BCS) {
+		engine->emit_start_watchdog = gen8_emit_start_watchdog;
+		engine->emit_stop_watchdog = gen8_emit_stop_watchdog;
+	}
+
 	return logical_ring_init(engine);
 }
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 3906b33f5e74..13a2bae8462a 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -277,6 +277,10 @@ struct intel_engine_cs {
 
 	int		(*emit_flush)(struct drm_i915_gem_request *request,
 				      u32 mode);
+	u32 *		(*emit_start_watchdog)(struct drm_i915_gem_request *req,
+					       u32 *cs);
+	u32 *		(*emit_stop_watchdog)(struct drm_i915_gem_request *req,
+					      u32 *cs);
 #define EMIT_INVALIDATE	BIT(0)
 #define EMIT_FLUSH	BIT(1)
 #define EMIT_BARRIER	(EMIT_INVALIDATE | EMIT_FLUSH)
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (16 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 19/20] drm/i915: Watchdog timeout: Include threshold value in error state Michel Thierry
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

Final enablement patch for GPU hang detection using watchdog timeout.
Using the gem_context_setparam ioctl, users can specify the desired
timeout value in microseconds, and the driver will do the conversion to
'timestamps'.

The recommended default watchdog threshold for video engines is 60000 us,
since this has been _empirically determined_ to be a good compromise for
low-latency requirements and low rate of false positives. The default
register value is ~106000us and the theoretical max value (all 1s) is
353 seconds.

Note, UABI engine ids and i915 engine ids are different, and this patch
uses the i915 ones. Some kind of mapping table [1] is required if we
decide to use the UABI engine ids.

[1] http://patchwork.freedesktop.org/patch/msgid/20170329135831.30254-2-chris@chris-wilson.co.uk

v2: Fixed get api to return values in microseconds. Threshold updated to
be per context engine. Check for u32 overflow. Capture ctx threshold
value in error state.

v3: Add a way to get array size, short-cut to disable all thresholds,
return EFAULT / EINVAL as needed. Move the capture of the threshold
value in the error state into a new patch. BXT has a different
timestamp base (because why not?).

v4: Checking if watchdog is available should be the first thing to
do, instead of giving false hopes to abi users; remove unnecessary & in
set_watchdog; ignore args->size in getparam.

v5: GEN9-LP platforms have a different crystal clock frequency, use the
right timestamp base for them (magic 8-ball predicts this will change
again later on, so future-proof it). (Daniele)

Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         | 56 +++++++++++++++++++
 drivers/gpu/drm/i915/i915_gem_context.c | 95 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.c        |  5 +-
 include/uapi/drm/i915_drm.h             |  1 +
 4 files changed, 155 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index d7141d65c3a9..3a757ef7dec0 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2191,6 +2191,9 @@ struct drm_i915_private {
 	struct drm_i915_fence_reg fence_regs[I915_MAX_NUM_FENCES]; /* assume 965 */
 	int num_fence_regs; /* 8 on pre-965, 16 otherwise */
 
+	/* Command stream timestamp base - helps define watchdog threshold */
+	u32 cs_timestamp_base;
+
 	unsigned int fsb_freq, mem_freq, is_ddr3;
 	unsigned int skl_preferred_vco_freq;
 	unsigned int max_cdclk_freq;
@@ -3554,6 +3557,59 @@ i915_gem_context_lookup_timeline(struct i915_gem_context *ctx,
 	return &vm->timeline.engine[engine->id];
 }
 
+/*
+ * BDW, CHV & SKL+ Timestamp timer resolution = 0.080 uSec,
+ * or 12500000 counts per second, or ~12 counts per microsecond.
+ *
+ * But BXT/GLK Timestamp timer resolution is different, 0.052 uSec,
+ * or 19200000 counts per second, or ~19 counts per microsecond.
+ *
+ * Future-proofing, some day it won't be as simple as just GEN & IS_LP.
+ */
+#define GEN8_TIMESTAMP_CNTS_PER_USEC 12
+#define GEN9_LP_TIMESTAMP_CNTS_PER_USEC 19
+static inline u32 cs_timestamp_in_us(struct drm_i915_private *dev_priv)
+{
+	u32 cs_timestamp_base = dev_priv->cs_timestamp_base;
+
+	if (cs_timestamp_base)
+		return cs_timestamp_base;
+
+	switch (INTEL_GEN(dev_priv)) {
+	default:
+		MISSING_CASE(INTEL_GEN(dev_priv));
+		/* fall through */
+	case 9:
+		cs_timestamp_base = IS_GEN9_LP(dev_priv) ?
+					GEN9_LP_TIMESTAMP_CNTS_PER_USEC :
+					GEN8_TIMESTAMP_CNTS_PER_USEC;
+		break;
+	case 8:
+		cs_timestamp_base = GEN8_TIMESTAMP_CNTS_PER_USEC;
+		break;
+	}
+
+	dev_priv->cs_timestamp_base = cs_timestamp_base;
+	return cs_timestamp_base;
+}
+
+static inline u32
+watchdog_to_us(struct drm_i915_private *dev_priv, u32 value_in_clock_counts)
+{
+	return value_in_clock_counts / cs_timestamp_in_us(dev_priv);
+}
+
+static inline u32
+watchdog_to_clock_counts(struct drm_i915_private *dev_priv, u64 value_in_us)
+{
+	u64 threshold = value_in_us * cs_timestamp_in_us(dev_priv);
+
+	if (overflows_type(threshold, u32))
+		return -EINVAL;
+
+	return threshold;
+}
+
 int i915_perf_open_ioctl(struct drm_device *dev, void *data,
 			 struct drm_file *file);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 210d6638e97b..a56e79430082 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -337,6 +337,95 @@ i915_gem_context_create_gvt(struct drm_device *dev)
 	return ctx;
 }
 
+/* Return the timer count threshold in microseconds. */
+int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
+				  struct drm_i915_gem_context_param *args)
+{
+	struct drm_i915_private *dev_priv = ctx->i915;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	u32 threshold_in_us[I915_NUM_ENGINES];
+
+	if (!dev_priv->engine[VCS]->emit_start_watchdog)
+		return -ENODEV;
+
+	for_each_engine(engine, dev_priv, id) {
+		struct intel_context *ce = &ctx->engine[id];
+
+		threshold_in_us[id] = watchdog_to_us(dev_priv,
+						     ce->watchdog_threshold);
+	}
+
+	mutex_unlock(&dev_priv->drm.struct_mutex);
+	if (__copy_to_user(u64_to_user_ptr(args->value),
+			   &threshold_in_us,
+			   sizeof(threshold_in_us))) {
+		mutex_lock(&dev_priv->drm.struct_mutex);
+		return -EFAULT;
+	}
+	mutex_lock(&dev_priv->drm.struct_mutex);
+
+	args->size = sizeof(threshold_in_us);
+
+	return 0;
+}
+
+/*
+ * Based on time out value in microseconds (us) calculate
+ * timer count thresholds needed based on core frequency.
+ * Watchdog can be disabled by setting it to 0.
+ */
+int i915_gem_context_set_watchdog(struct i915_gem_context *ctx,
+				  struct drm_i915_gem_context_param *args)
+{
+	struct drm_i915_private *dev_priv = ctx->i915;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	u32 threshold[I915_NUM_ENGINES];
+
+	if (!dev_priv->engine[VCS]->emit_start_watchdog)
+		return -ENODEV;
+
+	memset(threshold, 0, sizeof(threshold));
+
+	/* shortcut to disable in all engines */
+	if (args->size == 0)
+		goto set_watchdog;
+
+	if (args->size < sizeof(threshold))
+		return -EFAULT;
+
+	mutex_unlock(&dev_priv->drm.struct_mutex);
+	if (copy_from_user(threshold,
+			   u64_to_user_ptr(args->value),
+			   sizeof(threshold))) {
+		mutex_lock(&dev_priv->drm.struct_mutex);
+		return -EFAULT;
+	}
+	mutex_lock(&dev_priv->drm.struct_mutex);
+
+	/* not supported in blitter engine */
+	if (threshold[BCS] != 0)
+		return -EINVAL;
+
+	for_each_engine(engine, dev_priv, id) {
+		threshold[id] = watchdog_to_clock_counts(dev_priv,
+							 threshold[id]);
+
+		if (threshold[id] == -EINVAL)
+			return -EINVAL;
+	}
+
+set_watchdog:
+	for_each_engine(engine, dev_priv, id) {
+		struct intel_context *ce = &ctx->engine[id];
+
+		ce->watchdog_threshold = threshold[id];
+	}
+
+	return 0;
+}
+
 int i915_gem_context_init(struct drm_i915_private *dev_priv)
 {
 	struct i915_gem_context *ctx;
@@ -957,6 +1046,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_BANNABLE:
 		args->value = i915_gem_context_is_bannable(ctx);
 		break;
+	case I915_CONTEXT_PARAM_WATCHDOG:
+		ret = i915_gem_context_get_watchdog(ctx, args);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -1014,6 +1106,9 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 		else
 			i915_gem_context_clear_bannable(ctx);
 		break;
+	case I915_CONTEXT_PARAM_WATCHDOG:
+		ret = i915_gem_context_set_watchdog(ctx, args);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index d7a6ba48cd18..5765b6d7da26 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1573,7 +1573,7 @@ static inline u32 get_watchdog_disable(struct intel_engine_cs *engine)
 		return GEN8_XCS_WATCHDOG_DISABLE;
 }
 
-#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
+#define GEN8_WATCHDOG_1000US(dev_priv) watchdog_to_clock_counts(dev_priv, 1000)
 static void gen8_watchdog_irq_handler(unsigned long data)
 {
 	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
@@ -1603,7 +1603,8 @@ static void gen8_watchdog_irq_handler(unsigned long data)
 	} else {
 		engine->hangcheck.watchdog = current_seqno;
 		/* Re-start the counter, if really hung, it will expire again */
-		I915_WRITE_FW(RING_THRESH(engine->mmio_base), GEN8_WATCHDOG_1000US);
+		I915_WRITE_FW(RING_THRESH(engine->mmio_base),
+			      GEN8_WATCHDOG_1000US(dev_priv));
 		I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
 	}
 
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index fadedefba6db..18bc0ec618dd 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1308,6 +1308,7 @@ struct drm_i915_gem_context_param {
 #define I915_CONTEXT_PARAM_GTT_SIZE	0x3
 #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE	0x4
 #define I915_CONTEXT_PARAM_BANNABLE	0x5
+#define I915_CONTEXT_PARAM_WATCHDOG	0x6
 	__u64 value;
 };
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 19/20] drm/i915: Watchdog timeout: Include threshold value in error state
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (17 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 17:46 ` [PATCH v8 20/20] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs Michel Thierry
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

Save the watchdog threshold (in us) as part of the engine state.

v2: Only do it for gen8+ (and prevent a missing-case warn).

Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h       |  1 +
 drivers/gpu/drm/i915/i915_gpu_error.c | 12 ++++++++----
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 3a757ef7dec0..677c59dd8ae1 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -960,6 +960,7 @@ struct i915_gpu_state {
 			int ban_score;
 			int active;
 			int guilty;
+			int watchdog_threshold;
 		} context;
 
 		struct drm_i915_error_object {
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index ebd5501ec134..515fedb61751 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -388,9 +388,11 @@ static void error_print_context(struct drm_i915_error_state_buf *m,
 				const char *header,
 				const struct drm_i915_error_context *ctx)
 {
-	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, ban score %d guilty %d active %d\n",
+	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, ban score %d guilty %d active %d, watchdog %dus\n",
 		   header, ctx->comm, ctx->pid, ctx->handle, ctx->hw_id,
-		   ctx->ban_score, ctx->guilty, ctx->active);
+		   ctx->ban_score, ctx->guilty, ctx->active,
+		   INTEL_GEN(m->i915) >= 8 ?
+			watchdog_to_us(m->i915, ctx->watchdog_threshold) : 0);
 }
 
 static void error_print_engine(struct drm_i915_error_state_buf *m,
@@ -1349,7 +1351,8 @@ static void error_record_engine_execlists(struct intel_engine_cs *engine,
 }
 
 static void record_context(struct drm_i915_error_context *e,
-			   struct i915_gem_context *ctx)
+			   struct i915_gem_context *ctx,
+			   u32 engine_id)
 {
 	if (ctx->pid) {
 		struct task_struct *task;
@@ -1368,6 +1371,7 @@ static void record_context(struct drm_i915_error_context *e,
 	e->ban_score = ctx->ban_score;
 	e->guilty = ctx->guilty_count;
 	e->active = ctx->active_count;
+	e->watchdog_threshold =	ctx->engine[engine_id].watchdog_threshold;
 }
 
 static void request_record_user_bo(struct drm_i915_gem_request *request,
@@ -1431,7 +1435,7 @@ static void i915_gem_record_rings(struct drm_i915_private *dev_priv,
 			ee->vm = request->ctx->ppgtt ?
 				&request->ctx->ppgtt->base : &ggtt->base;
 
-			record_context(&ee->context, request->ctx);
+			record_context(&ee->context, request->ctx, engine->id);
 
 			/* We need to copy these to an anonymous buffer
 			 * as the simplest method to avoid being overwritten
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 20/20] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (18 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 19/20] drm/i915: Watchdog timeout: Include threshold value in error state Michel Thierry
@ 2017-05-22 17:46 ` Michel Thierry
  2017-05-22 18:04 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev11) Patchwork
  2017-06-06  1:02 ` ✗ Fi.CI.BAT: failure for Gen8+ engine-reset (rev12) Patchwork
  21 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-05-22 17:46 UTC (permalink / raw)
  To: intel-gfx

From firmware v8.8, GuC provides the count of media engine resets
(watchdog timeout). This information is available in the GuC shared
context data struct, which resides in the first page of the default
(kernel) lrc context.

Since GuC handled engine resets are transparent for kernel and user,
provide a simple debugfs entry to see the number of times media reset
has happened.

v2: Remove unnecessary struct_mutex, _get_dirty_page and kmap_atomic;
use READ_ONCE. (Chris)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c   | 22 ++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_guc_fwif.h | 18 ++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 01d6758d33d7..b38e8512fb95 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1403,6 +1403,26 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 	return 0;
 }
 
+static u32 i915_watchdog_reset_count(struct drm_i915_private *dev_priv)
+{
+	struct i915_gem_context *ctx;
+	struct page *page;
+	struct guc_shared_ctx_data *guc_shared_data;
+	u32 guc_media_reset_count;
+
+	if (!i915.enable_guc_submission)
+		return 0;
+
+	ctx = dev_priv->kernel_context;
+	page = i915_gem_object_get_page(ctx->engine[RCS].state->obj,
+					LRC_GUCSHR_PN);
+	guc_shared_data = kmap(page);
+	guc_media_reset_count = READ_ONCE(guc_shared_data->media_reset_count);
+	kunmap(page);
+
+	return guc_media_reset_count;
+}
+
 static int i915_reset_info(struct seq_file *m, void *unused)
 {
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
@@ -1411,6 +1431,8 @@ static int i915_reset_info(struct seq_file *m, void *unused)
 	enum intel_engine_id id;
 
 	seq_printf(m, "full gpu reset = %u\n", i915_reset_count(error));
+	seq_printf(m, "GuC watchdog/media reset = %u\n",
+		   i915_watchdog_reset_count(dev_priv));
 
 	for_each_engine(engine, dev_priv, id) {
 		seq_printf(m, "%s = %u\n", engine->name,
diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
index a2d0cba2f8b9..e45987f7aa50 100644
--- a/drivers/gpu/drm/i915/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
@@ -502,6 +502,24 @@ union guc_log_control {
 	u32 value;
 } __packed;
 
+/* GuC Shared Context Data Struct */
+struct guc_shared_ctx_data {
+	u32 addr_of_last_preempted_data_low;
+	u32 addr_of_last_preempted_data_high;
+	u32 addr_of_last_preempted_data_high_tmp;
+	u32 padding;
+	u32 is_mapped_to_proxy;
+	u32 proxy_ctx_id;
+	u32 engine_reset_ctx_id;
+	u32 media_reset_count;
+	u32 reserved[8];
+	u32 uk_last_ctx_switch_reason;
+	u32 was_reset;
+	u32 lrca_gpu_addr;
+	u32 execlist_ctx;
+	u32 reserved1[32];
+} __packed;
+
 /* This Action will be programmed in C180 - SOFT_SCRATCH_O_REG */
 enum intel_guc_action {
 	INTEL_GUC_ACTION_DEFAULT = 0x0,
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev11)
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (19 preceding siblings ...)
  2017-05-22 17:46 ` [PATCH v8 20/20] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs Michel Thierry
@ 2017-05-22 18:04 ` Patchwork
  2017-06-06  1:02 ` ✗ Fi.CI.BAT: failure for Gen8+ engine-reset (rev12) Patchwork
  21 siblings, 0 replies; 34+ messages in thread
From: Patchwork @ 2017-05-22 18:04 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

== Series Details ==

Series: Gen8+ engine-reset (rev11)
URL   : https://patchwork.freedesktop.org/series/21868/
State : success

== Summary ==

Series 21868v11 Gen8+ engine-reset
https://patchwork.freedesktop.org/api/1.0/series/21868/revisions/11/mbox/

Test kms_flip:
        Subgroup basic-flip-vs-dpms:
                pass       -> DMESG-WARN (fi-skl-6700hq) fdo#101144
        Subgroup basic-flip-vs-wf_vblank:
                fail       -> PASS       (fi-skl-6770hq) fdo#99739

fdo#101144 https://bugs.freedesktop.org/show_bug.cgi?id=101144
fdo#99739 https://bugs.freedesktop.org/show_bug.cgi?id=99739

fi-bdw-5557u     total:278  pass:267  dwarn:0   dfail:0   fail:0   skip:11  time:441s
fi-bdw-gvtdvm    total:278  pass:256  dwarn:8   dfail:0   fail:0   skip:14  time:438s
fi-bsw-n3050     total:278  pass:242  dwarn:0   dfail:0   fail:0   skip:36  time:590s
fi-bxt-j4205     total:278  pass:259  dwarn:0   dfail:0   fail:0   skip:19  time:516s
fi-byt-j1900     total:278  pass:254  dwarn:0   dfail:0   fail:0   skip:24  time:493s
fi-byt-n2820     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:485s
fi-hsw-4770      total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:417s
fi-hsw-4770r     total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:413s
fi-ilk-650       total:278  pass:228  dwarn:0   dfail:0   fail:0   skip:50  time:420s
fi-ivb-3520m     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:495s
fi-ivb-3770      total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:472s
fi-kbl-7500u     total:278  pass:255  dwarn:5   dfail:0   fail:0   skip:18  time:462s
fi-kbl-7560u     total:278  pass:263  dwarn:5   dfail:0   fail:0   skip:10  time:566s
fi-skl-6260u     total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:475s
fi-skl-6700hq    total:278  pass:260  dwarn:1   dfail:0   fail:0   skip:17  time:574s
fi-skl-6700k     total:278  pass:256  dwarn:4   dfail:0   fail:0   skip:18  time:464s
fi-skl-6770hq    total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:497s
fi-skl-gvtdvm    total:278  pass:265  dwarn:0   dfail:0   fail:0   skip:13  time:436s
fi-snb-2520m     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:533s
fi-snb-2600      total:278  pass:249  dwarn:0   dfail:0   fail:0   skip:29  time:407s

5c5ce2ab8183e4bd85a603169e260cc938837e2c drm-tip: 2017y-05m-22d-13h-42m-03s UTC integration manifest
1b35e7e drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs
55b2a8a drm/i915: Watchdog timeout: Include threshold value in error state
6b3b3a9 drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
7c8e3b2 drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
2d0df1b drm/i915: Watchdog timeout: IRQ handler for gen8+
057fcc9 drm/i915: Watchdog timeout: Pass GuC shared data structure during param load
c992439 drm/i915/guc: Add support for reset engine using GuC commands
dd0958f drm/i915/guc: Rename the function that resets the GuC
d6fee33 drm/i915/guc: Provide register list to be saved/restored during engine reset
072b5a0 drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder
be5c46e drm/i915/selftests: reset engine self tests
6b31b7c drm/i915: Add engine reset count in get-reset-stats ioctl
8d68656 drm/i915: Enable Engine reset and recovery support
d2e1dcc drm/i915: Carry on with reset even if hw engine is not ready
6b5864d drm/i915: Export per-engine reset count info to debugfs
b7021f1 drm/i915: Add engine reset count to error state
4edfcaf drm/i915: Add support for per engine reset recovery
1317a93 drm/i915: Modify error handler for per engine hang recovery
7ac0078 drm/i915: Update i915.reset to handle engine resets
88e47d5 drm/i915: Look for active requests earlier in the reset path

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_4773/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v8 03/20] drm/i915: Modify error handler for per engine hang recovery
  2017-05-22 17:46 ` [PATCH v8 03/20] drm/i915: Modify error handler for per engine hang recovery Michel Thierry
@ 2017-06-02 20:16   ` Chris Wilson
  2017-06-02 20:38     ` Michel Thierry
  2017-06-04 11:47     ` Chris Wilson
  2017-06-06  0:45   ` [PATCH v9] " Michel Thierry
  1 sibling, 2 replies; 34+ messages in thread
From: Chris Wilson @ 2017-06-02 20:16 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx

Quoting Michel Thierry (2017-05-22 18:46:24)
> +       /* try engine reset first, and continue if fails */

/* Please use sentences when convenient. It looks much neater that way. */

> +       if (intel_has_reset_engine(dev_priv)) {
> +               struct intel_engine_cs *engine;
> +               unsigned int tmp;
> +
> +               /* protect against concurrent reset attempts */
> +               if (test_and_set_bit(I915_RESET_ENGINE_IN_PROGRESS,
> +                                    &dev_priv->gpu_error.flags))
> +                       goto out;
> +
> +               for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
> +                       if (i915_reset_engine(engine) == 0)
> +                               engine_mask &= ~intel_engine_flag(engine);
> +               }
> +
> +               /* clear unconditionally, full reset won't care about it */
> +               clear_bit(I915_RESET_ENGINE_IN_PROGRESS,
> +                         &dev_priv->gpu_error.flags);
> +
> +               if (!engine_mask)
> +                       goto out;
> +       }
> +
> +       /* full reset needs the mutex, stop any other user trying to do so */
>         if (test_and_set_bit(I915_RESET_BACKOFF,
>                              &dev_priv->gpu_error.flags))

We have a big gaping hole here in coordination between a global reset
and per-engine resets.

I think you want to do something like:

if (intel_has_engine_reset()) {
	for_each_engine_mask() {
		if (test_and_set_bit(I915_RESET_ENGINE + engine->id))
			continue;

		if (i915_reset_engine() == 0)
			engine_mask &= ~engine->mask;

		clear_bit(I915_RESET_ENGINE + engine->id);
		wake_up_bit(I915_RESET_ENGINE + engine->id);
	}
}

if (!engine_mask)
	goto out;

if (test_and_set_bit(I915_RESET_BACKOFF))
	goto out;

for_each_engine() {
	wait_on_bit(I915_RESET_ENGINE + engine->id);
	set_bit(I915_RESET_ENGINE);
}

...do global reset...

for_each_engine() {
	clear_bit(I915_RESET_ENGINE);
}

The idea is that any per-engine reset that comes in whilst global is in
progress can skip, and that the global waits for any per-engine reset to
finish before clobbering state. The window in which global reset
completes and a new hang is detected before we clear the bits, I am
judiciously ignoring. Also this should allow us to perform parallel
resets between engines. (Now that's a selling point!)
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v8 03/20] drm/i915: Modify error handler for per engine hang recovery
  2017-06-02 20:16   ` Chris Wilson
@ 2017-06-02 20:38     ` Michel Thierry
  2017-06-02 21:29       ` Chris Wilson
  2017-06-04 11:47     ` Chris Wilson
  1 sibling, 1 reply; 34+ messages in thread
From: Michel Thierry @ 2017-06-02 20:38 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 6/2/2017 1:16 PM, Chris Wilson wrote:
> Quoting Michel Thierry (2017-05-22 18:46:24)
>> +       /* try engine reset first, and continue if fails */
> 
> /* Please use sentences when convenient. It looks much neater that way. */
> 

_less_ broken English:

/*
  * Try engine reset when available. We fall back to full reset if
  * single reset fails.
  */

>> +       if (intel_has_reset_engine(dev_priv)) {
>> +               struct intel_engine_cs *engine;
>> +               unsigned int tmp;
>> +
>> +               /* protect against concurrent reset attempts */
>> +               if (test_and_set_bit(I915_RESET_ENGINE_IN_PROGRESS,
>> +                                    &dev_priv->gpu_error.flags))
>> +                       goto out;
>> +
>> +               for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
>> +                       if (i915_reset_engine(engine) == 0)
>> +                               engine_mask &= ~intel_engine_flag(engine);
>> +               }
>> +
>> +               /* clear unconditionally, full reset won't care about it */
>> +               clear_bit(I915_RESET_ENGINE_IN_PROGRESS,
>> +                         &dev_priv->gpu_error.flags);
>> +
>> +               if (!engine_mask)
>> +                       goto out;
>> +       }
>> +
>> +       /* full reset needs the mutex, stop any other user trying to do so */
>>          if (test_and_set_bit(I915_RESET_BACKOFF,
>>                               &dev_priv->gpu_error.flags))
> 
> We have a big gaping hole here in coordination between a global reset
> and per-engine resets.
> 
> I think you want to do something like:
> 
> if (intel_has_engine_reset()) {
> 	for_each_engine_mask() {
> 		if (test_and_set_bit(I915_RESET_ENGINE + engine->id))
> 			continue;
> 
> 		if (i915_reset_engine() == 0)
> 			engine_mask &= ~engine->mask;
> 
> 		clear_bit(I915_RESET_ENGINE + engine->id);
> 		wake_up_bit(I915_RESET_ENGINE + engine->id);
> 	}
> }
> 
> if (!engine_mask)
> 	goto out;
> 
> if (test_and_set_bit(I915_RESET_BACKOFF))
> 	goto out;
> 
> for_each_engine() {
> 	wait_on_bit(I915_RESET_ENGINE + engine->id);
> 	set_bit(I915_RESET_ENGINE);
> }
> 
> ...do global reset...
> 
> for_each_engine() {
> 	clear_bit(I915_RESET_ENGINE);
> }
> 
> The idea is that any per-engine reset that comes in whilst global is in
> progress can skip, and that the global waits for any per-engine reset to
> finish before clobbering state. The window in which global reset
> completes and a new hang is detected before we clear the bits, I am
> judiciously ignoring. Also this should allow us to perform parallel
> resets between engines. (Now that's a selling point!)

Fair enough, I can change it to support parallel resets.

One thing I forgot to ask, what should I do with the error/reset 
uevents? As it is, we will only tell userspace in case of global reset.

Thanks
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v8 03/20] drm/i915: Modify error handler for per engine hang recovery
  2017-06-02 20:38     ` Michel Thierry
@ 2017-06-02 21:29       ` Chris Wilson
  0 siblings, 0 replies; 34+ messages in thread
From: Chris Wilson @ 2017-06-02 21:29 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx

Quoting Michel Thierry (2017-06-02 21:38:56)
> One thing I forgot to ask, what should I do with the error/reset 
> uevents? As it is, we will only tell userspace in case of global reset.

My first thought is that we don't care to send the uevent for anything
less than a global reset. The original use case was for abrtd style
automatic bug reporters, but afaik no one is harvesting those error
reports any more. I think we could drop it and no one would be the
wiser. (It's not an ABI break if no one notices.) On the other hand, we
do use the uevent to catch bugs (unexpected hangs) in igt. For igt, I
could just disable per-engine resets for the duration of the those tests.
Those tests are not about testing the reset and so missing coverage of
per-engine reset is not an issue.

Since this is an i915-specific feature, I think the actual listeners to
this uevent are going to be rare, tending to 0. I hope.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v8 03/20] drm/i915: Modify error handler for per engine hang recovery
  2017-06-02 20:16   ` Chris Wilson
  2017-06-02 20:38     ` Michel Thierry
@ 2017-06-04 11:47     ` Chris Wilson
  2017-06-04 12:06       ` Chris Wilson
  1 sibling, 1 reply; 34+ messages in thread
From: Chris Wilson @ 2017-06-04 11:47 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx

Quoting Chris Wilson (2017-06-02 21:16:42)
> I think you want to do something like:
> 
> if (intel_has_engine_reset()) {
>         for_each_engine_mask() {
>                 if (test_and_set_bit(I915_RESET_ENGINE + engine->id))
>                         continue;
> 
>                 if (i915_reset_engine() == 0)
>                         engine_mask &= ~engine->mask;
> 
>                 clear_bit(I915_RESET_ENGINE + engine->id);
>                 wake_up_bit(I915_RESET_ENGINE + engine->id);
>         }
> }
> 
> if (!engine_mask)
>         goto out;
> 
> if (test_and_set_bit(I915_RESET_BACKOFF))
>         goto out;
> 
> for_each_engine() {
>         wait_on_bit(I915_RESET_ENGINE + engine->id);
>         set_bit(I915_RESET_ENGINE);
> }

This needs to be in a loop for that sweet layer of paranoia.

while (test_and_set_bit(I915_RESET_ENGINE + engine->id))
	wait_on_bit(I915_RESET_ENGINE + engine->id);
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v8 03/20] drm/i915: Modify error handler for per engine hang recovery
  2017-06-04 11:47     ` Chris Wilson
@ 2017-06-04 12:06       ` Chris Wilson
  2017-06-06  0:40         ` Michel Thierry
  0 siblings, 1 reply; 34+ messages in thread
From: Chris Wilson @ 2017-06-04 12:06 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx

And whilst I'm here, we need to extend I915_PARAM_HAS_GPU_RESET to
indicate having per-engine resets for the complimentary set of igt.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v8 03/20] drm/i915: Modify error handler for per engine hang recovery
  2017-06-04 12:06       ` Chris Wilson
@ 2017-06-06  0:40         ` Michel Thierry
  2017-06-06 10:16           ` Chris Wilson
  0 siblings, 1 reply; 34+ messages in thread
From: Michel Thierry @ 2017-06-06  0:40 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 6/4/2017 5:06 AM, Chris Wilson wrote:
> And whilst I'm here, we need to extend I915_PARAM_HAS_GPU_RESET to
> indicate having per-engine resets for the complimentary set of igt.
> -Chris
> 

Something like this?

         case I915_PARAM_HAS_GPU_RESET:
-               value = i915.enable_hangcheck && 
intel_has_gpu_reset(dev_priv);
+               value = i915.enable_hangcheck;
+               if (value)
+                       value = intel_has_reset_engine(dev_priv) ? 2 :
+                               intel_has_gpu_reset(dev_priv) ? 1 : 0;
                 break;

(you'll probably think of a nicer way to do it)

I'm also sending the patch with the changes to support parallel resets, 
some other patches need rebase after this change, but I'm holding on those.

Thanks,
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v9] drm/i915: Modify error handler for per engine hang recovery
  2017-05-22 17:46 ` [PATCH v8 03/20] drm/i915: Modify error handler for per engine hang recovery Michel Thierry
  2017-06-02 20:16   ` Chris Wilson
@ 2017-06-06  0:45   ` Michel Thierry
  2017-06-06 13:03     ` kbuild test robot
  2017-06-06 13:05     ` kbuild test robot
  1 sibling, 2 replies; 34+ messages in thread
From: Michel Thierry @ 2017-06-06  0:45 UTC (permalink / raw)
  To: intel-gfx

This is a preparatory patch which modifies error handler to do per engine
hang recovery. The actual patch which implements this sequence follows
later in the series. The aim is to prepare existing recovery function to
adapt to this new function where applicable (which fails at this point
because core implementation is lacking) and continue recovery using legacy
full gpu reset.

A helper function is also added to query the availability of engine
reset.

Note that this implementation of engine reset is for i915 directly
submitting to the ELSP, where the driver manages the hang detection,
recovery and resubmission. With GuC submission these tasks are shared
between driver and firmware; i915 will still responsible for detecting a
hang, and when it does it will have to request GuC to reset that Engine and
remind the firmware about the outstanding submissions. This will be
added in different patch.

v2: rebase, advertise engine reset availability in platform definition,
add note about GuC submission.
v3: s/*engine_reset*/*reset_engine*/. (Chris)
Handle reset as 2 level resets, by first going to engine only and fall
backing to full/chip reset as needed, i.e. reset_engine will need the
struct_mutex.
v4: Pass the engine mask to i915_reset. (Chris)
v5: Rebase, update selftests.
v6: Rebase, prepare for mutex-less reset engine.
v7: Pass reset_engine mask as a function parameter, and iterate over the
engine mask for reset_engine. (Chris)
v8: Use i915.reset >=2 in has_reset_engine; remove redundant reset
logging; add a reset-engine-in-progress flag to prevent concurrent
resets, and avoid dual purposing of reset-backoff. (Chris)
v9: Support reset of different engines in parallel (Chris)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c     | 13 +++++++++++
 drivers/gpu/drm/i915/i915_drv.h     | 10 +++++++++
 drivers/gpu/drm/i915/i915_irq.c     | 45 +++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_pci.c     |  5 ++++-
 drivers/gpu/drm/i915/intel_uncore.c | 11 +++++++++
 5 files changed, 83 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 90b646c51759..feaef8b28277 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1884,6 +1884,19 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	goto finish;
 }
 
+/**
+ * i915_reset_engine - reset GPU engine to recover from a hang
+ * @engine: engine to reset
+ *
+ * Reset a specific GPU engine. Useful if a hang is detected.
+ * Returns zero on successful reset or otherwise an error code.
+ */
+int i915_reset_engine(struct intel_engine_cs *engine)
+{
+	/* FIXME: replace me with engine reset sequence */
+	return -ENODEV;
+}
+
 static int i915_pm_suspend(struct device *kdev)
 {
 	struct pci_dev *pdev = to_pci_dev(kdev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 7574d918cd37..5abe6a8f3e04 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -753,6 +753,7 @@ struct intel_csr {
 	func(has_csr); \
 	func(has_ddi); \
 	func(has_dp_mst); \
+	func(has_reset_engine); \
 	func(has_fbc); \
 	func(has_fpga_dbg); \
 	func(has_full_ppgtt); \
@@ -1548,6 +1549,12 @@ struct i915_gpu_error {
 	 * inspect the bit and do the reset directly, otherwise the worker
 	 * waits for the struct_mutex.
 	 *
+	 * #I915_RESET_ENGINE[num_engines] - Since the driver doesn't need to
+	 * acquire the struct_mutex to reset an engine, we need an explicit
+	 * flag to prevent two concurrent reset attempts in the same engine.
+	 * As the number of engines continues to grow, allocate the flags from
+	 * the most significant bits.
+	 *
 	 * #I915_WEDGED - If reset fails and we can no longer use the GPU,
 	 * we set the #I915_WEDGED bit. Prior to command submission, e.g.
 	 * i915_gem_request_alloc(), this bit is checked and the sequence
@@ -1557,6 +1564,7 @@ struct i915_gpu_error {
 #define I915_RESET_BACKOFF	0
 #define I915_RESET_HANDOFF	1
 #define I915_WEDGED		(BITS_PER_LONG - 1)
+#define I915_RESET_ENGINE	(I915_WEDGED - I915_NUM_ENGINES)
 
 	/**
 	 * Waitqueue to signal when a hang is detected. Used to for waiters
@@ -3056,6 +3064,8 @@ extern void i915_driver_unload(struct drm_device *dev);
 extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern void i915_reset(struct drm_i915_private *dev_priv);
+extern int i915_reset_engine(struct intel_engine_cs *engine);
+extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 4cd9ee1ba332..4782b14dbcb6 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2722,6 +2722,8 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 		       u32 engine_mask,
 		       const char *fmt, ...)
 {
+	struct intel_engine_cs *engine;
+	unsigned int tmp;
 	va_list args;
 	char error_msg[80];
 
@@ -2744,12 +2746,55 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 	if (!engine_mask)
 		goto out;
 
+	/*
+	 * Try engine reset when available. We fall back to full reset if
+	 * single reset fails.
+	 */
+	if (intel_has_reset_engine(dev_priv)) {
+		for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
+			BUILD_BUG_ON(I915_RESET_HANDOFF >= I915_RESET_ENGINE);
+			if (test_and_set_bit(I915_RESET_ENGINE + engine->id,
+					     &dev_priv->gpu_error.flags))
+				continue;
+
+			if (i915_reset_engine(engine) == 0)
+				engine_mask &= ~engine_mask;
+
+			/*
+			 * Clear unconditionally because full reset won't
+			 * care about reset-engine flags.
+			 */
+			clear_bit(I915_RESET_ENGINE + engine->id,
+				  &dev_priv->gpu_error.flags);
+			wake_up_bit(&dev_priv->gpu_error.flags,
+				    I915_RESET_ENGINE + engine->id);
+		}
+
+		if (!engine_mask)
+			goto out;
+	}
+
+	/* full reset needs the mutex, stop any other user trying to do so */
 	if (test_and_set_bit(I915_RESET_BACKOFF,
 			     &dev_priv->gpu_error.flags))
 		goto out;
 
+	/* prevent any other reset-engine attempt */
+	for_each_engine(engine, dev_priv, tmp) {
+		while (test_and_set_bit(I915_RESET_ENGINE + engine->id,
+					&dev_priv->gpu_error.flags))
+			wait_on_bit(&dev_priv->gpu_error.flags,
+				    I915_RESET_ENGINE + engine->id,
+				    TASK_UNINTERRUPTIBLE);
+	}
+
 	i915_reset_and_wakeup(dev_priv);
 
+	for_each_engine(engine, dev_priv, tmp) {
+		clear_bit(I915_RESET_ENGINE + engine->id,
+			  &dev_priv->gpu_error.flags);
+	}
+
 out:
 	intel_runtime_pm_put(dev_priv);
 }
diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index cf43dc1d539f..537c1d56ecef 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -310,7 +310,8 @@ static const struct intel_device_info intel_haswell_info = {
 	BDW_COLORS, \
 	.has_logical_ring_contexts = 1, \
 	.has_full_48bit_ppgtt = 1, \
-	.has_64bit_reloc = 1
+	.has_64bit_reloc = 1, \
+	.has_reset_engine = 1
 
 static const struct intel_device_info intel_broadwell_info = {
 	BDW_FEATURES,
@@ -341,6 +342,7 @@ static const struct intel_device_info intel_cherryview_info = {
 	.has_gmch_display = 1,
 	.has_aliasing_ppgtt = 1,
 	.has_full_ppgtt = 1,
+	.has_reset_engine = 1,
 	.display_mmio_offset = VLV_DISPLAY_BASE,
 	GEN_CHV_PIPEOFFSETS,
 	CURSOR_OFFSETS,
@@ -388,6 +390,7 @@ static const struct intel_device_info intel_skylake_gt3_info = {
 	.has_aliasing_ppgtt = 1, \
 	.has_full_ppgtt = 1, \
 	.has_full_48bit_ppgtt = 1, \
+	.has_reset_engine = 1, \
 	GEN_DEFAULT_PIPEOFFSETS, \
 	IVB_CURSOR_OFFSETS, \
 	BDW_COLORS
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 9882724bc2b6..1ed3dd8df850 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1719,6 +1719,17 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv)
 	return intel_get_gpu_reset(dev_priv) != NULL;
 }
 
+/*
+ * When GuC submission is enabled, GuC manages ELSP and can initiate the
+ * engine reset too. For now, fall back to full GPU reset if it is enabled.
+ */
+bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
+{
+	return (dev_priv->info.has_reset_engine &&
+		!dev_priv->guc.execbuf_client &&
+		i915.reset >= 2);
+}
+
 int intel_guc_reset(struct drm_i915_private *dev_priv)
 {
 	int ret;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* ✗ Fi.CI.BAT: failure for Gen8+ engine-reset (rev12)
  2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
                   ` (20 preceding siblings ...)
  2017-05-22 18:04 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev11) Patchwork
@ 2017-06-06  1:02 ` Patchwork
  21 siblings, 0 replies; 34+ messages in thread
From: Patchwork @ 2017-06-06  1:02 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

== Series Details ==

Series: Gen8+ engine-reset (rev12)
URL   : https://patchwork.freedesktop.org/series/21868/
State : failure

== Summary ==

  CHK     include/config/kernel.release
  CHK     include/generated/uapi/linux/version.h
  CHK     include/generated/utsrelease.h
  CHK     include/generated/bounds.h
  CHK     include/generated/timeconst.h
  CHK     include/generated/asm-offsets.h
  CALL    scripts/checksyscalls.sh
  CHK     scripts/mod/devicetable-offsets.h
  CHK     include/generated/compile.h
  CHK     kernel/config_data.h
  CC [M]  drivers/gpu/drm/i915/i915_drv.o
In file included from ./include/linux/ioport.h:12:0,
                 from ./include/linux/acpi.h:25,
                 from drivers/gpu/drm/i915/i915_drv.c:30:
drivers/gpu/drm/i915/i915_drv.c: In function ‘i915_reset_engine’:
drivers/gpu/drm/i915/i915_drv.c:1906:23: error: ‘I915_RESET_ENGINE_IN_PROGRESS’ undeclared (first use in this function)
  GEM_BUG_ON(!test_bit(I915_RESET_ENGINE_IN_PROGRESS, &error->flags));
                       ^
./include/linux/compiler.h:179:42: note: in definition of macro ‘unlikely’
 # define unlikely(x) __builtin_expect(!!(x), 0)
                                          ^
drivers/gpu/drm/i915/i915_gem.h:31:26: note: in expansion of macro ‘BUG_ON’
 #define GEM_BUG_ON(expr) BUG_ON(expr)
                          ^
drivers/gpu/drm/i915/i915_drv.c:1906:2: note: in expansion of macro ‘GEM_BUG_ON’
  GEM_BUG_ON(!test_bit(I915_RESET_ENGINE_IN_PROGRESS, &error->flags));
  ^
drivers/gpu/drm/i915/i915_drv.c:1906:14: note: in expansion of macro ‘test_bit’
  GEM_BUG_ON(!test_bit(I915_RESET_ENGINE_IN_PROGRESS, &error->flags));
              ^
drivers/gpu/drm/i915/i915_drv.c:1906:23: note: each undeclared identifier is reported only once for each function it appears in
  GEM_BUG_ON(!test_bit(I915_RESET_ENGINE_IN_PROGRESS, &error->flags));
                       ^
./include/linux/compiler.h:179:42: note: in definition of macro ‘unlikely’
 # define unlikely(x) __builtin_expect(!!(x), 0)
                                          ^
drivers/gpu/drm/i915/i915_gem.h:31:26: note: in expansion of macro ‘BUG_ON’
 #define GEM_BUG_ON(expr) BUG_ON(expr)
                          ^
drivers/gpu/drm/i915/i915_drv.c:1906:2: note: in expansion of macro ‘GEM_BUG_ON’
  GEM_BUG_ON(!test_bit(I915_RESET_ENGINE_IN_PROGRESS, &error->flags));
  ^
drivers/gpu/drm/i915/i915_drv.c:1906:14: note: in expansion of macro ‘test_bit’
  GEM_BUG_ON(!test_bit(I915_RESET_ENGINE_IN_PROGRESS, &error->flags));
              ^
scripts/Makefile.build:302: recipe for target 'drivers/gpu/drm/i915/i915_drv.o' failed
make[4]: *** [drivers/gpu/drm/i915/i915_drv.o] Error 1
scripts/Makefile.build:561: recipe for target 'drivers/gpu/drm/i915' failed
make[3]: *** [drivers/gpu/drm/i915] Error 2
scripts/Makefile.build:561: recipe for target 'drivers/gpu/drm' failed
make[2]: *** [drivers/gpu/drm] Error 2
scripts/Makefile.build:561: recipe for target 'drivers/gpu' failed
make[1]: *** [drivers/gpu] Error 2
Makefile:1016: recipe for target 'drivers' failed
make: *** [drivers] Error 2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v8 03/20] drm/i915: Modify error handler for per engine hang recovery
  2017-06-06  0:40         ` Michel Thierry
@ 2017-06-06 10:16           ` Chris Wilson
  2017-06-06 18:51             ` Michel Thierry
  0 siblings, 1 reply; 34+ messages in thread
From: Chris Wilson @ 2017-06-06 10:16 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx

Quoting Michel Thierry (2017-06-06 01:40:31)
> On 6/4/2017 5:06 AM, Chris Wilson wrote:
> > And whilst I'm here, we need to extend I915_PARAM_HAS_GPU_RESET to
> > indicate having per-engine resets for the complimentary set of igt.
> > -Chris
> > 
> 
> Something like this?
> 
>          case I915_PARAM_HAS_GPU_RESET:
> -               value = i915.enable_hangcheck && 
> intel_has_gpu_reset(dev_priv);
> +               value = i915.enable_hangcheck;
> +               if (value)
> +                       value = intel_has_reset_engine(dev_priv) ? 2 :
> +                               intel_has_gpu_reset(dev_priv) ? 1 : 0;
>                  break;
> 
> (you'll probably think of a nicer way to do it)

I didn't think it was sensible to advertise reset-engine support without
global reset (or the hangcheck to detect the error), and for the time
being we can keep thinking of this as an integer rather than a set of
flags.

So I was just thinking of

value = i915.enable_hangcheck && intel_has_gpu_reset(dev_priv);
if (value && intel_has_reset_engine(dev_priv))
	value = 2;

If you want to propose breaking it into flags

value = !!i915.enable_hangcheck; /* can't remember if this is bool */
if (intel_has_gpu_reset(dev_priv))
	value |= BIT(1);
if (intel_has_reset_engine(dev_priv))
	value |= BIT(2);

Then we need to teach igt to look at the flags.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9] drm/i915: Modify error handler for per engine hang recovery
  2017-06-06  0:45   ` [PATCH v9] " Michel Thierry
@ 2017-06-06 13:03     ` kbuild test robot
  2017-06-06 13:05     ` kbuild test robot
  1 sibling, 0 replies; 34+ messages in thread
From: kbuild test robot @ 2017-06-06 13:03 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx, kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1877 bytes --]

Hi Michel,

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on v4.12-rc4 next-20170606]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Michel-Thierry/drm-i915-Modify-error-handler-for-per-engine-hang-recovery/20170606-203011
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-randconfig-x014-201723 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   drivers/gpu/drm/i915/intel_uncore.c: In function 'intel_has_reset_engine':
>> drivers/gpu/drm/i915/intel_uncore.c:1730:14: warning: comparison of constant '2' with boolean expression is always false [-Wbool-compare]
      i915.reset >= 2);
                 ^~

vim +/2 +1730 drivers/gpu/drm/i915/intel_uncore.c

  1714		return ret;
  1715	}
  1716	
  1717	bool intel_has_gpu_reset(struct drm_i915_private *dev_priv)
  1718	{
  1719		return intel_get_gpu_reset(dev_priv) != NULL;
  1720	}
  1721	
  1722	/*
  1723	 * When GuC submission is enabled, GuC manages ELSP and can initiate the
  1724	 * engine reset too. For now, fall back to full GPU reset if it is enabled.
  1725	 */
  1726	bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
  1727	{
  1728		return (dev_priv->info.has_reset_engine &&
  1729			!dev_priv->guc.execbuf_client &&
> 1730			i915.reset >= 2);
  1731	}
  1732	
  1733	int intel_guc_reset(struct drm_i915_private *dev_priv)
  1734	{
  1735		int ret;
  1736	
  1737		if (!HAS_GUC(dev_priv))
  1738			return -EINVAL;

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 23065 bytes --]

[-- Attachment #3: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v9] drm/i915: Modify error handler for per engine hang recovery
  2017-06-06  0:45   ` [PATCH v9] " Michel Thierry
  2017-06-06 13:03     ` kbuild test robot
@ 2017-06-06 13:05     ` kbuild test robot
  1 sibling, 0 replies; 34+ messages in thread
From: kbuild test robot @ 2017-06-06 13:05 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx, kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1553 bytes --]

Hi Michel,

[auto build test ERROR on drm-intel/for-linux-next]
[also build test ERROR on v4.12-rc4 next-20170606]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Michel-Thierry/drm-i915-Modify-error-handler-for-per-engine-hang-recovery/20170606-203011
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: i386-randconfig-x078-06041529 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/gpu/drm/i915/intel_uncore.c: In function 'intel_has_reset_engine':
>> drivers/gpu/drm/i915/intel_uncore.c:1730:14: error: comparison of constant '2' with boolean expression is always false [-Werror=bool-compare]
      i915.reset >= 2);
                 ^~
   cc1: all warnings being treated as errors

vim +/2 +1730 drivers/gpu/drm/i915/intel_uncore.c

  1724	 * engine reset too. For now, fall back to full GPU reset if it is enabled.
  1725	 */
  1726	bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
  1727	{
  1728		return (dev_priv->info.has_reset_engine &&
  1729			!dev_priv->guc.execbuf_client &&
> 1730			i915.reset >= 2);
  1731	}
  1732	
  1733	int intel_guc_reset(struct drm_i915_private *dev_priv)

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28194 bytes --]

[-- Attachment #3: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v8 03/20] drm/i915: Modify error handler for per engine hang recovery
  2017-06-06 10:16           ` Chris Wilson
@ 2017-06-06 18:51             ` Michel Thierry
  0 siblings, 0 replies; 34+ messages in thread
From: Michel Thierry @ 2017-06-06 18:51 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 6/6/2017 3:16 AM, Chris Wilson wrote:
> I didn't think it was sensible to advertise reset-engine support without
> global reset (or the hangcheck to detect the error), and for the time
> being we can keep thinking of this as an integer rather than a set of
> flags.
> 
> So I was just thinking of
> 
> value = i915.enable_hangcheck && intel_has_gpu_reset(dev_priv);
> if (value && intel_has_reset_engine(dev_priv))
> 	value = 2;

reset-engine will depend on global reset (it uses gen8_reset_engines 
with the right engine_mask), so yes, lets use this.

Thanks
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2017-06-06 18:51 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-22 17:46 [PATCH v8 00/20] Gen8+ engine-reset Michel Thierry
2017-05-22 17:46 ` [PATCH v8 01/20] drm/i915: Look for active requests earlier in the reset path Michel Thierry
2017-05-22 17:46 ` [PATCH v8 02/20] drm/i915: Update i915.reset to handle engine resets Michel Thierry
2017-05-22 17:46 ` [PATCH v8 03/20] drm/i915: Modify error handler for per engine hang recovery Michel Thierry
2017-06-02 20:16   ` Chris Wilson
2017-06-02 20:38     ` Michel Thierry
2017-06-02 21:29       ` Chris Wilson
2017-06-04 11:47     ` Chris Wilson
2017-06-04 12:06       ` Chris Wilson
2017-06-06  0:40         ` Michel Thierry
2017-06-06 10:16           ` Chris Wilson
2017-06-06 18:51             ` Michel Thierry
2017-06-06  0:45   ` [PATCH v9] " Michel Thierry
2017-06-06 13:03     ` kbuild test robot
2017-06-06 13:05     ` kbuild test robot
2017-05-22 17:46 ` [PATCH v8 04/20] drm/i915: Add support for per engine reset recovery Michel Thierry
2017-05-22 17:46 ` [PATCH v8 05/20] drm/i915: Add engine reset count to error state Michel Thierry
2017-05-22 17:46 ` [PATCH v8 06/20] drm/i915: Export per-engine reset count info to debugfs Michel Thierry
2017-05-22 17:46 ` [PATCH v8 07/20] drm/i915: Carry on with reset even if hw engine is not ready Michel Thierry
2017-05-22 17:46 ` [PATCH v8 08/20] drm/i915: Enable Engine reset and recovery support Michel Thierry
2017-05-22 17:46 ` [PATCH v8 09/20] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
2017-05-22 17:46 ` [PATCH v8 10/20] drm/i915/selftests: reset engine self tests Michel Thierry
2017-05-22 17:46 ` [PATCH v8 11/20] drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder Michel Thierry
2017-05-22 17:46 ` [PATCH v8 12/20] drm/i915/guc: Provide register list to be saved/restored during engine reset Michel Thierry
2017-05-22 17:46 ` [PATCH v8 13/20] drm/i915/guc: Rename the function that resets the GuC Michel Thierry
2017-05-22 17:46 ` [PATCH v8 14/20] drm/i915/guc: Add support for reset engine using GuC commands Michel Thierry
2017-05-22 17:46 ` [PATCH v8 15/20] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load Michel Thierry
2017-05-22 17:46 ` [PATCH v8 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
2017-05-22 17:46 ` [PATCH v8 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
2017-05-22 17:46 ` [PATCH v8 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
2017-05-22 17:46 ` [PATCH v8 19/20] drm/i915: Watchdog timeout: Include threshold value in error state Michel Thierry
2017-05-22 17:46 ` [PATCH v8 20/20] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs Michel Thierry
2017-05-22 18:04 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev11) Patchwork
2017-06-06  1:02 ` ✗ Fi.CI.BAT: failure for Gen8+ engine-reset (rev12) Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.