All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 00/20] Gen8+ engine-reset
@ 2017-04-27 23:12 Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 01/20] drm/i915: Update i915.reset to handle engine resets Michel Thierry
                   ` (25 more replies)
  0 siblings, 26 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

These patches add the reset-engine feature from Gen8. This is also
referred to as Timeout detection and recovery (TDR). This complements to
the full gpu reset feature available in i915 but it only allows to reset a
particular engine instead of all engines thus providing a light weight
engine reset and recovery mechanism.

Thanks to recent changes merged, this implementation is now not only for
execlists, but for GuC based submission too; it is still limited from
Gen8 onwards. I have also included the changes for watchdog timeout
detection. The GuC related patches are functional, but can be seen as RFC.

Timeout detection relies on the existing hangcheck, which remains the same;
main changes are to the recovery mechanism. Once we detect a hang on a
particular engine we identify the request that caused the hang, skip the
request and adjust head pointers to allow the execution to proceed
normally. After some cleanup, submissions are restarted to process
remaining work queued to that engine.

If engine reset fails to recover engine correctly then we fallback to full
gpu reset.

We can argue about the effectiveness of reset-engine vs full reset when
more than one ring is hung, but the benefits of just resetting one engine
are reduced when the driver has to do it multiple times.

v2: ELSP queue request tracking and reset path changes to handle incomplete
requests during reset. Thanks to Chris Wilson for providing these patches.

v3: Let the waiter keep handling the full gpu reset if it already has the
lock; point out that GuC submission needs a different method to restart
workloads after the engine reset completes.

v4: Handle reset as 2 level resets, by first going to engine only and fall
backing to full/chip reset as needed, i.e. reset_engine will need the
struct_mutex.

v5: Rebased after reset flag split in 2, add GuC support, include watchdog
detection patches, addressing comments from prev RFC.

v6: Mutex-less reset engine. Updates in watchdog abi and guc whitelist &
register-restore fixes (including an old patch from Daniele).

v7: Removed leftovers from v5; review comments; ability to cancel the reset
if there's no active request.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Arun Siluvery (7):
  drm/i915: Update i915.reset to handle engine resets
  drm/i915: Modify error handler for per engine hang recovery
  drm/i915: Add support for per engine reset recovery
  drm/i915: Add engine reset count to error state
  drm/i915: Export per-engine reset count info to debugfs
  drm/i915: Enable Engine reset and recovery support
  drm/i915/guc: Provide register list to be saved/restored during engine
    reset

Daniele Ceraolo Spurio (1):
  drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder

Michel Thierry (11):
  drm/i915: Cancel reset-engine if we couldn't find an active request
  drm/i915: Add engine reset count in get-reset-stats ioctl
  drm/i915/selftests: reset engine self tests
  drm/i915/guc: Rename the function that resets the GuC
  drm/i915/guc: Add support for reset engine using GuC commands
  drm/i915: Watchdog timeout: Pass GuC shared data structure during
    param load
  drm/i915: Watchdog timeout: IRQ handler for gen8+
  drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  drm/i915: Watchdog timeout: Include threshold value in error state
  drm/i915: Watchdog timeout: Export media reset count from GuC to
    debugfs

Mika Kuoppala (1):
  drm/i915: Skip reset request if there is one already

 drivers/gpu/drm/i915/i915_debugfs.c              |  43 +++++++
 drivers/gpu/drm/i915/i915_drv.c                  | 109 +++++++++++++++-
 drivers/gpu/drm/i915/i915_drv.h                  |  67 +++++++++-
 drivers/gpu/drm/i915/i915_gem.c                  | 116 ++++++++++-------
 drivers/gpu/drm/i915/i915_gem_context.c          | 109 +++++++++++++++-
 drivers/gpu/drm/i915/i915_gem_context.h          |   4 +
 drivers/gpu/drm/i915/i915_gem_request.c          |   2 +-
 drivers/gpu/drm/i915/i915_gpu_error.c            |  14 +-
 drivers/gpu/drm/i915/i915_guc_submission.c       | 136 ++++++++++++++++++--
 drivers/gpu/drm/i915/i915_irq.c                  |  45 ++++++-
 drivers/gpu/drm/i915/i915_params.c               |   6 +-
 drivers/gpu/drm/i915/i915_params.h               |   2 +-
 drivers/gpu/drm/i915/i915_pci.c                  |   5 +-
 drivers/gpu/drm/i915/i915_reg.h                  |   6 +
 drivers/gpu/drm/i915/intel_engine_cs.c           |  65 +++++++---
 drivers/gpu/drm/i915/intel_guc_fwif.h            |  27 +++-
 drivers/gpu/drm/i915/intel_guc_loader.c          |  11 ++
 drivers/gpu/drm/i915/intel_hangcheck.c           |  13 +-
 drivers/gpu/drm/i915/intel_lrc.c                 | 155 ++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h          |   8 ++
 drivers/gpu/drm/i915/intel_uc.c                  |   4 +-
 drivers/gpu/drm/i915/intel_uc.h                  |   3 +
 drivers/gpu/drm/i915/intel_uncore.c              |  37 +++++-
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 147 +++++++++++++++++++++
 include/uapi/drm/i915_drm.h                      |   7 +-
 25 files changed, 1023 insertions(+), 118 deletions(-)

-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v7 01/20] drm/i915: Update i915.reset to handle engine resets
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 02/20] drm/i915: Modify error handler for per engine hang recovery Michel Thierry
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

In preparation for engine reset work update this parameter to handle more
than one type of reset. Default at the moment is still full gpu reset.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_params.c | 6 +++---
 drivers/gpu/drm/i915/i915_params.h | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index b6a7e363d076..045cadb77285 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -46,7 +46,7 @@ struct i915_params i915 __read_mostly = {
 	.prefault_disable = 0,
 	.load_detect_test = 0,
 	.force_reset_modeset_test = 0,
-	.reset = true,
+	.reset = 1,
 	.error_capture = true,
 	.invert_brightness = 0,
 	.disable_display = 0,
@@ -115,8 +115,8 @@ MODULE_PARM_DESC(vbt_sdvo_panel_type,
 	"Override/Ignore selection of SDVO panel mode in the VBT "
 	"(-2=ignore, -1=auto [default], index in VBT BIOS table)");
 
-module_param_named_unsafe(reset, i915.reset, bool, 0600);
-MODULE_PARM_DESC(reset, "Attempt GPU resets (default: true)");
+module_param_named_unsafe(reset, i915.reset, int, 0600);
+MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset [default], 2=engine reset)");
 
 #if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
 module_param_named(error_capture, i915.error_capture, bool, 0600);
diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
index 34148cc8637c..febbfdbd30bd 100644
--- a/drivers/gpu/drm/i915/i915_params.h
+++ b/drivers/gpu/drm/i915/i915_params.h
@@ -51,6 +51,7 @@
 	func(int, use_mmio_flip); \
 	func(int, mmio_debug); \
 	func(int, edp_vswing); \
+	func(int, reset); \
 	func(unsigned int, inject_load_failure); \
 	/* leave bools at the end to not create holes */ \
 	func(bool, alpha_support); \
@@ -60,7 +61,6 @@
 	func(bool, prefault_disable); \
 	func(bool, load_detect_test); \
 	func(bool, force_reset_modeset_test); \
-	func(bool, reset); \
 	func(bool, error_capture); \
 	func(bool, disable_display); \
 	func(bool, verbose_state_checks); \
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 02/20] drm/i915: Modify error handler for per engine hang recovery
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 01/20] drm/i915: Update i915.reset to handle engine resets Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-29 14:19   ` Chris Wilson
  2017-05-15 21:14   ` [PATCH " Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 03/20] drm/i915: Add support for per engine reset recovery Michel Thierry
                   ` (23 subsequent siblings)
  25 siblings, 2 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

This is a preparatory patch which modifies error handler to do per engine
hang recovery. The actual patch which implements this sequence follows
later in the series. The aim is to prepare existing recovery function to
adapt to this new function where applicable (which fails at this point
because core implementation is lacking) and continue recovery using legacy
full gpu reset.

A helper function is also added to query the availability of engine
reset.

The error events behaviour that are used to notify user of reset are
adapted to engine reset such that it doesn't break users listening to these
events. In legacy we report an error event, a reset event before resetting
the gpu and a reset done event marking the completion of reset. The same
behaviour is adapted but reset event is only dispatched once even when
multiple engines are hung. Finally once reset is complete we send reset
done event as usual.

Note that this implementation of engine reset is for i915 directly
submitting to the ELSP, where the driver manages the hang detection,
recovery and resubmission. With GuC submission these tasks are shared
between driver and firmware; i915 will still responsible for detecting a
hang, and when it does it will have to request GuC to reset that Engine and
remind the firmware about the outstanding submissions. This will be
added in different patch.

v2: rebase, advertise engine reset availability in platform definition,
add note about GuC submission.
v3: s/*engine_reset*/*reset_engine*/. (Chris)
Handle reset as 2 level resets, by first going to engine only and fall
backing to full/chip reset as needed, i.e. reset_engine will need the
struct_mutex.
v4: Pass the engine mask to i915_reset. (Chris)
v5: Rebase, update selftests.
v6: Rebase, prepare for mutex-less reset engine.
v7: Pass reset_engine mask as a function parameter, and iterate over the
engine mask for reset_engine. (Chris)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c     | 15 +++++++++++++++
 drivers/gpu/drm/i915/i915_drv.h     |  3 +++
 drivers/gpu/drm/i915/i915_irq.c     | 33 ++++++++++++++++++++++++++++++---
 drivers/gpu/drm/i915/i915_pci.c     |  5 ++++-
 drivers/gpu/drm/i915/intel_uncore.c | 11 +++++++++++
 5 files changed, 63 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index c7d68e789642..48c8b69d9bde 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1800,6 +1800,8 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	if (!test_bit(I915_RESET_HANDOFF, &error->flags))
 		return;
 
+	DRM_DEBUG_DRIVER("resetting chip\n");
+
 	/* Clear any previous failed attempts at recovery. Time to try again. */
 	if (!i915_gem_unset_wedged(dev_priv))
 		goto wakeup;
@@ -1863,6 +1865,19 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	goto finish;
 }
 
+/**
+ * i915_reset_engine - reset GPU engine to recover from a hang
+ * @engine: engine to reset
+ *
+ * Reset a specific GPU engine. Useful if a hang is detected.
+ * Returns zero on successful reset or otherwise an error code.
+ */
+int i915_reset_engine(struct intel_engine_cs *engine)
+{
+	/* FIXME: replace me with engine reset sequence */
+	return -ENODEV;
+}
+
 static int i915_pm_suspend(struct device *kdev)
 {
 	struct pci_dev *pdev = to_pci_dev(kdev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index e06af46f5a57..ab7e68626c49 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -814,6 +814,7 @@ struct intel_csr {
 	func(has_ddi); \
 	func(has_decoupled_mmio); \
 	func(has_dp_mst); \
+	func(has_reset_engine); \
 	func(has_fbc); \
 	func(has_fpga_dbg); \
 	func(has_full_ppgtt); \
@@ -3019,6 +3020,8 @@ extern void i915_driver_unload(struct drm_device *dev);
 extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern void i915_reset(struct drm_i915_private *dev_priv);
+extern int i915_reset_engine(struct intel_engine_cs *engine);
+extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index fd97fe00cd0d..3a59ef1367ec 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2635,11 +2635,13 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg)
 /**
  * i915_reset_and_wakeup - do process context error handling work
  * @dev_priv: i915 device private
+ * @engine_mask: engine(s) hung - for reset-engine only.
  *
  * Fire an error uevent so userspace can see that a hang or error
  * was detected.
  */
-static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
+static void
+i915_reset_and_wakeup(struct drm_i915_private *dev_priv, u32 engine_mask)
 {
 	struct kobject *kobj = &dev_priv->drm.primary->kdev->kobj;
 	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
@@ -2648,9 +2650,33 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 
 	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
 
-	DRM_DEBUG_DRIVER("resetting chip\n");
+	/*
+	 * This event needs to be sent before performing gpu reset. When
+	 * engine resets are supported we iterate through all engines and
+	 * reset hung engines individually. To keep the event dispatch
+	 * mechanism consistent with full gpu reset, this is only sent once
+	 * even when multiple engines are hung. It is also safe to move this
+	 * here because when we are in this function, we will definitely
+	 * perform gpu reset.
+	 */
 	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
 
+	/* try engine reset first, and continue if fails; look mom, no mutex! */
+	if (intel_has_reset_engine(dev_priv)) {
+		struct intel_engine_cs *engine;
+		unsigned int tmp;
+
+		for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
+			if (i915_reset_engine(engine) == 0)
+				engine_mask &= ~intel_engine_flag(engine);
+		}
+
+		if (engine_mask)
+			DRM_WARN("per-engine reset failed, promoting to full gpu reset\n");
+		else
+			goto finish;
+	}
+
 	intel_prepare_reset(dev_priv);
 
 	set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
@@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 		kobject_uevent_env(kobj,
 				   KOBJ_CHANGE, reset_done_event);
 
+finish:
 	/*
 	 * Note: The wake_up also serves as a memory barrier so that
 	 * waiters see the updated value of the dev_priv->gpu_error.
@@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 			     &dev_priv->gpu_error.flags))
 		goto out;
 
-	i915_reset_and_wakeup(dev_priv);
+	i915_reset_and_wakeup(dev_priv, engine_mask);
 
 out:
 	intel_runtime_pm_put(dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index f87b0c4e564d..d5002b55cbd8 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -313,7 +313,8 @@ static const struct intel_device_info intel_haswell_info = {
 	BDW_COLORS, \
 	.has_logical_ring_contexts = 1, \
 	.has_full_48bit_ppgtt = 1, \
-	.has_64bit_reloc = 1
+	.has_64bit_reloc = 1, \
+	.has_reset_engine = 1
 
 static const struct intel_device_info intel_broadwell_info = {
 	BDW_FEATURES,
@@ -345,6 +346,7 @@ static const struct intel_device_info intel_cherryview_info = {
 	.has_gmch_display = 1,
 	.has_aliasing_ppgtt = 1,
 	.has_full_ppgtt = 1,
+	.has_reset_engine = 1,
 	.display_mmio_offset = VLV_DISPLAY_BASE,
 	GEN_CHV_PIPEOFFSETS,
 	CURSOR_OFFSETS,
@@ -394,6 +396,7 @@ static const struct intel_device_info intel_skylake_gt3_info = {
 	.has_aliasing_ppgtt = 1, \
 	.has_full_ppgtt = 1, \
 	.has_full_48bit_ppgtt = 1, \
+	.has_reset_engine = 1, \
 	GEN_DEFAULT_PIPEOFFSETS, \
 	IVB_CURSOR_OFFSETS, \
 	BDW_COLORS
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 07a722f74fa1..ab5bdd110ac3 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1776,6 +1776,17 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv)
 	return intel_get_gpu_reset(dev_priv) != NULL;
 }
 
+/*
+ * When GuC submission is enabled, GuC manages ELSP and can initiate the
+ * engine reset too. For now, fall back to full GPU reset if it is enabled.
+ */
+bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
+{
+	return (dev_priv->info.has_reset_engine &&
+		!dev_priv->guc.execbuf_client &&
+		i915.reset == 2);
+}
+
 int intel_guc_reset(struct drm_i915_private *dev_priv)
 {
 	int ret;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 03/20] drm/i915: Add support for per engine reset recovery
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 01/20] drm/i915: Update i915.reset to handle engine resets Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 02/20] drm/i915: Modify error handler for per engine hang recovery Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-27 23:50   ` Chris Wilson
  2017-05-15 21:18   ` [PATCH " Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 04/20] drm/i915: Skip reset request if there is one already Michel Thierry
                   ` (22 subsequent siblings)
  25 siblings, 2 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

This change implements support for per-engine reset as an initial, less
intrusive hang recovery option to be attempted before falling back to the
legacy full GPU reset recovery mode if necessary. This is only supported
from Gen8 onwards.

Hangchecker determines which engines are hung and invokes error handler to
recover from it. Error handler schedules recovery for each of those engines
that are hung. The recovery procedure is as follows,
 - identifies the request that caused the hang and it is dropped
 - force engine to idle: this is done by issuing a reset request
 - reset and re-init engine
 - restart submissions to the engine

If engine reset fails then we fall back to heavy weight full gpu reset
which resets all engines and reinitiazes complete state of HW and SW.

v2: Rebase.
v3: s/*engine_reset*/*reset_engine*/; freeze engine and irqs before
calling i915_gem_reset_engine (Chris).
v4: Rebase, modify i915_gem_reset_prepare to use a ring mask and
reuse the function for reset_engine.
v5: intel_reset_engine_start/cancel instead of request/unrequest_reset.
v6: Clean up reset_engine function to not require mutex, i.e. no need to call
revoke/restore_fences and _retire_requests (Chris).
v7: Remove leftovers from v5, i.e. no need to disable irq, hold
forcewake or wakeup the handoff bit (Chris).

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c         | 60 ++++++++++++++++++--
 drivers/gpu/drm/i915/i915_drv.h         | 12 +++-
 drivers/gpu/drm/i915/i915_gem.c         | 97 +++++++++++++++++++--------------
 drivers/gpu/drm/i915/i915_gem_request.c |  2 +-
 drivers/gpu/drm/i915/intel_uncore.c     | 20 +++++++
 5 files changed, 142 insertions(+), 49 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 48c8b69d9bde..ae891529dedd 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1810,7 +1810,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
 
 	pr_notice("drm/i915: Resetting chip after gpu hang\n");
 	disable_irq(dev_priv->drm.irq);
-	ret = i915_gem_reset_prepare(dev_priv);
+	ret = i915_gem_reset_prepare(dev_priv, ALL_ENGINES);
 	if (ret) {
 		DRM_ERROR("GPU recovery failed\n");
 		intel_gpu_reset(dev_priv, ALL_ENGINES);
@@ -1852,7 +1852,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	i915_queue_hangcheck(dev_priv);
 
 finish:
-	i915_gem_reset_finish(dev_priv);
+	i915_gem_reset_finish(dev_priv, ALL_ENGINES);
 	enable_irq(dev_priv->drm.irq);
 
 wakeup:
@@ -1871,11 +1871,63 @@ void i915_reset(struct drm_i915_private *dev_priv)
  *
  * Reset a specific GPU engine. Useful if a hang is detected.
  * Returns zero on successful reset or otherwise an error code.
+ *
+ * Procedure is:
+ *  - identifies the request that caused the hang and it is dropped
+ *  - force engine to idle: this is done by issuing a reset request
+ *  - reset engine
+ *  - restart submissions to the engine
  */
 int i915_reset_engine(struct intel_engine_cs *engine)
 {
-	/* FIXME: replace me with engine reset sequence */
-	return -ENODEV;
+	int ret;
+	struct drm_i915_private *dev_priv = engine->i915;
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
+
+	GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, &error->flags));
+
+	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
+
+	ret = i915_gem_reset_prepare_engine(engine);
+	if (ret) {
+		DRM_ERROR("Previous reset failed - promote to full reset\n");
+		goto out;
+	}
+
+	/*
+	 * the request that caused the hang is stuck on elsp, identify the
+	 * active request and drop it, adjust head to skip the offending
+	 * request to resume executing remaining requests in the queue.
+	 */
+	i915_gem_reset_engine(engine);
+
+	/* forcing engine to idle */
+	ret = intel_reset_engine_start(engine);
+	if (ret) {
+		DRM_ERROR("Failed to disable %s\n", engine->name);
+		goto out;
+	}
+
+	/* finally, reset engine */
+	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
+	if (ret) {
+		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
+		intel_reset_engine_cancel(engine);
+		goto out;
+	}
+
+	/* be sure the request reset bit gets cleared */
+	intel_reset_engine_cancel(engine);
+
+	i915_gem_reset_finish_engine(engine);
+
+	/* replay remaining requests in the queue */
+	ret = engine->init_hw(engine);
+	if (ret)
+		goto out; //XXX: ignore this line for now
+
+out:
+	return ret;
 }
 
 static int i915_pm_suspend(struct device *kdev)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index ab7e68626c49..efbf34318893 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3022,6 +3022,8 @@ extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern void i915_reset(struct drm_i915_private *dev_priv);
 extern int i915_reset_engine(struct intel_engine_cs *engine);
 extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
+extern int intel_reset_engine_start(struct intel_engine_cs *engine);
+extern void intel_reset_engine_cancel(struct intel_engine_cs *engine);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
@@ -3410,7 +3412,6 @@ int __must_check i915_gem_set_global_seqno(struct drm_device *dev, u32 seqno);
 
 struct drm_i915_gem_request *
 i915_gem_find_active_request(struct intel_engine_cs *engine);
-
 void i915_gem_retire_requests(struct drm_i915_private *dev_priv);
 
 static inline bool i915_reset_backoff(struct i915_gpu_error *error)
@@ -3438,11 +3439,16 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
-int i915_gem_reset_prepare(struct drm_i915_private *dev_priv);
+int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
+int i915_gem_reset_prepare(struct drm_i915_private *dev_priv,
+			   unsigned int engine_mask);
 void i915_gem_reset(struct drm_i915_private *dev_priv);
-void i915_gem_reset_finish(struct drm_i915_private *dev_priv);
+void i915_gem_reset_finish_engine(struct intel_engine_cs *engine);
+void i915_gem_reset_finish(struct drm_i915_private *dev_priv,
+			   unsigned int engine_mask);
 void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
 bool i915_gem_unset_wedged(struct drm_i915_private *dev_priv);
+void i915_gem_reset_engine(struct intel_engine_cs *engine);
 
 void i915_gem_init_mmio(struct drm_i915_private *i915);
 int __must_check i915_gem_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 33fb11cc5acc..bce38062f94e 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2793,48 +2793,57 @@ static bool engine_stalled(struct intel_engine_cs *engine)
 	return true;
 }
 
-int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
+/* Ensure irq handler finishes, and not run again. */
+int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 {
-	struct intel_engine_cs *engine;
-	enum intel_engine_id id;
+	struct drm_i915_gem_request *request;
 	int err = 0;
 
-	/* Ensure irq handler finishes, and not run again. */
-	for_each_engine(engine, dev_priv, id) {
-		struct drm_i915_gem_request *request;
-
-		/* Prevent the signaler thread from updating the request
-		 * state (by calling dma_fence_signal) as we are processing
-		 * the reset. The write from the GPU of the seqno is
-		 * asynchronous and the signaler thread may see a different
-		 * value to us and declare the request complete, even though
-		 * the reset routine have picked that request as the active
-		 * (incomplete) request. This conflict is not handled
-		 * gracefully!
-		 */
-		kthread_park(engine->breadcrumbs.signaler);
-
-		/* Prevent request submission to the hardware until we have
-		 * completed the reset in i915_gem_reset_finish(). If a request
-		 * is completed by one engine, it may then queue a request
-		 * to a second via its engine->irq_tasklet *just* as we are
-		 * calling engine->init_hw() and also writing the ELSP.
-		 * Turning off the engine->irq_tasklet until the reset is over
-		 * prevents the race.
-		 */
-		tasklet_kill(&engine->irq_tasklet);
-		tasklet_disable(&engine->irq_tasklet);
 
-		if (engine->irq_seqno_barrier)
-			engine->irq_seqno_barrier(engine);
+	/* Prevent the signaler thread from updating the request
+	 * state (by calling dma_fence_signal) as we are processing
+	 * the reset. The write from the GPU of the seqno is
+	 * asynchronous and the signaler thread may see a different
+	 * value to us and declare the request complete, even though
+	 * the reset routine have picked that request as the active
+	 * (incomplete) request. This conflict is not handled
+	 * gracefully!
+	 */
+	kthread_park(engine->breadcrumbs.signaler);
+
+	/* Prevent request submission to the hardware until we have
+	 * completed the reset in i915_gem_reset_finish(). If a request
+	 * is completed by one engine, it may then queue a request
+	 * to a second via its engine->irq_tasklet *just* as we are
+	 * calling engine->init_hw() and also writing the ELSP.
+	 * Turning off the engine->irq_tasklet until the reset is over
+	 * prevents the race.
+	 */
+	tasklet_kill(&engine->irq_tasklet);
+	tasklet_disable(&engine->irq_tasklet);
 
-		if (engine_stalled(engine)) {
-			request = i915_gem_find_active_request(engine);
-			if (request && request->fence.error == -EIO)
-				err = -EIO; /* Previous reset failed! */
-		}
+	if (engine->irq_seqno_barrier)
+		engine->irq_seqno_barrier(engine);
+
+	if (engine_stalled(engine)) {
+		request = i915_gem_find_active_request(engine);
+		if (request && request->fence.error == -EIO)
+			err = -EIO; /* Previous reset failed! */
 	}
 
+	return err;
+}
+
+int i915_gem_reset_prepare(struct drm_i915_private *dev_priv,
+			   unsigned int engine_mask)
+{
+	struct intel_engine_cs *engine;
+	unsigned int tmp;
+	int err = 0;
+
+	for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
+		err = i915_gem_reset_prepare_engine(engine);
+
 	i915_gem_revoke_fences(dev_priv);
 
 	return err;
@@ -2920,7 +2929,7 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 	return guilty;
 }
 
-static void i915_gem_reset_engine(struct intel_engine_cs *engine)
+void i915_gem_reset_engine(struct intel_engine_cs *engine)
 {
 	struct drm_i915_gem_request *request;
 
@@ -2966,16 +2975,22 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
 	}
 }
 
-void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
+void i915_gem_reset_finish_engine(struct intel_engine_cs *engine)
+{
+	tasklet_enable(&engine->irq_tasklet);
+	kthread_unpark(engine->breadcrumbs.signaler);
+}
+
+void i915_gem_reset_finish(struct drm_i915_private *dev_priv,
+			   unsigned int engine_mask)
 {
 	struct intel_engine_cs *engine;
-	enum intel_engine_id id;
+	unsigned int tmp;
 
 	lockdep_assert_held(&dev_priv->drm.struct_mutex);
 
-	for_each_engine(engine, dev_priv, id) {
-		tasklet_enable(&engine->irq_tasklet);
-		kthread_unpark(engine->breadcrumbs.signaler);
+	for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
+		i915_gem_reset_finish_engine(engine);
 	}
 }
 
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index 6198f6997d05..f69a8c535d5f 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -1216,7 +1216,7 @@ long i915_wait_request(struct drm_i915_gem_request *req,
 	return timeout;
 }
 
-static void engine_retire_requests(struct intel_engine_cs *engine)
+void engine_retire_requests(struct intel_engine_cs *engine)
 {
 	struct drm_i915_gem_request *request, *next;
 	u32 seqno = intel_engine_get_seqno(engine);
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index ab5bdd110ac3..3ebba6b2dd74 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1801,6 +1801,26 @@ int intel_guc_reset(struct drm_i915_private *dev_priv)
 	return ret;
 }
 
+/*
+ * On gen8+ a reset request has to be issued via the reset control register
+ * before a GPU engine can be reset in order to stop the command streamer
+ * and idle the engine. This replaces the legacy way of stopping an engine
+ * by writing to the stop ring bit in the MI_MODE register.
+ */
+int intel_reset_engine_start(struct intel_engine_cs *engine)
+{
+	return gen8_reset_engine_start(engine);
+}
+
+/*
+ * It is possible to back off from a previously issued reset request by simply
+ * clearing the reset request bit in the reset control register.
+ */
+void intel_reset_engine_cancel(struct intel_engine_cs *engine)
+{
+	gen8_reset_engine_cancel(engine);
+}
+
 bool intel_uncore_unclaimed_mmio(struct drm_i915_private *dev_priv)
 {
 	return check_for_unclaimed_mmio(dev_priv);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 04/20] drm/i915: Skip reset request if there is one already
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (2 preceding siblings ...)
  2017-04-27 23:12 ` [PATCH v7 03/20] drm/i915: Add support for per engine reset recovery Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-29 14:21   ` Chris Wilson
  2017-04-27 23:12 ` [PATCH v7 05/20] drm/i915: Cancel reset-engine if we couldn't find an active request Michel Thierry
                   ` (21 subsequent siblings)
  25 siblings, 1 reply; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

From: Mika Kuoppala <mika.kuoppala@linux.intel.com>

To perform engine reset we first disable engine to capture its state. This
is done by issuing a reset request. Because we are reusing existing
infrastructure, again when we actually reset an engine, reset function
checks engine mask and issues reset request again which is unnecessary. To
avoid this we check if the engine is already prepared, if so we just exit
from that point.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 3ebba6b2dd74..120fb440bb8b 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1686,10 +1686,15 @@ int intel_wait_for_register(struct drm_i915_private *dev_priv,
 static int gen8_reset_engine_start(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
+	const i915_reg_t reset_ctrl = RING_RESET_CTL(engine->mmio_base);
+	const u32 ready = RESET_CTL_REQUEST_RESET | RESET_CTL_READY_TO_RESET;
 	int ret;
 
-	I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
-		      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
+	/* If engine has been already prepared, we can shortcut here */
+	if ((I915_READ_FW(reset_ctrl) & ready) == ready)
+		return 0;
+
+	I915_WRITE_FW(reset_ctrl, _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));
 
 	ret = intel_wait_for_register_fw(dev_priv,
 					 RING_RESET_CTL(engine->mmio_base),
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 05/20] drm/i915: Cancel reset-engine if we couldn't find an active request
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (3 preceding siblings ...)
  2017-04-27 23:12 ` [PATCH v7 04/20] drm/i915: Skip reset request if there is one already Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-29 14:26   ` Chris Wilson
                     ` (4 more replies)
  2017-04-27 23:12 ` [PATCH v7 06/20] drm/i915: Add engine reset count to error state Michel Thierry
                   ` (20 subsequent siblings)
  25 siblings, 5 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

Before reseting an engine, check if there is an active request, and if
the _hung_ request has completed. In these two cases, the seqno has moved
after hang declaration and we can skip the reset.

Also store the active request so that we only search for it once.

Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c | 37 +++++++++++++++++++++++++++++--------
 drivers/gpu/drm/i915/i915_drv.h |  6 ++++--
 drivers/gpu/drm/i915/i915_gem.c | 37 ++++++++++++++++++++++++-------------
 3 files changed, 57 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index ae891529dedd..a64e9b63cdbc 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1811,7 +1811,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	pr_notice("drm/i915: Resetting chip after gpu hang\n");
 	disable_irq(dev_priv->drm.irq);
 	ret = i915_gem_reset_prepare(dev_priv, ALL_ENGINES);
-	if (ret) {
+	if (ret == -EIO) {
 		DRM_ERROR("GPU recovery failed\n");
 		intel_gpu_reset(dev_priv, ALL_ENGINES);
 		goto error;
@@ -1883,23 +1883,40 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	int ret;
 	struct drm_i915_private *dev_priv = engine->i915;
 	struct i915_gpu_error *error = &dev_priv->gpu_error;
+	struct drm_i915_gem_request *active_request;
 
 	GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, &error->flags));
 
 	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
 
-	ret = i915_gem_reset_prepare_engine(engine);
-	if (ret) {
-		DRM_ERROR("Previous reset failed - promote to full reset\n");
-		goto out;
+	active_request = i915_gem_reset_prepare_engine(engine);
+	if (!active_request) {
+		DRM_DEBUG_DRIVER("seqno moved after hang declaration, pardoned\n");
+		goto canceled;
+	}
+	if (IS_ERR(active_request)) {
+		ret = PTR_ERR(active_request);
+		if (ret == -ECANCELED) {
+			DRM_DEBUG_DRIVER("no active request found, skip reset\n");
+			goto canceled;
+		} else if (ret) {
+			DRM_DEBUG_DRIVER("Previous reset failed, promote to full reset\n");
+			goto out;
+		}
 	}
 
+	if (__i915_gem_request_completed(active_request, engine->hangcheck.seqno)) {
+		DRM_DEBUG_DRIVER("request completed, skip the reset\n");
+		goto canceled;
+	}
+
+
 	/*
-	 * the request that caused the hang is stuck on elsp, identify the
-	 * active request and drop it, adjust head to skip the offending
+	 * the request that caused the hang is stuck on elsp, we know the
+	 * active request and can drop it, adjust head to skip the offending
 	 * request to resume executing remaining requests in the queue.
 	 */
-	i915_gem_reset_engine(engine);
+	i915_gem_reset_engine(engine, active_request);
 
 	/* forcing engine to idle */
 	ret = intel_reset_engine_start(engine);
@@ -1928,6 +1945,10 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 
 out:
 	return ret;
+
+canceled:
+	i915_gem_reset_finish_engine(engine);
+	return 0;
 }
 
 static int i915_pm_suspend(struct device *kdev)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index efbf34318893..8e93189c2104 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3439,7 +3439,8 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
-int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
+struct drm_i915_gem_request *
+i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv,
 			   unsigned int engine_mask);
 void i915_gem_reset(struct drm_i915_private *dev_priv);
@@ -3448,7 +3449,8 @@ void i915_gem_reset_finish(struct drm_i915_private *dev_priv,
 			   unsigned int engine_mask);
 void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
 bool i915_gem_unset_wedged(struct drm_i915_private *dev_priv);
-void i915_gem_reset_engine(struct intel_engine_cs *engine);
+void i915_gem_reset_engine(struct intel_engine_cs *engine,
+			   struct drm_i915_gem_request *request);
 
 void i915_gem_init_mmio(struct drm_i915_private *i915);
 int __must_check i915_gem_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index bce38062f94e..4e357d333cc2 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2793,12 +2793,15 @@ static bool engine_stalled(struct intel_engine_cs *engine)
 	return true;
 }
 
-/* Ensure irq handler finishes, and not run again. */
-int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
+/*
+ * Ensure irq handler finishes, and not run again.
+ * For reset-engine we also store the active request so that we only search
+ * for it once.
+ */
+struct drm_i915_gem_request *
+i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 {
-	struct drm_i915_gem_request *request;
-	int err = 0;
-
+	struct drm_i915_gem_request *request = NULL;
 
 	/* Prevent the signaler thread from updating the request
 	 * state (by calling dma_fence_signal) as we are processing
@@ -2827,22 +2830,29 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 
 	if (engine_stalled(engine)) {
 		request = i915_gem_find_active_request(engine);
+		if (!request)
+			return ERR_PTR(-ECANCELED); /* Can't find a request, abort! */
+
 		if (request && request->fence.error == -EIO)
-			err = -EIO; /* Previous reset failed! */
+			return ERR_PTR(-EIO); /* Previous reset failed! */
 	}
 
-	return err;
+	return request;
 }
 
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv,
 			   unsigned int engine_mask)
 {
 	struct intel_engine_cs *engine;
+	struct drm_i915_gem_request *request;
 	unsigned int tmp;
 	int err = 0;
 
-	for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
-		err = i915_gem_reset_prepare_engine(engine);
+	for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
+		request = i915_gem_reset_prepare_engine(engine);
+		if (request && IS_ERR(request))
+			err = PTR_ERR(request);
+	}
 
 	i915_gem_revoke_fences(dev_priv);
 
@@ -2929,11 +2939,12 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 	return guilty;
 }
 
-void i915_gem_reset_engine(struct intel_engine_cs *engine)
+void i915_gem_reset_engine(struct intel_engine_cs *engine,
+			   struct drm_i915_gem_request *request)
 {
-	struct drm_i915_gem_request *request;
+	if (!request)
+		request = i915_gem_find_active_request(engine);
 
-	request = i915_gem_find_active_request(engine);
 	if (request && i915_gem_reset_request(request)) {
 		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
 				 engine->name, request->global_seqno);
@@ -2959,7 +2970,7 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
 	for_each_engine(engine, dev_priv, id) {
 		struct i915_gem_context *ctx;
 
-		i915_gem_reset_engine(engine);
+		i915_gem_reset_engine(engine, NULL);
 		ctx = fetch_and_zero(&engine->last_retired_context);
 		if (ctx)
 			engine->context_unpin(engine, ctx);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 06/20] drm/i915: Add engine reset count to error state
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (4 preceding siblings ...)
  2017-04-27 23:12 ` [PATCH v7 05/20] drm/i915: Cancel reset-engine if we couldn't find an active request Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 07/20] drm/i915: Export per-engine reset count info to debugfs Michel Thierry
                   ` (19 subsequent siblings)
  25 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

Driver maintains count of how many times a given engine is reset, useful to
capture this in error state also. It gives an idea of how engine is coping
up with the workloads it is executing before this error state.

A follow-up patch will provide this information in debugfs.

v2: s/engine_reset/reset_engine/ (Chris)
    Define count as unsigned int (Tvrtko)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c       |  3 ++-
 drivers/gpu/drm/i915/i915_drv.h       | 10 ++++++++++
 drivers/gpu/drm/i915/i915_gpu_error.c |  3 +++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index a64e9b63cdbc..426db8756e95 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1941,8 +1941,9 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	/* replay remaining requests in the queue */
 	ret = engine->init_hw(engine);
 	if (ret)
-		goto out; //XXX: ignore this line for now
+		goto out;
 
+	error->reset_engine_count[engine->id]++;
 out:
 	return ret;
 
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 8e93189c2104..b00ea523a634 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -982,6 +982,7 @@ struct i915_gpu_state {
 		enum intel_engine_hangcheck_action hangcheck_action;
 		struct i915_address_space *vm;
 		int num_requests;
+		u32 reset_count;
 
 		/* position of active request inside the ring */
 		u32 rq_head, rq_post, rq_tail;
@@ -1617,6 +1618,9 @@ struct i915_gpu_error {
 #define I915_RESET_HANDOFF	1
 #define I915_WEDGED		(BITS_PER_LONG - 1)
 
+	/** Number of times an engine has been reset */
+	u32 reset_engine_count[I915_NUM_ENGINES];
+
 	/**
 	 * Waitqueue to signal when a hang is detected. Used to for waiters
 	 * to release the struct_mutex for the reset to procede.
@@ -3439,6 +3443,12 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
+static inline u32 i915_reset_engine_count(struct i915_gpu_error *error,
+					  struct intel_engine_cs *engine)
+{
+	return READ_ONCE(error->reset_engine_count[engine->id]);
+}
+
 struct drm_i915_gem_request *
 i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv,
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 14e2064b7653..a2ffb1ef2cfa 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -463,6 +463,7 @@ static void error_print_engine(struct drm_i915_error_state_buf *m,
 	err_printf(m, "  hangcheck action timestamp: %lu, %u ms ago\n",
 		   ee->hangcheck_timestamp,
 		   jiffies_to_msecs(jiffies - ee->hangcheck_timestamp));
+	err_printf(m, "  engine reset count: %u\n", ee->reset_count);
 
 	error_print_request(m, "  ELSP[0]: ", &ee->execlist[0]);
 	error_print_request(m, "  ELSP[1]: ", &ee->execlist[1]);
@@ -1244,6 +1245,8 @@ static void error_record_engine_registers(struct i915_gpu_state *error,
 	ee->hangcheck_timestamp = engine->hangcheck.action_timestamp;
 	ee->hangcheck_action = engine->hangcheck.action;
 	ee->hangcheck_stalled = engine->hangcheck.stalled;
+	ee->reset_count = i915_reset_engine_count(&dev_priv->gpu_error,
+						  engine);
 
 	if (USES_PPGTT(dev_priv)) {
 		int i;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 07/20] drm/i915: Export per-engine reset count info to debugfs
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (5 preceding siblings ...)
  2017-04-27 23:12 ` [PATCH v7 06/20] drm/i915: Add engine reset count to error state Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 08/20] drm/i915: Enable Engine reset and recovery support Michel Thierry
                   ` (18 subsequent siblings)
  25 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

A new variable is added to export the reset counts to debugfs, this
includes full gpu reset and engine reset count. This is useful for tests
where they are expected to trigger reset; these counts are checked before
and after the test to ensure the same.

v2: Include reset engine count in i915_engine_info too (Chris).

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 870c470177b5..6444c1a9bd22 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1403,6 +1403,23 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 	return 0;
 }
 
+static int i915_reset_info(struct seq_file *m, void *unused)
+{
+	struct drm_i915_private *dev_priv = node_to_i915(m->private);
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+
+	seq_printf(m, "full gpu reset = %u\n", i915_reset_count(error));
+
+	for_each_engine(engine, dev_priv, id) {
+		seq_printf(m, "%s = %u\n", engine->name,
+			   i915_reset_engine_count(error, engine));
+	}
+
+	return 0;
+}
+
 static int ironlake_drpc_info(struct seq_file *m)
 {
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
@@ -3242,6 +3259,7 @@ static int i915_display_info(struct seq_file *m, void *unused)
 static int i915_engine_info(struct seq_file *m, void *unused)
 {
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
 
@@ -3265,6 +3283,8 @@ static int i915_engine_info(struct seq_file *m, void *unused)
 			   engine->hangcheck.seqno,
 			   jiffies_to_msecs(jiffies - engine->hangcheck.action_timestamp),
 			   engine->timeline->inflight_seqnos);
+		seq_printf(m, "\tReset count: %d\n",
+			   i915_reset_engine_count(error, engine));
 
 		rcu_read_lock();
 
@@ -4777,6 +4797,7 @@ static const struct drm_info_list i915_debugfs_list[] = {
 	{"i915_huc_load_status", i915_huc_load_status_info, 0},
 	{"i915_frequency_info", i915_frequency_info, 0},
 	{"i915_hangcheck_info", i915_hangcheck_info, 0},
+	{"i915_reset_info", i915_reset_info, 0},
 	{"i915_drpc_info", i915_drpc_info, 0},
 	{"i915_emon_status", i915_emon_status, 0},
 	{"i915_ring_freq_table", i915_ring_freq_table, 0},
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 08/20] drm/i915: Enable Engine reset and recovery support
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (6 preceding siblings ...)
  2017-04-27 23:12 ` [PATCH v7 07/20] drm/i915: Export per-engine reset count info to debugfs Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 09/20] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
                   ` (17 subsequent siblings)
  25 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

This feature is made available only from Gen8, for previous gen devices
driver uses legacy full gpu reset.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_params.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index 045cadb77285..14e2c2e57f96 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -46,7 +46,7 @@ struct i915_params i915 __read_mostly = {
 	.prefault_disable = 0,
 	.load_detect_test = 0,
 	.force_reset_modeset_test = 0,
-	.reset = 1,
+	.reset = 2,
 	.error_capture = true,
 	.invert_brightness = 0,
 	.disable_display = 0,
@@ -116,7 +116,7 @@ MODULE_PARM_DESC(vbt_sdvo_panel_type,
 	"(-2=ignore, -1=auto [default], index in VBT BIOS table)");
 
 module_param_named_unsafe(reset, i915.reset, int, 0600);
-MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset [default], 2=engine reset)");
+MODULE_PARM_DESC(reset, "Attempt GPU resets (0=disabled, 1=full gpu reset, 2=engine reset [default])");
 
 #if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)
 module_param_named(error_capture, i915.error_capture, bool, 0600);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 09/20] drm/i915: Add engine reset count in get-reset-stats ioctl
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (7 preceding siblings ...)
  2017-04-27 23:12 ` [PATCH v7 08/20] drm/i915: Enable Engine reset and recovery support Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 10/20] drm/i915/selftests: reset engine self tests Michel Thierry
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

Users/tests relying on the total reset count will start seeing a smaller
number since most of the hangs can be handled by engine reset.
Note that if reset engine x, context a running on engine y will be unaware
and unaffected.

To start the discussion, include just a total engine reset count. If it
is deemed useful, it can be extended to report each engine separately.

v2: s/engine_reset/reset_engine/, use union in uapi to not break compatibility.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 14 +++++++++++---
 include/uapi/drm/i915_drm.h             |  6 +++++-
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index d46a69d3d390..e98d9daa3f00 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -1074,9 +1074,11 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct drm_i915_reset_stats *args = data;
 	struct i915_gem_context *ctx;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
 	int ret;
 
-	if (args->flags || args->pad)
+	if (args->flags)
 		return -EINVAL;
 
 	if (args->ctx_id == DEFAULT_CONTEXT_HANDLE && !capable(CAP_SYS_ADMIN))
@@ -1092,10 +1094,16 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 		return PTR_ERR(ctx);
 	}
 
-	if (capable(CAP_SYS_ADMIN))
+	if (capable(CAP_SYS_ADMIN)) {
 		args->reset_count = i915_reset_count(&dev_priv->gpu_error);
-	else
+		for_each_engine(engine, dev_priv, id)
+			args->reset_engine_count +=
+				i915_reset_engine_count(&dev_priv->gpu_error,
+							engine);
+	} else {
 		args->reset_count = 0;
+		args->reset_engine_count = 0;
+	}
 
 	args->batch_active = ctx->guilty_count;
 	args->batch_pending = ctx->active_count;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index f24a80d2d42e..fadedefba6db 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1278,7 +1278,11 @@ struct drm_i915_reset_stats {
 	/* Number of batches lost pending for execution, for this context */
 	__u32 batch_pending;
 
-	__u32 pad;
+	union {
+		__u32 pad;
+		/* Engine resets since boot/module reload, for all contexts */
+		__u32 reset_engine_count;
+	};
 };
 
 struct drm_i915_gem_userptr {
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 10/20] drm/i915/selftests: reset engine self tests
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (8 preceding siblings ...)
  2017-04-27 23:12 ` [PATCH v7 09/20] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 11/20] drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder Michel Thierry
                   ` (15 subsequent siblings)
  25 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

Check that we can reset specific engines, also check the fallback to
full reset if something didn't work.

v2: rebase.

Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 147 +++++++++++++++++++++++
 1 file changed, 147 insertions(+)

diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index aa31d6c0cdfb..f64fa0e4bb40 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -322,6 +322,56 @@ static int igt_global_reset(void *arg)
 	return err;
 }
 
+static int igt_reset_engine(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	unsigned int reset_count, reset_engine_count;
+	int err = 0;
+
+	/* Check that we can issue a global GPU and engine reset */
+
+	if (!intel_has_gpu_reset(i915))
+		return 0;
+
+	if (!intel_has_reset_engine(i915))
+		return 0;
+
+	set_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
+
+	for_each_engine(engine, i915, id) {
+		reset_count = i915_reset_count(&i915->gpu_error);
+		reset_engine_count = i915_reset_engine_count(&i915->gpu_error,
+							     engine);
+
+		err = i915_reset_engine(engine);
+		if (err) {
+			pr_err("i915_reset_engine failed\n");
+			break;
+		}
+
+		if (i915_reset_count(&i915->gpu_error) != reset_count) {
+			pr_err("Full GPU reset recorded! (engine reset expected)\n");
+			err = -EINVAL;
+			break;
+		}
+
+		if (i915_reset_engine_count(&i915->gpu_error, engine) ==
+		    reset_engine_count) {
+			pr_err("No %s engine reset recorded!\n", engine->name);
+			err = -EINVAL;
+			break;
+		}
+	}
+
+	clear_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
+	if (i915_terminally_wedged(&i915->gpu_error))
+		err = -EIO;
+
+	return err;
+}
+
 static u32 fake_hangcheck(struct drm_i915_gem_request *rq)
 {
 	u32 reset_count;
@@ -526,13 +576,110 @@ static int igt_reset_queue(void *arg)
 	return err;
 }
 
+static int igt_render_engine_reset_fallback(void *arg)
+{
+	struct drm_i915_private *i915 = arg;
+	struct intel_engine_cs *engine = i915->engine[RCS];
+	struct hang h;
+	struct drm_i915_gem_request *rq;
+	unsigned int reset_count, reset_engine_count;
+	int err = 0;
+
+	/* Check that we can issue a global GPU and engine reset */
+
+	if (!intel_has_gpu_reset(i915))
+		return 0;
+
+	if (!intel_has_reset_engine(i915))
+		return 0;
+
+	set_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
+	mutex_lock(&i915->drm.struct_mutex);
+
+	err = hang_init(&h, i915);
+	if (err)
+		goto unlock;
+
+	rq = hang_create_request(&h, engine, i915->kernel_context);
+	if (IS_ERR(rq)) {
+		err = PTR_ERR(rq);
+		goto fini;
+	}
+
+	i915_gem_request_get(rq);
+	__i915_add_request(rq, true);
+
+	/* make reset engine fail */
+	rq->fence.error = -EIO;
+
+	if (!wait_for_hang(&h, rq)) {
+		pr_err("Failed to start request %x\n", rq->fence.seqno);
+		err = -EIO;
+		goto fini;
+	}
+
+	reset_engine_count = i915_reset_engine_count(&i915->gpu_error, engine);
+	reset_count = fake_hangcheck(rq);
+
+	err = i915_reset_engine(engine);
+	if (err) {
+		pr_err("i915_reset_engine failed\n");
+		goto fini;
+	}
+
+	if (i915_reset_engine_count(&i915->gpu_error, engine) !=
+	    reset_engine_count) {
+		pr_err("render engine reset recorded! (full reset expected)\n");
+		err = -EINVAL;
+		goto fini;
+	}
+
+	if (i915_reset_count(&i915->gpu_error) == reset_count) {
+		pr_err("No full GPU reset recorded!\n");
+		err = -EINVAL;
+		goto fini;
+	}
+
+	/*
+	 * by using fence.error = -EIO, full reset sets the wedged flag, do one
+	 * more full reset to re-enable the hw.
+	 */
+	if (i915_terminally_wedged(&i915->gpu_error)) {
+		rq->fence.error = 0;
+
+		set_bit(I915_RESET_HANDOFF, &i915->gpu_error.flags);
+		i915_reset(i915);
+		GEM_BUG_ON(test_bit(I915_RESET_HANDOFF,
+				    &i915->gpu_error.flags));
+
+		if (i915_reset_count(&i915->gpu_error) == reset_count) {
+			pr_err("No full GPU reset recorded!\n");
+			err = -EINVAL;
+			goto fini;
+		}
+	}
+
+fini:
+	hang_fini(&h);
+unlock:
+	mutex_unlock(&i915->drm.struct_mutex);
+	clear_bit(I915_RESET_BACKOFF, &i915->gpu_error.flags);
+
+	if (i915_terminally_wedged(&i915->gpu_error))
+		return -EIO;
+
+	return err;
+}
+
 int intel_hangcheck_live_selftests(struct drm_i915_private *i915)
 {
 	static const struct i915_subtest tests[] = {
 		SUBTEST(igt_hang_sanitycheck),
 		SUBTEST(igt_global_reset),
+		SUBTEST(igt_reset_engine),
 		SUBTEST(igt_wait_reset),
 		SUBTEST(igt_reset_queue),
+		SUBTEST(igt_render_engine_reset_fallback),
 	};
 
 	if (!intel_has_gpu_reset(i915))
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 11/20] drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (9 preceding siblings ...)
  2017-04-27 23:12 ` [PATCH v7 10/20] drm/i915/selftests: reset engine self tests Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 12/20] drm/i915/guc: Provide register list to be saved/restored during engine reset Michel Thierry
                   ` (14 subsequent siblings)
  25 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

From: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

The mmio_start offset for the whitelist is the first FORCE_TO_NONPRIV
register the GuC can use to restore the provided whitelist when an
engine reset via GuC (which we still don't support) is triggered.

We're currently adding the mmio_base of the engine to the absolute
address of the RCS version of the register, which results in the wrong
offset. Fix it by using the definition we already have instead of
re-defining it in the GuC FW header.

Also add a comment to avoid future issues with FORCE_TO_NONPRIV
registers, which are also used by the workaround framework.

v2: improve comment (Michal), move comment about save/restore because it
    is not related to the mmio_white_list field.

v3: rebase/resurrect.

Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Arkadiusz Hiler <arkadiusz.hiler@intel.com>
Cc: Oscar Mateo <oscar.mateo@intel.com>
Reviewed-by: Michał Winiarski <michal.winiarski@intel.com> (v2)
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_guc_submission.c | 11 +++++++++--
 drivers/gpu/drm/i915/intel_guc_fwif.h      |  1 -
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index 4cc97bf1bdac..2cfe5d3b7795 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -1034,10 +1034,17 @@ static int guc_ads_create(struct intel_guc *guc)
 	/* MMIO reg state */
 	for_each_engine(engine, dev_priv, id) {
 		blob->reg_state.white_list[engine->guc_id].mmio_start =
-			engine->mmio_base + GUC_MMIO_WHITE_LIST_START;
+			i915_mmio_reg_offset(RING_FORCE_TO_NONPRIV(engine->mmio_base, 0));
 
-		/* Nothing to be saved or restored for now. */
+		/*
+		 * Note: if the GuC whitelist management is enabled, the values
+		 * should be filled using the workaround framework to avoid
+		 * inconsistencies with the handling of FORCE_TO_NONPRIV
+		 * registers.
+		 */
 		blob->reg_state.white_list[engine->guc_id].count = 0;
+
+		/* Nothing to be saved or restored for now. */
 	}
 
 	/*
diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
index 6156845641a3..e6f8079df94a 100644
--- a/drivers/gpu/drm/i915/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
@@ -394,7 +394,6 @@ struct guc_policies {
 #define GUC_REGSET_SAVE_CURRENT_VALUE	0x10
 
 #define GUC_REGSET_MAX_REGISTERS	25
-#define GUC_MMIO_WHITE_LIST_START	0x24d0
 #define GUC_MMIO_WHITE_LIST_MAX		12
 #define GUC_S3_SAVE_SPACE_PAGES		10
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 12/20] drm/i915/guc: Provide register list to be saved/restored during engine reset
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (10 preceding siblings ...)
  2017-04-27 23:12 ` [PATCH v7 11/20] drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-27 23:58   ` Chris Wilson
  2017-04-27 23:12 ` [PATCH v7 13/20] drm/i915/guc: Rename the function that resets the GuC Michel Thierry
                   ` (13 subsequent siblings)
  25 siblings, 1 reply; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

From: Arun Siluvery <arun.siluvery@linux.intel.com>

GuC expects a list of registers from the driver which are saved/restored
during engine reset. The type of value to be saved is controlled by
flags. We provide a minimal set of registers that we want GuC to save and
restore. This is not an issue in case of engine reset as driver initializes
most of them following an engine reset, but in case of media reset (aka
watchdog reset) which is completely internal to GuC (including resubmission
of hung workload), it is necessary to provide this list, otherwise GuC won't
be able to schedule further workloads after a reset. This is the minimal
set of registers identified for things to work as expected but if we see
any new issues, this register list can be expanded.

In order to not loose any existing workarounds, we have to let GuC know
the registers and its values. These will be reapplied after the reset.
Note that we can't just read the current value because most of these
registers are masked (so we have a workaround for a workaround for a
workaround).

v2: REGSET_MASKED is too difficult for GuC, use REGSET_SAVE_DEFAULT_VALUE
and current value from RING_MODE reg instead; no need to preserve
head/tail either, be extra paranoid and save whitelisted registers (Daniele).

v3: Workarounds added only once during _init_workarounds also have to
been restored, or we risk loosing them after internal GuC reset
(Daniele).

v4: Rename macro used to keep track the workaround registers we will
have to restore after reset (s/I915_GUC_REG_WRITE/WA_REG_WR_GUC_RESTORE).

Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h            |  3 ++
 drivers/gpu/drm/i915/i915_guc_submission.c | 68 +++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_engine_cs.c     | 65 +++++++++++++++++++---------
 3 files changed, 114 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index b00ea523a634..c9ff7f726d47 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1913,7 +1913,10 @@ struct i915_wa_reg {
 
 struct i915_workarounds {
 	struct i915_wa_reg reg[I915_MAX_WA_REGS];
+	/* list of registers (and their values) that GuC will have to restore */
+	struct i915_wa_reg guc_reg[GUC_REGSET_MAX_REGISTERS];
 	u32 count;
+	u32 guc_count;
 	u32 hw_whitelist_count[I915_NUM_ENGINES];
 };
 
diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index 2cfe5d3b7795..4d1784c84fd4 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -1001,6 +1001,24 @@ static void guc_policies_init(struct guc_policies *policies)
 	policies->is_valid = 1;
 }
 
+/*
+ * In this macro it is highly unlikely to exceed max value but even if we did
+ * it is not an error so just throw a warning and continue. Only side effect
+ * in continuing further means some registers won't be added to save/restore
+ * list.
+ */
+#define GUC_ADD_MMIO_REG_ADS(node, reg_addr, _flags, defvalue)		\
+	do {								\
+		u32 __count = node->number_of_registers;		\
+		if (WARN_ON(__count >= GUC_REGSET_MAX_REGISTERS))	\
+			continue;					\
+		node->registers[__count].offset = reg_addr.reg;		\
+		node->registers[__count].flags = (_flags);		\
+		if (defvalue)						\
+			node->registers[__count].value = (defvalue);	\
+		node->number_of_registers++;				\
+	} while (0)
+
 static int guc_ads_create(struct intel_guc *guc)
 {
 	struct drm_i915_private *dev_priv = guc_to_i915(guc);
@@ -1014,6 +1032,7 @@ static int guc_ads_create(struct intel_guc *guc)
 		u8 reg_state_buffer[GUC_S3_SAVE_SPACE_PAGES * PAGE_SIZE];
 	} __packed *blob;
 	struct intel_engine_cs *engine;
+	struct i915_workarounds *workarounds = &dev_priv->workarounds;
 	enum intel_engine_id id;
 	u32 base;
 
@@ -1033,6 +1052,47 @@ static int guc_ads_create(struct intel_guc *guc)
 
 	/* MMIO reg state */
 	for_each_engine(engine, dev_priv, id) {
+		u32 i;
+		struct guc_mmio_regset *eng_reg =
+			&blob->reg_state.engine_reg[engine->guc_id];
+
+		/*
+		 * Provide a list of registers to be saved/restored during gpu
+		 * reset. This is mainly required for Media reset (aka watchdog
+		 * timeout) which is completely under the control of GuC
+		 * (resubmission of hung workload is handled inside GuC).
+		 */
+		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_HWS_PGA(engine->mmio_base),
+				     GUC_REGSET_ENGINERESET |
+				     GUC_REGSET_SAVE_CURRENT_VALUE, 0);
+
+		/*
+		 * Workaround the guc issue with masked registers, note that
+		 * at this point guc submission is still disabled and the mode
+		 * register doesnt have the irq_steering bit set, which we
+		 * need to fwd irqs to GuC.
+		 */
+		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_MODE_GEN7(engine),
+				     GUC_REGSET_ENGINERESET |
+				     GUC_REGSET_SAVE_DEFAULT_VALUE,
+				     I915_READ(RING_MODE_GEN7(engine)) |
+				     GFX_INTERRUPT_STEERING | (0xFFFF<<16));
+
+		GUC_ADD_MMIO_REG_ADS(eng_reg, RING_IMR(engine->mmio_base),
+				     GUC_REGSET_ENGINERESET |
+				     GUC_REGSET_SAVE_CURRENT_VALUE, 0);
+
+		/* ask guc to re-apply workarounds set in *_init_workarounds */
+		for (i = 0; i < workarounds->guc_count; i++)
+			GUC_ADD_MMIO_REG_ADS(eng_reg,
+					     workarounds->guc_reg[i].addr,
+					     GUC_REGSET_ENGINERESET |
+					     GUC_REGSET_SAVE_DEFAULT_VALUE,
+					     workarounds->guc_reg[i].value);
+
+		DRM_DEBUG_DRIVER("%s register save/restore count: %u\n",
+				 engine->name, eng_reg->number_of_registers);
+
 		blob->reg_state.white_list[engine->guc_id].mmio_start =
 			i915_mmio_reg_offset(RING_FORCE_TO_NONPRIV(engine->mmio_base, 0));
 
@@ -1042,9 +1102,13 @@ static int guc_ads_create(struct intel_guc *guc)
 		 * inconsistencies with the handling of FORCE_TO_NONPRIV
 		 * registers.
 		 */
-		blob->reg_state.white_list[engine->guc_id].count = 0;
+		blob->reg_state.white_list[engine->guc_id].count =
+					workarounds->hw_whitelist_count[id];
 
-		/* Nothing to be saved or restored for now. */
+		for (i = 0; i < workarounds->hw_whitelist_count[id]; i++) {
+			blob->reg_state.white_list[engine->guc_id].offsets[i] =
+				I915_READ(RING_FORCE_TO_NONPRIV(engine->mmio_base, i));
+		}
 	}
 
 	/*
diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index 82a274b336c5..cdc0ccf63e47 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -623,6 +623,29 @@ static int wa_ring_whitelist_reg(struct intel_engine_cs *engine,
 	return 0;
 }
 
+static int guc_wa_add(struct drm_i915_private *dev_priv,
+		      i915_reg_t addr, const u32 val)
+{
+	const u32 idx = dev_priv->workarounds.guc_count;
+
+	I915_WRITE(addr, val);
+	if (WARN_ON(idx >= GUC_REGSET_MAX_REGISTERS))
+		return -ENOSPC;
+
+	dev_priv->workarounds.guc_reg[idx].addr = addr;
+	/* GuC can't handle masked regs, so we store the value|mask together */
+	dev_priv->workarounds.guc_reg[idx].value = val;
+	dev_priv->workarounds.guc_count++;
+
+	return 0;
+}
+
+#define WA_REG_WR_GUC_RESTORE(addr, val) do { \
+		const int r = guc_wa_add(dev_priv, (addr), (val)); \
+		if (r) \
+			return r; \
+	} while (0)
+
 static int gen8_init_workarounds(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
@@ -730,15 +753,16 @@ static int gen9_init_workarounds(struct intel_engine_cs *engine)
 	int ret;
 
 	/* WaConextSwitchWithConcurrentTLBInvalidate:skl,bxt,kbl,glk */
-	I915_WRITE(GEN9_CSFE_CHICKEN1_RCS, _MASKED_BIT_ENABLE(GEN9_PREEMPT_GPGPU_SYNC_SWITCH_DISABLE));
+	WA_REG_WR_GUC_RESTORE(GEN9_CSFE_CHICKEN1_RCS,
+			      _MASKED_BIT_ENABLE(GEN9_PREEMPT_GPGPU_SYNC_SWITCH_DISABLE));
 
 	/* WaEnableLbsSlaRetryTimerDecrement:skl,bxt,kbl,glk */
-	I915_WRITE(BDW_SCRATCH1, I915_READ(BDW_SCRATCH1) |
-		   GEN9_LBS_SLA_RETRY_TIMER_DECREMENT_ENABLE);
+	WA_REG_WR_GUC_RESTORE(BDW_SCRATCH1, I915_READ(BDW_SCRATCH1) |
+			      GEN9_LBS_SLA_RETRY_TIMER_DECREMENT_ENABLE);
 
 	/* WaDisableKillLogic:bxt,skl,kbl */
-	I915_WRITE(GAM_ECOCHK, I915_READ(GAM_ECOCHK) |
-		   ECOCHK_DIS_TLB);
+	WA_REG_WR_GUC_RESTORE(GAM_ECOCHK, I915_READ(GAM_ECOCHK) |
+			      ECOCHK_DIS_TLB);
 
 	/* WaClearFlowControlGpgpuContextSave:skl,bxt,kbl,glk */
 	/* WaDisablePartialInstShootdown:skl,bxt,kbl,glk */
@@ -807,8 +831,8 @@ static int gen9_init_workarounds(struct intel_engine_cs *engine)
 			  HDC_FORCE_NON_COHERENT);
 
 	/* WaDisableHDCInvalidation:skl,bxt,kbl */
-	I915_WRITE(GAM_ECOCHK, I915_READ(GAM_ECOCHK) |
-		   BDW_DISABLE_HDC_INVALIDATION);
+	WA_REG_WR_GUC_RESTORE(GAM_ECOCHK, I915_READ(GAM_ECOCHK) |
+			      BDW_DISABLE_HDC_INVALIDATION);
 
 	/* WaDisableSamplerPowerBypassForSOPingPong:skl,bxt,kbl */
 	if (IS_SKYLAKE(dev_priv) ||
@@ -821,8 +845,8 @@ static int gen9_init_workarounds(struct intel_engine_cs *engine)
 	WA_SET_BIT_MASKED(HALF_SLICE_CHICKEN2, GEN8_ST_PO_DISABLE);
 
 	/* WaOCLCoherentLineFlush:skl,bxt,kbl */
-	I915_WRITE(GEN8_L3SQCREG4, (I915_READ(GEN8_L3SQCREG4) |
-				    GEN8_LQSC_FLUSH_COHERENT_LINES));
+	WA_REG_WR_GUC_RESTORE(GEN8_L3SQCREG4, (I915_READ(GEN8_L3SQCREG4) |
+					       GEN8_LQSC_FLUSH_COHERENT_LINES));
 
 	/* WaVFEStateAfterPipeControlwithMediaStateClear:skl,bxt,glk */
 	ret = wa_ring_whitelist_reg(engine, GEN9_CTX_PREEMPT_REG);
@@ -897,12 +921,12 @@ static int skl_init_workarounds(struct intel_engine_cs *engine)
 	 * until D0 which is the default case so this is equivalent to
 	 * !WaDisablePerCtxtPreemptionGranularityControl:skl
 	 */
-	I915_WRITE(GEN7_FF_SLICE_CS_CHICKEN1,
-		   _MASKED_BIT_ENABLE(GEN9_FFSC_PERCTX_PREEMPT_CTRL));
+	WA_REG_WR_GUC_RESTORE(GEN7_FF_SLICE_CS_CHICKEN1,
+			      _MASKED_BIT_ENABLE(GEN9_FFSC_PERCTX_PREEMPT_CTRL));
 
 	/* WaEnableGapsTsvCreditFix:skl */
-	I915_WRITE(GEN8_GARBCNTL, (I915_READ(GEN8_GARBCNTL) |
-				   GEN9_GAPS_TSV_CREDIT_DISABLE));
+	WA_REG_WR_GUC_RESTORE(GEN8_GARBCNTL, (I915_READ(GEN8_GARBCNTL) |
+					      GEN9_GAPS_TSV_CREDIT_DISABLE));
 
 	/* WaDisableGafsUnitClkGating:skl */
 	WA_SET_BIT(GEN7_UCGCTL4, GEN8_EU_GAUNIT_CLOCK_GATE_DISABLE);
@@ -932,12 +956,12 @@ static int bxt_init_workarounds(struct intel_engine_cs *engine)
 	/* WaStoreMultiplePTEenable:bxt */
 	/* This is a requirement according to Hardware specification */
 	if (IS_BXT_REVID(dev_priv, 0, BXT_REVID_A1))
-		I915_WRITE(TILECTL, I915_READ(TILECTL) | TILECTL_TLBPF);
+		WA_REG_WR_GUC_RESTORE(TILECTL, I915_READ(TILECTL) | TILECTL_TLBPF);
 
 	/* WaSetClckGatingDisableMedia:bxt */
 	if (IS_BXT_REVID(dev_priv, 0, BXT_REVID_A1)) {
-		I915_WRITE(GEN7_MISCCPCTL, (I915_READ(GEN7_MISCCPCTL) &
-					    ~GEN8_DOP_CLOCK_GATE_MEDIA_ENABLE));
+		WA_REG_WR_GUC_RESTORE(GEN7_MISCCPCTL, (I915_READ(GEN7_MISCCPCTL) &
+						       ~GEN8_DOP_CLOCK_GATE_MEDIA_ENABLE));
 	}
 
 	/* WaDisableThreadStallDopClockGating:bxt */
@@ -973,8 +997,8 @@ static int bxt_init_workarounds(struct intel_engine_cs *engine)
 
 	/* WaProgramL3SqcReg1DefaultForPerf:bxt */
 	if (IS_BXT_REVID(dev_priv, BXT_REVID_B0, REVID_FOREVER))
-		I915_WRITE(GEN8_L3SQCREG1, L3_GENERAL_PRIO_CREDITS(62) |
-					   L3_HIGH_PRIO_CREDITS(2));
+		WA_REG_WR_GUC_RESTORE(GEN8_L3SQCREG1, L3_GENERAL_PRIO_CREDITS(62) |
+						      L3_HIGH_PRIO_CREDITS(2));
 
 	/* WaToEnableHwFixForPushConstHWBug:bxt */
 	if (IS_BXT_REVID(dev_priv, BXT_REVID_C0, REVID_FOREVER))
@@ -999,8 +1023,8 @@ static int kbl_init_workarounds(struct intel_engine_cs *engine)
 		return ret;
 
 	/* WaEnableGapsTsvCreditFix:kbl */
-	I915_WRITE(GEN8_GARBCNTL, (I915_READ(GEN8_GARBCNTL) |
-				   GEN9_GAPS_TSV_CREDIT_DISABLE));
+	WA_REG_WR_GUC_RESTORE(GEN8_GARBCNTL, (I915_READ(GEN8_GARBCNTL) |
+					      GEN9_GAPS_TSV_CREDIT_DISABLE));
 
 	/* WaDisableDynamicCreditSharing:kbl */
 	if (IS_KBL_REVID(dev_priv, 0, KBL_REVID_B0))
@@ -1061,6 +1085,7 @@ int init_workarounds_ring(struct intel_engine_cs *engine)
 	WARN_ON(engine->id != RCS);
 
 	dev_priv->workarounds.count = 0;
+	dev_priv->workarounds.guc_count = 0;
 	dev_priv->workarounds.hw_whitelist_count[engine->id] = 0;
 
 	if (IS_BROADWELL(dev_priv))
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 13/20] drm/i915/guc: Rename the function that resets the GuC
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (11 preceding siblings ...)
  2017-04-27 23:12 ` [PATCH v7 12/20] drm/i915/guc: Provide register list to be saved/restored during engine reset Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-28  7:40   ` Tvrtko Ursulin
  2017-04-27 23:12 ` [PATCH v7 14/20] drm/i915/guc: Add support for reset engine using GuC commands Michel Thierry
                   ` (12 subsequent siblings)
  25 siblings, 1 reply; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

intel_guc_reset sounds more like the microcontroller is the one performing
a reset, while in this case is the opposite. intel_reset_guc not only
makes it clearer, it follows the other intel_reset functions available.

Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h     | 2 +-
 drivers/gpu/drm/i915/intel_uc.c     | 4 ++--
 drivers/gpu/drm/i915/intel_uncore.c | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index c9ff7f726d47..e9e04c92a376 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3031,7 +3031,7 @@ extern int i915_reset_engine(struct intel_engine_cs *engine);
 extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
 extern int intel_reset_engine_start(struct intel_engine_cs *engine);
 extern void intel_reset_engine_cancel(struct intel_engine_cs *engine);
-extern int intel_guc_reset(struct drm_i915_private *dev_priv);
+extern int intel_reset_guc(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
 extern unsigned long i915_chipset_val(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/intel_uc.c b/drivers/gpu/drm/i915/intel_uc.c
index 900e3767a899..bad282b6c886 100644
--- a/drivers/gpu/drm/i915/intel_uc.c
+++ b/drivers/gpu/drm/i915/intel_uc.c
@@ -46,9 +46,9 @@ static int __intel_uc_reset_hw(struct drm_i915_private *dev_priv)
 	int ret;
 	u32 guc_status;
 
-	ret = intel_guc_reset(dev_priv);
+	ret = intel_reset_guc(dev_priv);
 	if (ret) {
-		DRM_ERROR("GuC reset failed, ret = %d\n", ret);
+		DRM_ERROR("Reset GuC failed, ret = %d\n", ret);
 		return ret;
 	}
 
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 120fb440bb8b..00251d83e7bd 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1792,7 +1792,7 @@ bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
 		i915.reset == 2);
 }
 
-int intel_guc_reset(struct drm_i915_private *dev_priv)
+int intel_reset_guc(struct drm_i915_private *dev_priv)
 {
 	int ret;
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 14/20] drm/i915/guc: Add support for reset engine using GuC commands
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (12 preceding siblings ...)
  2017-04-27 23:12 ` [PATCH v7 13/20] drm/i915/guc: Rename the function that resets the GuC Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 15/20] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load Michel Thierry
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

This patch adds per engine reset and recovery (TDR) support when GuC is
used to submit workloads to GPU.

In the case of i915 directly submission to ELSP, driver manages hang
detection, recovery and resubmission. With GuC submission these tasks
are shared between driver and GuC. i915 is still responsible for detecting
a hang, and when it does it only requests GuC to reset that Engine. GuC
internally manages acquiring forcewake and idling the engine before actually
resetting it.

Once the reset is successful, i915 takes over again and handles resubmission.
The scheduler in i915 knows which requests are pending so after resetting
a engine, pending workloads/requests are resubmitted again.

v2: s/i915_guc_request_engine_reset/i915_guc_reset_engine/ to match the
non-guc funtion names.

v3: Removed debug message about engine restarting from which request,
since the new baseline do it regardless of submission mode. (Chris)

Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c            | 42 +++++++++++++++++---------
 drivers/gpu/drm/i915/i915_drv.h            |  1 +
 drivers/gpu/drm/i915/i915_guc_submission.c | 48 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_guc_fwif.h      |  6 ++++
 drivers/gpu/drm/i915/intel_uc.h            |  1 +
 drivers/gpu/drm/i915/intel_uncore.c        |  5 ----
 6 files changed, 84 insertions(+), 19 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 426db8756e95..df8e2e8e3b7f 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1918,24 +1918,34 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	 */
 	i915_gem_reset_engine(engine, active_request);
 
-	/* forcing engine to idle */
-	ret = intel_reset_engine_start(engine);
-	if (ret) {
-		DRM_ERROR("Failed to disable %s\n", engine->name);
-		goto out;
-	}
+	if (!dev_priv->guc.execbuf_client) {
+		/* forcing engine to idle */
+		ret = intel_reset_engine_start(engine);
+		if (ret) {
+			DRM_ERROR("Failed to disable %s\n", engine->name);
+			goto out;
+		}
 
-	/* finally, reset engine */
-	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
-	if (ret) {
-		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
+		/* finally, reset engine */
+		ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
+		if (ret) {
+			DRM_ERROR("Failed to reset %s, ret=%d\n",
+				  engine->name, ret);
+			intel_reset_engine_cancel(engine);
+			goto out;
+		}
+
+		/* be sure the request reset bit gets cleared */
 		intel_reset_engine_cancel(engine);
-		goto out;
+	} else {
+		ret = i915_guc_reset_engine(engine);
+		if (ret) {
+			DRM_ERROR("GuC failed to reset %s, ret=%d\n",
+				  engine->name, ret);
+			goto out;
+		}
 	}
 
-	/* be sure the request reset bit gets cleared */
-	intel_reset_engine_cancel(engine);
-
 	i915_gem_reset_finish_engine(engine);
 
 	/* replay remaining requests in the queue */
@@ -1943,6 +1953,10 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	if (ret)
 		goto out;
 
+	/* for guc too */
+	if (dev_priv->guc.execbuf_client)
+		i915_guc_submission_reenable_engine(engine);
+
 	error->reset_engine_count[engine->id]++;
 out:
 	return ret;
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index e9e04c92a376..cbefcd4b2507 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3032,6 +3032,7 @@ extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
 extern int intel_reset_engine_start(struct intel_engine_cs *engine);
 extern void intel_reset_engine_cancel(struct intel_engine_cs *engine);
 extern int intel_reset_guc(struct drm_i915_private *dev_priv);
+extern int i915_guc_reset_engine(struct intel_engine_cs *engine);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
 extern unsigned long i915_chipset_val(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index 4d1784c84fd4..57815edfc4df 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -1344,6 +1344,25 @@ void i915_guc_submission_disable(struct drm_i915_private *dev_priv)
 	guc->execbuf_client = NULL;
 }
 
+void i915_guc_submission_reenable_engine(struct intel_engine_cs *engine)
+{
+	struct drm_i915_private *dev_priv = engine->i915;
+	struct intel_guc *guc = &dev_priv->guc;
+	struct i915_guc_client *client = guc->execbuf_client;
+	const int wqi_size = sizeof(struct guc_wq_item);
+	struct drm_i915_gem_request *rq;
+
+	GEM_BUG_ON(!client);
+	intel_guc_sample_forcewake(guc);
+
+	spin_lock_irq(&engine->timeline->lock);
+	list_for_each_entry(rq, &engine->timeline->requests, link) {
+		guc_client_update_wq_rsvd(client, wqi_size);
+		__i915_guc_submit(rq);
+	}
+	spin_unlock_irq(&engine->timeline->lock);
+}
+
 /**
  * intel_guc_suspend() - notify GuC entering suspend state
  * @dev_priv:	i915 device private
@@ -1395,3 +1414,32 @@ int intel_guc_resume(struct drm_i915_private *dev_priv)
 
 	return intel_guc_send(guc, data, ARRAY_SIZE(data));
 }
+
+int i915_guc_reset_engine(struct intel_engine_cs *engine)
+{
+	struct drm_i915_private *dev_priv = engine->i915;
+	struct intel_guc *guc = &dev_priv->guc;
+	struct i915_gem_context *ctx;
+	u32 data[7];
+
+	if (!i915.enable_guc_submission)
+		return 0;
+
+	ctx = dev_priv->kernel_context;
+
+	/*
+	 * The affected context report is populated by GuC and is provided
+	 * to the driver using the shared page. We request for it but don't
+	 * use it as scheduler has all of these details.
+	 */
+	data[0] = INTEL_GUC_ACTION_REQUEST_ENGINE_RESET;
+	data[1] = engine->guc_id;
+	data[2] = INTEL_GUC_RESET_OPTION_REPORT_AFFECTED_CONTEXTS;
+	data[3] = 0;
+	data[4] = 0;
+	data[5] = guc->execbuf_client->stage_id;
+	/* first page is shared data with GuC */
+	data[6] = guc_ggtt_offset(ctx->engine[RCS].state);
+
+	return intel_guc_send(guc, data, ARRAY_SIZE(data));
+}
diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
index e6f8079df94a..081f2cf614e6 100644
--- a/drivers/gpu/drm/i915/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
@@ -505,6 +505,7 @@ union guc_log_control {
 /* This Action will be programmed in C180 - SOFT_SCRATCH_O_REG */
 enum intel_guc_action {
 	INTEL_GUC_ACTION_DEFAULT = 0x0,
+	INTEL_GUC_ACTION_REQUEST_ENGINE_RESET = 0x3,
 	INTEL_GUC_ACTION_SAMPLE_FORCEWAKE = 0x6,
 	INTEL_GUC_ACTION_ALLOCATE_DOORBELL = 0x10,
 	INTEL_GUC_ACTION_DEALLOCATE_DOORBELL = 0x20,
@@ -518,6 +519,11 @@ enum intel_guc_action {
 	INTEL_GUC_ACTION_LIMIT
 };
 
+/* Reset engine options */
+enum action_engine_reset_options {
+	INTEL_GUC_RESET_OPTION_REPORT_AFFECTED_CONTEXTS = 0x10,
+};
+
 /*
  * The GuC sends its response to a command by overwriting the
  * command in SS0. The response is distinguishable from a command
diff --git a/drivers/gpu/drm/i915/intel_uc.h b/drivers/gpu/drm/i915/intel_uc.h
index 2f0229da20bb..a348d08ce227 100644
--- a/drivers/gpu/drm/i915/intel_uc.h
+++ b/drivers/gpu/drm/i915/intel_uc.h
@@ -247,6 +247,7 @@ int i915_guc_wq_reserve(struct drm_i915_gem_request *rq);
 void i915_guc_wq_unreserve(struct drm_i915_gem_request *request);
 void i915_guc_submission_disable(struct drm_i915_private *dev_priv);
 void i915_guc_submission_fini(struct drm_i915_private *dev_priv);
+void i915_guc_submission_reenable_engine(struct intel_engine_cs *engine);
 struct i915_vma *intel_guc_allocate_vma(struct intel_guc *guc, u32 size);
 
 /* intel_guc_log.c */
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 00251d83e7bd..cf1a74dc8595 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1781,14 +1781,9 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv)
 	return intel_get_gpu_reset(dev_priv) != NULL;
 }
 
-/*
- * When GuC submission is enabled, GuC manages ELSP and can initiate the
- * engine reset too. For now, fall back to full GPU reset if it is enabled.
- */
 bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
 {
 	return (dev_priv->info.has_reset_engine &&
-		!dev_priv->guc.execbuf_client &&
 		i915.reset == 2);
 }
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 15/20] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (13 preceding siblings ...)
  2017-04-27 23:12 ` [PATCH v7 14/20] drm/i915/guc: Add support for reset engine using GuC commands Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

For watchdog / media reset, the firmware must know the address of the shared
data page (the first page of the default context).

This information should be in DWORD 9 of the GUC_CTL structure.

v2: Use guc_ggtt_offset (Chris).
Store the ggtt offset of the default ctx as we needed for
suspend/resume/reset (Daniele).

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_guc_submission.c | 21 ++++++---------------
 drivers/gpu/drm/i915/intel_guc_fwif.h      |  2 +-
 drivers/gpu/drm/i915/intel_guc_loader.c    | 11 +++++++++++
 drivers/gpu/drm/i915/intel_uc.h            |  2 ++
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c
index 57815edfc4df..97392e1c04d1 100644
--- a/drivers/gpu/drm/i915/i915_guc_submission.c
+++ b/drivers/gpu/drm/i915/i915_guc_submission.c
@@ -1370,7 +1370,6 @@ void i915_guc_submission_reenable_engine(struct intel_engine_cs *engine)
 int intel_guc_suspend(struct drm_i915_private *dev_priv)
 {
 	struct intel_guc *guc = &dev_priv->guc;
-	struct i915_gem_context *ctx;
 	u32 data[3];
 
 	if (guc->fw.load_status != INTEL_UC_FIRMWARE_SUCCESS)
@@ -1378,13 +1377,11 @@ int intel_guc_suspend(struct drm_i915_private *dev_priv)
 
 	gen9_disable_guc_interrupts(dev_priv);
 
-	ctx = dev_priv->kernel_context;
-
 	data[0] = INTEL_GUC_ACTION_ENTER_S_STATE;
 	/* any value greater than GUC_POWER_D0 */
 	data[1] = GUC_POWER_D1;
-	/* first page is shared data with GuC */
-	data[2] = guc_ggtt_offset(ctx->engine[RCS].state);
+	/* first page of default ctx is shared data with GuC */
+	data[2] = guc->shared_data_offset;
 
 	return intel_guc_send(guc, data, ARRAY_SIZE(data));
 }
@@ -1396,7 +1393,6 @@ int intel_guc_suspend(struct drm_i915_private *dev_priv)
 int intel_guc_resume(struct drm_i915_private *dev_priv)
 {
 	struct intel_guc *guc = &dev_priv->guc;
-	struct i915_gem_context *ctx;
 	u32 data[3];
 
 	if (guc->fw.load_status != INTEL_UC_FIRMWARE_SUCCESS)
@@ -1405,12 +1401,10 @@ int intel_guc_resume(struct drm_i915_private *dev_priv)
 	if (i915.guc_log_level >= 0)
 		gen9_enable_guc_interrupts(dev_priv);
 
-	ctx = dev_priv->kernel_context;
-
 	data[0] = INTEL_GUC_ACTION_EXIT_S_STATE;
 	data[1] = GUC_POWER_D0;
-	/* first page is shared data with GuC */
-	data[2] = guc_ggtt_offset(ctx->engine[RCS].state);
+	/* first page of default ctx is shared data with GuC */
+	data[2] = guc->shared_data_offset;
 
 	return intel_guc_send(guc, data, ARRAY_SIZE(data));
 }
@@ -1419,14 +1413,11 @@ int i915_guc_reset_engine(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
 	struct intel_guc *guc = &dev_priv->guc;
-	struct i915_gem_context *ctx;
 	u32 data[7];
 
 	if (!i915.enable_guc_submission)
 		return 0;
 
-	ctx = dev_priv->kernel_context;
-
 	/*
 	 * The affected context report is populated by GuC and is provided
 	 * to the driver using the shared page. We request for it but don't
@@ -1438,8 +1429,8 @@ int i915_guc_reset_engine(struct intel_engine_cs *engine)
 	data[3] = 0;
 	data[4] = 0;
 	data[5] = guc->execbuf_client->stage_id;
-	/* first page is shared data with GuC */
-	data[6] = guc_ggtt_offset(ctx->engine[RCS].state);
+	/* first page of default ctx is shared data with GuC */
+	data[6] = guc->shared_data_offset;
 
 	return intel_guc_send(guc, data, ARRAY_SIZE(data));
 }
diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
index 081f2cf614e6..a2d0cba2f8b9 100644
--- a/drivers/gpu/drm/i915/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
@@ -135,7 +135,7 @@
 #define   GUC_ADS_ADDR_SHIFT		11
 #define   GUC_ADS_ADDR_MASK		0xfffff800
 
-#define GUC_CTL_RSRVD			9
+#define GUC_CTL_SHARED_DATA		9
 
 #define GUC_CTL_MAX_DWORDS		(SOFT_SCRATCH_COUNT - 2) /* [1..14] */
 
diff --git a/drivers/gpu/drm/i915/intel_guc_loader.c b/drivers/gpu/drm/i915/intel_guc_loader.c
index d9045b6e897b..8cd5c2bf9510 100644
--- a/drivers/gpu/drm/i915/intel_guc_loader.c
+++ b/drivers/gpu/drm/i915/intel_guc_loader.c
@@ -108,6 +108,7 @@ static void guc_params_init(struct drm_i915_private *dev_priv)
 {
 	struct intel_guc *guc = &dev_priv->guc;
 	u32 params[GUC_CTL_MAX_DWORDS];
+	struct i915_gem_context *ctx;
 	int i;
 
 	memset(&params, 0, sizeof(params));
@@ -156,6 +157,16 @@ static void guc_params_init(struct drm_i915_private *dev_priv)
 		params[GUC_CTL_FEATURE] &= ~GUC_CTL_DISABLE_SCHEDULER;
 	}
 
+	/*
+	 * For watchdog / media reset, GuC must know the address of the shared
+	 * data page, which is the first page of the default context.
+	 * We will also use this page in several places (suspend/resume/reset),
+	 * so save the ggtt offset.
+	 */
+	ctx = dev_priv->kernel_context;
+	guc->shared_data_offset = guc_ggtt_offset(ctx->engine[RCS].state);
+	params[GUC_CTL_SHARED_DATA] = guc->shared_data_offset;
+
 	I915_WRITE(SOFT_SCRATCH(0), 0);
 
 	for (i = 0; i < GUC_CTL_MAX_DWORDS; i++)
diff --git a/drivers/gpu/drm/i915/intel_uc.h b/drivers/gpu/drm/i915/intel_uc.h
index a348d08ce227..c686f20082d7 100644
--- a/drivers/gpu/drm/i915/intel_uc.h
+++ b/drivers/gpu/drm/i915/intel_uc.h
@@ -195,6 +195,8 @@ struct intel_guc {
 	DECLARE_BITMAP(doorbell_bitmap, GUC_NUM_DOORBELLS);
 	uint32_t db_cacheline;		/* Cyclic counter mod pagesize	*/
 
+	uint32_t shared_data_offset;	/* First page of default ctx */
+
 	/* Action status & statistics */
 	uint64_t action_count;		/* Total commands issued	*/
 	uint32_t action_cmd;		/* Last command word		*/
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (14 preceding siblings ...)
  2017-04-27 23:12 ` [PATCH v7 15/20] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
                   ` (9 subsequent siblings)
  25 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

*** General ***

Watchdog timeout (or "media engine reset") is a feature that allows
userland applications to enable hang detection on individual batch buffers.
The detection mechanism itself is mostly bound to the hardware and the only
thing that the driver needs to do to support this form of hang detection
is to implement the interrupt handling support as well as watchdog command
emission before and after the emitted batch buffer start instruction in the
ring buffer.

The principle of the hang detection mechanism is as follows:

1. Once the decision has been made to enable watchdog timeout for a
particular batch buffer and the driver is in the process of emitting the
batch buffer start instruction into the ring buffer it also emits a
watchdog timer start instruction before and a watchdog timer cancellation
instruction after the batch buffer start instruction in the ring buffer.

2. Once the GPU execution reaches the watchdog timer start instruction
the hardware watchdog counter is started by the hardware. The counter
keeps counting until either reaching a previously configured threshold
value or the timer cancellation instruction is executed.

2a. If the counter reaches the threshold value the hardware fires a
watchdog interrupt that is picked up by the watchdog interrupt handler.
This means that a hang has been detected and the driver needs to deal with
it the same way it would deal with a engine hang detected by the periodic
hang checker. The only difference between the two is that we already blamed
the active request (to ensure an engine reset).

2b. If the batch buffer completes and the execution reaches the watchdog
cancellation instruction before the watchdog counter reaches its
threshold value the watchdog is cancelled and nothing more comes of it.
No hang is detected.

Note about future interaction with preemption: Preemption could happen
in a command sequence prior to watchdog counter getting disabled,
resulting in watchdog being triggered following preemption (e.g. when
watchdog had been enabled in the low priority batch). The driver will
need to explicitly disable the watchdog counter as part of the
preemption sequence.

*** This patch introduces: ***

1. IRQ handler code for watchdog timeout allowing direct hang recovery
based on hardware-driven hang detection, which then integrates directly
with the hang recovery path. This is independent of having per-engine reset
or just full gpu reset.

2. Watchdog specific register information.

Currently the render engine and all available media engines support
watchdog timeout (VECS is only supported in GEN9). The specifications elude
to the BCS engine being supported but that is currently not supported by
this commit.

Note that the value to stop the counter is different between render and
non-render engines in GEN8; GEN9 onwards it's the same.

v2: Move irq handler to tasklet, arm watchdog for a 2nd time to check
against false-positives.

v3: Don't use high priority tasklet, use engine_last_submit while
checking for false-positives. From GEN9 onwards, the stop counter bit is
the same for all engines.

v4: Remove unnecessary brackets, use current_seqno to mark the request
as guilty in the hangcheck/capture code.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |  4 ++
 drivers/gpu/drm/i915/i915_irq.c         | 12 +++++-
 drivers/gpu/drm/i915/i915_reg.h         |  6 +++
 drivers/gpu/drm/i915/intel_hangcheck.c  | 13 +++++--
 drivers/gpu/drm/i915/intel_lrc.c        | 69 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h |  4 ++
 6 files changed, 103 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index cbefcd4b2507..2e1211e25945 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1608,6 +1608,9 @@ struct i915_gpu_error {
 	 * inspect the bit and do the reset directly, otherwise the worker
 	 * waits for the struct_mutex.
 	 *
+	 * #I915_RESET_WATCHDOG - When hw detects a hang before us, we can use
+	 * I915_RESET_WATCHDOG to report the hang detection cause accurately.
+	 *
 	 * #I915_WEDGED - If reset fails and we can no longer use the GPU,
 	 * we set the #I915_WEDGED bit. Prior to command submission, e.g.
 	 * i915_gem_request_alloc(), this bit is checked and the sequence
@@ -1616,6 +1619,7 @@ struct i915_gpu_error {
 	unsigned long flags;
 #define I915_RESET_BACKOFF	0
 #define I915_RESET_HANDOFF	1
+#define I915_RESET_WATCHDOG	2
 #define I915_WEDGED		(BITS_PER_LONG - 1)
 
 	/** Number of times an engine has been reset */
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 3a59ef1367ec..662cc3d93a18 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1370,6 +1370,9 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
 
 	if (tasklet)
 		tasklet_hi_schedule(&engine->irq_tasklet);
+
+	if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift))
+		tasklet_schedule(&engine->watchdog_tasklet);
 }
 
 static irqreturn_t gen8_gt_irq_ack(struct drm_i915_private *dev_priv,
@@ -3455,12 +3458,15 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
 	uint32_t gt_interrupts[] = {
 		GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 			GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT,
 		GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
 			GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
-			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
+			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
 		0,
 		GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT
@@ -3469,6 +3475,10 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
 	if (HAS_L3_DPF(dev_priv))
 		gt_interrupts[0] |= GT_RENDER_L3_PARITY_ERROR_INTERRUPT;
 
+	/* VECS watchdog is only available in skl+ */
+	if (INTEL_GEN(dev_priv) >= 9)
+		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
+
 	dev_priv->pm_ier = 0x0;
 	dev_priv->pm_imr = ~dev_priv->pm_ier;
 	GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 4c72adae368b..c6a47648f99c 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -1908,6 +1908,11 @@ enum skl_disp_power_wells {
 #define RING_START(base)	_MMIO((base)+0x38)
 #define RING_CTL(base)		_MMIO((base)+0x3c)
 #define   RING_CTL_SIZE(size)	((size) - PAGE_SIZE) /* in bytes -> pages */
+#define RING_CNTR(base)        _MMIO((base) + 0x178)
+#define   GEN8_WATCHDOG_ENABLE		0
+#define   GEN8_WATCHDOG_DISABLE	1
+#define   GEN8_XCS_WATCHDOG_DISABLE	0xFFFFFFFF /* GEN8 & non-render only */
+#define RING_THRESH(base)      _MMIO((base) + 0x17C)
 #define RING_SYNC_0(base)	_MMIO((base)+0x40)
 #define RING_SYNC_1(base)	_MMIO((base)+0x44)
 #define RING_SYNC_2(base)	_MMIO((base)+0x48)
@@ -2386,6 +2391,7 @@ enum skl_disp_power_wells {
 #define GT_BSD_USER_INTERRUPT			(1 << 12)
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /* hsw+; rsvd on snb, ivb, vlv */
 #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
+#define GT_GEN8_WATCHDOG_INTERRUPT		(1 <<  6) /* gen8+ */
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /* !snb */
 #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
 #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index 9b0ece427bdc..254155ebab45 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -388,7 +388,8 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
 
 static void hangcheck_declare_hang(struct drm_i915_private *i915,
 				   unsigned int hung,
-				   unsigned int stuck)
+				   unsigned int stuck,
+				   unsigned int watchdog)
 {
 	struct intel_engine_cs *engine;
 	char msg[80];
@@ -401,7 +402,8 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
 	if (stuck != hung)
 		hung &= ~stuck;
 	len = scnprintf(msg, sizeof(msg),
-			"%s on ", stuck == hung ? "No progress" : "Hang");
+			"%s on ", watchdog ? "Watchdog timeout" :
+				  stuck == hung ? "No progress" : "Hang");
 	for_each_engine_masked(engine, i915, hung, tmp)
 		len += scnprintf(msg + len, sizeof(msg) - len,
 				 "%s, ", engine->name);
@@ -425,7 +427,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 			     gpu_error.hangcheck_work.work);
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
-	unsigned int hung = 0, stuck = 0;
+	unsigned int hung = 0, stuck = 0, watchdog = 0;
 	int busy_count = 0;
 
 	if (!i915.enable_hangcheck)
@@ -437,6 +439,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	if (i915_terminally_wedged(&dev_priv->gpu_error))
 		return;
 
+	if (test_and_clear_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags))
+		watchdog = 1;
+
 	/* As enabling the GPU requires fairly extensive mmio access,
 	 * periodically arm the mmio checker to see if we are triggering
 	 * any invalid access.
@@ -463,7 +468,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	}
 
 	if (hung)
-		hangcheck_declare_hang(dev_priv, hung, stuck);
+		hangcheck_declare_hang(dev_priv, hung, stuck, watchdog);
 
 	/* Reset timer in case GPU hangs without another request being added */
 	if (busy_count)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 5ec064a56a7d..69a73440ff12 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1462,6 +1462,53 @@ static int gen8_emit_flush_render(struct drm_i915_gem_request *request,
 	return 0;
 }
 
+/* From GEN9 onwards, all engines use the same RING_CNTR format */
+static inline u32 get_watchdog_disable(struct intel_engine_cs *engine)
+{
+	if (engine->id == RCS || INTEL_GEN(engine->i915) >= 9)
+		return GEN8_WATCHDOG_DISABLE;
+	else
+		return GEN8_XCS_WATCHDOG_DISABLE;
+}
+
+#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
+static void gen8_watchdog_irq_handler(unsigned long data)
+{
+	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
+	struct drm_i915_private *dev_priv = engine->i915;
+	u32 current_seqno;
+
+	intel_uncore_forcewake_get(dev_priv, engine->fw_domains);
+
+	/* Stop the counter to prevent further timeout interrupts */
+	I915_WRITE_FW(RING_CNTR(engine->mmio_base), get_watchdog_disable(engine));
+
+	current_seqno = intel_engine_get_seqno(engine);
+
+	/* did the request complete after the timer expired? */
+	if (intel_engine_last_submit(engine) == current_seqno)
+		goto fw_put;
+
+	if (engine->hangcheck.watchdog == current_seqno) {
+		/* Make sure the active request will be marked as guilty */
+		engine->hangcheck.stalled = true;
+		engine->hangcheck.seqno = current_seqno;
+
+		/* And try to run the hangcheck_work as soon as possible */
+		set_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags);
+		queue_delayed_work(system_long_wq,
+				   &dev_priv->gpu_error.hangcheck_work, 0);
+	} else {
+		engine->hangcheck.watchdog = current_seqno;
+		/* Re-start the counter, if really hung, it will expire again */
+		I915_WRITE_FW(RING_THRESH(engine->mmio_base), GEN8_WATCHDOG_1000US);
+		I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
+	}
+
+fw_put:
+	intel_uncore_forcewake_put(dev_priv, engine->fw_domains);
+}
+
 /*
  * Reserve space for 2 NOOPs at the end of each request to be
  * used as a workaround for not being allowed to do lite
@@ -1555,6 +1602,9 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine)
 	if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &engine->irq_tasklet.state)))
 		tasklet_kill(&engine->irq_tasklet);
 
+	if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &engine->watchdog_tasklet.state)))
+		tasklet_kill(&engine->watchdog_tasklet);
+
 	dev_priv = engine->i915;
 
 	if (engine->buffer) {
@@ -1613,6 +1663,22 @@ logical_ring_default_irqs(struct intel_engine_cs *engine)
 	unsigned shift = engine->irq_shift;
 	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
 	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
+
+	switch (engine->id) {
+	default:
+		/* BCS engine does not support hw watchdog */
+		break;
+	case RCS:
+	case VCS:
+	case VCS2:
+		engine->irq_keep_mask |= (GT_GEN8_WATCHDOG_INTERRUPT << shift);
+		break;
+	case VECS:
+		if (INTEL_GEN(engine->i915) >= 9)
+			engine->irq_keep_mask |=
+				(GT_GEN8_WATCHDOG_INTERRUPT << shift);
+		break;
+	}
 }
 
 static int
@@ -1661,6 +1727,9 @@ logical_ring_setup(struct intel_engine_cs *engine)
 	tasklet_init(&engine->irq_tasklet,
 		     intel_lrc_irq_handler, (unsigned long)engine);
 
+	tasklet_init(&engine->watchdog_tasklet,
+		     gen8_watchdog_irq_handler, (unsigned long)engine);
+
 	logical_ring_default_vfuncs(engine);
 	logical_ring_default_irqs(engine);
 }
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 2506bbe26fa0..17b3194e4034 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -128,6 +128,7 @@ struct intel_instdone {
 struct intel_engine_hangcheck {
 	u64 acthd;
 	u32 seqno;
+	u32 watchdog;
 	enum intel_engine_hangcheck_action action;
 	unsigned long action_timestamp;
 	int deadlock;
@@ -410,6 +411,9 @@ struct intel_engine_cs {
 
 	struct intel_engine_hangcheck hangcheck;
 
+	/* watchdog_tasklet: stop counter and re-schedule hangcheck_work asap */
+	struct tasklet_struct watchdog_tasklet;
+
 	bool needs_cmd_parser;
 
 	/*
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (15 preceding siblings ...)
  2017-04-27 23:12 ` [PATCH v7 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
                   ` (8 subsequent siblings)
  25 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

Emit the required commands into the ring buffer for starting and
stopping the watchdog timer before/after batch buffer start during
batch buffer submission.

v2: Support watchdog threshold per context engine, merge lri commands,
and move watchdog commands emission to emit_bb_start. Request space of
combined start_watchdog, bb_start and stop_watchdog to avoid any error
after emitting bb_start.

v3: There were too many req->engine in emit_bb_start.
Use GEM_BUG_ON instead of returning a very late EINVAL in the remote
case of watchdog misprogramming; set correct LRI cmd size in
emit_stop_watchdog. (Chris)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.h |  4 ++
 drivers/gpu/drm/i915/intel_lrc.c        | 85 +++++++++++++++++++++++++++++++--
 drivers/gpu/drm/i915/intel_ringbuffer.h |  4 ++
 3 files changed, 89 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 4af2ab94558b..88700bdbb4e1 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -150,6 +150,10 @@ struct i915_gem_context {
 		u32 *lrc_reg_state;
 		u64 lrc_desc;
 		int pin_count;
+		/** watchdog_threshold: hw watchdog threshold value,
+		 * in clock counts
+		 */
+		u32 watchdog_threshold;
 		bool initialised;
 	} engine[I915_NUM_ENGINES];
 
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 69a73440ff12..207cf7d8721b 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1310,7 +1310,10 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
 			      u64 offset, u32 len,
 			      const unsigned int flags)
 {
+	struct intel_engine_cs *engine = req->engine;
 	u32 *cs;
+	u32 num_dwords;
+	bool watchdog_running = false;
 	int ret;
 
 	/* Don't rely in hw updating PDPs, specially in lite-restore.
@@ -1320,20 +1323,38 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
 	 * not idle). PML4 is allocated during ppgtt init so this is
 	 * not needed in 48-bit.*/
 	if (req->ctx->ppgtt &&
-	    (intel_engine_flag(req->engine) & req->ctx->ppgtt->pd_dirty_rings) &&
+	    (intel_engine_flag(engine) & req->ctx->ppgtt->pd_dirty_rings) &&
 	    !i915_vm_is_48bit(&req->ctx->ppgtt->base) &&
 	    !intel_vgpu_active(req->i915)) {
 		ret = intel_logical_ring_emit_pdps(req);
 		if (ret)
 			return ret;
 
-		req->ctx->ppgtt->pd_dirty_rings &= ~intel_engine_flag(req->engine);
+		req->ctx->ppgtt->pd_dirty_rings &= ~intel_engine_flag(engine);
+	}
+
+	/* bb_start only */
+	num_dwords = 4;
+
+	/* check if watchdog will be required */
+	if (req->ctx->engine[engine->id].watchdog_threshold != 0) {
+		GEM_BUG_ON(!engine->emit_start_watchdog ||
+			   !engine->emit_stop_watchdog);
+
+		/* + start_watchdog (6) + stop_watchdog (4) */
+		num_dwords += 10;
+		watchdog_running = true;
 	}
 
-	cs = intel_ring_begin(req, 4);
+	cs = intel_ring_begin(req, num_dwords);
 	if (IS_ERR(cs))
 		return PTR_ERR(cs);
 
+	if (watchdog_running) {
+		/* Start watchdog timer */
+		cs = engine->emit_start_watchdog(req, cs);
+	}
+
 	/* FIXME(BDW): Address space and security selectors. */
 	*cs++ = MI_BATCH_BUFFER_START_GEN8 |
 		(flags & I915_DISPATCH_SECURE ? 0 : BIT(8)) |
@@ -1341,8 +1362,13 @@ static int gen8_emit_bb_start(struct drm_i915_gem_request *req,
 	*cs++ = lower_32_bits(offset);
 	*cs++ = upper_32_bits(offset);
 	*cs++ = MI_NOOP;
-	intel_ring_advance(req, cs);
 
+	if (watchdog_running) {
+		/* Cancel watchdog timer */
+		cs = engine->emit_stop_watchdog(req, cs);
+	}
+
+	intel_ring_advance(req, cs);
 	return 0;
 }
 
@@ -1509,6 +1535,49 @@ static void gen8_watchdog_irq_handler(unsigned long data)
 	intel_uncore_forcewake_put(dev_priv, engine->fw_domains);
 }
 
+static u32 *gen8_emit_start_watchdog(struct drm_i915_gem_request *req, u32 *cs)
+{
+	struct intel_engine_cs *engine = req->engine;
+	struct i915_gem_context *ctx = req->ctx;
+	struct intel_context *ce = &ctx->engine[engine->id];
+
+	/* XXX: no watchdog support in BCS engine */
+	GEM_BUG_ON(engine->id == BCS);
+
+	/*
+	 * watchdog register must never be programmed to zero. This would
+	 * cause the watchdog counter to exceed and not allow the engine to
+	 * go into IDLE state
+	 */
+	GEM_BUG_ON(ce->watchdog_threshold == 0);
+
+	/* Set counter period */
+	*cs++ = MI_LOAD_REGISTER_IMM(2);
+	*cs++ = i915_mmio_reg_offset(RING_THRESH(engine->mmio_base));
+	*cs++ = ce->watchdog_threshold;
+	/* Start counter */
+	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
+	*cs++ = GEN8_WATCHDOG_ENABLE;
+	*cs++ = MI_NOOP;
+
+	return cs;
+}
+
+static u32 *gen8_emit_stop_watchdog(struct drm_i915_gem_request *req, u32 *cs)
+{
+	struct intel_engine_cs *engine = req->engine;
+
+	/* XXX: no watchdog support in BCS engine */
+	GEM_BUG_ON(engine->id == BCS);
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
+	*cs++ = get_watchdog_disable(engine);
+	*cs++ = MI_NOOP;
+
+	return cs;
+}
+
 /*
  * Reserve space for 2 NOOPs at the end of each request to be
  * used as a workaround for not being allowed to do lite
@@ -1777,6 +1846,8 @@ int logical_render_ring_init(struct intel_engine_cs *engine)
 	engine->emit_flush = gen8_emit_flush_render;
 	engine->emit_breadcrumb = gen8_emit_breadcrumb_render;
 	engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_render_sz;
+	engine->emit_start_watchdog = gen8_emit_start_watchdog;
+	engine->emit_stop_watchdog = gen8_emit_stop_watchdog;
 
 	ret = intel_engine_create_scratch(engine, PAGE_SIZE);
 	if (ret)
@@ -1800,6 +1871,12 @@ int logical_xcs_ring_init(struct intel_engine_cs *engine)
 {
 	logical_ring_setup(engine);
 
+	/* BCS engine does not have a watchdog-expired irq */
+	if (engine->id != BCS) {
+		engine->emit_start_watchdog = gen8_emit_start_watchdog;
+		engine->emit_stop_watchdog = gen8_emit_stop_watchdog;
+	}
+
 	return logical_ring_init(engine);
 }
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 17b3194e4034..fee80a1c5d95 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -280,6 +280,10 @@ struct intel_engine_cs {
 
 	int		(*emit_flush)(struct drm_i915_gem_request *request,
 				      u32 mode);
+	u32 *		(*emit_start_watchdog)(struct drm_i915_gem_request *req,
+					       u32 *cs);
+	u32 *		(*emit_stop_watchdog)(struct drm_i915_gem_request *req,
+					      u32 *cs);
 #define EMIT_INVALIDATE	BIT(0)
 #define EMIT_FLUSH	BIT(1)
 #define EMIT_BARRIER	(EMIT_INVALIDATE | EMIT_FLUSH)
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (16 preceding siblings ...)
  2017-04-27 23:12 ` [PATCH v7 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-27 23:12 ` [PATCH v7 19/20] drm/i915: Watchdog timeout: Include threshold value in error state Michel Thierry
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

Final enablement patch for GPU hang detection using watchdog timeout.
Using the gem_context_setparam ioctl, users can specify the desired
timeout value in microseconds, and the driver will do the conversion to
'timestamps'.

The recommended default watchdog threshold for video engines is 60000 us,
since this has been _empirically determined_ to be a good compromise for
low-latency requirements and low rate of false positives. The default
register value is ~106000us and the theoretical max value (all 1s) is
353 seconds.

Note, UABI engine ids and i915 engine ids are different, and this patch
uses the i915 ones. Some kind of mapping table [1] is required if we
decide to use the UABI engine ids.

[1] http://patchwork.freedesktop.org/patch/msgid/20170329135831.30254-2-chris@chris-wilson.co.uk

v2: Fixed get api to return values in microseconds. Threshold updated to
be per context engine. Check for u32 overflow. Capture ctx threshold
value in error state.

v3: Add a way to get array size, short-cut to disable all thresholds,
return EFAULT / EINVAL as needed. Move the capture of the threshold
value in the error state into a new patch. BXT has a different
timestamp base (because why not?).

v4: Checking if watchdog is available should be the first thing to
do, instead of giving false hopes to abi users; remove unnecessary & in
set_watchdog; ignore args->size in getparam.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         | 29 ++++++++++
 drivers/gpu/drm/i915/i915_gem_context.c | 95 +++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.c        |  5 +-
 include/uapi/drm/i915_drm.h             |  1 +
 4 files changed, 128 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 2e1211e25945..7a64f67974cb 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3578,6 +3578,35 @@ i915_gem_context_lookup_timeline(struct i915_gem_context *ctx,
 	return &vm->timeline.engine[engine->id];
 }
 
+/*
+ * BDW & SKL+ Timestamp timer resolution = 0.080 uSec,
+ * or 12500000 counts per second, or ~12 counts per microsecond.
+ *
+ * But Broxton Timestamp timer resolution is different, 0.052 uSec,
+ * or 19200000 counts per second, or ~19 counts per microsecond.
+ */
+#define SKL_TIMESTAMP_CNTS_PER_USEC 12
+#define BXT_TIMESTAMP_CNTS_PER_USEC 19
+#define TIMESTAMP_CNTS_PER_USEC(dev_priv) (IS_BROXTON(dev_priv) ? \
+					   BXT_TIMESTAMP_CNTS_PER_USEC : \
+					   SKL_TIMESTAMP_CNTS_PER_USEC)
+static inline u32
+watchdog_to_us(struct drm_i915_private *dev_priv, u32 value_in_clock_counts)
+{
+	return value_in_clock_counts / TIMESTAMP_CNTS_PER_USEC(dev_priv);
+}
+
+static inline u32
+watchdog_to_clock_counts(struct drm_i915_private *dev_priv, u64 value_in_us)
+{
+	u64 threshold = value_in_us * TIMESTAMP_CNTS_PER_USEC(dev_priv);
+
+	if (overflows_type(threshold, u32))
+		return -EINVAL;
+
+	return threshold;
+}
+
 int i915_perf_open_ioctl(struct drm_device *dev, void *data,
 			 struct drm_file *file);
 
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index e98d9daa3f00..574df077cf34 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -363,6 +363,95 @@ i915_gem_context_create_gvt(struct drm_device *dev)
 	return ctx;
 }
 
+/* Return the timer count threshold in microseconds. */
+int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
+				  struct drm_i915_gem_context_param *args)
+{
+	struct drm_i915_private *dev_priv = ctx->i915;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	u32 threshold_in_us[I915_NUM_ENGINES];
+
+	if (!dev_priv->engine[VCS]->emit_start_watchdog)
+		return -ENODEV;
+
+	for_each_engine(engine, dev_priv, id) {
+		struct intel_context *ce = &ctx->engine[id];
+
+		threshold_in_us[id] = watchdog_to_us(dev_priv,
+						     ce->watchdog_threshold);
+	}
+
+	mutex_unlock(&dev_priv->drm.struct_mutex);
+	if (__copy_to_user(u64_to_user_ptr(args->value),
+			   &threshold_in_us,
+			   sizeof(threshold_in_us))) {
+		mutex_lock(&dev_priv->drm.struct_mutex);
+		return -EFAULT;
+	}
+	mutex_lock(&dev_priv->drm.struct_mutex);
+
+	args->size = sizeof(threshold_in_us);
+
+	return 0;
+}
+
+/*
+ * Based on time out value in microseconds (us) calculate
+ * timer count thresholds needed based on core frequency.
+ * Watchdog can be disabled by setting it to 0.
+ */
+int i915_gem_context_set_watchdog(struct i915_gem_context *ctx,
+				  struct drm_i915_gem_context_param *args)
+{
+	struct drm_i915_private *dev_priv = ctx->i915;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	u32 threshold[I915_NUM_ENGINES];
+
+	if (!dev_priv->engine[VCS]->emit_start_watchdog)
+		return -ENODEV;
+
+	memset(threshold, 0, sizeof(threshold));
+
+	/* shortcut to disable in all engines */
+	if (args->size == 0)
+		goto set_watchdog;
+
+	if (args->size < sizeof(threshold))
+		return -EFAULT;
+
+	mutex_unlock(&dev_priv->drm.struct_mutex);
+	if (copy_from_user(threshold,
+			   u64_to_user_ptr(args->value),
+			   sizeof(threshold))) {
+		mutex_lock(&dev_priv->drm.struct_mutex);
+		return -EFAULT;
+	}
+	mutex_lock(&dev_priv->drm.struct_mutex);
+
+	/* not supported in blitter engine */
+	if (threshold[BCS] != 0)
+		return -EINVAL;
+
+	for_each_engine(engine, dev_priv, id) {
+		threshold[id] = watchdog_to_clock_counts(dev_priv,
+							 threshold[id]);
+
+		if (threshold[id] == -EINVAL)
+			return -EINVAL;
+	}
+
+set_watchdog:
+	for_each_engine(engine, dev_priv, id) {
+		struct intel_context *ce = &ctx->engine[id];
+
+		ce->watchdog_threshold = threshold[id];
+	}
+
+	return 0;
+}
+
 int i915_gem_context_init(struct drm_i915_private *dev_priv)
 {
 	struct i915_gem_context *ctx;
@@ -1002,6 +1091,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_BANNABLE:
 		args->value = i915_gem_context_is_bannable(ctx);
 		break;
+	case I915_CONTEXT_PARAM_WATCHDOG:
+		ret = i915_gem_context_get_watchdog(ctx, args);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -1059,6 +1151,9 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 		else
 			i915_gem_context_clear_bannable(ctx);
 		break;
+	case I915_CONTEXT_PARAM_WATCHDOG:
+		ret = i915_gem_context_set_watchdog(ctx, args);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 207cf7d8721b..76ed994a8bbf 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1497,7 +1497,7 @@ static inline u32 get_watchdog_disable(struct intel_engine_cs *engine)
 		return GEN8_XCS_WATCHDOG_DISABLE;
 }
 
-#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
+#define GEN8_WATCHDOG_1000US(dev_priv) watchdog_to_clock_counts(dev_priv, 1000)
 static void gen8_watchdog_irq_handler(unsigned long data)
 {
 	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
@@ -1527,7 +1527,8 @@ static void gen8_watchdog_irq_handler(unsigned long data)
 	} else {
 		engine->hangcheck.watchdog = current_seqno;
 		/* Re-start the counter, if really hung, it will expire again */
-		I915_WRITE_FW(RING_THRESH(engine->mmio_base), GEN8_WATCHDOG_1000US);
+		I915_WRITE_FW(RING_THRESH(engine->mmio_base),
+			      GEN8_WATCHDOG_1000US(dev_priv));
 		I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
 	}
 
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index fadedefba6db..18bc0ec618dd 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1308,6 +1308,7 @@ struct drm_i915_gem_context_param {
 #define I915_CONTEXT_PARAM_GTT_SIZE	0x3
 #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE	0x4
 #define I915_CONTEXT_PARAM_BANNABLE	0x5
+#define I915_CONTEXT_PARAM_WATCHDOG	0x6
 	__u64 value;
 };
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 19/20] drm/i915: Watchdog timeout: Include threshold value in error state
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (17 preceding siblings ...)
  2017-04-27 23:12 ` [PATCH v7 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
@ 2017-04-27 23:12 ` Michel Thierry
  2017-04-27 23:13 ` [PATCH v7 20/20] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs Michel Thierry
                   ` (6 subsequent siblings)
  25 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:12 UTC (permalink / raw)
  To: intel-gfx

Save the watchdog threshold (in us) as part of the engine state.

Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h       |  1 +
 drivers/gpu/drm/i915/i915_gpu_error.c | 11 +++++++----
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 7a64f67974cb..aaa7d3d96bda 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1022,6 +1022,7 @@ struct i915_gpu_state {
 			int ban_score;
 			int active;
 			int guilty;
+			int watchdog_threshold;
 		} context;
 
 		struct drm_i915_error_object {
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index a2ffb1ef2cfa..1b1a49bc0c3c 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -388,9 +388,10 @@ static void error_print_context(struct drm_i915_error_state_buf *m,
 				const char *header,
 				const struct drm_i915_error_context *ctx)
 {
-	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, ban score %d guilty %d active %d\n",
+	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, ban score %d guilty %d active %d, watchdog %dus\n",
 		   header, ctx->comm, ctx->pid, ctx->handle, ctx->hw_id,
-		   ctx->ban_score, ctx->guilty, ctx->active);
+		   ctx->ban_score, ctx->guilty, ctx->active,
+		   watchdog_to_us(m->i915, ctx->watchdog_threshold));
 }
 
 static void error_print_engine(struct drm_i915_error_state_buf *m,
@@ -1344,7 +1345,8 @@ static void error_record_engine_execlists(struct intel_engine_cs *engine,
 }
 
 static void record_context(struct drm_i915_error_context *e,
-			   struct i915_gem_context *ctx)
+			   struct i915_gem_context *ctx,
+			   u32 engine_id)
 {
 	if (ctx->pid) {
 		struct task_struct *task;
@@ -1363,6 +1365,7 @@ static void record_context(struct drm_i915_error_context *e,
 	e->ban_score = ctx->ban_score;
 	e->guilty = ctx->guilty_count;
 	e->active = ctx->active_count;
+	e->watchdog_threshold =	ctx->engine[engine_id].watchdog_threshold;
 }
 
 static void request_record_user_bo(struct drm_i915_gem_request *request,
@@ -1426,7 +1429,7 @@ static void i915_gem_record_rings(struct drm_i915_private *dev_priv,
 			ee->vm = request->ctx->ppgtt ?
 				&request->ctx->ppgtt->base : &ggtt->base;
 
-			record_context(&ee->context, request->ctx);
+			record_context(&ee->context, request->ctx, engine->id);
 
 			/* We need to copy these to an anonymous buffer
 			 * as the simplest method to avoid being overwritten
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v7 20/20] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (18 preceding siblings ...)
  2017-04-27 23:12 ` [PATCH v7 19/20] drm/i915: Watchdog timeout: Include threshold value in error state Michel Thierry
@ 2017-04-27 23:13 ` Michel Thierry
  2017-04-27 23:30 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev4) Patchwork
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-27 23:13 UTC (permalink / raw)
  To: intel-gfx

From firmware v8.8, GuC provides the count of media engine resets
(watchdog timeout). This information is available in the GuC shared
context data struct, which resides in the first page of the default
(kernel) lrc context.

Since GuC handled engine resets are transparent for kernel and user,
provide a simple debugfs entry to see the number of times media reset
has happened.

v2: Remove unnecessary struct_mutex, _get_dirty_page and kmap_atomic;
use READ_ONCE. (Chris)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c   | 22 ++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_guc_fwif.h | 18 ++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 6444c1a9bd22..35ce771c8b8f 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1403,6 +1403,26 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 	return 0;
 }
 
+static u32 i915_watchdog_reset_count(struct drm_i915_private *dev_priv)
+{
+	struct i915_gem_context *ctx;
+	struct page *page;
+	struct guc_shared_ctx_data *guc_shared_data;
+	u32 guc_media_reset_count;
+
+	if (!i915.enable_guc_submission)
+		return 0;
+
+	ctx = dev_priv->kernel_context;
+	page = i915_gem_object_get_page(ctx->engine[RCS].state->obj,
+					LRC_GUCSHR_PN);
+	guc_shared_data = kmap(page);
+	guc_media_reset_count = READ_ONCE(guc_shared_data->media_reset_count);
+	kunmap(page);
+
+	return guc_media_reset_count;
+}
+
 static int i915_reset_info(struct seq_file *m, void *unused)
 {
 	struct drm_i915_private *dev_priv = node_to_i915(m->private);
@@ -1411,6 +1431,8 @@ static int i915_reset_info(struct seq_file *m, void *unused)
 	enum intel_engine_id id;
 
 	seq_printf(m, "full gpu reset = %u\n", i915_reset_count(error));
+	seq_printf(m, "GuC watchdog/media reset = %u\n",
+		   i915_watchdog_reset_count(dev_priv));
 
 	for_each_engine(engine, dev_priv, id) {
 		seq_printf(m, "%s = %u\n", engine->name,
diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h
index a2d0cba2f8b9..e45987f7aa50 100644
--- a/drivers/gpu/drm/i915/intel_guc_fwif.h
+++ b/drivers/gpu/drm/i915/intel_guc_fwif.h
@@ -502,6 +502,24 @@ union guc_log_control {
 	u32 value;
 } __packed;
 
+/* GuC Shared Context Data Struct */
+struct guc_shared_ctx_data {
+	u32 addr_of_last_preempted_data_low;
+	u32 addr_of_last_preempted_data_high;
+	u32 addr_of_last_preempted_data_high_tmp;
+	u32 padding;
+	u32 is_mapped_to_proxy;
+	u32 proxy_ctx_id;
+	u32 engine_reset_ctx_id;
+	u32 media_reset_count;
+	u32 reserved[8];
+	u32 uk_last_ctx_switch_reason;
+	u32 was_reset;
+	u32 lrca_gpu_addr;
+	u32 execlist_ctx;
+	u32 reserved1[32];
+} __packed;
+
 /* This Action will be programmed in C180 - SOFT_SCRATCH_O_REG */
 enum intel_guc_action {
 	INTEL_GUC_ACTION_DEFAULT = 0x0,
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev4)
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (19 preceding siblings ...)
  2017-04-27 23:13 ` [PATCH v7 20/20] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs Michel Thierry
@ 2017-04-27 23:30 ` Patchwork
  2017-05-15 21:32 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev5) Patchwork
                   ` (4 subsequent siblings)
  25 siblings, 0 replies; 62+ messages in thread
From: Patchwork @ 2017-04-27 23:30 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

== Series Details ==

Series: Gen8+ engine-reset (rev4)
URL   : https://patchwork.freedesktop.org/series/21868/
State : success

== Summary ==

Series 21868v4 Gen8+ engine-reset
https://patchwork.freedesktop.org/api/1.0/series/21868/revisions/4/mbox/

fi-bdw-5557u     total:278  pass:267  dwarn:0   dfail:0   fail:0   skip:11  time:431s
fi-bdw-gvtdvm    total:278  pass:256  dwarn:8   dfail:0   fail:0   skip:14  time:424s
fi-bsw-n3050     total:278  pass:242  dwarn:0   dfail:0   fail:0   skip:36  time:575s
fi-bxt-j4205     total:278  pass:259  dwarn:0   dfail:0   fail:0   skip:19  time:510s
fi-bxt-t5700     total:278  pass:258  dwarn:0   dfail:0   fail:0   skip:20  time:541s
fi-byt-j1900     total:278  pass:254  dwarn:0   dfail:0   fail:0   skip:24  time:485s
fi-byt-n2820     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:485s
fi-hsw-4770      total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:410s
fi-hsw-4770r     total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:407s
fi-ilk-650       total:278  pass:228  dwarn:0   dfail:0   fail:0   skip:50  time:425s
fi-ivb-3520m     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:496s
fi-ivb-3770      total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:460s
fi-kbl-7500u     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:457s
fi-kbl-7560u     total:278  pass:267  dwarn:1   dfail:0   fail:0   skip:10  time:566s
fi-skl-6260u     total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:456s
fi-skl-6700hq    total:278  pass:261  dwarn:0   dfail:0   fail:0   skip:17  time:571s
fi-skl-6700k     total:278  pass:256  dwarn:4   dfail:0   fail:0   skip:18  time:463s
fi-skl-6770hq    total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:490s
fi-skl-gvtdvm    total:278  pass:265  dwarn:0   dfail:0   fail:0   skip:13  time:434s
fi-snb-2520m     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:533s
fi-snb-2600      total:278  pass:249  dwarn:0   dfail:0   fail:0   skip:29  time:401s

d73272105f5bcd79f25c57bd5b28b72d41153cdb drm-tip: 2017y-04m-27d-21h-26m-48s UTC integration manifest
966284c drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs
5b29010 drm/i915: Watchdog timeout: Include threshold value in error state
06fad5f drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
02813be drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
9856155 drm/i915: Watchdog timeout: IRQ handler for gen8+
00b9bb5 drm/i915: Watchdog timeout: Pass GuC shared data structure during param load
5a9c8f85 drm/i915/guc: Add support for reset engine using GuC commands
27455f1 drm/i915/guc: Rename the function that resets the GuC
1baa734 drm/i915/guc: Provide register list to be saved/restored during engine reset
d4776dd drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder
2d8aef8 drm/i915/selftests: reset engine self tests
506713ad drm/i915: Add engine reset count in get-reset-stats ioctl
6da8513 drm/i915: Enable Engine reset and recovery support
929be53 drm/i915: Export per-engine reset count info to debugfs
7e340b7 drm/i915: Add engine reset count to error state
e08d4b6 drm/i915: Cancel reset-engine if we couldn't find an active request
ed3259b drm/i915: Skip reset request if there is one already
4a68c8b drm/i915: Add support for per engine reset recovery
4c53f6a drm/i915: Modify error handler for per engine hang recovery
d30dbab drm/i915: Update i915.reset to handle engine resets

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_4571/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 03/20] drm/i915: Add support for per engine reset recovery
  2017-04-27 23:12 ` [PATCH v7 03/20] drm/i915: Add support for per engine reset recovery Michel Thierry
@ 2017-04-27 23:50   ` Chris Wilson
  2017-04-28 21:59     ` Michel Thierry
  2017-05-04  0:26     ` Michel Thierry
  2017-05-15 21:18   ` [PATCH " Michel Thierry
  1 sibling, 2 replies; 62+ messages in thread
From: Chris Wilson @ 2017-04-27 23:50 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Thu, Apr 27, 2017 at 04:12:43PM -0700, Michel Thierry wrote:
> From: Arun Siluvery <arun.siluvery@linux.intel.com>
> 
> This change implements support for per-engine reset as an initial, less
> intrusive hang recovery option to be attempted before falling back to the
> legacy full GPU reset recovery mode if necessary. This is only supported
> from Gen8 onwards.
> 
> Hangchecker determines which engines are hung and invokes error handler to
> recover from it. Error handler schedules recovery for each of those engines
> that are hung. The recovery procedure is as follows,
>  - identifies the request that caused the hang and it is dropped
>  - force engine to idle: this is done by issuing a reset request
>  - reset and re-init engine
>  - restart submissions to the engine
> 
> If engine reset fails then we fall back to heavy weight full gpu reset
> which resets all engines and reinitiazes complete state of HW and SW.
> 
> v2: Rebase.
> v3: s/*engine_reset*/*reset_engine*/; freeze engine and irqs before
> calling i915_gem_reset_engine (Chris).
> v4: Rebase, modify i915_gem_reset_prepare to use a ring mask and
> reuse the function for reset_engine.
> v5: intel_reset_engine_start/cancel instead of request/unrequest_reset.
> v6: Clean up reset_engine function to not require mutex, i.e. no need to call
> revoke/restore_fences and _retire_requests (Chris).
> v7: Remove leftovers from v5, i.e. no need to disable irq, hold
> forcewake or wakeup the handoff bit (Chris).
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c         | 60 ++++++++++++++++++--
>  drivers/gpu/drm/i915/i915_drv.h         | 12 +++-
>  drivers/gpu/drm/i915/i915_gem.c         | 97 +++++++++++++++++++--------------
>  drivers/gpu/drm/i915/i915_gem_request.c |  2 +-
>  drivers/gpu/drm/i915/intel_uncore.c     | 20 +++++++
>  5 files changed, 142 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 48c8b69d9bde..ae891529dedd 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1810,7 +1810,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
>  
>  	pr_notice("drm/i915: Resetting chip after gpu hang\n");
>  	disable_irq(dev_priv->drm.irq);
> -	ret = i915_gem_reset_prepare(dev_priv);
> +	ret = i915_gem_reset_prepare(dev_priv, ALL_ENGINES);
>  	if (ret) {
>  		DRM_ERROR("GPU recovery failed\n");
>  		intel_gpu_reset(dev_priv, ALL_ENGINES);
> @@ -1852,7 +1852,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
>  	i915_queue_hangcheck(dev_priv);
>  
>  finish:
> -	i915_gem_reset_finish(dev_priv);
> +	i915_gem_reset_finish(dev_priv, ALL_ENGINES);
>  	enable_irq(dev_priv->drm.irq);
>  
>  wakeup:
> @@ -1871,11 +1871,63 @@ void i915_reset(struct drm_i915_private *dev_priv)
>   *
>   * Reset a specific GPU engine. Useful if a hang is detected.
>   * Returns zero on successful reset or otherwise an error code.
> + *
> + * Procedure is:
> + *  - identifies the request that caused the hang and it is dropped
> + *  - force engine to idle: this is done by issuing a reset request
> + *  - reset engine
> + *  - restart submissions to the engine

Why does the prospective caller need to know this?

>   */
>  int i915_reset_engine(struct intel_engine_cs *engine)
>  {
> -	/* FIXME: replace me with engine reset sequence */
> -	return -ENODEV;
> +	int ret;
> +	struct drm_i915_private *dev_priv = engine->i915;
> +	struct i915_gpu_error *error = &dev_priv->gpu_error;
> +
> +	GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, &error->flags));
> +
> +	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
> +
> +	ret = i915_gem_reset_prepare_engine(engine);
> +	if (ret) {
> +		DRM_ERROR("Previous reset failed - promote to full reset\n");
> +		goto out;
> +	}
> +
> +	/*
> +	 * the request that caused the hang is stuck on elsp, identify the
> +	 * active request and drop it, adjust head to skip the offending
> +	 * request to resume executing remaining requests in the queue.
> +	 */
> +	i915_gem_reset_engine(engine);
> +
> +	/* forcing engine to idle */
> +	ret = intel_reset_engine_start(engine);
> +	if (ret) {
> +		DRM_ERROR("Failed to disable %s\n", engine->name);
> +		goto out;
> +	}
> +
> +	/* finally, reset engine */
> +	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
> +	if (ret) {
> +		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
> +		intel_reset_engine_cancel(engine);
> +		goto out;
> +	}
> +
> +	/* be sure the request reset bit gets cleared */
> +	intel_reset_engine_cancel(engine);
> +
> +	i915_gem_reset_finish_engine(engine);
> +
> +	/* replay remaining requests in the queue */
> +	ret = engine->init_hw(engine);
> +	if (ret)
> +		goto out; //XXX: ignore this line for now

Please give the comments here some tlc. Focus on the why, you are
telling me what the code does.

> +
> +out:
> +	return ret;
>  }
>  
>  static int i915_pm_suspend(struct device *kdev)
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index ab7e68626c49..efbf34318893 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -3022,6 +3022,8 @@ extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
>  extern void i915_reset(struct drm_i915_private *dev_priv);
>  extern int i915_reset_engine(struct intel_engine_cs *engine);
>  extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
> +extern int intel_reset_engine_start(struct intel_engine_cs *engine);
> +extern void intel_reset_engine_cancel(struct intel_engine_cs *engine);
>  extern int intel_guc_reset(struct drm_i915_private *dev_priv);
>  extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
>  extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
> @@ -3410,7 +3412,6 @@ int __must_check i915_gem_set_global_seqno(struct drm_device *dev, u32 seqno);
>  
>  struct drm_i915_gem_request *
>  i915_gem_find_active_request(struct intel_engine_cs *engine);
> -

Nope. (find_active_request is not in the same group of operations as
retire_requests.)

>  void i915_gem_retire_requests(struct drm_i915_private *dev_priv);
>  
>  static inline bool i915_reset_backoff(struct i915_gpu_error *error)
> @@ -3438,11 +3439,16 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
>  	return READ_ONCE(error->reset_count);
>  }
>  
> -int i915_gem_reset_prepare(struct drm_i915_private *dev_priv);
> +int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
> +int i915_gem_reset_prepare(struct drm_i915_private *dev_priv,
> +			   unsigned int engine_mask);
>  void i915_gem_reset(struct drm_i915_private *dev_priv);
> -void i915_gem_reset_finish(struct drm_i915_private *dev_priv);
> +void i915_gem_reset_finish_engine(struct intel_engine_cs *engine);
> +void i915_gem_reset_finish(struct drm_i915_private *dev_priv,
> +			   unsigned int engine_mask);
>  void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
>  bool i915_gem_unset_wedged(struct drm_i915_private *dev_priv);
> +void i915_gem_reset_engine(struct intel_engine_cs *engine);
>  
>  void i915_gem_init_mmio(struct drm_i915_private *i915);
>  int __must_check i915_gem_init(struct drm_i915_private *dev_priv);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 33fb11cc5acc..bce38062f94e 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2793,48 +2793,57 @@ static bool engine_stalled(struct intel_engine_cs *engine)
>  	return true;
>  }
>  
> -int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
> +/* Ensure irq handler finishes, and not run again. */
> +int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>  {
> -	struct intel_engine_cs *engine;
> -	enum intel_engine_id id;
> +	struct drm_i915_gem_request *request;
>  	int err = 0;
>  
> -	/* Ensure irq handler finishes, and not run again. */
> -	for_each_engine(engine, dev_priv, id) {
> -		struct drm_i915_gem_request *request;
> -
> -		/* Prevent the signaler thread from updating the request
> -		 * state (by calling dma_fence_signal) as we are processing
> -		 * the reset. The write from the GPU of the seqno is
> -		 * asynchronous and the signaler thread may see a different
> -		 * value to us and declare the request complete, even though
> -		 * the reset routine have picked that request as the active
> -		 * (incomplete) request. This conflict is not handled
> -		 * gracefully!
> -		 */
> -		kthread_park(engine->breadcrumbs.signaler);
> -
> -		/* Prevent request submission to the hardware until we have
> -		 * completed the reset in i915_gem_reset_finish(). If a request
> -		 * is completed by one engine, it may then queue a request
> -		 * to a second via its engine->irq_tasklet *just* as we are
> -		 * calling engine->init_hw() and also writing the ELSP.
> -		 * Turning off the engine->irq_tasklet until the reset is over
> -		 * prevents the race.
> -		 */
> -		tasklet_kill(&engine->irq_tasklet);
> -		tasklet_disable(&engine->irq_tasklet);
>  
> -		if (engine->irq_seqno_barrier)
> -			engine->irq_seqno_barrier(engine);
> +	/* Prevent the signaler thread from updating the request
> +	 * state (by calling dma_fence_signal) as we are processing
> +	 * the reset. The write from the GPU of the seqno is
> +	 * asynchronous and the signaler thread may see a different
> +	 * value to us and declare the request complete, even though
> +	 * the reset routine have picked that request as the active
> +	 * (incomplete) request. This conflict is not handled
> +	 * gracefully!
> +	 */
> +	kthread_park(engine->breadcrumbs.signaler);
> +
> +	/* Prevent request submission to the hardware until we have
> +	 * completed the reset in i915_gem_reset_finish(). If a request
> +	 * is completed by one engine, it may then queue a request
> +	 * to a second via its engine->irq_tasklet *just* as we are
> +	 * calling engine->init_hw() and also writing the ELSP.
> +	 * Turning off the engine->irq_tasklet until the reset is over
> +	 * prevents the race.
> +	 */
> +	tasklet_kill(&engine->irq_tasklet);
> +	tasklet_disable(&engine->irq_tasklet);
>  
> -		if (engine_stalled(engine)) {
> -			request = i915_gem_find_active_request(engine);
> -			if (request && request->fence.error == -EIO)
> -				err = -EIO; /* Previous reset failed! */
> -		}
> +	if (engine->irq_seqno_barrier)
> +		engine->irq_seqno_barrier(engine);
> +
> +	if (engine_stalled(engine)) {
> +		request = i915_gem_find_active_request(engine);
> +		if (request && request->fence.error == -EIO)
> +			err = -EIO; /* Previous reset failed! */
>  	}
>  
> +	return err;
> +}
> +
> +int i915_gem_reset_prepare(struct drm_i915_private *dev_priv,
> +			   unsigned int engine_mask)
> +{
> +	struct intel_engine_cs *engine;
> +	unsigned int tmp;
> +	int err = 0;
> +
> +	for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
> +		err = i915_gem_reset_prepare_engine(engine);

You are losing any earlier err.

> +
>  	i915_gem_revoke_fences(dev_priv);
>  
>  	return err;
> @@ -2920,7 +2929,7 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
>  	return guilty;
>  }
>  
> -static void i915_gem_reset_engine(struct intel_engine_cs *engine)
> +void i915_gem_reset_engine(struct intel_engine_cs *engine)
>  {
>  	struct drm_i915_gem_request *request;
>  
> @@ -2966,16 +2975,22 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
>  	}
>  }
>  
> -void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
> +void i915_gem_reset_finish_engine(struct intel_engine_cs *engine)
> +{
> +	tasklet_enable(&engine->irq_tasklet);
> +	kthread_unpark(engine->breadcrumbs.signaler);
> +}
> +
> +void i915_gem_reset_finish(struct drm_i915_private *dev_priv,
> +			   unsigned int engine_mask)
>  {
>  	struct intel_engine_cs *engine;
> -	enum intel_engine_id id;
> +	unsigned int tmp;
>  
>  	lockdep_assert_held(&dev_priv->drm.struct_mutex);
>  
> -	for_each_engine(engine, dev_priv, id) {
> -		tasklet_enable(&engine->irq_tasklet);
> -		kthread_unpark(engine->breadcrumbs.signaler);
> +	for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
> +		i915_gem_reset_finish_engine(engine);
>  	}
>  }
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
> index 6198f6997d05..f69a8c535d5f 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -1216,7 +1216,7 @@ long i915_wait_request(struct drm_i915_gem_request *req,
>  	return timeout;
>  }
>  
> -static void engine_retire_requests(struct intel_engine_cs *engine)
> +void engine_retire_requests(struct intel_engine_cs *engine)

Fortunately stray chunk. I was about to scream.

>  {
>  	struct drm_i915_gem_request *request, *next;
>  	u32 seqno = intel_engine_get_seqno(engine);
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index ab5bdd110ac3..3ebba6b2dd74 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1801,6 +1801,26 @@ int intel_guc_reset(struct drm_i915_private *dev_priv)
>  	return ret;
>  }
>  
> +/*
> + * On gen8+ a reset request has to be issued via the reset control register
> + * before a GPU engine can be reset in order to stop the command streamer
> + * and idle the engine. This replaces the legacy way of stopping an engine
> + * by writing to the stop ring bit in the MI_MODE register.
> + */
> +int intel_reset_engine_start(struct intel_engine_cs *engine)
> +{
> +	return gen8_reset_engine_start(engine);
> +}
> +
> +/*
> + * It is possible to back off from a previously issued reset request by simply
> + * clearing the reset request bit in the reset control register.
> + */
> +void intel_reset_engine_cancel(struct intel_engine_cs *engine)
> +{
> +	gen8_reset_engine_cancel(engine);
> +}
> +
>  bool intel_uncore_unclaimed_mmio(struct drm_i915_private *dev_priv)
>  {
>  	return check_for_unclaimed_mmio(dev_priv);
> -- 
> 2.11.0
> 

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 12/20] drm/i915/guc: Provide register list to be saved/restored during engine reset
  2017-04-27 23:12 ` [PATCH v7 12/20] drm/i915/guc: Provide register list to be saved/restored during engine reset Michel Thierry
@ 2017-04-27 23:58   ` Chris Wilson
  2017-04-28 15:36     ` Michel Thierry
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Wilson @ 2017-04-27 23:58 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Thu, Apr 27, 2017 at 04:12:52PM -0700, Michel Thierry wrote:
> +#define WA_REG_WR_GUC_RESTORE(addr, val) do { \
> +		const int r = guc_wa_add(dev_priv, (addr), (val)); \
> +		if (r) \
> +			return r; \
> +	} while (0)

Try to avoid burying returns inside macros. Does this macro help code
readability? Would perhaps just a table of registers + values be easier?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 13/20] drm/i915/guc: Rename the function that resets the GuC
  2017-04-27 23:12 ` [PATCH v7 13/20] drm/i915/guc: Rename the function that resets the GuC Michel Thierry
@ 2017-04-28  7:40   ` Tvrtko Ursulin
  2017-05-01 20:09     ` Michel Thierry
  0 siblings, 1 reply; 62+ messages in thread
From: Tvrtko Ursulin @ 2017-04-28  7:40 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx


On 28/04/2017 00:12, Michel Thierry wrote:
> intel_guc_reset sounds more like the microcontroller is the one performing
> a reset, while in this case is the opposite. intel_reset_guc not only
> makes it clearer, it follows the other intel_reset functions available.
>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h     | 2 +-
>  drivers/gpu/drm/i915/intel_uc.c     | 4 ++--
>  drivers/gpu/drm/i915/intel_uncore.c | 2 +-
>  3 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index c9ff7f726d47..e9e04c92a376 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -3031,7 +3031,7 @@ extern int i915_reset_engine(struct intel_engine_cs *engine);
>  extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
>  extern int intel_reset_engine_start(struct intel_engine_cs *engine);
>  extern void intel_reset_engine_cancel(struct intel_engine_cs *engine);
> -extern int intel_guc_reset(struct drm_i915_private *dev_priv);
> +extern int intel_reset_guc(struct drm_i915_private *dev_priv);
>  extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
>  extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
>  extern unsigned long i915_chipset_val(struct drm_i915_private *dev_priv);
> diff --git a/drivers/gpu/drm/i915/intel_uc.c b/drivers/gpu/drm/i915/intel_uc.c
> index 900e3767a899..bad282b6c886 100644
> --- a/drivers/gpu/drm/i915/intel_uc.c
> +++ b/drivers/gpu/drm/i915/intel_uc.c
> @@ -46,9 +46,9 @@ static int __intel_uc_reset_hw(struct drm_i915_private *dev_priv)
>  	int ret;
>  	u32 guc_status;
>
> -	ret = intel_guc_reset(dev_priv);
> +	ret = intel_reset_guc(dev_priv);
>  	if (ret) {
> -		DRM_ERROR("GuC reset failed, ret = %d\n", ret);
> +		DRM_ERROR("Reset GuC failed, ret = %d\n", ret);

As a non-native speaker I might be wrong, but was thinking something 
like "Failed to reset GuC", "Resetting GuC failed", "Reset of GuC 
failed" would be clearer? I leave it for someone more competent to decide.

>  		return ret;
>  	}
>
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 120fb440bb8b..00251d83e7bd 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1792,7 +1792,7 @@ bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
>  		i915.reset == 2);
>  }
>
> -int intel_guc_reset(struct drm_i915_private *dev_priv)
> +int intel_reset_guc(struct drm_i915_private *dev_priv)
>  {
>  	int ret;
>
>

Potential message bikeshed aside:

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 12/20] drm/i915/guc: Provide register list to be saved/restored during engine reset
  2017-04-27 23:58   ` Chris Wilson
@ 2017-04-28 15:36     ` Michel Thierry
  0 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-28 15:36 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 4/27/2017 4:58 PM, Chris Wilson wrote:
> On Thu, Apr 27, 2017 at 04:12:52PM -0700, Michel Thierry wrote:
>> +#define WA_REG_WR_GUC_RESTORE(addr, val) do { \
>> +		const int r = guc_wa_add(dev_priv, (addr), (val)); \
>> +		if (r) \
>> +			return r; \
>> +	} while (0)
>
> Try to avoid burying returns inside macros. Does this macro help code
> readability? Would perhaps just a table of registers + values be easier?
> -Chris
>

Sure, I can change it to something else.
I only replicated what the other WA_* macros (the ones we need in ctx 
switch) were doing.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 03/20] drm/i915: Add support for per engine reset recovery
  2017-04-27 23:50   ` Chris Wilson
@ 2017-04-28 21:59     ` Michel Thierry
  2017-05-04  0:26     ` Michel Thierry
  1 sibling, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-04-28 21:59 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 4/27/2017 4:50 PM, Chris Wilson wrote:
>> -static void engine_retire_requests(struct intel_engine_cs *engine)
>> +void engine_retire_requests(struct intel_engine_cs *engine)
> Fortunately stray chunk. I was about to scream.
>

This chunk has been there for quite a long time, at least since v4... 
thanks for spotting it (I'm the one that should be screaming).
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 02/20] drm/i915: Modify error handler for per engine hang recovery
  2017-04-27 23:12 ` [PATCH v7 02/20] drm/i915: Modify error handler for per engine hang recovery Michel Thierry
@ 2017-04-29 14:19   ` Chris Wilson
  2017-05-08 18:31     ` Michel Thierry
  2017-05-15 21:14   ` [PATCH " Michel Thierry
  1 sibling, 1 reply; 62+ messages in thread
From: Chris Wilson @ 2017-04-29 14:19 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote:
> From: Arun Siluvery <arun.siluvery@linux.intel.com>
> 
> This is a preparatory patch which modifies error handler to do per engine
> hang recovery. The actual patch which implements this sequence follows
> later in the series. The aim is to prepare existing recovery function to
> adapt to this new function where applicable (which fails at this point
> because core implementation is lacking) and continue recovery using legacy
> full gpu reset.
> 
> A helper function is also added to query the availability of engine
> reset.
> 
> The error events behaviour that are used to notify user of reset are
> adapted to engine reset such that it doesn't break users listening to these
> events. In legacy we report an error event, a reset event before resetting
> the gpu and a reset done event marking the completion of reset. The same
> behaviour is adapted but reset event is only dispatched once even when
> multiple engines are hung. Finally once reset is complete we send reset
> done event as usual.
> 
> Note that this implementation of engine reset is for i915 directly
> submitting to the ELSP, where the driver manages the hang detection,
> recovery and resubmission. With GuC submission these tasks are shared
> between driver and firmware; i915 will still responsible for detecting a
> hang, and when it does it will have to request GuC to reset that Engine and
> remind the firmware about the outstanding submissions. This will be
> added in different patch.
> 
> v2: rebase, advertise engine reset availability in platform definition,
> add note about GuC submission.
> v3: s/*engine_reset*/*reset_engine*/. (Chris)
> Handle reset as 2 level resets, by first going to engine only and fall
> backing to full/chip reset as needed, i.e. reset_engine will need the
> struct_mutex.
> v4: Pass the engine mask to i915_reset. (Chris)
> v5: Rebase, update selftests.
> v6: Rebase, prepare for mutex-less reset engine.
> v7: Pass reset_engine mask as a function parameter, and iterate over the
> engine mask for reset_engine. (Chris)
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Ian Lister <ian.lister@intel.com>
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c     | 15 +++++++++++++++
>  drivers/gpu/drm/i915/i915_drv.h     |  3 +++
>  drivers/gpu/drm/i915/i915_irq.c     | 33 ++++++++++++++++++++++++++++++---
>  drivers/gpu/drm/i915/i915_pci.c     |  5 ++++-
>  drivers/gpu/drm/i915/intel_uncore.c | 11 +++++++++++
>  5 files changed, 63 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index c7d68e789642..48c8b69d9bde 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1800,6 +1800,8 @@ void i915_reset(struct drm_i915_private *dev_priv)
>  	if (!test_bit(I915_RESET_HANDOFF, &error->flags))
>  		return;
>  
> +	DRM_DEBUG_DRIVER("resetting chip\n");

This is redundant since we have a "Resetting chip" already here. Just
kill it.

> +
>  	/* Clear any previous failed attempts at recovery. Time to try again. */
>  	if (!i915_gem_unset_wedged(dev_priv))
>  		goto wakeup;
> @@ -1863,6 +1865,19 @@ void i915_reset(struct drm_i915_private *dev_priv)
>  	goto finish;
>  }
>  
> +/**
> + * i915_reset_engine - reset GPU engine to recover from a hang
> + * @engine: engine to reset
> + *
> + * Reset a specific GPU engine. Useful if a hang is detected.
> + * Returns zero on successful reset or otherwise an error code.
> + */
> +int i915_reset_engine(struct intel_engine_cs *engine)
> +{
> +	/* FIXME: replace me with engine reset sequence */
> +	return -ENODEV;
> +}
> +
>  static int i915_pm_suspend(struct device *kdev)
>  {
>  	struct pci_dev *pdev = to_pci_dev(kdev);
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index e06af46f5a57..ab7e68626c49 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -814,6 +814,7 @@ struct intel_csr {
>  	func(has_ddi); \
>  	func(has_decoupled_mmio); \
>  	func(has_dp_mst); \
> +	func(has_reset_engine); \
>  	func(has_fbc); \
>  	func(has_fpga_dbg); \
>  	func(has_full_ppgtt); \
> @@ -3019,6 +3020,8 @@ extern void i915_driver_unload(struct drm_device *dev);
>  extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
>  extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
>  extern void i915_reset(struct drm_i915_private *dev_priv);
> +extern int i915_reset_engine(struct intel_engine_cs *engine);
> +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
>  extern int intel_guc_reset(struct drm_i915_private *dev_priv);
>  extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
>  extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index fd97fe00cd0d..3a59ef1367ec 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2635,11 +2635,13 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg)
>  /**
>   * i915_reset_and_wakeup - do process context error handling work
>   * @dev_priv: i915 device private
> + * @engine_mask: engine(s) hung - for reset-engine only.
>   *
>   * Fire an error uevent so userspace can see that a hang or error
>   * was detected.
>   */
> -static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
> +static void
> +i915_reset_and_wakeup(struct drm_i915_private *dev_priv, u32 engine_mask)
>  {
>  	struct kobject *kobj = &dev_priv->drm.primary->kdev->kobj;
>  	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
> @@ -2648,9 +2650,33 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>  
>  	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
>  
> -	DRM_DEBUG_DRIVER("resetting chip\n");
> +	/*
> +	 * This event needs to be sent before performing gpu reset. When
> +	 * engine resets are supported we iterate through all engines and
> +	 * reset hung engines individually. To keep the event dispatch
> +	 * mechanism consistent with full gpu reset, this is only sent once
> +	 * even when multiple engines are hung. It is also safe to move this
> +	 * here because when we are in this function, we will definitely
> +	 * perform gpu reset.
> +	 */
>  	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
>  
> +	/* try engine reset first, and continue if fails; look mom, no mutex! */
> +	if (intel_has_reset_engine(dev_priv)) {
> +		struct intel_engine_cs *engine;
> +		unsigned int tmp;
> +
> +		for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
> +			if (i915_reset_engine(engine) == 0)
> +				engine_mask &= ~intel_engine_flag(engine);
> +		}
> +
> +		if (engine_mask)
> +			DRM_WARN("per-engine reset failed, promoting to full gpu reset\n");
> +		else
> +			goto finish;

This will look nicer if we did just try per-engine reset and then
quitely promote (it's not that quiet as we do get logging) to global.

for_each_engine_masked() {}
if (!engine_mask)


> +	}
> +
>  	intel_prepare_reset(dev_priv);
>  
>  	set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
> @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>  		kobject_uevent_env(kobj,
>  				   KOBJ_CHANGE, reset_done_event);
>  
> +finish:
>  	/*
>  	 * Note: The wake_up also serves as a memory barrier so that
>  	 * waiters see the updated value of the dev_priv->gpu_error.
> @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
>  			     &dev_priv->gpu_error.flags))
>  		goto out;
>  
> -	i915_reset_and_wakeup(dev_priv);
> +	i915_reset_and_wakeup(dev_priv, engine_mask);

? You don't need to wakeup the struct_mutex so we don't need this after
per-engine resets. Time to split up i915_reset_and_wakeup(), because we
certainly shouldn't be calling intel_finish_reset() without first calling
intel_prepare_reset(). Which is right here in my tree...

> +/*
> + * When GuC submission is enabled, GuC manages ELSP and can initiate the
> + * engine reset too. For now, fall back to full GPU reset if it is enabled.
> + */
> +bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
> +{
> +	return (dev_priv->info.has_reset_engine &&
> +		!dev_priv->guc.execbuf_client &&
> +		i915.reset == 2);

i915.reset >= 2
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 04/20] drm/i915: Skip reset request if there is one already
  2017-04-27 23:12 ` [PATCH v7 04/20] drm/i915: Skip reset request if there is one already Michel Thierry
@ 2017-04-29 14:21   ` Chris Wilson
  2017-05-01 21:15     ` Michel Thierry
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Wilson @ 2017-04-29 14:21 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Thu, Apr 27, 2017 at 04:12:44PM -0700, Michel Thierry wrote:
> From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> 
> To perform engine reset we first disable engine to capture its state. This
> is done by issuing a reset request. Because we are reusing existing
> infrastructure, again when we actually reset an engine, reset function
> checks engine mask and issues reset request again which is unnecessary. To
> avoid this we check if the engine is already prepared, if so we just exit
> from that point.

Do we still need this? I am a bit dubious because it implies we have no
idea what we are doing, recursively calling resets.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 05/20] drm/i915: Cancel reset-engine if we couldn't find an active request
  2017-04-27 23:12 ` [PATCH v7 05/20] drm/i915: Cancel reset-engine if we couldn't find an active request Michel Thierry
@ 2017-04-29 14:26   ` Chris Wilson
  2017-05-15 21:20   ` [PATCH " Michel Thierry
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 62+ messages in thread
From: Chris Wilson @ 2017-04-29 14:26 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Thu, Apr 27, 2017 at 04:12:45PM -0700, Michel Thierry wrote:
> Before reseting an engine, check if there is an active request, and if
> the _hung_ request has completed. In these two cases, the seqno has moved
> after hang declaration and we can skip the reset.
> 
> Also store the active request so that we only search for it once.
> 
> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c | 37 +++++++++++++++++++++++++++++--------
>  drivers/gpu/drm/i915/i915_drv.h |  6 ++++--
>  drivers/gpu/drm/i915/i915_gem.c | 37 ++++++++++++++++++++++++-------------
>  3 files changed, 57 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index ae891529dedd..a64e9b63cdbc 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1811,7 +1811,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
>  	pr_notice("drm/i915: Resetting chip after gpu hang\n");
>  	disable_irq(dev_priv->drm.irq);
>  	ret = i915_gem_reset_prepare(dev_priv, ALL_ENGINES);
> -	if (ret) {
> +	if (ret == -EIO) {
>  		DRM_ERROR("GPU recovery failed\n");
>  		intel_gpu_reset(dev_priv, ALL_ENGINES);
>  		goto error;
> @@ -1883,23 +1883,40 @@ int i915_reset_engine(struct intel_engine_cs *engine)
>  	int ret;
>  	struct drm_i915_private *dev_priv = engine->i915;
>  	struct i915_gpu_error *error = &dev_priv->gpu_error;
> +	struct drm_i915_gem_request *active_request;
>  
>  	GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, &error->flags));
>  
>  	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
>  
> -	ret = i915_gem_reset_prepare_engine(engine);
> -	if (ret) {
> -		DRM_ERROR("Previous reset failed - promote to full reset\n");
> -		goto out;
> +	active_request = i915_gem_reset_prepare_engine(engine);
> +	if (!active_request) {
> +		DRM_DEBUG_DRIVER("seqno moved after hang declaration, pardoned\n");
> +		goto canceled;
> +	}
> +	if (IS_ERR(active_request)) {
> +		ret = PTR_ERR(active_request);
> +		if (ret == -ECANCELED) {
> +			DRM_DEBUG_DRIVER("no active request found, skip reset\n");
> +			goto canceled;

-ECANCELED is just NULL. Make it so.

> +		} else if (ret) {
> +			DRM_DEBUG_DRIVER("Previous reset failed, promote to full reset\n");
> +			goto out;
> +		}
>  	}
>  
> +	if (__i915_gem_request_completed(active_request, engine->hangcheck.seqno)) {

Hmm, this is very incorrect. (Do it correctly as part of
prepare_engine.)

> +		DRM_DEBUG_DRIVER("request completed, skip the reset\n");
> +		goto canceled;
> +	}

This is part of the pardon check above. My comment was if we store the
active_request under engine->hangcheck for the global reset case, then
we would need to double check it wasn't completed after the delays of
setting up the global reset. Here, the check that the seqno is still
active is sufficient.

> +
> +
>  	/*
> -	 * the request that caused the hang is stuck on elsp, identify the
> -	 * active request and drop it, adjust head to skip the offending
> +	 * the request that caused the hang is stuck on elsp, we know the
> +	 * active request and can drop it, adjust head to skip the offending
>  	 * request to resume executing remaining requests in the queue.
>  	 */
> -	i915_gem_reset_engine(engine);
> +	i915_gem_reset_engine(engine, active_request);
>  
>  	/* forcing engine to idle */
>  	ret = intel_reset_engine_start(engine);
> @@ -1928,6 +1945,10 @@ int i915_reset_engine(struct intel_engine_cs *engine)
>  
>  out:
>  	return ret;
> +
> +canceled:
> +	i915_gem_reset_finish_engine(engine);
> +	return 0;
>  }
>  
>  static int i915_pm_suspend(struct device *kdev)
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index efbf34318893..8e93189c2104 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -3439,7 +3439,8 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
>  	return READ_ONCE(error->reset_count);
>  }
>  
> -int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
> +struct drm_i915_gem_request *
> +i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
>  int i915_gem_reset_prepare(struct drm_i915_private *dev_priv,
>  			   unsigned int engine_mask);
>  void i915_gem_reset(struct drm_i915_private *dev_priv);
> @@ -3448,7 +3449,8 @@ void i915_gem_reset_finish(struct drm_i915_private *dev_priv,
>  			   unsigned int engine_mask);
>  void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
>  bool i915_gem_unset_wedged(struct drm_i915_private *dev_priv);
> -void i915_gem_reset_engine(struct intel_engine_cs *engine);
> +void i915_gem_reset_engine(struct intel_engine_cs *engine,
> +			   struct drm_i915_gem_request *request);
>  
>  void i915_gem_init_mmio(struct drm_i915_private *i915);
>  int __must_check i915_gem_init(struct drm_i915_private *dev_priv);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index bce38062f94e..4e357d333cc2 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2793,12 +2793,15 @@ static bool engine_stalled(struct intel_engine_cs *engine)
>  	return true;
>  }
>  
> -/* Ensure irq handler finishes, and not run again. */
> -int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
> +/*
> + * Ensure irq handler finishes, and not run again.
> + * For reset-engine we also store the active request so that we only search
> + * for it once.
> + */
> +struct drm_i915_gem_request *
> +i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>  {
> -	struct drm_i915_gem_request *request;
> -	int err = 0;
> -
> +	struct drm_i915_gem_request *request = NULL;
>  
>  	/* Prevent the signaler thread from updating the request
>  	 * state (by calling dma_fence_signal) as we are processing
> @@ -2827,22 +2830,29 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>  
>  	if (engine_stalled(engine)) {
>  		request = i915_gem_find_active_request(engine);
> +		if (!request)
> +			return ERR_PTR(-ECANCELED); /* Can't find a request, abort! */
> +
>  		if (request && request->fence.error == -EIO)
> -			err = -EIO; /* Previous reset failed! */
> +			return ERR_PTR(-EIO); /* Previous reset failed! */
>  	}
>  
> -	return err;
> +	return request;
>  }
>  
>  int i915_gem_reset_prepare(struct drm_i915_private *dev_priv,
>  			   unsigned int engine_mask)
>  {
>  	struct intel_engine_cs *engine;
> +	struct drm_i915_gem_request *request;
>  	unsigned int tmp;
>  	int err = 0;
>  
> -	for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
> -		err = i915_gem_reset_prepare_engine(engine);
> +	for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
> +		request = i915_gem_reset_prepare_engine(engine);
> +		if (request && IS_ERR(request))

Can just be IS_ERR(). And this doesn't want to report -ECANCELED!

> +			err = PTR_ERR(request);
> +	}
>  
>  	i915_gem_revoke_fences(dev_priv);
>  
> @@ -2929,11 +2939,12 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
>  	return guilty;
>  }
>  
> -void i915_gem_reset_engine(struct intel_engine_cs *engine)
> +void i915_gem_reset_engine(struct intel_engine_cs *engine,
> +			   struct drm_i915_gem_request *request)
>  {
> -	struct drm_i915_gem_request *request;
> +	if (!request)
> +		request = i915_gem_find_active_request(engine);
>  
> -	request = i915_gem_find_active_request(engine);
>  	if (request && i915_gem_reset_request(request)) {
>  		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
>  				 engine->name, request->global_seqno);
> @@ -2959,7 +2970,7 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
>  	for_each_engine(engine, dev_priv, id) {
>  		struct i915_gem_context *ctx;
>  
> -		i915_gem_reset_engine(engine);
> +		i915_gem_reset_engine(engine, NULL);
>  		ctx = fetch_and_zero(&engine->last_retired_context);
>  		if (ctx)
>  			engine->context_unpin(engine, ctx);
> -- 
> 2.11.0
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 13/20] drm/i915/guc: Rename the function that resets the GuC
  2017-04-28  7:40   ` Tvrtko Ursulin
@ 2017-05-01 20:09     ` Michel Thierry
  0 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-05-01 20:09 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

On 28/04/17 00:40, Tvrtko Ursulin wrote:
>> --- a/drivers/gpu/drm/i915/intel_uc.c
>> +++ b/drivers/gpu/drm/i915/intel_uc.c
>> @@ -46,9 +46,9 @@ static int __intel_uc_reset_hw(struct
>> drm_i915_private *dev_priv)
>>      int ret;
>>      u32 guc_status;
>>
>> -    ret = intel_guc_reset(dev_priv);
>> +    ret = intel_reset_guc(dev_priv);
>>      if (ret) {
>> -        DRM_ERROR("GuC reset failed, ret = %d\n", ret);
>> +        DRM_ERROR("Reset GuC failed, ret = %d\n", ret);
>
> As a non-native speaker I might be wrong, but was thinking something
> like "Failed to reset GuC", "Resetting GuC failed", "Reset of GuC
> failed" would be clearer? I leave it for someone more competent to decide.

"Failed to reset GuC" sounds good to me.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 04/20] drm/i915: Skip reset request if there is one already
  2017-04-29 14:21   ` Chris Wilson
@ 2017-05-01 21:15     ` Michel Thierry
  0 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-05-01 21:15 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 29/04/17 07:21, Chris Wilson wrote:
> On Thu, Apr 27, 2017 at 04:12:44PM -0700, Michel Thierry wrote:
>> From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>
>> To perform engine reset we first disable engine to capture its state. This
>> is done by issuing a reset request. Because we are reusing existing
>> infrastructure, again when we actually reset an engine, reset function
>> checks engine mask and issues reset request again which is unnecessary. To
>> avoid this we check if the engine is already prepared, if so we just exit
>> from that point.
>
> Do we still need this? I am a bit dubious because it implies we have no
> idea what we are doing, recursively calling resets.
> -Chris
>

I can drop this one. It isn't really needed (the 'shortcut' it refers is 
because we already set the bit in intel_reset_engine_start).

btw here it's only setting/querying "Ready-ness for Reset", and I've 
heard rumours that the register may not clear itself sometimes (but I 
haven't seen that behaviour myself).

-Michel
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 03/20] drm/i915: Add support for per engine reset recovery
  2017-04-27 23:50   ` Chris Wilson
  2017-04-28 21:59     ` Michel Thierry
@ 2017-05-04  0:26     ` Michel Thierry
  1 sibling, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-05-04  0:26 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Mika Kuoppala

On 27/04/17 16:50, Chris Wilson wrote:
> On Thu, Apr 27, 2017 at 04:12:43PM -0700, Michel Thierry wrote:
>> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>>
>> This change implements support for per-engine reset as an initial, less
>> intrusive hang recovery option to be attempted before falling back to the
>> legacy full GPU reset recovery mode if necessary. This is only supported
>> from Gen8 onwards.
>>
>> Hangchecker determines which engines are hung and invokes error handler to
>> recover from it. Error handler schedules recovery for each of those engines
>> that are hung. The recovery procedure is as follows,
>>  - identifies the request that caused the hang and it is dropped
>>  - force engine to idle: this is done by issuing a reset request
>>  - reset and re-init engine
>>  - restart submissions to the engine
>>
>> If engine reset fails then we fall back to heavy weight full gpu reset
>> which resets all engines and reinitiazes complete state of HW and SW.
>>
>> v2: Rebase.
>> v3: s/*engine_reset*/*reset_engine*/; freeze engine and irqs before
>> calling i915_gem_reset_engine (Chris).
>> v4: Rebase, modify i915_gem_reset_prepare to use a ring mask and
>> reuse the function for reset_engine.
>> v5: intel_reset_engine_start/cancel instead of request/unrequest_reset.
>> v6: Clean up reset_engine function to not require mutex, i.e. no need to call
>> revoke/restore_fences and _retire_requests (Chris).
>> v7: Remove leftovers from v5, i.e. no need to disable irq, hold
>> forcewake or wakeup the handoff bit (Chris).
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_drv.c         | 60 ++++++++++++++++++--
>>  drivers/gpu/drm/i915/i915_drv.h         | 12 +++-
>>  drivers/gpu/drm/i915/i915_gem.c         | 97 +++++++++++++++++++--------------
>>  drivers/gpu/drm/i915/i915_gem_request.c |  2 +-
>>  drivers/gpu/drm/i915/intel_uncore.c     | 20 +++++++
>>  5 files changed, 142 insertions(+), 49 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index 48c8b69d9bde..ae891529dedd 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1810,7 +1810,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>
>>  	pr_notice("drm/i915: Resetting chip after gpu hang\n");
>>  	disable_irq(dev_priv->drm.irq);
>> -	ret = i915_gem_reset_prepare(dev_priv);
>> +	ret = i915_gem_reset_prepare(dev_priv, ALL_ENGINES);
>>  	if (ret) {
>>  		DRM_ERROR("GPU recovery failed\n");
>>  		intel_gpu_reset(dev_priv, ALL_ENGINES);
>> @@ -1852,7 +1852,7 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>  	i915_queue_hangcheck(dev_priv);
>>
>>  finish:
>> -	i915_gem_reset_finish(dev_priv);
>> +	i915_gem_reset_finish(dev_priv, ALL_ENGINES);
>>  	enable_irq(dev_priv->drm.irq);
>>
>>  wakeup:
>> @@ -1871,11 +1871,63 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>   *
>>   * Reset a specific GPU engine. Useful if a hang is detected.
>>   * Returns zero on successful reset or otherwise an error code.
>> + *
>> + * Procedure is:
>> + *  - identifies the request that caused the hang and it is dropped
>> + *  - force engine to idle: this is done by issuing a reset request
>> + *  - reset engine
>> + *  - restart submissions to the engine
>
> Why does the prospective caller need to know this?
>
>>   */
>>  int i915_reset_engine(struct intel_engine_cs *engine)
>>  {
>> -	/* FIXME: replace me with engine reset sequence */
>> -	return -ENODEV;
>> +	int ret;
>> +	struct drm_i915_private *dev_priv = engine->i915;
>> +	struct i915_gpu_error *error = &dev_priv->gpu_error;
>> +
>> +	GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, &error->flags));
>> +
>> +	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
>> +
>> +	ret = i915_gem_reset_prepare_engine(engine);
>> +	if (ret) {
>> +		DRM_ERROR("Previous reset failed - promote to full reset\n");
>> +		goto out;
>> +	}
>> +
>> +	/*
>> +	 * the request that caused the hang is stuck on elsp, identify the
>> +	 * active request and drop it, adjust head to skip the offending
>> +	 * request to resume executing remaining requests in the queue.
>> +	 */
>> +	i915_gem_reset_engine(engine);
>> +
>> +	/* forcing engine to idle */
>> +	ret = intel_reset_engine_start(engine);
>> +	if (ret) {
>> +		DRM_ERROR("Failed to disable %s\n", engine->name);
>> +		goto out;
>> +	}
>> +
>> +	/* finally, reset engine */
>> +	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
>> +	if (ret) {
>> +		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
>> +		intel_reset_engine_cancel(engine);
>> +		goto out;
>> +	}
>> +
>> +	/* be sure the request reset bit gets cleared */
>> +	intel_reset_engine_cancel(engine);
>> +
>> +	i915_gem_reset_finish_engine(engine);
>> +
>> +	/* replay remaining requests in the queue */
>> +	ret = engine->init_hw(engine);
>> +	if (ret)
>> +		goto out; //XXX: ignore this line for now
>
> Please give the comments here some tlc. Focus on the why, you are
> telling me what the code does.
>
Hi, sorry about the delay.

True, and that's not the whole story; after the engine is reset, we have 
to program the RING_MODE & RING_HWS_PGA registers again... so really the 
important thing is to call gen8_init_common_ring (which is part of 
engine->init_hw).

Then we also have workarounds that we don't want to loose after the reset.

So it's better if the comment says that we have to re-init the engine 
and program settings that were lost after the reset-engine.


>> +
>> +out:
>> +	return ret;
>>  }
>>
>>  static int i915_pm_suspend(struct device *kdev)
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>> index ab7e68626c49..efbf34318893 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -3022,6 +3022,8 @@ extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
>>  extern void i915_reset(struct drm_i915_private *dev_priv);
>>  extern int i915_reset_engine(struct intel_engine_cs *engine);
>>  extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
>> +extern int intel_reset_engine_start(struct intel_engine_cs *engine);
>> +extern void intel_reset_engine_cancel(struct intel_engine_cs *engine);
>>  extern int intel_guc_reset(struct drm_i915_private *dev_priv);
>>  extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
>>  extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
>> @@ -3410,7 +3412,6 @@ int __must_check i915_gem_set_global_seqno(struct drm_device *dev, u32 seqno);
>>
>>  struct drm_i915_gem_request *
>>  i915_gem_find_active_request(struct intel_engine_cs *engine);
>> -
>
> Nope. (find_active_request is not in the same group of operations as
> retire_requests.)
>
>>  void i915_gem_retire_requests(struct drm_i915_private *dev_priv);
>>
>>  static inline bool i915_reset_backoff(struct i915_gpu_error *error)
>> @@ -3438,11 +3439,16 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
>>  	return READ_ONCE(error->reset_count);
>>  }
>>
>> -int i915_gem_reset_prepare(struct drm_i915_private *dev_priv);
>> +int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
>> +int i915_gem_reset_prepare(struct drm_i915_private *dev_priv,
>> +			   unsigned int engine_mask);
>>  void i915_gem_reset(struct drm_i915_private *dev_priv);
>> -void i915_gem_reset_finish(struct drm_i915_private *dev_priv);
>> +void i915_gem_reset_finish_engine(struct intel_engine_cs *engine);
>> +void i915_gem_reset_finish(struct drm_i915_private *dev_priv,
>> +			   unsigned int engine_mask);
>>  void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
>>  bool i915_gem_unset_wedged(struct drm_i915_private *dev_priv);
>> +void i915_gem_reset_engine(struct intel_engine_cs *engine);
>>
>>  void i915_gem_init_mmio(struct drm_i915_private *i915);
>>  int __must_check i915_gem_init(struct drm_i915_private *dev_priv);
>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>> index 33fb11cc5acc..bce38062f94e 100644
>> --- a/drivers/gpu/drm/i915/i915_gem.c
>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>> @@ -2793,48 +2793,57 @@ static bool engine_stalled(struct intel_engine_cs *engine)
>>  	return true;
>>  }
>>
>> -int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
>> +/* Ensure irq handler finishes, and not run again. */
>> +int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>>  {
>> -	struct intel_engine_cs *engine;
>> -	enum intel_engine_id id;
>> +	struct drm_i915_gem_request *request;
>>  	int err = 0;
>>
>> -	/* Ensure irq handler finishes, and not run again. */
>> -	for_each_engine(engine, dev_priv, id) {
>> -		struct drm_i915_gem_request *request;
>> -
>> -		/* Prevent the signaler thread from updating the request
>> -		 * state (by calling dma_fence_signal) as we are processing
>> -		 * the reset. The write from the GPU of the seqno is
>> -		 * asynchronous and the signaler thread may see a different
>> -		 * value to us and declare the request complete, even though
>> -		 * the reset routine have picked that request as the active
>> -		 * (incomplete) request. This conflict is not handled
>> -		 * gracefully!
>> -		 */
>> -		kthread_park(engine->breadcrumbs.signaler);
>> -
>> -		/* Prevent request submission to the hardware until we have
>> -		 * completed the reset in i915_gem_reset_finish(). If a request
>> -		 * is completed by one engine, it may then queue a request
>> -		 * to a second via its engine->irq_tasklet *just* as we are
>> -		 * calling engine->init_hw() and also writing the ELSP.
>> -		 * Turning off the engine->irq_tasklet until the reset is over
>> -		 * prevents the race.
>> -		 */
>> -		tasklet_kill(&engine->irq_tasklet);
>> -		tasklet_disable(&engine->irq_tasklet);
>>
>> -		if (engine->irq_seqno_barrier)
>> -			engine->irq_seqno_barrier(engine);
>> +	/* Prevent the signaler thread from updating the request
>> +	 * state (by calling dma_fence_signal) as we are processing
>> +	 * the reset. The write from the GPU of the seqno is
>> +	 * asynchronous and the signaler thread may see a different
>> +	 * value to us and declare the request complete, even though
>> +	 * the reset routine have picked that request as the active
>> +	 * (incomplete) request. This conflict is not handled
>> +	 * gracefully!
>> +	 */
>> +	kthread_park(engine->breadcrumbs.signaler);
>> +
>> +	/* Prevent request submission to the hardware until we have
>> +	 * completed the reset in i915_gem_reset_finish(). If a request
>> +	 * is completed by one engine, it may then queue a request
>> +	 * to a second via its engine->irq_tasklet *just* as we are
>> +	 * calling engine->init_hw() and also writing the ELSP.
>> +	 * Turning off the engine->irq_tasklet until the reset is over
>> +	 * prevents the race.
>> +	 */
>> +	tasklet_kill(&engine->irq_tasklet);
>> +	tasklet_disable(&engine->irq_tasklet);
>>
>> -		if (engine_stalled(engine)) {
>> -			request = i915_gem_find_active_request(engine);
>> -			if (request && request->fence.error == -EIO)
>> -				err = -EIO; /* Previous reset failed! */
>> -		}
>> +	if (engine->irq_seqno_barrier)
>> +		engine->irq_seqno_barrier(engine);
>> +
>> +	if (engine_stalled(engine)) {
>> +		request = i915_gem_find_active_request(engine);
>> +		if (request && request->fence.error == -EIO)
>> +			err = -EIO; /* Previous reset failed! */
>>  	}
>>
>> +	return err;
>> +}
>> +
>> +int i915_gem_reset_prepare(struct drm_i915_private *dev_priv,
>> +			   unsigned int engine_mask)
>> +{
>> +	struct intel_engine_cs *engine;
>> +	unsigned int tmp;
>> +	int err = 0;
>> +
>> +	for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
>> +		err = i915_gem_reset_prepare_engine(engine);
>
> You are losing any earlier err.
>
>> +
>>  	i915_gem_revoke_fences(dev_priv);
>>
>>  	return err;
>> @@ -2920,7 +2929,7 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
>>  	return guilty;
>>  }
>>
>> -static void i915_gem_reset_engine(struct intel_engine_cs *engine)
>> +void i915_gem_reset_engine(struct intel_engine_cs *engine)
>>  {
>>  	struct drm_i915_gem_request *request;
>>
>> @@ -2966,16 +2975,22 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
>>  	}
>>  }
>>
>> -void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
>> +void i915_gem_reset_finish_engine(struct intel_engine_cs *engine)
>> +{
>> +	tasklet_enable(&engine->irq_tasklet);
>> +	kthread_unpark(engine->breadcrumbs.signaler);
>> +}
>> +
>> +void i915_gem_reset_finish(struct drm_i915_private *dev_priv,
>> +			   unsigned int engine_mask)
>>  {
>>  	struct intel_engine_cs *engine;
>> -	enum intel_engine_id id;
>> +	unsigned int tmp;
>>
>>  	lockdep_assert_held(&dev_priv->drm.struct_mutex);
>>
>> -	for_each_engine(engine, dev_priv, id) {
>> -		tasklet_enable(&engine->irq_tasklet);
>> -		kthread_unpark(engine->breadcrumbs.signaler);
>> +	for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
>> +		i915_gem_reset_finish_engine(engine);
>>  	}
>>  }
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
>> index 6198f6997d05..f69a8c535d5f 100644
>> --- a/drivers/gpu/drm/i915/i915_gem_request.c
>> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
>> @@ -1216,7 +1216,7 @@ long i915_wait_request(struct drm_i915_gem_request *req,
>>  	return timeout;
>>  }
>>
>> -static void engine_retire_requests(struct intel_engine_cs *engine)
>> +void engine_retire_requests(struct intel_engine_cs *engine)
>
> Fortunately stray chunk. I was about to scream.
>
>>  {
>>  	struct drm_i915_gem_request *request, *next;
>>  	u32 seqno = intel_engine_get_seqno(engine);
>> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
>> index ab5bdd110ac3..3ebba6b2dd74 100644
>> --- a/drivers/gpu/drm/i915/intel_uncore.c
>> +++ b/drivers/gpu/drm/i915/intel_uncore.c
>> @@ -1801,6 +1801,26 @@ int intel_guc_reset(struct drm_i915_private *dev_priv)
>>  	return ret;
>>  }
>>
>> +/*
>> + * On gen8+ a reset request has to be issued via the reset control register
>> + * before a GPU engine can be reset in order to stop the command streamer
>> + * and idle the engine. This replaces the legacy way of stopping an engine
>> + * by writing to the stop ring bit in the MI_MODE register.
>> + */
>> +int intel_reset_engine_start(struct intel_engine_cs *engine)
>> +{
>> +	return gen8_reset_engine_start(engine);
>> +}
>> +
>> +/*
>> + * It is possible to back off from a previously issued reset request by simply
>> + * clearing the reset request bit in the reset control register.
>> + */
>> +void intel_reset_engine_cancel(struct intel_engine_cs *engine)
>> +{
>> +	gen8_reset_engine_cancel(engine);
>> +}
>> +
>>  bool intel_uncore_unclaimed_mmio(struct drm_i915_private *dev_priv)
>>  {
>>  	return check_for_unclaimed_mmio(dev_priv);
>> --
>> 2.11.0
>>
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 02/20] drm/i915: Modify error handler for per engine hang recovery
  2017-04-29 14:19   ` Chris Wilson
@ 2017-05-08 18:31     ` Michel Thierry
  2017-05-12 20:55       ` Michel Thierry
  0 siblings, 1 reply; 62+ messages in thread
From: Michel Thierry @ 2017-05-08 18:31 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Mika Kuoppala


On 4/29/2017 7:19 AM, Chris Wilson wrote:
> On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote:
>> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>>
>> This is a preparatory patch which modifies error handler to do per engine
>> hang recovery. The actual patch which implements this sequence follows
>> later in the series. The aim is to prepare existing recovery function to
>> adapt to this new function where applicable (which fails at this point
>> because core implementation is lacking) and continue recovery using legacy
>> full gpu reset.
>>
>> A helper function is also added to query the availability of engine
>> reset.
>>
>> The error events behaviour that are used to notify user of reset are
>> adapted to engine reset such that it doesn't break users listening to these
>> events. In legacy we report an error event, a reset event before resetting
>> the gpu and a reset done event marking the completion of reset. The same
>> behaviour is adapted but reset event is only dispatched once even when
>> multiple engines are hung. Finally once reset is complete we send reset
>> done event as usual.
>>
>> Note that this implementation of engine reset is for i915 directly
>> submitting to the ELSP, where the driver manages the hang detection,
>> recovery and resubmission. With GuC submission these tasks are shared
>> between driver and firmware; i915 will still responsible for detecting a
>> hang, and when it does it will have to request GuC to reset that Engine and
>> remind the firmware about the outstanding submissions. This will be
>> added in different patch.
>>
>> v2: rebase, advertise engine reset availability in platform definition,
>> add note about GuC submission.
>> v3: s/*engine_reset*/*reset_engine*/. (Chris)
>> Handle reset as 2 level resets, by first going to engine only and fall
>> backing to full/chip reset as needed, i.e. reset_engine will need the
>> struct_mutex.
>> v4: Pass the engine mask to i915_reset. (Chris)
>> v5: Rebase, update selftests.
>> v6: Rebase, prepare for mutex-less reset engine.
>> v7: Pass reset_engine mask as a function parameter, and iterate over the
>> engine mask for reset_engine. (Chris)
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> Signed-off-by: Ian Lister <ian.lister@intel.com>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_drv.c     | 15 +++++++++++++++
>>  drivers/gpu/drm/i915/i915_drv.h     |  3 +++
>>  drivers/gpu/drm/i915/i915_irq.c     | 33 ++++++++++++++++++++++++++++++---
>>  drivers/gpu/drm/i915/i915_pci.c     |  5 ++++-
>>  drivers/gpu/drm/i915/intel_uncore.c | 11 +++++++++++
>>  5 files changed, 63 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index c7d68e789642..48c8b69d9bde 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1800,6 +1800,8 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>  	if (!test_bit(I915_RESET_HANDOFF, &error->flags))
>>  		return;
>>
>> +	DRM_DEBUG_DRIVER("resetting chip\n");
>
> This is redundant since we have a "Resetting chip" already here. Just
> kill it.
>
>> +
>>  	/* Clear any previous failed attempts at recovery. Time to try again. */
>>  	if (!i915_gem_unset_wedged(dev_priv))
>>  		goto wakeup;
>> @@ -1863,6 +1865,19 @@ void i915_reset(struct drm_i915_private *dev_priv)
>>  	goto finish;
>>  }
>>
>> +/**
>> + * i915_reset_engine - reset GPU engine to recover from a hang
>> + * @engine: engine to reset
>> + *
>> + * Reset a specific GPU engine. Useful if a hang is detected.
>> + * Returns zero on successful reset or otherwise an error code.
>> + */
>> +int i915_reset_engine(struct intel_engine_cs *engine)
>> +{
>> +	/* FIXME: replace me with engine reset sequence */
>> +	return -ENODEV;
>> +}
>> +
>>  static int i915_pm_suspend(struct device *kdev)
>>  {
>>  	struct pci_dev *pdev = to_pci_dev(kdev);
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>> index e06af46f5a57..ab7e68626c49 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -814,6 +814,7 @@ struct intel_csr {
>>  	func(has_ddi); \
>>  	func(has_decoupled_mmio); \
>>  	func(has_dp_mst); \
>> +	func(has_reset_engine); \
>>  	func(has_fbc); \
>>  	func(has_fpga_dbg); \
>>  	func(has_full_ppgtt); \
>> @@ -3019,6 +3020,8 @@ extern void i915_driver_unload(struct drm_device *dev);
>>  extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
>>  extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
>>  extern void i915_reset(struct drm_i915_private *dev_priv);
>> +extern int i915_reset_engine(struct intel_engine_cs *engine);
>> +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
>>  extern int intel_guc_reset(struct drm_i915_private *dev_priv);
>>  extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
>>  extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>> index fd97fe00cd0d..3a59ef1367ec 100644
>> --- a/drivers/gpu/drm/i915/i915_irq.c
>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>> @@ -2635,11 +2635,13 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg)
>>  /**
>>   * i915_reset_and_wakeup - do process context error handling work
>>   * @dev_priv: i915 device private
>> + * @engine_mask: engine(s) hung - for reset-engine only.
>>   *
>>   * Fire an error uevent so userspace can see that a hang or error
>>   * was detected.
>>   */
>> -static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>> +static void
>> +i915_reset_and_wakeup(struct drm_i915_private *dev_priv, u32 engine_mask)
>>  {
>>  	struct kobject *kobj = &dev_priv->drm.primary->kdev->kobj;
>>  	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
>> @@ -2648,9 +2650,33 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>>
>>  	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
>>
>> -	DRM_DEBUG_DRIVER("resetting chip\n");
>> +	/*
>> +	 * This event needs to be sent before performing gpu reset. When
>> +	 * engine resets are supported we iterate through all engines and
>> +	 * reset hung engines individually. To keep the event dispatch
>> +	 * mechanism consistent with full gpu reset, this is only sent once
>> +	 * even when multiple engines are hung. It is also safe to move this
>> +	 * here because when we are in this function, we will definitely
>> +	 * perform gpu reset.
>> +	 */
>>  	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
>>
>> +	/* try engine reset first, and continue if fails; look mom, no mutex! */
>> +	if (intel_has_reset_engine(dev_priv)) {
>> +		struct intel_engine_cs *engine;
>> +		unsigned int tmp;
>> +
>> +		for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
>> +			if (i915_reset_engine(engine) == 0)
>> +				engine_mask &= ~intel_engine_flag(engine);
>> +		}
>> +
>> +		if (engine_mask)
>> +			DRM_WARN("per-engine reset failed, promoting to full gpu reset\n");
>> +		else
>> +			goto finish;
>
> This will look nicer if we did just try per-engine reset and then
> quitely promote (it's not that quiet as we do get logging) to global.
>
> for_each_engine_masked() {}
> if (!engine_mask)
>
>
>> +	}
>> +
>>  	intel_prepare_reset(dev_priv);
>>
>>  	set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
>> @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
>>  		kobject_uevent_env(kobj,
>>  				   KOBJ_CHANGE, reset_done_event);
>>
>> +finish:
>>  	/*
>>  	 * Note: The wake_up also serves as a memory barrier so that
>>  	 * waiters see the updated value of the dev_priv->gpu_error.
>> @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
>>  			     &dev_priv->gpu_error.flags))
>>  		goto out;
>>
>> -	i915_reset_and_wakeup(dev_priv);
>> +	i915_reset_and_wakeup(dev_priv, engine_mask);
>
> ? You don't need to wakeup the struct_mutex so we don't need this after
> per-engine resets. Time to split up i915_reset_and_wakeup(), because we
> certainly shouldn't be calling intel_finish_reset() without first calling
> intel_prepare_reset(). Which is right here in my tree...
>

Looking at your tree, it wouldn't call finish_reset there either, only 
these two are called after a successful reset:

finish:
	clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags);
	wake_up_all(&dev_priv->gpu_error.reset_queue);

But you're right, we only need to clear the error flag, no need to call 
wake_up_all.

Should I move the per-engine reset to i915_handle_error, and then leave 
i915_reset_and_wakeup just for full resets?
That would also make the promotion from per-engine to global look a bit 
'clearer'.

Thanks,

>> +/*
>> + * When GuC submission is enabled, GuC manages ELSP and can initiate the
>> + * engine reset too. For now, fall back to full GPU reset if it is enabled.
>> + */
>> +bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
>> +{
>> +	return (dev_priv->info.has_reset_engine &&
>> +		!dev_priv->guc.execbuf_client &&
>> +		i915.reset == 2);
>
> i915.reset >= 2
> -Chris
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 02/20] drm/i915: Modify error handler for per engine hang recovery
  2017-05-08 18:31     ` Michel Thierry
@ 2017-05-12 20:55       ` Michel Thierry
  2017-05-12 21:09         ` Chris Wilson
  0 siblings, 1 reply; 62+ messages in thread
From: Michel Thierry @ 2017-05-12 20:55 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 5/8/2017 11:31 AM, Michel Thierry wrote:
> On 4/29/2017 7:19 AM, Chris Wilson wrote:
>> On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote:
>>> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>>>
...
>>> +    }
>>> +
>>>      intel_prepare_reset(dev_priv);
>>>
>>>      set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
>>> @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct
>>> drm_i915_private *dev_priv)
>>>          kobject_uevent_env(kobj,
>>>                     KOBJ_CHANGE, reset_done_event);
>>>
>>> +finish:
>>>      /*
>>>       * Note: The wake_up also serves as a memory barrier so that
>>>       * waiters see the updated value of the dev_priv->gpu_error.
>>> @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private
>>> *dev_priv,
>>>                   &dev_priv->gpu_error.flags))
>>>          goto out;
>>>
>>> -    i915_reset_and_wakeup(dev_priv);
>>> +    i915_reset_and_wakeup(dev_priv, engine_mask);
>>
>> ? You don't need to wakeup the struct_mutex so we don't need this after
>> per-engine resets. Time to split up i915_reset_and_wakeup(), because we
>> certainly shouldn't be calling intel_finish_reset() without first calling
>> intel_prepare_reset(). Which is right here in my tree...
>>
>
> Looking at your tree, it wouldn't call finish_reset there either, only
> these two are called after a successful reset:
>
> finish:
>     clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags);
>     wake_up_all(&dev_priv->gpu_error.reset_queue);
>
> But you're right, we only need to clear the error flag, no need to call
> wake_up_all.
>
> Should I move the per-engine reset to i915_handle_error, and then leave
> i915_reset_and_wakeup just for full resets?
> That would also make the promotion from per-engine to global look a bit
> 'clearer'.
>

I just noticed an issue if I don't call wake_up_all. There can be 
someone else waiting for the reset to complete 
(i915_mutex_lock_interruptible -> i915_gem_wait_for_error).

I915_RESET_BACKOFF has/had 2 roles, stop any other user to grab the 
struct mutex (which we won't need in reset-engine) and prevent two 
concurrent reset attempts (which we still want). Time to add a new flag 
for the later? (I915_RESET_ENGINE_IN_PROGRESS?)

Here's an example without calling wake_up_all (10s timeout):
[  126.816054] [drm:i915_reset_engine [i915]] resetting rcs0
...
[  137.499910] [IGT] gem_ringfill: exiting, ret=0

Compared to the one that does,
[   69.799519] [drm:i915_reset_engine [i915]] resetting rcs0
...
[   69.801335] [IGT] gem_tdr: exiting, ret=0

Thanks,

-Michel





_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 02/20] drm/i915: Modify error handler for per engine hang recovery
  2017-05-12 20:55       ` Michel Thierry
@ 2017-05-12 21:09         ` Chris Wilson
  2017-05-12 21:23           ` Michel Thierry
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Wilson @ 2017-05-12 21:09 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Fri, May 12, 2017 at 01:55:11PM -0700, Michel Thierry wrote:
> On 5/8/2017 11:31 AM, Michel Thierry wrote:
> >On 4/29/2017 7:19 AM, Chris Wilson wrote:
> >>On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote:
> >>>From: Arun Siluvery <arun.siluvery@linux.intel.com>
> >>>
> ...
> >>>+    }
> >>>+
> >>>     intel_prepare_reset(dev_priv);
> >>>
> >>>     set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
> >>>@@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct
> >>>drm_i915_private *dev_priv)
> >>>         kobject_uevent_env(kobj,
> >>>                    KOBJ_CHANGE, reset_done_event);
> >>>
> >>>+finish:
> >>>     /*
> >>>      * Note: The wake_up also serves as a memory barrier so that
> >>>      * waiters see the updated value of the dev_priv->gpu_error.
> >>>@@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private
> >>>*dev_priv,
> >>>                  &dev_priv->gpu_error.flags))
> >>>         goto out;
> >>>
> >>>-    i915_reset_and_wakeup(dev_priv);
> >>>+    i915_reset_and_wakeup(dev_priv, engine_mask);
> >>
> >>? You don't need to wakeup the struct_mutex so we don't need this after
> >>per-engine resets. Time to split up i915_reset_and_wakeup(), because we
> >>certainly shouldn't be calling intel_finish_reset() without first calling
> >>intel_prepare_reset(). Which is right here in my tree...
> >>
> >
> >Looking at your tree, it wouldn't call finish_reset there either, only
> >these two are called after a successful reset:
> >
> >finish:
> >    clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags);
> >    wake_up_all(&dev_priv->gpu_error.reset_queue);
> >
> >But you're right, we only need to clear the error flag, no need to call
> >wake_up_all.
> >
> >Should I move the per-engine reset to i915_handle_error, and then leave
> >i915_reset_and_wakeup just for full resets?
> >That would also make the promotion from per-engine to global look a bit
> >'clearer'.
> >
> 
> I just noticed an issue if I don't call wake_up_all. There can be
> someone else waiting for the reset to complete
> (i915_mutex_lock_interruptible -> i915_gem_wait_for_error).
> 
> I915_RESET_BACKOFF has/had 2 roles, stop any other user to grab the
> struct mutex (which we won't need in reset-engine) and prevent two
> concurrent reset attempts (which we still want). Time to add a new
> flag for the later? (I915_RESET_ENGINE_IN_PROGRESS?)

Yes, that would be a good idea to avoid dual purposing the bits. Now
that we do direct resets along the wait path, we can completely drop the
i915_mutex_interruptible(). (No one else should be holding the mutex
indefinitely.) I think that's a better approach -- I think we've already
moved all the EIO magic aware to the ABI points where we deemed it
necessary.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v7 02/20] drm/i915: Modify error handler for per engine hang recovery
  2017-05-12 21:09         ` Chris Wilson
@ 2017-05-12 21:23           ` Michel Thierry
  0 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-05-12 21:23 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx



On 5/12/2017 2:09 PM, Chris Wilson wrote:
> On Fri, May 12, 2017 at 01:55:11PM -0700, Michel Thierry wrote:
>> On 5/8/2017 11:31 AM, Michel Thierry wrote:
>>> On 4/29/2017 7:19 AM, Chris Wilson wrote:
>>>> On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote:
>>>>> From: Arun Siluvery <arun.siluvery@linux.intel.com>
>>>>>
>> ...
>>>>> +    }
>>>>> +
>>>>>     intel_prepare_reset(dev_priv);
>>>>>
>>>>>     set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
>>>>> @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct
>>>>> drm_i915_private *dev_priv)
>>>>>         kobject_uevent_env(kobj,
>>>>>                    KOBJ_CHANGE, reset_done_event);
>>>>>
>>>>> +finish:
>>>>>     /*
>>>>>      * Note: The wake_up also serves as a memory barrier so that
>>>>>      * waiters see the updated value of the dev_priv->gpu_error.
>>>>> @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private
>>>>> *dev_priv,
>>>>>                  &dev_priv->gpu_error.flags))
>>>>>         goto out;
>>>>>
>>>>> -    i915_reset_and_wakeup(dev_priv);
>>>>> +    i915_reset_and_wakeup(dev_priv, engine_mask);
>>>>
>>>> ? You don't need to wakeup the struct_mutex so we don't need this after
>>>> per-engine resets. Time to split up i915_reset_and_wakeup(), because we
>>>> certainly shouldn't be calling intel_finish_reset() without first calling
>>>> intel_prepare_reset(). Which is right here in my tree...
>>>>
>>>
>>> Looking at your tree, it wouldn't call finish_reset there either, only
>>> these two are called after a successful reset:
>>>
>>> finish:
>>>    clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags);
>>>    wake_up_all(&dev_priv->gpu_error.reset_queue);
>>>
>>> But you're right, we only need to clear the error flag, no need to call
>>> wake_up_all.
>>>
>>> Should I move the per-engine reset to i915_handle_error, and then leave
>>> i915_reset_and_wakeup just for full resets?
>>> That would also make the promotion from per-engine to global look a bit
>>> 'clearer'.
>>>
>>
>> I just noticed an issue if I don't call wake_up_all. There can be
>> someone else waiting for the reset to complete
>> (i915_mutex_lock_interruptible -> i915_gem_wait_for_error).
>>
>> I915_RESET_BACKOFF has/had 2 roles, stop any other user to grab the
>> struct mutex (which we won't need in reset-engine) and prevent two
>> concurrent reset attempts (which we still want). Time to add a new
>> flag for the later? (I915_RESET_ENGINE_IN_PROGRESS?)
>
> Yes, that would be a good idea to avoid dual purposing the bits. Now
> that we do direct resets along the wait path, we can completely drop the
> i915_mutex_interruptible(). (No one else should be holding the mutex
> indefinitely.) I think that's a better approach -- I think we've already
> moved all the EIO magic aware to the ABI points where we deemed it
> necessary.

And it seems to work ok with the new flag and no wake_up. I'll run more 
tests.

Thanks
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 02/20] drm/i915: Modify error handler for per engine hang recovery
  2017-04-27 23:12 ` [PATCH v7 02/20] drm/i915: Modify error handler for per engine hang recovery Michel Thierry
  2017-04-29 14:19   ` Chris Wilson
@ 2017-05-15 21:14   ` Michel Thierry
  1 sibling, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-05-15 21:14 UTC (permalink / raw)
  To: intel-gfx

This is a preparatory patch which modifies error handler to do per engine
hang recovery. The actual patch which implements this sequence follows
later in the series. The aim is to prepare existing recovery function to
adapt to this new function where applicable (which fails at this point
because core implementation is lacking) and continue recovery using legacy
full gpu reset.

A helper function is also added to query the availability of engine
reset.

<OPEN>
The error events that are used to notify user of reset are currently
ommited in case of engine reset. In legacy we report an error event, a reset
event before resetting the gpu and a reset done event marking the completion
of reset. The same behaviour can be adapted but reset event will only
dispatched once even when multiple engines are hung, and there needs to be
some control to prevent multiple events in case the reset-engine fails and
the driver fall backs to full chip reset.
</OPEN>

Note that this implementation of engine reset is for i915 directly
submitting to the ELSP, where the driver manages the hang detection,
recovery and resubmission. With GuC submission these tasks are shared
between driver and firmware; i915 will still responsible for detecting a
hang, and when it does it will have to request GuC to reset that Engine and
remind the firmware about the outstanding submissions. This will be
added in different patch.

v2: rebase, advertise engine reset availability in platform definition,
add note about GuC submission.
v3: s/*engine_reset*/*reset_engine*/. (Chris)
Handle reset as 2 level resets, by first going to engine only and fall
backing to full/chip reset as needed, i.e. reset_engine will need the
struct_mutex.
v4: Pass the engine mask to i915_reset. (Chris)
v5: Rebase, update selftests.
v6: Rebase, prepare for mutex-less reset engine.
v7: Pass reset_engine mask as a function parameter, and iterate over the
engine mask for reset_engine. (Chris)
v8: Use i915.reset >=2 in has_reset_engine; remove redundant reset
logging; add a reset-engine-in-progress flag to prevent concurrent
resets, and avoid dual purposing of reset-backoff. (Chris)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c     | 13 +++++++++++++
 drivers/gpu/drm/i915/i915_drv.h     |  8 ++++++++
 drivers/gpu/drm/i915/i915_irq.c     | 24 ++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_pci.c     |  5 ++++-
 drivers/gpu/drm/i915/intel_uncore.c | 11 +++++++++++
 5 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 72fb47a439d2..fa21458d6e1e 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1877,6 +1877,19 @@ void i915_reset(struct drm_i915_private *dev_priv)
 	goto finish;
 }
 
+/**
+ * i915_reset_engine - reset GPU engine to recover from a hang
+ * @engine: engine to reset
+ *
+ * Reset a specific GPU engine. Useful if a hang is detected.
+ * Returns zero on successful reset or otherwise an error code.
+ */
+int i915_reset_engine(struct intel_engine_cs *engine)
+{
+	/* FIXME: replace me with engine reset sequence */
+	return -ENODEV;
+}
+
 static int i915_pm_suspend(struct device *kdev)
 {
 	struct pci_dev *pdev = to_pci_dev(kdev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index ebc38f33723d..f0ff918e5d0b 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -705,6 +705,7 @@ struct intel_csr {
 	func(has_ddi); \
 	func(has_decoupled_mmio); \
 	func(has_dp_mst); \
+	func(has_reset_engine); \
 	func(has_fbc); \
 	func(has_fpga_dbg); \
 	func(has_full_ppgtt); \
@@ -1498,6 +1499,10 @@ struct i915_gpu_error {
 	 * inspect the bit and do the reset directly, otherwise the worker
 	 * waits for the struct_mutex.
 	 *
+	 * #I915_RESET_ENGINE_IN_PROGRESS - Since the driver doesn't need to
+	 * acquire the struct_mutex to reset an engine, we need an explicit
+	 * flag to prevent two concurrent reset-engine attempts.
+	 *
 	 * #I915_WEDGED - If reset fails and we can no longer use the GPU,
 	 * we set the #I915_WEDGED bit. Prior to command submission, e.g.
 	 * i915_gem_request_alloc(), this bit is checked and the sequence
@@ -1506,6 +1511,7 @@ struct i915_gpu_error {
 	unsigned long flags;
 #define I915_RESET_BACKOFF	0
 #define I915_RESET_HANDOFF	1
+#define I915_RESET_ENGINE_IN_PROGRESS	2
 #define I915_WEDGED		(BITS_PER_LONG - 1)
 
 	/**
@@ -2989,6 +2995,8 @@ extern void i915_driver_unload(struct drm_device *dev);
 extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern void i915_reset(struct drm_i915_private *dev_priv);
+extern int i915_reset_engine(struct intel_engine_cs *engine);
+extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 9f5ae1e938be..958d56f49dad 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2741,6 +2741,30 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 	if (!engine_mask)
 		goto out;
 
+	/* try engine reset first, and continue if fails */
+	if (intel_has_reset_engine(dev_priv)) {
+		struct intel_engine_cs *engine;
+		unsigned int tmp;
+
+		/* protect against concurrent reset attempts */
+		if (test_and_set_bit(I915_RESET_ENGINE_IN_PROGRESS,
+				     &dev_priv->gpu_error.flags))
+			goto out;
+
+		for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
+			if (i915_reset_engine(engine) == 0)
+				engine_mask &= ~intel_engine_flag(engine);
+		}
+
+		/* clear unconditionally, full reset won't care about it */
+		clear_bit(I915_RESET_ENGINE_IN_PROGRESS,
+			  &dev_priv->gpu_error.flags);
+
+		if (!engine_mask)
+			goto out;
+	}
+
+	/* full reset needs the mutex, stop any other user trying to do so */
 	if (test_and_set_bit(I915_RESET_BACKOFF,
 			     &dev_priv->gpu_error.flags))
 		goto out;
diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index f80db2ccd92f..4dfb400aef85 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -310,7 +310,8 @@ static const struct intel_device_info intel_haswell_info = {
 	BDW_COLORS, \
 	.has_logical_ring_contexts = 1, \
 	.has_full_48bit_ppgtt = 1, \
-	.has_64bit_reloc = 1
+	.has_64bit_reloc = 1, \
+	.has_reset_engine = 1
 
 static const struct intel_device_info intel_broadwell_info = {
 	BDW_FEATURES,
@@ -341,6 +342,7 @@ static const struct intel_device_info intel_cherryview_info = {
 	.has_gmch_display = 1,
 	.has_aliasing_ppgtt = 1,
 	.has_full_ppgtt = 1,
+	.has_reset_engine = 1,
 	.display_mmio_offset = VLV_DISPLAY_BASE,
 	GEN_CHV_PIPEOFFSETS,
 	CURSOR_OFFSETS,
@@ -389,6 +391,7 @@ static const struct intel_device_info intel_skylake_gt3_info = {
 	.has_aliasing_ppgtt = 1, \
 	.has_full_ppgtt = 1, \
 	.has_full_48bit_ppgtt = 1, \
+	.has_reset_engine = 1, \
 	GEN_DEFAULT_PIPEOFFSETS, \
 	IVB_CURSOR_OFFSETS, \
 	BDW_COLORS
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index a9a6933afda2..5f55cb57127a 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1776,6 +1776,17 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv)
 	return intel_get_gpu_reset(dev_priv) != NULL;
 }
 
+/*
+ * When GuC submission is enabled, GuC manages ELSP and can initiate the
+ * engine reset too. For now, fall back to full GPU reset if it is enabled.
+ */
+bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
+{
+	return (dev_priv->info.has_reset_engine &&
+		!dev_priv->guc.execbuf_client &&
+		i915.reset >= 2);
+}
+
 int intel_guc_reset(struct drm_i915_private *dev_priv)
 {
 	int ret;
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 03/20] drm/i915: Add support for per engine reset recovery
  2017-04-27 23:12 ` [PATCH v7 03/20] drm/i915: Add support for per engine reset recovery Michel Thierry
  2017-04-27 23:50   ` Chris Wilson
@ 2017-05-15 21:18   ` Michel Thierry
  1 sibling, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-05-15 21:18 UTC (permalink / raw)
  To: intel-gfx

This change implements support for per-engine reset as an initial, less
intrusive hang recovery option to be attempted before falling back to the
legacy full GPU reset recovery mode if necessary. This is only supported
from Gen8 onwards.

Hangchecker determines which engines are hung and invokes error handler to
recover from it. Error handler schedules recovery for each of those engines
that are hung. The recovery procedure is as follows,
 - identifies the request that caused the hang and it is dropped
 - force engine to idle: this is done by issuing a reset request
 - reset the engine
 - re-init the engine to resume submissions.

If engine reset fails then we fall back to heavy weight full gpu reset
which resets all engines and reinitiazes complete state of HW and SW.

v2: Rebase.
v3: s/*engine_reset*/*reset_engine*/; freeze engine and irqs before
calling i915_gem_reset_engine (Chris).
v4: Rebase, modify i915_gem_reset_prepare to use a ring mask and
reuse the function for reset_engine.
v5: intel_reset_engine_start/cancel instead of request/unrequest_reset.
v6: Clean up reset_engine function to not require mutex, i.e. no need to call
revoke/restore_fences and _retire_requests (Chris).
v7: Remove leftovers from v5, i.e. no need to disable irq, hold
forcewake or wakeup the handoff bit (Chris).
v8: engine_retire_requests should be (and it was) static; explain that
we have to re-init the engine after reset, which is why the init_hw call
is needed; check reset-in-progress flag (Chris).

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c     | 58 ++++++++++++++++++++++-
 drivers/gpu/drm/i915/i915_drv.h     |  5 ++
 drivers/gpu/drm/i915/i915_gem.c     | 92 +++++++++++++++++++++----------------
 drivers/gpu/drm/i915/intel_uncore.c | 20 ++++++++
 4 files changed, 133 insertions(+), 42 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index fa21458d6e1e..d62793805794 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1883,11 +1883,65 @@ void i915_reset(struct drm_i915_private *dev_priv)
  *
  * Reset a specific GPU engine. Useful if a hang is detected.
  * Returns zero on successful reset or otherwise an error code.
+ *
+ * Procedure is:
+ *  - identifies the request that caused the hang and it is dropped
+ *  - force engine to idle: this is done by issuing a reset request
+ *  - reset engine
+ *  - re-init/configure engine
  */
 int i915_reset_engine(struct intel_engine_cs *engine)
 {
-	/* FIXME: replace me with engine reset sequence */
-	return -ENODEV;
+	int ret;
+	struct drm_i915_private *dev_priv = engine->i915;
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
+
+	GEM_BUG_ON(!test_bit(I915_RESET_ENGINE_IN_PROGRESS, &error->flags));
+
+	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
+
+	ret = i915_gem_reset_prepare_engine(engine);
+	if (ret) {
+		DRM_ERROR("Previous reset failed - promote to full reset\n");
+		goto out;
+	}
+
+	/*
+	 * the request that caused the hang is stuck on elsp, identify the
+	 * active request and drop it, adjust head to skip the offending
+	 * request to resume executing remaining requests in the queue.
+	 */
+	i915_gem_reset_engine(engine);
+
+	/* forcing engine to idle */
+	ret = intel_reset_engine_start(engine);
+	if (ret) {
+		DRM_ERROR("Failed to disable %s\n", engine->name);
+		goto out;
+	}
+
+	/* finally, reset engine */
+	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
+	if (ret) {
+		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
+		intel_reset_engine_cancel(engine);
+		goto out;
+	}
+
+	/* be sure the request reset bit gets cleared */
+	intel_reset_engine_cancel(engine);
+
+	i915_gem_reset_finish_engine(engine);
+
+	/*
+	 * The engine and its registers (and workarounds in case of render)
+	 * have been reset to their default values. Follow the init_ring
+	 * process to program RING_MODE, HWSP and re-enable submission.
+	 */
+	ret = engine->init_hw(engine);
+
+out:
+	return ret;
 }
 
 static int i915_pm_suspend(struct device *kdev)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index f0ff918e5d0b..a5b9c666b3bf 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2997,6 +2997,8 @@ extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern void i915_reset(struct drm_i915_private *dev_priv);
 extern int i915_reset_engine(struct intel_engine_cs *engine);
 extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
+extern int intel_reset_engine_start(struct intel_engine_cs *engine);
+extern void intel_reset_engine_cancel(struct intel_engine_cs *engine);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
@@ -3368,11 +3370,14 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
+int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv);
 void i915_gem_reset(struct drm_i915_private *dev_priv);
+void i915_gem_reset_finish_engine(struct intel_engine_cs *engine);
 void i915_gem_reset_finish(struct drm_i915_private *dev_priv);
 void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
 bool i915_gem_unset_wedged(struct drm_i915_private *dev_priv);
+void i915_gem_reset_engine(struct intel_engine_cs *engine);
 
 void i915_gem_init_mmio(struct drm_i915_private *i915);
 int __must_check i915_gem_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 0c1cbe98c994..b5dc073a5ddc 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2793,48 +2793,56 @@ static bool engine_stalled(struct intel_engine_cs *engine)
 	return true;
 }
 
-int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
+/* Ensure irq handler finishes, and not run again. */
+int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 {
-	struct intel_engine_cs *engine;
-	enum intel_engine_id id;
+	struct drm_i915_gem_request *request;
 	int err = 0;
 
-	/* Ensure irq handler finishes, and not run again. */
-	for_each_engine(engine, dev_priv, id) {
-		struct drm_i915_gem_request *request;
-
-		/* Prevent the signaler thread from updating the request
-		 * state (by calling dma_fence_signal) as we are processing
-		 * the reset. The write from the GPU of the seqno is
-		 * asynchronous and the signaler thread may see a different
-		 * value to us and declare the request complete, even though
-		 * the reset routine have picked that request as the active
-		 * (incomplete) request. This conflict is not handled
-		 * gracefully!
-		 */
-		kthread_park(engine->breadcrumbs.signaler);
-
-		/* Prevent request submission to the hardware until we have
-		 * completed the reset in i915_gem_reset_finish(). If a request
-		 * is completed by one engine, it may then queue a request
-		 * to a second via its engine->irq_tasklet *just* as we are
-		 * calling engine->init_hw() and also writing the ELSP.
-		 * Turning off the engine->irq_tasklet until the reset is over
-		 * prevents the race.
-		 */
-		tasklet_kill(&engine->irq_tasklet);
-		tasklet_disable(&engine->irq_tasklet);
 
-		if (engine->irq_seqno_barrier)
-			engine->irq_seqno_barrier(engine);
+	/* Prevent the signaler thread from updating the request
+	 * state (by calling dma_fence_signal) as we are processing
+	 * the reset. The write from the GPU of the seqno is
+	 * asynchronous and the signaler thread may see a different
+	 * value to us and declare the request complete, even though
+	 * the reset routine have picked that request as the active
+	 * (incomplete) request. This conflict is not handled
+	 * gracefully!
+	 */
+	kthread_park(engine->breadcrumbs.signaler);
+
+	/* Prevent request submission to the hardware until we have
+	 * completed the reset in i915_gem_reset_finish(). If a request
+	 * is completed by one engine, it may then queue a request
+	 * to a second via its engine->irq_tasklet *just* as we are
+	 * calling engine->init_hw() and also writing the ELSP.
+	 * Turning off the engine->irq_tasklet until the reset is over
+	 * prevents the race.
+	 */
+	tasklet_kill(&engine->irq_tasklet);
+	tasklet_disable(&engine->irq_tasklet);
 
-		if (engine_stalled(engine)) {
-			request = i915_gem_find_active_request(engine);
-			if (request && request->fence.error == -EIO)
-				err = -EIO; /* Previous reset failed! */
-		}
+	if (engine->irq_seqno_barrier)
+		engine->irq_seqno_barrier(engine);
+
+	if (engine_stalled(engine)) {
+		request = i915_gem_find_active_request(engine);
+		if (request && request->fence.error == -EIO)
+			err = -EIO; /* Previous reset failed! */
 	}
 
+	return err;
+}
+
+int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
+{
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	int err = 0;
+
+	for_each_engine(engine, dev_priv, id)
+		err = i915_gem_reset_prepare_engine(engine);
+
 	i915_gem_revoke_fences(dev_priv);
 
 	return err;
@@ -2920,7 +2928,7 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 	return guilty;
 }
 
-static void i915_gem_reset_engine(struct intel_engine_cs *engine)
+void i915_gem_reset_engine(struct intel_engine_cs *engine)
 {
 	struct drm_i915_gem_request *request;
 
@@ -2966,6 +2974,12 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
 	}
 }
 
+void i915_gem_reset_finish_engine(struct intel_engine_cs *engine)
+{
+	tasklet_enable(&engine->irq_tasklet);
+	kthread_unpark(engine->breadcrumbs.signaler);
+}
+
 void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
 {
 	struct intel_engine_cs *engine;
@@ -2973,10 +2987,8 @@ void i915_gem_reset_finish(struct drm_i915_private *dev_priv)
 
 	lockdep_assert_held(&dev_priv->drm.struct_mutex);
 
-	for_each_engine(engine, dev_priv, id) {
-		tasklet_enable(&engine->irq_tasklet);
-		kthread_unpark(engine->breadcrumbs.signaler);
-	}
+	for_each_engine(engine, dev_priv, id)
+		i915_gem_reset_finish_engine(engine);
 }
 
 static void nop_submit_request(struct drm_i915_gem_request *request)
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 5f55cb57127a..55c2b9486f70 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1801,6 +1801,26 @@ int intel_guc_reset(struct drm_i915_private *dev_priv)
 	return ret;
 }
 
+/*
+ * On gen8+ a reset request has to be issued via the reset control register
+ * before a GPU engine can be reset in order to stop the command streamer
+ * and idle the engine. This replaces the legacy way of stopping an engine
+ * by writing to the stop ring bit in the MI_MODE register.
+ */
+int intel_reset_engine_start(struct intel_engine_cs *engine)
+{
+	return gen8_reset_engine_start(engine);
+}
+
+/*
+ * It is possible to back off from a previously issued reset request by simply
+ * clearing the reset request bit in the reset control register.
+ */
+void intel_reset_engine_cancel(struct intel_engine_cs *engine)
+{
+	gen8_reset_engine_cancel(engine);
+}
+
 bool intel_uncore_unclaimed_mmio(struct drm_i915_private *dev_priv)
 {
 	return check_for_unclaimed_mmio(dev_priv);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 05/20] drm/i915: Cancel reset-engine if we couldn't find an active request
  2017-04-27 23:12 ` [PATCH v7 05/20] drm/i915: Cancel reset-engine if we couldn't find an active request Michel Thierry
  2017-04-29 14:26   ` Chris Wilson
@ 2017-05-15 21:20   ` Michel Thierry
  2017-05-15 21:31     ` Chris Wilson
  2017-05-17 20:41   ` [PATCH] " Michel Thierry
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 62+ messages in thread
From: Michel Thierry @ 2017-05-15 21:20 UTC (permalink / raw)
  To: intel-gfx

Before reseting an engine, check if there is an active request, and if
the _hung_ request has completed. In these two cases, the seqno has moved
after hang declaration and we can skip the reset.

Also store the active request so that we only search for it once.

v2: Check for request completion inside _prepare_engine, don't use
ECANCELED, remove unnecessary null checks (Chris).

Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c | 21 +++++++++++++------
 drivers/gpu/drm/i915/i915_drv.h |  6 ++++--
 drivers/gpu/drm/i915/i915_gem.c | 45 ++++++++++++++++++++++++++++-------------
 3 files changed, 50 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index d62793805794..6ee60c1e17ee 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1895,23 +1895,28 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	int ret;
 	struct drm_i915_private *dev_priv = engine->i915;
 	struct i915_gpu_error *error = &dev_priv->gpu_error;
+	struct drm_i915_gem_request *active_request;
 
 	GEM_BUG_ON(!test_bit(I915_RESET_ENGINE_IN_PROGRESS, &error->flags));
 
 	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
 
-	ret = i915_gem_reset_prepare_engine(engine);
-	if (ret) {
-		DRM_ERROR("Previous reset failed - promote to full reset\n");
+	active_request = i915_gem_reset_prepare_engine(engine);
+	if (!active_request) {
+		DRM_DEBUG_DRIVER("seqno moved after hang declaration, pardoned\n");
+		goto canceled;
+	} else if (IS_ERR(active_request)) {
+		DRM_DEBUG_DRIVER("Previous reset failed, promote to full reset\n");
+		ret = PTR_ERR(active_request);
 		goto out;
 	}
 
 	/*
-	 * the request that caused the hang is stuck on elsp, identify the
-	 * active request and drop it, adjust head to skip the offending
+	 * the request that caused the hang is stuck on elsp, we know the
+	 * active request and can drop it, adjust head to skip the offending
 	 * request to resume executing remaining requests in the queue.
 	 */
-	i915_gem_reset_engine(engine);
+	i915_gem_reset_engine(engine, active_request);
 
 	/* forcing engine to idle */
 	ret = intel_reset_engine_start(engine);
@@ -1942,6 +1947,10 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 
 out:
 	return ret;
+
+canceled:
+	i915_gem_reset_finish_engine(engine);
+	return 0;
 }
 
 static int i915_pm_suspend(struct device *kdev)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index a5b9c666b3bf..f8cbd286f904 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3370,14 +3370,16 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
-int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
+struct drm_i915_gem_request *
+i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv);
 void i915_gem_reset(struct drm_i915_private *dev_priv);
 void i915_gem_reset_finish_engine(struct intel_engine_cs *engine);
 void i915_gem_reset_finish(struct drm_i915_private *dev_priv);
 void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
 bool i915_gem_unset_wedged(struct drm_i915_private *dev_priv);
-void i915_gem_reset_engine(struct intel_engine_cs *engine);
+void i915_gem_reset_engine(struct intel_engine_cs *engine,
+			   struct drm_i915_gem_request *request);
 
 void i915_gem_init_mmio(struct drm_i915_private *i915);
 int __must_check i915_gem_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index b5dc073a5ddc..2e47678315d4 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2793,12 +2793,15 @@ static bool engine_stalled(struct intel_engine_cs *engine)
 	return true;
 }
 
-/* Ensure irq handler finishes, and not run again. */
-int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
+/*
+ * Ensure irq handler finishes, and not run again.
+ * For reset-engine we also store the active request so that we only search
+ * for it once.
+ */
+struct drm_i915_gem_request *
+i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 {
-	struct drm_i915_gem_request *request;
-	int err = 0;
-
+	struct drm_i915_gem_request *request = NULL;
 
 	/* Prevent the signaler thread from updating the request
 	 * state (by calling dma_fence_signal) as we are processing
@@ -2827,21 +2830,34 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 
 	if (engine_stalled(engine)) {
 		request = i915_gem_find_active_request(engine);
-		if (request && request->fence.error == -EIO)
-			err = -EIO; /* Previous reset failed! */
+
+		if (request) {
+			if (request->fence.error == -EIO)
+				return ERR_PTR(-EIO); /* Previous reset failed! */
+
+			if (__i915_gem_request_completed(request,
+							 engine->hangcheck.seqno))
+				return NULL; /* request completed, skip reset */
+		}
 	}
 
-	return err;
+	return request;
 }
 
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
 {
 	struct intel_engine_cs *engine;
+	struct drm_i915_gem_request *request;
 	enum intel_engine_id id;
 	int err = 0;
 
-	for_each_engine(engine, dev_priv, id)
-		err = i915_gem_reset_prepare_engine(engine);
+	for_each_engine(engine, dev_priv, id) {
+		request = i915_gem_reset_prepare_engine(engine);
+		if (IS_ERR(request)) {
+			err = PTR_ERR(request);
+			break;
+		}
+	}
 
 	i915_gem_revoke_fences(dev_priv);
 
@@ -2928,11 +2944,12 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 	return guilty;
 }
 
-void i915_gem_reset_engine(struct intel_engine_cs *engine)
+void i915_gem_reset_engine(struct intel_engine_cs *engine,
+			   struct drm_i915_gem_request *request)
 {
-	struct drm_i915_gem_request *request;
+	if (!request)
+		request = i915_gem_find_active_request(engine);
 
-	request = i915_gem_find_active_request(engine);
 	if (request && i915_gem_reset_request(request)) {
 		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
 				 engine->name, request->global_seqno);
@@ -2958,7 +2975,7 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
 	for_each_engine(engine, dev_priv, id) {
 		struct i915_gem_context *ctx;
 
-		i915_gem_reset_engine(engine);
+		i915_gem_reset_engine(engine, NULL);
 		ctx = fetch_and_zero(&engine->last_retired_context);
 		if (ctx)
 			engine->context_unpin(engine, ctx);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/20] drm/i915: Cancel reset-engine if we couldn't find an active request
  2017-05-15 21:20   ` [PATCH " Michel Thierry
@ 2017-05-15 21:31     ` Chris Wilson
  2017-05-15 21:47       ` Chris Wilson
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Wilson @ 2017-05-15 21:31 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Mon, May 15, 2017 at 02:20:01PM -0700, Michel Thierry wrote:
> @@ -2827,21 +2830,34 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>  
>  	if (engine_stalled(engine)) {
>  		request = i915_gem_find_active_request(engine);
> -		if (request && request->fence.error == -EIO)
> -			err = -EIO; /* Previous reset failed! */
> +
> +		if (request) {
> +			if (request->fence.error == -EIO)
> +				return ERR_PTR(-EIO); /* Previous reset failed! */
> +
> +			if (__i915_gem_request_completed(request,
> +							 engine->hangcheck.seqno))

This is not the seqno for the request, so this is incorrect. It will
judge that the request was preempted (as hangcheck.seqno must be less
thn request->global_seqno) and so conclude that the request was never
completed.

You just want if (i915_gem_request_completed(request))
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev5)
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (20 preceding siblings ...)
  2017-04-27 23:30 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev4) Patchwork
@ 2017-05-15 21:32 ` Patchwork
  2017-05-15 21:48 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev7) Patchwork
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 62+ messages in thread
From: Patchwork @ 2017-05-15 21:32 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

== Series Details ==

Series: Gen8+ engine-reset (rev5)
URL   : https://patchwork.freedesktop.org/series/21868/
State : success

== Summary ==

Series 21868v5 Gen8+ engine-reset
https://patchwork.freedesktop.org/api/1.0/series/21868/revisions/5/mbox/

Test gem_exec_flush:
        Subgroup basic-batch-kernel-default-uc:
                pass       -> FAIL       (fi-snb-2600) fdo#100007

fdo#100007 https://bugs.freedesktop.org/show_bug.cgi?id=100007

fi-bdw-5557u     total:278  pass:267  dwarn:0   dfail:0   fail:0   skip:11  time:446s
fi-bdw-gvtdvm    total:278  pass:256  dwarn:8   dfail:0   fail:0   skip:14  time:430s
fi-bsw-n3050     total:278  pass:242  dwarn:0   dfail:0   fail:0   skip:36  time:595s
fi-bxt-j4205     total:278  pass:259  dwarn:0   dfail:0   fail:0   skip:19  time:513s
fi-byt-j1900     total:278  pass:254  dwarn:0   dfail:0   fail:0   skip:24  time:496s
fi-byt-n2820     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:491s
fi-hsw-4770      total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:420s
fi-hsw-4770r     total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:410s
fi-ilk-650       total:278  pass:228  dwarn:0   dfail:0   fail:0   skip:50  time:423s
fi-ivb-3520m     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:489s
fi-ivb-3770      total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:470s
fi-kbl-7500u     total:278  pass:255  dwarn:5   dfail:0   fail:0   skip:18  time:473s
fi-kbl-7560u     total:278  pass:263  dwarn:5   dfail:0   fail:0   skip:10  time:578s
fi-skl-6260u     total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:468s
fi-skl-6700hq    total:278  pass:261  dwarn:0   dfail:0   fail:0   skip:17  time:588s
fi-skl-6700k     total:278  pass:256  dwarn:4   dfail:0   fail:0   skip:18  time:461s
fi-skl-6770hq    total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:498s
fi-skl-gvtdvm    total:278  pass:265  dwarn:0   dfail:0   fail:0   skip:13  time:439s
fi-snb-2520m     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:545s
fi-snb-2600      total:278  pass:248  dwarn:0   dfail:0   fail:1   skip:29  time:417s

9b25870f9fa4548ec2bb40e42fa28f35db2189e1 drm-tip: 2017y-05m-15d-15h-47m-31s UTC integration manifest
8001816 drm/i915: Cancel reset-engine if we couldn't find an active request
7430985 drm/i915: Skip reset request if there is one already
3297c86 drm/i915: Add support for per engine reset recovery
47d3cf1 drm/i915: Modify error handler for per engine hang recovery
3a9b15f drm/i915: Update i915.reset to handle engine resets

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_4701/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/20] drm/i915: Cancel reset-engine if we couldn't find an active request
  2017-05-15 21:31     ` Chris Wilson
@ 2017-05-15 21:47       ` Chris Wilson
  2017-05-15 22:25         ` Michel Thierry
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Wilson @ 2017-05-15 21:47 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx

On Mon, May 15, 2017 at 10:31:58PM +0100, Chris Wilson wrote:
> On Mon, May 15, 2017 at 02:20:01PM -0700, Michel Thierry wrote:
> > @@ -2827,21 +2830,34 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
> >  
> >  	if (engine_stalled(engine)) {
> >  		request = i915_gem_find_active_request(engine);
> > -		if (request && request->fence.error == -EIO)
> > -			err = -EIO; /* Previous reset failed! */
> > +
> > +		if (request) {
> > +			if (request->fence.error == -EIO)
> > +				return ERR_PTR(-EIO); /* Previous reset failed! */
> > +
> > +			if (__i915_gem_request_completed(request,
> > +							 engine->hangcheck.seqno))
> 
> This is not the seqno for the request, so this is incorrect. It will
> judge that the request was preempted (as hangcheck.seqno must be less
> thn request->global_seqno) and so conclude that the request was never
> completed.
> 
> You just want if (i915_gem_request_completed(request))

Also not here. This pardon check should be deferred to the caller just
before commiting to thre reset. In the case of global reset, we want to
gather up all the engines' active requests first, complete our
preparations and then double check the engine was hung.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev7)
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (21 preceding siblings ...)
  2017-05-15 21:32 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev5) Patchwork
@ 2017-05-15 21:48 ` Patchwork
  2017-05-17 21:09 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev8) Patchwork
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 62+ messages in thread
From: Patchwork @ 2017-05-15 21:48 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

== Series Details ==

Series: Gen8+ engine-reset (rev7)
URL   : https://patchwork.freedesktop.org/series/21868/
State : success

== Summary ==

Series 21868v7 Gen8+ engine-reset
https://patchwork.freedesktop.org/api/1.0/series/21868/revisions/7/mbox/

fi-bdw-5557u     total:278  pass:267  dwarn:0   dfail:0   fail:0   skip:11  time:448s
fi-bdw-gvtdvm    total:278  pass:256  dwarn:8   dfail:0   fail:0   skip:14  time:432s
fi-bsw-n3050     total:278  pass:242  dwarn:0   dfail:0   fail:0   skip:36  time:587s
fi-bxt-j4205     total:278  pass:259  dwarn:0   dfail:0   fail:0   skip:19  time:510s
fi-byt-j1900     total:278  pass:254  dwarn:0   dfail:0   fail:0   skip:24  time:498s
fi-byt-n2820     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:486s
fi-hsw-4770      total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:423s
fi-hsw-4770r     total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:418s
fi-ilk-650       total:278  pass:228  dwarn:0   dfail:0   fail:0   skip:50  time:420s
fi-ivb-3520m     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:502s
fi-ivb-3770      total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:465s
fi-kbl-7500u     total:278  pass:255  dwarn:5   dfail:0   fail:0   skip:18  time:460s
fi-kbl-7560u     total:278  pass:263  dwarn:5   dfail:0   fail:0   skip:10  time:573s
fi-skl-6260u     total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:457s
fi-skl-6700hq    total:278  pass:261  dwarn:0   dfail:0   fail:0   skip:17  time:584s
fi-skl-6700k     total:278  pass:256  dwarn:4   dfail:0   fail:0   skip:18  time:464s
fi-skl-6770hq    total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:499s
fi-skl-gvtdvm    total:278  pass:265  dwarn:0   dfail:0   fail:0   skip:13  time:439s
fi-snb-2520m     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:539s
fi-snb-2600      total:278  pass:249  dwarn:0   dfail:0   fail:0   skip:29  time:405s

9b25870f9fa4548ec2bb40e42fa28f35db2189e1 drm-tip: 2017y-05m-15d-15h-47m-31s UTC integration manifest
568b6eb drm/i915: Cancel reset-engine if we couldn't find an active request
a581675 drm/i915: Skip reset request if there is one already
ce0b6a2 drm/i915: Add support for per engine reset recovery
dbc521d drm/i915: Modify error handler for per engine hang recovery
a421731 drm/i915: Update i915.reset to handle engine resets

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_4702/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/20] drm/i915: Cancel reset-engine if we couldn't find an active request
  2017-05-15 21:47       ` Chris Wilson
@ 2017-05-15 22:25         ` Michel Thierry
  2017-05-16  7:54           ` Chris Wilson
  0 siblings, 1 reply; 62+ messages in thread
From: Michel Thierry @ 2017-05-15 22:25 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 5/15/2017 2:47 PM, Chris Wilson wrote:
> On Mon, May 15, 2017 at 10:31:58PM +0100, Chris Wilson wrote:
>> On Mon, May 15, 2017 at 02:20:01PM -0700, Michel Thierry wrote:
>>> @@ -2827,21 +2830,34 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>>>
>>>  	if (engine_stalled(engine)) {
>>>  		request = i915_gem_find_active_request(engine);
>>> -		if (request && request->fence.error == -EIO)
>>> -			err = -EIO; /* Previous reset failed! */
>>> +
>>> +		if (request) {
>>> +			if (request->fence.error == -EIO)
>>> +				return ERR_PTR(-EIO); /* Previous reset failed! */
>>> +
>>> +			if (__i915_gem_request_completed(request,
>>> +							 engine->hangcheck.seqno))
>>
>> This is not the seqno for the request, so this is incorrect. It will
>> judge that the request was preempted (as hangcheck.seqno must be less
>> thn request->global_seqno) and so conclude that the request was never
>> completed.
>>
>> You just want if (i915_gem_request_completed(request))

Thanks, I'll change the function.

>
> Also not here. This pardon check should be deferred to the caller just
> before commiting to thre reset. In the case of global reset, we want to
> gather up all the engines' active requests first, complete our
> preparations and then double check the engine was hung.

i915_reset_engine calls this directly, but 'full reset' [from 
i915_gem_reset_prepare()] would not be affected and it won't pardon 
anything... i915_gem_reset_engine is doing the double check you mention.

-Michel
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/20] drm/i915: Cancel reset-engine if we couldn't find an active request
  2017-05-15 22:25         ` Michel Thierry
@ 2017-05-16  7:54           ` Chris Wilson
  2017-05-17  0:13             ` Michel Thierry
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Wilson @ 2017-05-16  7:54 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Mon, May 15, 2017 at 03:25:27PM -0700, Michel Thierry wrote:
> On 5/15/2017 2:47 PM, Chris Wilson wrote:
> >On Mon, May 15, 2017 at 10:31:58PM +0100, Chris Wilson wrote:
> >>On Mon, May 15, 2017 at 02:20:01PM -0700, Michel Thierry wrote:
> >>>@@ -2827,21 +2830,34 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
> >>>
> >>> 	if (engine_stalled(engine)) {
> >>> 		request = i915_gem_find_active_request(engine);
> >>>-		if (request && request->fence.error == -EIO)
> >>>-			err = -EIO; /* Previous reset failed! */
> >>>+
> >>>+		if (request) {
> >>>+			if (request->fence.error == -EIO)
> >>>+				return ERR_PTR(-EIO); /* Previous reset failed! */
> >>>+
> >>>+			if (__i915_gem_request_completed(request,
> >>>+							 engine->hangcheck.seqno))
> >>
> >>This is not the seqno for the request, so this is incorrect. It will
> >>judge that the request was preempted (as hangcheck.seqno must be less
> >>thn request->global_seqno) and so conclude that the request was never
> >>completed.
> >>
> >>You just want if (i915_gem_request_completed(request))
> 
> Thanks, I'll change the function.
> 
> >
> >Also not here. This pardon check should be deferred to the caller just
> >before commiting to thre reset. In the case of global reset, we want to
> >gather up all the engines' active requests first, complete our
> >preparations and then double check the engine was hung.
> 
> i915_reset_engine calls this directly, but 'full reset' [from
> i915_gem_reset_prepare()] would not be affected and it won't pardon
> anything... i915_gem_reset_engine is doing the double check you
> mention.

Aye, but in the long run I was thinking of capturing this request in
engine->hangcheck.active_request and then we reuse that info in the later
phases.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/20] drm/i915: Cancel reset-engine if we couldn't find an active request
  2017-05-16  7:54           ` Chris Wilson
@ 2017-05-17  0:13             ` Michel Thierry
  2017-05-17  7:19               ` Chris Wilson
  0 siblings, 1 reply; 62+ messages in thread
From: Michel Thierry @ 2017-05-17  0:13 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 16/05/17 00:54, Chris Wilson wrote:
> On Mon, May 15, 2017 at 03:25:27PM -0700, Michel Thierry wrote:
>> On 5/15/2017 2:47 PM, Chris Wilson wrote:
>>> On Mon, May 15, 2017 at 10:31:58PM +0100, Chris Wilson wrote:
>>>> On Mon, May 15, 2017 at 02:20:01PM -0700, Michel Thierry wrote:
>>>>> @@ -2827,21 +2830,34 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>>>>>
>>>>> 	if (engine_stalled(engine)) {
>>>>> 		request = i915_gem_find_active_request(engine);
>>>>> -		if (request && request->fence.error == -EIO)
>>>>> -			err = -EIO; /* Previous reset failed! */
>>>>> +
>>>>> +		if (request) {
>>>>> +			if (request->fence.error == -EIO)
>>>>> +				return ERR_PTR(-EIO); /* Previous reset failed! */
>>>>> +
>>>>> +			if (__i915_gem_request_completed(request,
>>>>> +							 engine->hangcheck.seqno))
>>>>
>>>> This is not the seqno for the request, so this is incorrect. It will
>>>> judge that the request was preempted (as hangcheck.seqno must be less
>>>> thn request->global_seqno) and so conclude that the request was never
>>>> completed.
>>>>
>>>> You just want if (i915_gem_request_completed(request))
>>
>> Thanks, I'll change the function.
>>
>>>
>>> Also not here. This pardon check should be deferred to the caller just
>>> before commiting to thre reset. In the case of global reset, we want to
>>> gather up all the engines' active requests first, complete our
>>> preparations and then double check the engine was hung.
>>
>> i915_reset_engine calls this directly, but 'full reset' [from
>> i915_gem_reset_prepare()] would not be affected and it won't pardon
>> anything... i915_gem_reset_engine is doing the double check you
>> mention.
>
> Aye, but in the long run I was thinking of capturing this request in
> engine->hangcheck.active_request and then we reuse that info in the later
> phases.

Capture hangcheck.active_request during hangcheck_declare_hang? Or still 
here in reset_prepare?

Thanks
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/20] drm/i915: Cancel reset-engine if we couldn't find an active request
  2017-05-17  0:13             ` Michel Thierry
@ 2017-05-17  7:19               ` Chris Wilson
  0 siblings, 0 replies; 62+ messages in thread
From: Chris Wilson @ 2017-05-17  7:19 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Tue, May 16, 2017 at 05:13:58PM -0700, Michel Thierry wrote:
> On 16/05/17 00:54, Chris Wilson wrote:
> >On Mon, May 15, 2017 at 03:25:27PM -0700, Michel Thierry wrote:
> >>On 5/15/2017 2:47 PM, Chris Wilson wrote:
> >>>On Mon, May 15, 2017 at 10:31:58PM +0100, Chris Wilson wrote:
> >>>>On Mon, May 15, 2017 at 02:20:01PM -0700, Michel Thierry wrote:
> >>>>>@@ -2827,21 +2830,34 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
> >>>>>
> >>>>>	if (engine_stalled(engine)) {
> >>>>>		request = i915_gem_find_active_request(engine);
> >>>>>-		if (request && request->fence.error == -EIO)
> >>>>>-			err = -EIO; /* Previous reset failed! */
> >>>>>+
> >>>>>+		if (request) {
> >>>>>+			if (request->fence.error == -EIO)
> >>>>>+				return ERR_PTR(-EIO); /* Previous reset failed! */
> >>>>>+
> >>>>>+			if (__i915_gem_request_completed(request,
> >>>>>+							 engine->hangcheck.seqno))
> >>>>
> >>>>This is not the seqno for the request, so this is incorrect. It will
> >>>>judge that the request was preempted (as hangcheck.seqno must be less
> >>>>thn request->global_seqno) and so conclude that the request was never
> >>>>completed.
> >>>>
> >>>>You just want if (i915_gem_request_completed(request))
> >>
> >>Thanks, I'll change the function.
> >>
> >>>
> >>>Also not here. This pardon check should be deferred to the caller just
> >>>before commiting to thre reset. In the case of global reset, we want to
> >>>gather up all the engines' active requests first, complete our
> >>>preparations and then double check the engine was hung.
> >>
> >>i915_reset_engine calls this directly, but 'full reset' [from
> >>i915_gem_reset_prepare()] would not be affected and it won't pardon
> >>anything... i915_gem_reset_engine is doing the double check you
> >>mention.
> >
> >Aye, but in the long run I was thinking of capturing this request in
> >engine->hangcheck.active_request and then we reuse that info in the later
> >phases.
> 
> Capture hangcheck.active_request during hangcheck_declare_hang? Or
> still here in reset_prepare?

Not in the hangcheck worker itself since we want that to be as lockless
as we can make it and since putting it inside the reset works just as
well, we should.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH] drm/i915: Cancel reset-engine if we couldn't find an active request
  2017-04-27 23:12 ` [PATCH v7 05/20] drm/i915: Cancel reset-engine if we couldn't find an active request Michel Thierry
  2017-04-29 14:26   ` Chris Wilson
  2017-05-15 21:20   ` [PATCH " Michel Thierry
@ 2017-05-17 20:41   ` Michel Thierry
  2017-05-17 20:52     ` Chris Wilson
  2017-05-18 18:22   ` [PATCH] drm/i915: Look for active requests earlier in the reset path Michel Thierry
  2017-05-18 21:11   ` Michel Thierry
  4 siblings, 1 reply; 62+ messages in thread
From: Michel Thierry @ 2017-05-17 20:41 UTC (permalink / raw)
  To: intel-gfx

Before reseting an engine, check if there is an active request, and if
the _hung_ request has completed. In these two cases, the seqno has moved
after hang declaration and we can skip the reset.

Also store the active request so that we only search for it once, this
applies for reset-engine and full reset.

v2: Check for request completion inside _prepare_engine, don't use
ECANCELED, remove unnecessary null checks (Chris).

v3: Capture active requests during reset_prepare and store it the
engine hangcheck obj (Chris).

Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c         | 18 ++++++++++----
 drivers/gpu/drm/i915/i915_drv.h         |  3 ++-
 drivers/gpu/drm/i915/i915_gem.c         | 42 +++++++++++++++++++++++----------
 drivers/gpu/drm/i915/intel_ringbuffer.h |  1 +
 4 files changed, 46 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index d62793805794..771857258292 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1900,15 +1900,19 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 
 	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
 
-	ret = i915_gem_reset_prepare_engine(engine);
-	if (ret) {
-		DRM_ERROR("Previous reset failed - promote to full reset\n");
+	engine->hangcheck.active_request = i915_gem_reset_prepare_engine(engine);
+	if (!engine->hangcheck.active_request) {
+		DRM_DEBUG_DRIVER("seqno moved after hang declaration, pardoned\n");
+		goto canceled;
+	} else if (IS_ERR(engine->hangcheck.active_request)) {
+		DRM_DEBUG_DRIVER("Previous reset failed, promote to full reset\n");
+		ret = PTR_ERR(engine->hangcheck.active_request);
 		goto out;
 	}
 
 	/*
-	 * the request that caused the hang is stuck on elsp, identify the
-	 * active request and drop it, adjust head to skip the offending
+	 * the request that caused the hang is stuck on elsp, we know the
+	 * active request and can drop it, adjust head to skip the offending
 	 * request to resume executing remaining requests in the queue.
 	 */
 	i915_gem_reset_engine(engine);
@@ -1942,6 +1946,10 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 
 out:
 	return ret;
+
+canceled:
+	i915_gem_reset_finish_engine(engine);
+	return 0;
 }
 
 static int i915_pm_suspend(struct device *kdev)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index a5b9c666b3bf..6cbfeaa02246 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3370,7 +3370,8 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
-int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
+struct drm_i915_gem_request *
+i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv);
 void i915_gem_reset(struct drm_i915_private *dev_priv);
 void i915_gem_reset_finish_engine(struct intel_engine_cs *engine);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index b5dc073a5ddc..5ec454dafb9f 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2793,12 +2793,14 @@ static bool engine_stalled(struct intel_engine_cs *engine)
 	return true;
 }
 
-/* Ensure irq handler finishes, and not run again. */
-int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
+/*
+ * Ensure irq handler finishes, and not run again.
+ * Also store the active request so that we only search for it once.
+ */
+struct drm_i915_gem_request *
+i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 {
-	struct drm_i915_gem_request *request;
-	int err = 0;
-
+	struct drm_i915_gem_request *request = NULL;
 
 	/* Prevent the signaler thread from updating the request
 	 * state (by calling dma_fence_signal) as we are processing
@@ -2827,21 +2829,35 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 
 	if (engine_stalled(engine)) {
 		request = i915_gem_find_active_request(engine);
-		if (request && request->fence.error == -EIO)
-			err = -EIO; /* Previous reset failed! */
+
+		if (request) {
+			if (request->fence.error == -EIO)
+				return ERR_PTR(-EIO); /* Previous reset failed! */
+
+			if (i915_gem_request_completed(request))
+				return NULL; /* request completed, skip it */
+		}
 	}
 
-	return err;
+	return request;
 }
 
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
 {
 	struct intel_engine_cs *engine;
+	struct drm_i915_gem_request *request;
 	enum intel_engine_id id;
 	int err = 0;
 
-	for_each_engine(engine, dev_priv, id)
-		err = i915_gem_reset_prepare_engine(engine);
+	for_each_engine(engine, dev_priv, id) {
+		request = i915_gem_reset_prepare_engine(engine);
+		if (IS_ERR(request)) {
+			err = PTR_ERR(request);
+			break;
+		}
+
+		engine->hangcheck.active_request = request;
+	}
 
 	i915_gem_revoke_fences(dev_priv);
 
@@ -2930,9 +2946,11 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 
 void i915_gem_reset_engine(struct intel_engine_cs *engine)
 {
-	struct drm_i915_gem_request *request;
+	struct drm_i915_gem_request *request = engine->hangcheck.active_request;
+
+	if (!request)
+		request = i915_gem_find_active_request(engine);
 
-	request = i915_gem_find_active_request(engine);
 	if (request && i915_gem_reset_request(request)) {
 		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
 				 engine->name, request->global_seqno);
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index ec16fb6fde62..f850c4b12337 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -121,6 +121,7 @@ struct intel_engine_hangcheck {
 	unsigned long action_timestamp;
 	int deadlock;
 	struct intel_instdone instdone;
+	struct drm_i915_gem_request *active_request;
 	bool stalled;
 };
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH] drm/i915: Cancel reset-engine if we couldn't find an active request
  2017-05-17 20:41   ` [PATCH] " Michel Thierry
@ 2017-05-17 20:52     ` Chris Wilson
  2017-05-18  1:11       ` Michel Thierry
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Wilson @ 2017-05-17 20:52 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Wed, May 17, 2017 at 01:41:34PM -0700, Michel Thierry wrote:
> @@ -2827,21 +2829,35 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>  
>  	if (engine_stalled(engine)) {
>  		request = i915_gem_find_active_request(engine);
> -		if (request && request->fence.error == -EIO)
> -			err = -EIO; /* Previous reset failed! */
> +
> +		if (request) {
> +			if (request->fence.error == -EIO)
> +				return ERR_PTR(-EIO); /* Previous reset failed! */
> +
> +			if (i915_gem_request_completed(request))
> +				return NULL; /* request completed, skip it */

This check is pointless here. We are just a few cycles since it was
known to be true. Both paths should be doing it just before the actual
reset for symmetry.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev8)
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (22 preceding siblings ...)
  2017-05-15 21:48 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev7) Patchwork
@ 2017-05-17 21:09 ` Patchwork
  2017-05-18 18:40 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev9) Patchwork
  2017-05-18 21:29 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev10) Patchwork
  25 siblings, 0 replies; 62+ messages in thread
From: Patchwork @ 2017-05-17 21:09 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

== Series Details ==

Series: Gen8+ engine-reset (rev8)
URL   : https://patchwork.freedesktop.org/series/21868/
State : success

== Summary ==

Series 21868v8 Gen8+ engine-reset
https://patchwork.freedesktop.org/api/1.0/series/21868/revisions/8/mbox/

Test gem_exec_flush:
        Subgroup basic-batch-kernel-default-uc:
                pass       -> FAIL       (fi-snb-2600) fdo#100007

fdo#100007 https://bugs.freedesktop.org/show_bug.cgi?id=100007

fi-bdw-5557u     total:278  pass:267  dwarn:0   dfail:0   fail:0   skip:11  time:444s
fi-bsw-n3050     total:278  pass:242  dwarn:0   dfail:0   fail:0   skip:36  time:585s
fi-bxt-j4205     total:278  pass:259  dwarn:0   dfail:0   fail:0   skip:19  time:516s
fi-byt-j1900     total:278  pass:254  dwarn:0   dfail:0   fail:0   skip:24  time:493s
fi-byt-n2820     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:483s
fi-hsw-4770      total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:418s
fi-hsw-4770r     total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:408s
fi-ilk-650       total:278  pass:228  dwarn:0   dfail:0   fail:0   skip:50  time:416s
fi-ivb-3520m     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:498s
fi-ivb-3770      total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:464s
fi-kbl-7500u     total:278  pass:255  dwarn:5   dfail:0   fail:0   skip:18  time:467s
fi-skl-6260u     total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:460s
fi-skl-6700hq    total:278  pass:261  dwarn:0   dfail:0   fail:0   skip:17  time:578s
fi-skl-6700k     total:278  pass:256  dwarn:4   dfail:0   fail:0   skip:18  time:468s
fi-skl-6770hq    total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:510s
fi-snb-2520m     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:529s
fi-snb-2600      total:278  pass:248  dwarn:0   dfail:0   fail:1   skip:29  time:397s

eb3549d312620118dec3a69200894ac8a8fff358 drm-tip: 2017y-05m-17d-13h-53m-40s UTC integration manifest
efa35f9 drm/i915: Cancel reset-engine if we couldn't find an active request
67b883b drm/i915: Skip reset request if there is one already
2fe8d13 drm/i915: Add support for per engine reset recovery
a60dcd2 drm/i915: Modify error handler for per engine hang recovery
eb1aebb drm/i915: Update i915.reset to handle engine resets

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_4731/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] drm/i915: Cancel reset-engine if we couldn't find an active request
  2017-05-17 20:52     ` Chris Wilson
@ 2017-05-18  1:11       ` Michel Thierry
  2017-05-18  7:56         ` Chris Wilson
  0 siblings, 1 reply; 62+ messages in thread
From: Michel Thierry @ 2017-05-18  1:11 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 17/05/17 13:52, Chris Wilson wrote:
> On Wed, May 17, 2017 at 01:41:34PM -0700, Michel Thierry wrote:
>> @@ -2827,21 +2829,35 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>>
>>  	if (engine_stalled(engine)) {
>>  		request = i915_gem_find_active_request(engine);
>> -		if (request && request->fence.error == -EIO)
>> -			err = -EIO; /* Previous reset failed! */
>> +
>> +		if (request) {
>> +			if (request->fence.error == -EIO)
>> +				return ERR_PTR(-EIO); /* Previous reset failed! */
>> +
>> +			if (i915_gem_request_completed(request))
>> +				return NULL; /* request completed, skip it */
>
> This check is pointless here. We are just a few cycles since it was
> known to be true. Both paths should be doing it just before the actual
> reset for symmetry.

As you said, in gem_reset_request, 'guilty' should check for 
i915_gem_request_completed instead of engine_stalled... but at that 
point it's too late to cancel the reset (intel_gpu_reset has already 
been called).

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] drm/i915: Cancel reset-engine if we couldn't find an active request
  2017-05-18  1:11       ` Michel Thierry
@ 2017-05-18  7:56         ` Chris Wilson
  2017-05-18 17:19           ` Michel Thierry
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Wilson @ 2017-05-18  7:56 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Wed, May 17, 2017 at 06:11:06PM -0700, Michel Thierry wrote:
> On 17/05/17 13:52, Chris Wilson wrote:
> >On Wed, May 17, 2017 at 01:41:34PM -0700, Michel Thierry wrote:
> >>@@ -2827,21 +2829,35 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
> >>
> >> 	if (engine_stalled(engine)) {
> >> 		request = i915_gem_find_active_request(engine);
> >>-		if (request && request->fence.error == -EIO)
> >>-			err = -EIO; /* Previous reset failed! */
> >>+
> >>+		if (request) {
> >>+			if (request->fence.error == -EIO)
> >>+				return ERR_PTR(-EIO); /* Previous reset failed! */
> >>+
> >>+			if (i915_gem_request_completed(request))
> >>+				return NULL; /* request completed, skip it */
> >
> >This check is pointless here. We are just a few cycles since it was
> >known to be true. Both paths should be doing it just before the actual
> >reset for symmetry.
> 
> As you said, in gem_reset_request, 'guilty' should check for
> i915_gem_request_completed instead of engine_stalled... but at that
> point it's too late to cancel the reset (intel_gpu_reset has already
> been called).

Ok. At that point we are just deciding between skipping the request or
replaying it. The motivation behind carrying forward the active_request
was to avoid the repeated searches + engine_stalled() checks (since any
future check can then just confirm the active_request is still
incomplete).
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] drm/i915: Cancel reset-engine if we couldn't find an active request
  2017-05-18  7:56         ` Chris Wilson
@ 2017-05-18 17:19           ` Michel Thierry
  0 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-05-18 17:19 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 5/18/2017 12:56 AM, Chris Wilson wrote:
> On Wed, May 17, 2017 at 06:11:06PM -0700, Michel Thierry wrote:
>> On 17/05/17 13:52, Chris Wilson wrote:
>>> On Wed, May 17, 2017 at 01:41:34PM -0700, Michel Thierry wrote:
>>>> @@ -2827,21 +2829,35 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>>>>
>>>> 	if (engine_stalled(engine)) {
>>>> 		request = i915_gem_find_active_request(engine);
>>>> -		if (request && request->fence.error == -EIO)
>>>> -			err = -EIO; /* Previous reset failed! */
>>>> +
>>>> +		if (request) {
>>>> +			if (request->fence.error == -EIO)
>>>> +				return ERR_PTR(-EIO); /* Previous reset failed! */
>>>> +
>>>> +			if (i915_gem_request_completed(request))
>>>> +				return NULL; /* request completed, skip it */
>>>
>>> This check is pointless here. We are just a few cycles since it was
>>> known to be true. Both paths should be doing it just before the actual
>>> reset for symmetry.
>>
>> As you said, in gem_reset_request, 'guilty' should check for
>> i915_gem_request_completed instead of engine_stalled... but at that
>> point it's too late to cancel the reset (intel_gpu_reset has already
>> been called).
>
> Ok. At that point we are just deciding between skipping the request or
> replaying it. The motivation behind carrying forward the active_request
> was to avoid the repeated searches + engine_stalled() checks (since any
> future check can then just confirm the active_request is still
> incomplete).

Agreed, we'll still avoid the repeated searches + engine_stalled.
Let me send that.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH] drm/i915: Look for active requests earlier in the reset path
  2017-04-27 23:12 ` [PATCH v7 05/20] drm/i915: Cancel reset-engine if we couldn't find an active request Michel Thierry
                     ` (2 preceding siblings ...)
  2017-05-17 20:41   ` [PATCH] " Michel Thierry
@ 2017-05-18 18:22   ` Michel Thierry
  2017-05-18 18:26     ` Michel Thierry
  2017-05-18 19:55     ` Chris Wilson
  2017-05-18 21:11   ` Michel Thierry
  4 siblings, 2 replies; 62+ messages in thread
From: Michel Thierry @ 2017-05-18 18:22 UTC (permalink / raw)
  To: intel-gfx

And store the active request so that we only search for it once; this
applies for reset-engine and full reset.

v2: Check for request completion inside _prepare_engine, don't use
ECANCELED, remove unnecessary null checks (Chris).

v3: Capture active requests during reset_prepare and store it the
engine hangcheck obj.

v4: Rename commit, change i915_gem_reset_request to just confirm the
active_request is still incomplete, instead of engine_stalled (Chris).

Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>

fixes

Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c         | 11 ++++++-----
 drivers/gpu/drm/i915/i915_drv.h         |  3 ++-
 drivers/gpu/drm/i915/i915_gem.c         | 34 +++++++++++++++++++++------------
 drivers/gpu/drm/i915/intel_ringbuffer.h |  1 +
 4 files changed, 31 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index d62793805794..ec719376fc24 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1900,15 +1900,16 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 
 	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
 
-	ret = i915_gem_reset_prepare_engine(engine);
-	if (ret) {
-		DRM_ERROR("Previous reset failed - promote to full reset\n");
+	engine->hangcheck.active_request = i915_gem_reset_prepare_engine(engine);
+	if (IS_ERR(engine->hangcheck.active_request)) {
+		DRM_DEBUG_DRIVER("Previous reset failed, promote to full reset\n");
+		ret = PTR_ERR(engine->hangcheck.active_request);
 		goto out;
 	}
 
 	/*
-	 * the request that caused the hang is stuck on elsp, identify the
-	 * active request and drop it, adjust head to skip the offending
+	 * the request that caused the hang is stuck on elsp, we know the
+	 * active request and can drop it, adjust head to skip the offending
 	 * request to resume executing remaining requests in the queue.
 	 */
 	i915_gem_reset_engine(engine);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index a5b9c666b3bf..6cbfeaa02246 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3370,7 +3370,8 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
-int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
+struct drm_i915_gem_request *
+i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv);
 void i915_gem_reset(struct drm_i915_private *dev_priv);
 void i915_gem_reset_finish_engine(struct intel_engine_cs *engine);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index b5dc073a5ddc..c9f139b322d2 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2793,12 +2793,14 @@ static bool engine_stalled(struct intel_engine_cs *engine)
 	return true;
 }
 
-/* Ensure irq handler finishes, and not run again. */
-int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
+/*
+ * Ensure irq handler finishes, and not run again.
+ * Also store the active request so that we only search for it once.
+ */
+struct drm_i915_gem_request *
+i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 {
-	struct drm_i915_gem_request *request;
-	int err = 0;
-
+	struct drm_i915_gem_request *request = NULL;
 
 	/* Prevent the signaler thread from updating the request
 	 * state (by calling dma_fence_signal) as we are processing
@@ -2827,21 +2829,30 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 
 	if (engine_stalled(engine)) {
 		request = i915_gem_find_active_request(engine);
+
 		if (request && request->fence.error == -EIO)
-			err = -EIO; /* Previous reset failed! */
+			return ERR_PTR(-EIO); /* Previous reset failed! */
 	}
 
-	return err;
+	return request;
 }
 
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
 {
 	struct intel_engine_cs *engine;
+	struct drm_i915_gem_request *request;
 	enum intel_engine_id id;
 	int err = 0;
 
-	for_each_engine(engine, dev_priv, id)
-		err = i915_gem_reset_prepare_engine(engine);
+	for_each_engine(engine, dev_priv, id) {
+		request = i915_gem_reset_prepare_engine(engine);
+		if (IS_ERR(request)) {
+			err = PTR_ERR(request);
+			break;
+		}
+
+		engine->hangcheck.active_request = request;
+	}
 
 	i915_gem_revoke_fences(dev_priv);
 
@@ -2894,7 +2905,7 @@ static void engine_skip_context(struct drm_i915_gem_request *request)
 static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 {
 	/* Read once and return the resolution */
-	const bool guilty = engine_stalled(request->engine);
+	const bool guilty = !i915_gem_request_completed(request);
 
 	/* The guilty request will get skipped on a hung engine.
 	 *
@@ -2930,9 +2941,8 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 
 void i915_gem_reset_engine(struct intel_engine_cs *engine)
 {
-	struct drm_i915_gem_request *request;
+	struct drm_i915_gem_request *request = engine->hangcheck.active_request;
 
-	request = i915_gem_find_active_request(engine);
 	if (request && i915_gem_reset_request(request)) {
 		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
 				 engine->name, request->global_seqno);
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index ec16fb6fde62..f850c4b12337 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -121,6 +121,7 @@ struct intel_engine_hangcheck {
 	unsigned long action_timestamp;
 	int deadlock;
 	struct intel_instdone instdone;
+	struct drm_i915_gem_request *active_request;
 	bool stalled;
 };
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH] drm/i915: Look for active requests earlier in the reset path
  2017-05-18 18:22   ` [PATCH] drm/i915: Look for active requests earlier in the reset path Michel Thierry
@ 2017-05-18 18:26     ` Michel Thierry
  2017-05-18 19:55     ` Chris Wilson
  1 sibling, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-05-18 18:26 UTC (permalink / raw)
  To: intel-gfx

On 5/18/2017 11:22 AM, Michel Thierry wrote:
> fixes
>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>

rebase mistake
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev9)
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (23 preceding siblings ...)
  2017-05-17 21:09 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev8) Patchwork
@ 2017-05-18 18:40 ` Patchwork
  2017-05-18 21:29 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev10) Patchwork
  25 siblings, 0 replies; 62+ messages in thread
From: Patchwork @ 2017-05-18 18:40 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

== Series Details ==

Series: Gen8+ engine-reset (rev9)
URL   : https://patchwork.freedesktop.org/series/21868/
State : success

== Summary ==

Series 21868v9 Gen8+ engine-reset
https://patchwork.freedesktop.org/api/1.0/series/21868/revisions/9/mbox/

Test gem_exec_flush:
        Subgroup basic-batch-kernel-default-uc:
                fail       -> PASS       (fi-snb-2600) fdo#100007

fdo#100007 https://bugs.freedesktop.org/show_bug.cgi?id=100007

fi-bdw-5557u     total:278  pass:267  dwarn:0   dfail:0   fail:0   skip:11  time:443s
fi-bdw-gvtdvm    total:278  pass:256  dwarn:8   dfail:0   fail:0   skip:14  time:437s
fi-bsw-n3050     total:278  pass:242  dwarn:0   dfail:0   fail:0   skip:36  time:583s
fi-bxt-j4205     total:278  pass:259  dwarn:0   dfail:0   fail:0   skip:19  time:516s
fi-byt-j1900     total:278  pass:254  dwarn:0   dfail:0   fail:0   skip:24  time:495s
fi-byt-n2820     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:490s
fi-hsw-4770      total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:425s
fi-hsw-4770r     total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:409s
fi-ilk-650       total:278  pass:228  dwarn:0   dfail:0   fail:0   skip:50  time:418s
fi-ivb-3520m     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:500s
fi-ivb-3770      total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:470s
fi-kbl-7500u     total:278  pass:255  dwarn:5   dfail:0   fail:0   skip:18  time:461s
fi-skl-6260u     total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:459s
fi-skl-6700hq    total:278  pass:261  dwarn:0   dfail:0   fail:0   skip:17  time:578s
fi-skl-6700k     total:278  pass:256  dwarn:4   dfail:0   fail:0   skip:18  time:460s
fi-skl-6770hq    total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:500s
fi-skl-gvtdvm    total:278  pass:265  dwarn:0   dfail:0   fail:0   skip:13  time:442s
fi-snb-2520m     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:532s
fi-snb-2600      total:278  pass:249  dwarn:0   dfail:0   fail:0   skip:29  time:405s

ab08cb2750e769d074b2f147c8298ccd0cd08340 drm-tip: 2017y-05m-18d-15h-36m-17s UTC integration manifest
e1b2c9d drm/i915: Look for active requests earlier in the reset path
6a89f35 drm/i915: Skip reset request if there is one already
7b53b92 drm/i915: Add support for per engine reset recovery
16115e1 drm/i915: Modify error handler for per engine hang recovery
aef6c1a drm/i915: Update i915.reset to handle engine resets

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_4747/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] drm/i915: Look for active requests earlier in the reset path
  2017-05-18 18:22   ` [PATCH] drm/i915: Look for active requests earlier in the reset path Michel Thierry
  2017-05-18 18:26     ` Michel Thierry
@ 2017-05-18 19:55     ` Chris Wilson
  1 sibling, 0 replies; 62+ messages in thread
From: Chris Wilson @ 2017-05-18 19:55 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Thu, May 18, 2017 at 11:22:57AM -0700, Michel Thierry wrote:
> And store the active request so that we only search for it once; this
> applies for reset-engine and full reset.
> 
> v2: Check for request completion inside _prepare_engine, don't use
> ECANCELED, remove unnecessary null checks (Chris).
> 
> v3: Capture active requests during reset_prepare and store it the
> engine hangcheck obj.
> 
> v4: Rename commit, change i915_gem_reset_request to just confirm the
> active_request is still incomplete, instead of engine_stalled (Chris).
> 
> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> 
> fixes
> 
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c         | 11 ++++++-----
>  drivers/gpu/drm/i915/i915_drv.h         |  3 ++-
>  drivers/gpu/drm/i915/i915_gem.c         | 34 +++++++++++++++++++++------------
>  drivers/gpu/drm/i915/intel_ringbuffer.h |  1 +
>  4 files changed, 31 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index d62793805794..ec719376fc24 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1900,15 +1900,16 @@ int i915_reset_engine(struct intel_engine_cs *engine)
>  
>  	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
>  
> -	ret = i915_gem_reset_prepare_engine(engine);
> -	if (ret) {
> -		DRM_ERROR("Previous reset failed - promote to full reset\n");
> +	engine->hangcheck.active_request = i915_gem_reset_prepare_engine(engine);

Whilst this is not wrong (since we are serialising the per-engine and
global resets), I would suggest we avoid storing the request in the
hangcheck here and just pass the request along to
i915_gem_request_engine.

No strong reason, just less magic state passing between functions.

> +	if (IS_ERR(engine->hangcheck.active_request)) {
> +		DRM_DEBUG_DRIVER("Previous reset failed, promote to full reset\n");
> +		ret = PTR_ERR(engine->hangcheck.active_request);
>  		goto out;
>  	}
>  
> index b5dc073a5ddc..c9f139b322d2 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2793,12 +2793,14 @@ static bool engine_stalled(struct intel_engine_cs *engine)
>  	return true;
>  }
>  
> -/* Ensure irq handler finishes, and not run again. */
> -int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
> +/*
> + * Ensure irq handler finishes, and not run again.
> + * Also store the active request so that we only search for it once.
> + */
> +struct drm_i915_gem_request *
> +i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>  {
> -	struct drm_i915_gem_request *request;
> -	int err = 0;
> -
> +	struct drm_i915_gem_request *request = NULL;
>  
>  	/* Prevent the signaler thread from updating the request
>  	 * state (by calling dma_fence_signal) as we are processing
> @@ -2827,21 +2829,30 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>  
>  	if (engine_stalled(engine)) {
>  		request = i915_gem_find_active_request(engine);
> +

If we neuter the return beneath the if, this blank line can also go.

>  		if (request && request->fence.error == -EIO)
> -			err = -EIO; /* Previous reset failed! */
> +			return ERR_PTR(-EIO); /* Previous reset failed! */

request = ERR_PTR(-EIO); and then keep the single return.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH] drm/i915: Look for active requests earlier in the reset path
  2017-04-27 23:12 ` [PATCH v7 05/20] drm/i915: Cancel reset-engine if we couldn't find an active request Michel Thierry
                     ` (3 preceding siblings ...)
  2017-05-18 18:22   ` [PATCH] drm/i915: Look for active requests earlier in the reset path Michel Thierry
@ 2017-05-18 21:11   ` Michel Thierry
  2017-05-18 21:16     ` Chris Wilson
  4 siblings, 1 reply; 62+ messages in thread
From: Michel Thierry @ 2017-05-18 21:11 UTC (permalink / raw)
  To: intel-gfx

And store the active request so that we only search for it once; this
applies for reset-engine and full reset.

v2: Check for request completion inside _prepare_engine, don't use
ECANCELED, remove unnecessary null checks (Chris).

v3: Capture active requests during reset_prepare and store it the
engine hangcheck obj.

v4: Rename commit, change i915_gem_reset_request to just confirm the
active_request is still incomplete, instead of engine_stalled (Chris).

v5: With style; pass the active request to gem_reset_engine, keep single
return in reset_prepare_engine (Chris).

Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c         | 14 ++++++------
 drivers/gpu/drm/i915/i915_drv.h         |  6 ++++--
 drivers/gpu/drm/i915/i915_gem.c         | 38 ++++++++++++++++++++-------------
 drivers/gpu/drm/i915/intel_ringbuffer.h |  1 +
 4 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index d62793805794..2ba288e9311c 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1895,23 +1895,25 @@ int i915_reset_engine(struct intel_engine_cs *engine)
 	int ret;
 	struct drm_i915_private *dev_priv = engine->i915;
 	struct i915_gpu_error *error = &dev_priv->gpu_error;
+	struct drm_i915_gem_request *active_request;
 
 	GEM_BUG_ON(!test_bit(I915_RESET_ENGINE_IN_PROGRESS, &error->flags));
 
 	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
 
-	ret = i915_gem_reset_prepare_engine(engine);
-	if (ret) {
-		DRM_ERROR("Previous reset failed - promote to full reset\n");
+	active_request = i915_gem_reset_prepare_engine(engine);
+	if (IS_ERR(active_request)) {
+		DRM_DEBUG_DRIVER("Previous reset failed, promote to full reset\n");
+		ret = PTR_ERR(active_request);
 		goto out;
 	}
 
 	/*
-	 * the request that caused the hang is stuck on elsp, identify the
-	 * active request and drop it, adjust head to skip the offending
+	 * the request that caused the hang is stuck on elsp, we know the
+	 * active request and can drop it, adjust head to skip the offending
 	 * request to resume executing remaining requests in the queue.
 	 */
-	i915_gem_reset_engine(engine);
+	i915_gem_reset_engine(engine, active_request);
 
 	/* forcing engine to idle */
 	ret = intel_reset_engine_start(engine);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index a5b9c666b3bf..f8cbd286f904 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3370,14 +3370,16 @@ static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
-int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
+struct drm_i915_gem_request *
+i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv);
 void i915_gem_reset(struct drm_i915_private *dev_priv);
 void i915_gem_reset_finish_engine(struct intel_engine_cs *engine);
 void i915_gem_reset_finish(struct drm_i915_private *dev_priv);
 void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
 bool i915_gem_unset_wedged(struct drm_i915_private *dev_priv);
-void i915_gem_reset_engine(struct intel_engine_cs *engine);
+void i915_gem_reset_engine(struct intel_engine_cs *engine,
+			   struct drm_i915_gem_request *request);
 
 void i915_gem_init_mmio(struct drm_i915_private *i915);
 int __must_check i915_gem_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index b5dc073a5ddc..6e14bf039aed 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2793,12 +2793,14 @@ static bool engine_stalled(struct intel_engine_cs *engine)
 	return true;
 }
 
-/* Ensure irq handler finishes, and not run again. */
-int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
+/*
+ * Ensure irq handler finishes, and not run again.
+ * Also store the active request so that we only search for it once.
+ */
+struct drm_i915_gem_request *
+i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 {
-	struct drm_i915_gem_request *request;
-	int err = 0;
-
+	struct drm_i915_gem_request *request = NULL;
 
 	/* Prevent the signaler thread from updating the request
 	 * state (by calling dma_fence_signal) as we are processing
@@ -2828,20 +2830,28 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 	if (engine_stalled(engine)) {
 		request = i915_gem_find_active_request(engine);
 		if (request && request->fence.error == -EIO)
-			err = -EIO; /* Previous reset failed! */
+			request = ERR_PTR(-EIO); /* Previous reset failed! */
 	}
 
-	return err;
+	return request;
 }
 
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
 {
 	struct intel_engine_cs *engine;
+	struct drm_i915_gem_request *request;
 	enum intel_engine_id id;
 	int err = 0;
 
-	for_each_engine(engine, dev_priv, id)
-		err = i915_gem_reset_prepare_engine(engine);
+	for_each_engine(engine, dev_priv, id) {
+		request = i915_gem_reset_prepare_engine(engine);
+		if (IS_ERR(request)) {
+			err = PTR_ERR(request);
+			break;
+		}
+
+		engine->hangcheck.active_request = request;
+	}
 
 	i915_gem_revoke_fences(dev_priv);
 
@@ -2894,7 +2904,7 @@ static void engine_skip_context(struct drm_i915_gem_request *request)
 static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 {
 	/* Read once and return the resolution */
-	const bool guilty = engine_stalled(request->engine);
+	const bool guilty = !i915_gem_request_completed(request);
 
 	/* The guilty request will get skipped on a hung engine.
 	 *
@@ -2928,11 +2938,9 @@ static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 	return guilty;
 }
 
-void i915_gem_reset_engine(struct intel_engine_cs *engine)
+void i915_gem_reset_engine(struct intel_engine_cs *engine,
+			   struct drm_i915_gem_request *request)
 {
-	struct drm_i915_gem_request *request;
-
-	request = i915_gem_find_active_request(engine);
 	if (request && i915_gem_reset_request(request)) {
 		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
 				 engine->name, request->global_seqno);
@@ -2958,7 +2966,7 @@ void i915_gem_reset(struct drm_i915_private *dev_priv)
 	for_each_engine(engine, dev_priv, id) {
 		struct i915_gem_context *ctx;
 
-		i915_gem_reset_engine(engine);
+		i915_gem_reset_engine(engine, engine->hangcheck.active_request);
 		ctx = fetch_and_zero(&engine->last_retired_context);
 		if (ctx)
 			engine->context_unpin(engine, ctx);
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index ec16fb6fde62..f850c4b12337 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -121,6 +121,7 @@ struct intel_engine_hangcheck {
 	unsigned long action_timestamp;
 	int deadlock;
 	struct intel_instdone instdone;
+	struct drm_i915_gem_request *active_request;
 	bool stalled;
 };
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH] drm/i915: Look for active requests earlier in the reset path
  2017-05-18 21:11   ` Michel Thierry
@ 2017-05-18 21:16     ` Chris Wilson
  2017-05-18 21:34       ` Michel Thierry
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Wilson @ 2017-05-18 21:16 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Thu, May 18, 2017 at 02:11:15PM -0700, Michel Thierry wrote:
> And store the active request so that we only search for it once; this
> applies for reset-engine and full reset.
> 
> v2: Check for request completion inside _prepare_engine, don't use
> ECANCELED, remove unnecessary null checks (Chris).
> 
> v3: Capture active requests during reset_prepare and store it the
> engine hangcheck obj.
> 
> v4: Rename commit, change i915_gem_reset_request to just confirm the
> active_request is still incomplete, instead of engine_stalled (Chris).
> 
> v5: With style; pass the active request to gem_reset_engine, keep single
> return in reset_prepare_engine (Chris).
> 
> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>

I would order this earlier in the series, i.e. make the change to store
the active_request and pass the request onwards in the global reset
handler first.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev10)
  2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
                   ` (24 preceding siblings ...)
  2017-05-18 18:40 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev9) Patchwork
@ 2017-05-18 21:29 ` Patchwork
  25 siblings, 0 replies; 62+ messages in thread
From: Patchwork @ 2017-05-18 21:29 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

== Series Details ==

Series: Gen8+ engine-reset (rev10)
URL   : https://patchwork.freedesktop.org/series/21868/
State : success

== Summary ==

Series 21868v10 Gen8+ engine-reset
https://patchwork.freedesktop.org/api/1.0/series/21868/revisions/10/mbox/

Test gem_exec_flush:
        Subgroup basic-batch-kernel-default-uc:
                fail       -> PASS       (fi-snb-2600) fdo#100007

fdo#100007 https://bugs.freedesktop.org/show_bug.cgi?id=100007

fi-bdw-5557u     total:278  pass:267  dwarn:0   dfail:0   fail:0   skip:11  time:450s
fi-bdw-gvtdvm    total:278  pass:256  dwarn:8   dfail:0   fail:0   skip:14  time:443s
fi-bsw-n3050     total:278  pass:242  dwarn:0   dfail:0   fail:0   skip:36  time:601s
fi-bxt-j4205     total:278  pass:259  dwarn:0   dfail:0   fail:0   skip:19  time:514s
fi-byt-j1900     total:278  pass:254  dwarn:0   dfail:0   fail:0   skip:24  time:496s
fi-byt-n2820     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:488s
fi-hsw-4770      total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:420s
fi-hsw-4770r     total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16  time:415s
fi-ilk-650       total:278  pass:228  dwarn:0   dfail:0   fail:0   skip:50  time:416s
fi-ivb-3520m     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:500s
fi-ivb-3770      total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18  time:465s
fi-kbl-7500u     total:278  pass:255  dwarn:5   dfail:0   fail:0   skip:18  time:459s
fi-skl-6260u     total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:473s
fi-skl-6700hq    total:278  pass:261  dwarn:0   dfail:0   fail:0   skip:17  time:580s
fi-skl-6700k     total:278  pass:256  dwarn:4   dfail:0   fail:0   skip:18  time:461s
fi-skl-6770hq    total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10  time:500s
fi-skl-gvtdvm    total:278  pass:265  dwarn:0   dfail:0   fail:0   skip:13  time:439s
fi-snb-2520m     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28  time:539s
fi-snb-2600      total:278  pass:249  dwarn:0   dfail:0   fail:0   skip:29  time:408s

ab08cb2750e769d074b2f147c8298ccd0cd08340 drm-tip: 2017y-05m-18d-15h-36m-17s UTC integration manifest
8a637b4 drm/i915: Look for active requests earlier in the reset path
3fddaa3 drm/i915: Skip reset request if there is one already
b2ce28a drm/i915: Add support for per engine reset recovery
093b0d1 drm/i915: Modify error handler for per engine hang recovery
a3a49b9 drm/i915: Update i915.reset to handle engine resets

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_4750/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] drm/i915: Look for active requests earlier in the reset path
  2017-05-18 21:16     ` Chris Wilson
@ 2017-05-18 21:34       ` Michel Thierry
  0 siblings, 0 replies; 62+ messages in thread
From: Michel Thierry @ 2017-05-18 21:34 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 5/18/2017 2:16 PM, Chris Wilson wrote:
> On Thu, May 18, 2017 at 02:11:15PM -0700, Michel Thierry wrote:
>> And store the active request so that we only search for it once; this
>> applies for reset-engine and full reset.
>>
>> v2: Check for request completion inside _prepare_engine, don't use
>> ECANCELED, remove unnecessary null checks (Chris).
>>
>> v3: Capture active requests during reset_prepare and store it the
>> engine hangcheck obj.
>>
>> v4: Rename commit, change i915_gem_reset_request to just confirm the
>> active_request is still incomplete, instead of engine_stalled (Chris).
>>
>> v5: With style; pass the active request to gem_reset_engine, keep single
>> return in reset_prepare_engine (Chris).
>>
>> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
>
> I would order this earlier in the series, i.e. make the change to store
> the active_request and pass the request onwards in the global reset
> handler first.
> -Chris
>

ok, I'll move it to the beginning of the reset-engine series (which I 
plan to resend soon).

Thanks for the review.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2017-05-18 21:34 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-27 23:12 [PATCH v7 00/20] Gen8+ engine-reset Michel Thierry
2017-04-27 23:12 ` [PATCH v7 01/20] drm/i915: Update i915.reset to handle engine resets Michel Thierry
2017-04-27 23:12 ` [PATCH v7 02/20] drm/i915: Modify error handler for per engine hang recovery Michel Thierry
2017-04-29 14:19   ` Chris Wilson
2017-05-08 18:31     ` Michel Thierry
2017-05-12 20:55       ` Michel Thierry
2017-05-12 21:09         ` Chris Wilson
2017-05-12 21:23           ` Michel Thierry
2017-05-15 21:14   ` [PATCH " Michel Thierry
2017-04-27 23:12 ` [PATCH v7 03/20] drm/i915: Add support for per engine reset recovery Michel Thierry
2017-04-27 23:50   ` Chris Wilson
2017-04-28 21:59     ` Michel Thierry
2017-05-04  0:26     ` Michel Thierry
2017-05-15 21:18   ` [PATCH " Michel Thierry
2017-04-27 23:12 ` [PATCH v7 04/20] drm/i915: Skip reset request if there is one already Michel Thierry
2017-04-29 14:21   ` Chris Wilson
2017-05-01 21:15     ` Michel Thierry
2017-04-27 23:12 ` [PATCH v7 05/20] drm/i915: Cancel reset-engine if we couldn't find an active request Michel Thierry
2017-04-29 14:26   ` Chris Wilson
2017-05-15 21:20   ` [PATCH " Michel Thierry
2017-05-15 21:31     ` Chris Wilson
2017-05-15 21:47       ` Chris Wilson
2017-05-15 22:25         ` Michel Thierry
2017-05-16  7:54           ` Chris Wilson
2017-05-17  0:13             ` Michel Thierry
2017-05-17  7:19               ` Chris Wilson
2017-05-17 20:41   ` [PATCH] " Michel Thierry
2017-05-17 20:52     ` Chris Wilson
2017-05-18  1:11       ` Michel Thierry
2017-05-18  7:56         ` Chris Wilson
2017-05-18 17:19           ` Michel Thierry
2017-05-18 18:22   ` [PATCH] drm/i915: Look for active requests earlier in the reset path Michel Thierry
2017-05-18 18:26     ` Michel Thierry
2017-05-18 19:55     ` Chris Wilson
2017-05-18 21:11   ` Michel Thierry
2017-05-18 21:16     ` Chris Wilson
2017-05-18 21:34       ` Michel Thierry
2017-04-27 23:12 ` [PATCH v7 06/20] drm/i915: Add engine reset count to error state Michel Thierry
2017-04-27 23:12 ` [PATCH v7 07/20] drm/i915: Export per-engine reset count info to debugfs Michel Thierry
2017-04-27 23:12 ` [PATCH v7 08/20] drm/i915: Enable Engine reset and recovery support Michel Thierry
2017-04-27 23:12 ` [PATCH v7 09/20] drm/i915: Add engine reset count in get-reset-stats ioctl Michel Thierry
2017-04-27 23:12 ` [PATCH v7 10/20] drm/i915/selftests: reset engine self tests Michel Thierry
2017-04-27 23:12 ` [PATCH v7 11/20] drm/i915/guc: fix mmio whitelist mmio_start offset and add reminder Michel Thierry
2017-04-27 23:12 ` [PATCH v7 12/20] drm/i915/guc: Provide register list to be saved/restored during engine reset Michel Thierry
2017-04-27 23:58   ` Chris Wilson
2017-04-28 15:36     ` Michel Thierry
2017-04-27 23:12 ` [PATCH v7 13/20] drm/i915/guc: Rename the function that resets the GuC Michel Thierry
2017-04-28  7:40   ` Tvrtko Ursulin
2017-05-01 20:09     ` Michel Thierry
2017-04-27 23:12 ` [PATCH v7 14/20] drm/i915/guc: Add support for reset engine using GuC commands Michel Thierry
2017-04-27 23:12 ` [PATCH v7 15/20] drm/i915: Watchdog timeout: Pass GuC shared data structure during param load Michel Thierry
2017-04-27 23:12 ` [PATCH v7 16/20] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
2017-04-27 23:12 ` [PATCH v7 17/20] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
2017-04-27 23:12 ` [PATCH v7 18/20] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
2017-04-27 23:12 ` [PATCH v7 19/20] drm/i915: Watchdog timeout: Include threshold value in error state Michel Thierry
2017-04-27 23:13 ` [PATCH v7 20/20] drm/i915: Watchdog timeout: Export media reset count from GuC to debugfs Michel Thierry
2017-04-27 23:30 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev4) Patchwork
2017-05-15 21:32 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev5) Patchwork
2017-05-15 21:48 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev7) Patchwork
2017-05-17 21:09 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev8) Patchwork
2017-05-18 18:40 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev9) Patchwork
2017-05-18 21:29 ` ✓ Fi.CI.BAT: success for Gen8+ engine-reset (rev10) Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.