All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] drm/i915: Specify which engines to reset following semaphore/event lockups
@ 2018-03-20 10:04 Chris Wilson
  2018-03-20 10:04 ` [PATCH 2/2] drm/i915: Add control flags to i915_handle_error() Chris Wilson
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Chris Wilson @ 2018-03-20 10:04 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

If the GPU is stuck waiting for an event or for a semaphore, we need to
reset the GPU in order to recover. We have to tell the reset routine
which engines we want reset, but we were still using the old interface
and declaring it as "not-fatal".

Fixes: 14b730fcb8d9 ("drm/i915/tdr: Prepare error handler to accept mask of hung engines")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/intel_hangcheck.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index 42e45ae87393..c8ea510629fa 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -246,7 +246,7 @@ engine_stuck(struct intel_engine_cs *engine, u64 acthd)
 	 */
 	tmp = I915_READ_CTL(engine);
 	if (tmp & RING_WAIT) {
-		i915_handle_error(dev_priv, 0,
+		i915_handle_error(dev_priv, BIT(engine->id),
 				  "Kicking stuck wait on %s",
 				  engine->name);
 		I915_WRITE_CTL(engine, tmp);
@@ -258,7 +258,7 @@ engine_stuck(struct intel_engine_cs *engine, u64 acthd)
 		default:
 			return ENGINE_DEAD;
 		case 1:
-			i915_handle_error(dev_priv, 0,
+			i915_handle_error(dev_priv, ALL_ENGINES,
 					  "Kicking stuck semaphore on %s",
 					  engine->name);
 			I915_WRITE_CTL(engine, tmp);
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] drm/i915: Add control flags to i915_handle_error()
  2018-03-20 10:04 [PATCH 1/2] drm/i915: Specify which engines to reset following semaphore/event lockups Chris Wilson
@ 2018-03-20 10:04 ` Chris Wilson
  2018-03-20 14:11   ` Michel Thierry
  2018-03-20 10:12 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/2] drm/i915: Specify which engines to reset following semaphore/event lockups Patchwork
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 8+ messages in thread
From: Chris Wilson @ 2018-03-20 10:04 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

Not all callers want the GPU error to handled in the same way, so expose
a control parameter. In the first instance, some callers do not want the
heavyweight error capture so add a bit to request the state to be
captured and saved.

v2: Pass msg down to i915_reset/i915_reset_engine so that we include the
reason for the reset in the dev_notice(), superseding the earlier option
to not print that notice.
v3: Stash the reason inside the i915->gpu_error to handover to the direct
reset from the blocking waiter.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jeff McGee <jeff.mcgee@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c              |  4 +-
 drivers/gpu/drm/i915/i915_drv.c                  | 17 ++++----
 drivers/gpu/drm/i915/i915_drv.h                  | 10 ++---
 drivers/gpu/drm/i915/i915_gpu_error.h            |  3 ++
 drivers/gpu/drm/i915/i915_irq.c                  | 55 ++++++++++++++----------
 drivers/gpu/drm/i915/i915_request.c              |  2 +-
 drivers/gpu/drm/i915/intel_hangcheck.c           | 13 +++---
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 13 +++---
 8 files changed, 62 insertions(+), 55 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 964ea1a12357..7816cd53100a 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -4011,8 +4011,8 @@ i915_wedged_set(void *data, u64 val)
 		engine->hangcheck.stalled = true;
 	}
 
-	i915_handle_error(i915, val, "Manually set wedged engine mask = %llx",
-			  val);
+	i915_handle_error(i915, val, I915_ERROR_CAPTURE,
+			  "Manually set wedged engine mask = %llx", val);
 
 	wait_on_bit(&i915->gpu_error.flags,
 		    I915_RESET_HANDOFF,
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 1021bf40e236..6b04cc3be513 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1869,7 +1869,6 @@ static int i915_resume_switcheroo(struct drm_device *dev)
 /**
  * i915_reset - reset chip after a hang
  * @i915: #drm_i915_private to reset
- * @flags: Instructions
  *
  * Reset the chip.  Useful if a hang is detected. Marks the device as wedged
  * on failure.
@@ -1884,7 +1883,7 @@ static int i915_resume_switcheroo(struct drm_device *dev)
  *   - re-init interrupt state
  *   - re-init display
  */
-void i915_reset(struct drm_i915_private *i915, unsigned int flags)
+void i915_reset(struct drm_i915_private *i915)
 {
 	struct i915_gpu_error *error = &i915->gpu_error;
 	int ret;
@@ -1901,8 +1900,9 @@ void i915_reset(struct drm_i915_private *i915, unsigned int flags)
 	if (!i915_gem_unset_wedged(i915))
 		goto wakeup;
 
-	if (!(flags & I915_RESET_QUIET))
-		dev_notice(i915->drm.dev, "Resetting chip after gpu hang\n");
+	if (error->reason)
+		dev_notice(i915->drm.dev,
+			   "Resetting chip for %s\n", error->reason);
 	error->reset_count++;
 
 	disable_irq(i915->drm.irq);
@@ -2003,7 +2003,7 @@ static inline int intel_gt_reset_engine(struct drm_i915_private *dev_priv,
 /**
  * i915_reset_engine - reset GPU engine to recover from a hang
  * @engine: engine to reset
- * @flags: options
+ * @msg: reason for GPU reset; or NULL for no dev_notice()
  *
  * Reset a specific GPU engine. Useful if a hang is detected.
  * Returns zero on successful reset or otherwise an error code.
@@ -2013,7 +2013,7 @@ static inline int intel_gt_reset_engine(struct drm_i915_private *dev_priv,
  *  - reset engine (which will force the engine to idle)
  *  - re-init/configure engine
  */
-int i915_reset_engine(struct intel_engine_cs *engine, unsigned int flags)
+int i915_reset_engine(struct intel_engine_cs *engine, const char *msg)
 {
 	struct i915_gpu_error *error = &engine->i915->gpu_error;
 	struct i915_request *active_request;
@@ -2028,10 +2028,9 @@ int i915_reset_engine(struct intel_engine_cs *engine, unsigned int flags)
 		goto out;
 	}
 
-	if (!(flags & I915_RESET_QUIET)) {
+	if (msg)
 		dev_notice(engine->i915->drm.dev,
-			   "Resetting %s after gpu hang\n", engine->name);
-	}
+			   "Resetting %s for %s\n", engine->name, msg);
 	error->reset_engine_count[engine->id]++;
 
 	if (!engine->i915->guc.execbuf_client)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index e27ba8fb64e6..c9c3b2ba6a86 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2700,10 +2700,8 @@ extern void i915_driver_unload(struct drm_device *dev);
 extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 
-#define I915_RESET_QUIET BIT(0)
-extern void i915_reset(struct drm_i915_private *i915, unsigned int flags);
-extern int i915_reset_engine(struct intel_engine_cs *engine,
-			     unsigned int flags);
+extern void i915_reset(struct drm_i915_private *i915);
+extern int i915_reset_engine(struct intel_engine_cs *engine, const char *msg);
 
 extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
 extern int intel_reset_guc(struct drm_i915_private *dev_priv);
@@ -2751,10 +2749,12 @@ static inline void i915_queue_hangcheck(struct drm_i915_private *dev_priv)
 			   &dev_priv->gpu_error.hangcheck_work, delay);
 }
 
-__printf(3, 4)
+__printf(4, 5)
 void i915_handle_error(struct drm_i915_private *dev_priv,
 		       u32 engine_mask,
+		       unsigned long flags,
 		       const char *fmt, ...);
+#define I915_ERROR_CAPTURE BIT(0)
 
 extern void intel_irq_init(struct drm_i915_private *dev_priv);
 extern void intel_irq_fini(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
index ebbdf37e2879..ac5760673cc9 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.h
+++ b/drivers/gpu/drm/i915/i915_gpu_error.h
@@ -269,6 +269,9 @@ struct i915_gpu_error {
 	/** Number of times an engine has been reset */
 	u32 reset_engine_count[I915_NUM_ENGINES];
 
+	/** Reason for the current *global* reset */
+	const char *reason;
+
 	/**
 	 * Waitqueue to signal when a hang is detected. Used to for waiters
 	 * to release the struct_mutex for the reset to procede.
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 44eef355e12c..fa7310766217 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2877,15 +2877,10 @@ static irqreturn_t gen11_irq_handler(int irq, void *arg)
 	return IRQ_HANDLED;
 }
 
-/**
- * i915_reset_device - do process context error handling work
- * @dev_priv: i915 device private
- *
- * Fire an error uevent so userspace can see that a hang or error
- * was detected.
- */
-static void i915_reset_device(struct drm_i915_private *dev_priv)
+static void i915_reset_device(struct drm_i915_private *dev_priv,
+			      const char *msg)
 {
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
 	struct kobject *kobj = &dev_priv->drm.primary->kdev->kobj;
 	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
 	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
@@ -2901,29 +2896,32 @@ static void i915_reset_device(struct drm_i915_private *dev_priv)
 	i915_wedge_on_timeout(&w, dev_priv, 5*HZ) {
 		intel_prepare_reset(dev_priv);
 
+		error->reason = msg;
+
 		/* Signal that locked waiters should reset the GPU */
-		set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
-		wake_up_all(&dev_priv->gpu_error.wait_queue);
+		set_bit(I915_RESET_HANDOFF, &error->flags);
+		wake_up_all(&error->wait_queue);
 
 		/* Wait for anyone holding the lock to wakeup, without
 		 * blocking indefinitely on struct_mutex.
 		 */
 		do {
 			if (mutex_trylock(&dev_priv->drm.struct_mutex)) {
-				i915_reset(dev_priv, 0);
+				i915_reset(dev_priv);
 				mutex_unlock(&dev_priv->drm.struct_mutex);
 			}
-		} while (wait_on_bit_timeout(&dev_priv->gpu_error.flags,
+		} while (wait_on_bit_timeout(&error->flags,
 					     I915_RESET_HANDOFF,
 					     TASK_UNINTERRUPTIBLE,
 					     1));
 
+		error->reason = NULL;
+
 		intel_finish_reset(dev_priv);
 	}
 
-	if (!test_bit(I915_WEDGED, &dev_priv->gpu_error.flags))
-		kobject_uevent_env(kobj,
-				   KOBJ_CHANGE, reset_done_event);
+	if (!test_bit(I915_WEDGED, &error->flags))
+		kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event);
 }
 
 static void i915_clear_error_registers(struct drm_i915_private *dev_priv)
@@ -2955,6 +2953,7 @@ static void i915_clear_error_registers(struct drm_i915_private *dev_priv)
  * i915_handle_error - handle a gpu error
  * @dev_priv: i915 device private
  * @engine_mask: mask representing engines that are hung
+ * @flags: control flags
  * @fmt: Error message format string
  *
  * Do some basic checking of register state at error time and
@@ -2965,16 +2964,23 @@ static void i915_clear_error_registers(struct drm_i915_private *dev_priv)
  */
 void i915_handle_error(struct drm_i915_private *dev_priv,
 		       u32 engine_mask,
+		       unsigned long flags,
 		       const char *fmt, ...)
 {
 	struct intel_engine_cs *engine;
 	unsigned int tmp;
-	va_list args;
 	char error_msg[80];
+	char *msg = NULL;
 
-	va_start(args, fmt);
-	vscnprintf(error_msg, sizeof(error_msg), fmt, args);
-	va_end(args);
+	if (fmt) {
+		va_list args;
+
+		va_start(args, fmt);
+		vscnprintf(error_msg, sizeof(error_msg), fmt, args);
+		va_end(args);
+
+		msg = error_msg;
+	}
 
 	/*
 	 * In most cases it's guaranteed that we get here with an RPM
@@ -2986,8 +2992,11 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 	intel_runtime_pm_get(dev_priv);
 
 	engine_mask &= INTEL_INFO(dev_priv)->ring_mask;
-	i915_capture_error_state(dev_priv, engine_mask, error_msg);
-	i915_clear_error_registers(dev_priv);
+
+	if (flags & I915_ERROR_CAPTURE) {
+		i915_capture_error_state(dev_priv, engine_mask, msg);
+		i915_clear_error_registers(dev_priv);
+	}
 
 	/*
 	 * Try engine reset when available. We fall back to full reset if
@@ -3000,7 +3009,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 					     &dev_priv->gpu_error.flags))
 				continue;
 
-			if (i915_reset_engine(engine, 0) == 0)
+			if (i915_reset_engine(engine, msg) == 0)
 				engine_mask &= ~intel_engine_flag(engine);
 
 			clear_bit(I915_RESET_ENGINE + engine->id,
@@ -3030,7 +3039,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 				    TASK_UNINTERRUPTIBLE);
 	}
 
-	i915_reset_device(dev_priv);
+	i915_reset_device(dev_priv, msg);
 
 	for_each_engine(engine, dev_priv, tmp) {
 		clear_bit(I915_RESET_ENGINE + engine->id,
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 43c7134a9b93..2325886d1d55 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1229,7 +1229,7 @@ static bool __i915_wait_request_check_and_reset(struct i915_request *request)
 		return false;
 
 	__set_current_state(TASK_RUNNING);
-	i915_reset(request->i915, 0);
+	i915_reset(request->i915);
 	return true;
 }
 
diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index c8ea510629fa..fd0ffb8328d0 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -246,9 +246,8 @@ engine_stuck(struct intel_engine_cs *engine, u64 acthd)
 	 */
 	tmp = I915_READ_CTL(engine);
 	if (tmp & RING_WAIT) {
-		i915_handle_error(dev_priv, BIT(engine->id),
-				  "Kicking stuck wait on %s",
-				  engine->name);
+		i915_handle_error(dev_priv, BIT(engine->id), 0,
+				  "stuck wait on %s", engine->name);
 		I915_WRITE_CTL(engine, tmp);
 		return ENGINE_WAIT_KICK;
 	}
@@ -258,8 +257,8 @@ engine_stuck(struct intel_engine_cs *engine, u64 acthd)
 		default:
 			return ENGINE_DEAD;
 		case 1:
-			i915_handle_error(dev_priv, ALL_ENGINES,
-					  "Kicking stuck semaphore on %s",
+			i915_handle_error(dev_priv, ALL_ENGINES, 0,
+					  "stuck semaphore on %s",
 					  engine->name);
 			I915_WRITE_CTL(engine, tmp);
 			return ENGINE_WAIT_KICK;
@@ -386,13 +385,13 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
 	if (stuck != hung)
 		hung &= ~stuck;
 	len = scnprintf(msg, sizeof(msg),
-			"%s on ", stuck == hung ? "No progress" : "Hang");
+			"%s on ", stuck == hung ? "no progress" : "hang");
 	for_each_engine_masked(engine, i915, hung, tmp)
 		len += scnprintf(msg + len, sizeof(msg) - len,
 				 "%s, ", engine->name);
 	msg[len-2] = '\0';
 
-	return i915_handle_error(i915, hung, "%s", msg);
+	return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s", msg);
 }
 
 /*
diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index df7898c8edcb..4372826998aa 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -433,7 +433,7 @@ static int igt_global_reset(void *arg)
 	mutex_lock(&i915->drm.struct_mutex);
 	reset_count = i915_reset_count(&i915->gpu_error);
 
-	i915_reset(i915, I915_RESET_QUIET);
+	i915_reset(i915);
 
 	if (i915_reset_count(&i915->gpu_error) == reset_count) {
 		pr_err("No GPU reset recorded!\n");
@@ -518,7 +518,7 @@ static int __igt_reset_engine(struct drm_i915_private *i915, bool active)
 			engine->hangcheck.seqno =
 				intel_engine_get_seqno(engine);
 
-			err = i915_reset_engine(engine, I915_RESET_QUIET);
+			err = i915_reset_engine(engine, NULL);
 			if (err) {
 				pr_err("i915_reset_engine failed\n");
 				break;
@@ -725,7 +725,7 @@ static int __igt_reset_engine_others(struct drm_i915_private *i915,
 			engine->hangcheck.seqno =
 				intel_engine_get_seqno(engine);
 
-			err = i915_reset_engine(engine, I915_RESET_QUIET);
+			err = i915_reset_engine(engine, NULL);
 			if (err) {
 				pr_err("i915_reset_engine(%s:%s) failed, err=%d\n",
 				       engine->name, active ? "active" : "idle", err);
@@ -865,7 +865,6 @@ static int igt_wait_reset(void *arg)
 		       __func__, rq->fence.seqno, hws_seqno(&h, rq));
 		intel_engine_dump(rq->engine, &p, "%s\n", rq->engine->name);
 
-		i915_reset(i915, 0);
 		i915_gem_set_wedged(i915);
 
 		err = -EIO;
@@ -962,7 +961,6 @@ static int igt_reset_queue(void *arg)
 				i915_request_put(rq);
 				i915_request_put(prev);
 
-				i915_reset(i915, 0);
 				i915_gem_set_wedged(i915);
 
 				err = -EIO;
@@ -971,7 +969,7 @@ static int igt_reset_queue(void *arg)
 
 			reset_count = fake_hangcheck(prev);
 
-			i915_reset(i915, I915_RESET_QUIET);
+			i915_reset(i915);
 
 			GEM_BUG_ON(test_bit(I915_RESET_HANDOFF,
 					    &i915->gpu_error.flags));
@@ -1069,7 +1067,6 @@ static int igt_handle_error(void *arg)
 		       __func__, rq->fence.seqno, hws_seqno(&h, rq));
 		intel_engine_dump(rq->engine, &p, "%s\n", rq->engine->name);
 
-		i915_reset(i915, 0);
 		i915_gem_set_wedged(i915);
 
 		err = -EIO;
@@ -1084,7 +1081,7 @@ static int igt_handle_error(void *arg)
 	engine->hangcheck.stalled = true;
 	engine->hangcheck.seqno = intel_engine_get_seqno(engine);
 
-	i915_handle_error(i915, intel_engine_flag(engine), "%s", __func__);
+	i915_handle_error(i915, intel_engine_flag(engine), 0, NULL);
 
 	xchg(&i915->gpu_error.first_error, error);
 
-- 
2.16.2

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/2] drm/i915: Specify which engines to reset following semaphore/event lockups
  2018-03-20 10:04 [PATCH 1/2] drm/i915: Specify which engines to reset following semaphore/event lockups Chris Wilson
  2018-03-20 10:04 ` [PATCH 2/2] drm/i915: Add control flags to i915_handle_error() Chris Wilson
@ 2018-03-20 10:12 ` Patchwork
  2018-03-20 10:27 ` ✓ Fi.CI.BAT: success " Patchwork
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Patchwork @ 2018-03-20 10:12 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/2] drm/i915: Specify which engines to reset following semaphore/event lockups
URL   : https://patchwork.freedesktop.org/series/40265/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
02b19c71acb4 drm/i915: Specify which engines to reset following semaphore/event lockups
caf8f3ffe736 drm/i915: Add control flags to i915_handle_error()
-:113: CHECK:AVOID_EXTERNS: extern prototypes should be avoided in .h files
#113: FILE: drivers/gpu/drm/i915/i915_drv.h:2703:
+extern void i915_reset(struct drm_i915_private *i915);

-:114: CHECK:AVOID_EXTERNS: extern prototypes should be avoided in .h files
#114: FILE: drivers/gpu/drm/i915/i915_drv.h:2704:
+extern int i915_reset_engine(struct intel_engine_cs *engine, const char *msg);

total: 0 errors, 0 warnings, 2 checks, 323 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* ✓ Fi.CI.BAT: success for series starting with [1/2] drm/i915: Specify which engines to reset following semaphore/event lockups
  2018-03-20 10:04 [PATCH 1/2] drm/i915: Specify which engines to reset following semaphore/event lockups Chris Wilson
  2018-03-20 10:04 ` [PATCH 2/2] drm/i915: Add control flags to i915_handle_error() Chris Wilson
  2018-03-20 10:12 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/2] drm/i915: Specify which engines to reset following semaphore/event lockups Patchwork
@ 2018-03-20 10:27 ` Patchwork
  2018-03-20 12:23 ` ✗ Fi.CI.IGT: failure " Patchwork
  2018-03-20 14:08 ` [PATCH 1/2] " Michel Thierry
  4 siblings, 0 replies; 8+ messages in thread
From: Patchwork @ 2018-03-20 10:27 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/2] drm/i915: Specify which engines to reset following semaphore/event lockups
URL   : https://patchwork.freedesktop.org/series/40265/
State : success

== Summary ==

Series 40265v1 series starting with [1/2] drm/i915: Specify which engines to reset following semaphore/event lockups
https://patchwork.freedesktop.org/api/1.0/series/40265/revisions/1/mbox/

---- Known issues:

Test gem_mmap_gtt:
        Subgroup basic-small-bo-tiledx:
                fail       -> PASS       (fi-gdg-551) fdo#102575
Test kms_chamelium:
        Subgroup dp-edid-read:
                pass       -> FAIL       (fi-kbl-7500u) fdo#102505
Test kms_pipe_crc_basic:
        Subgroup suspend-read-crc-pipe-b:
                incomplete -> PASS       (fi-snb-2520m) fdo#103713

fdo#102575 https://bugs.freedesktop.org/show_bug.cgi?id=102575
fdo#102505 https://bugs.freedesktop.org/show_bug.cgi?id=102505
fdo#103713 https://bugs.freedesktop.org/show_bug.cgi?id=103713

fi-bdw-5557u     total:285  pass:264  dwarn:0   dfail:0   fail:0   skip:21  time:430s
fi-bdw-gvtdvm    total:285  pass:261  dwarn:0   dfail:0   fail:0   skip:24  time:442s
fi-blb-e6850     total:285  pass:220  dwarn:1   dfail:0   fail:0   skip:64  time:382s
fi-bsw-n3050     total:285  pass:239  dwarn:0   dfail:0   fail:0   skip:46  time:535s
fi-bwr-2160      total:285  pass:180  dwarn:0   dfail:0   fail:0   skip:105 time:301s
fi-bxt-dsi       total:285  pass:255  dwarn:0   dfail:0   fail:0   skip:30  time:513s
fi-bxt-j4205     total:285  pass:256  dwarn:0   dfail:0   fail:0   skip:29  time:515s
fi-byt-j1900     total:285  pass:250  dwarn:0   dfail:0   fail:0   skip:35  time:518s
fi-byt-n2820     total:285  pass:246  dwarn:0   dfail:0   fail:0   skip:39  time:507s
fi-cfl-8700k     total:285  pass:257  dwarn:0   dfail:0   fail:0   skip:28  time:414s
fi-cfl-s2        total:285  pass:259  dwarn:0   dfail:0   fail:0   skip:26  time:583s
fi-cfl-u         total:285  pass:259  dwarn:0   dfail:0   fail:0   skip:26  time:512s
fi-cnl-drrs      total:285  pass:254  dwarn:3   dfail:0   fail:0   skip:28  time:522s
fi-elk-e7500     total:285  pass:225  dwarn:1   dfail:0   fail:0   skip:59  time:427s
fi-gdg-551       total:285  pass:177  dwarn:0   dfail:0   fail:0   skip:108 time:317s
fi-glk-1         total:285  pass:257  dwarn:0   dfail:0   fail:0   skip:28  time:541s
fi-hsw-4770      total:285  pass:258  dwarn:0   dfail:0   fail:0   skip:27  time:405s
fi-ilk-650       total:285  pass:225  dwarn:0   dfail:0   fail:0   skip:60  time:419s
fi-ivb-3520m     total:285  pass:256  dwarn:0   dfail:0   fail:0   skip:29  time:472s
fi-ivb-3770      total:285  pass:252  dwarn:0   dfail:0   fail:0   skip:33  time:431s
fi-kbl-7500u     total:285  pass:259  dwarn:1   dfail:0   fail:1   skip:24  time:470s
fi-kbl-7567u     total:285  pass:265  dwarn:0   dfail:0   fail:0   skip:20  time:465s
fi-kbl-r         total:285  pass:258  dwarn:0   dfail:0   fail:0   skip:27  time:512s
fi-pnv-d510      total:285  pass:219  dwarn:1   dfail:0   fail:0   skip:65  time:659s
fi-skl-6260u     total:285  pass:265  dwarn:0   dfail:0   fail:0   skip:20  time:439s
fi-skl-6600u     total:285  pass:258  dwarn:0   dfail:0   fail:0   skip:27  time:534s
fi-skl-6700hq    total:285  pass:259  dwarn:0   dfail:0   fail:0   skip:26  time:544s
fi-skl-6700k2    total:285  pass:261  dwarn:0   dfail:0   fail:0   skip:24  time:503s
fi-skl-6770hq    total:285  pass:265  dwarn:0   dfail:0   fail:0   skip:20  time:499s
fi-skl-guc       total:285  pass:257  dwarn:0   dfail:0   fail:0   skip:28  time:435s
fi-skl-gvtdvm    total:285  pass:262  dwarn:0   dfail:0   fail:0   skip:23  time:446s
fi-snb-2520m     total:285  pass:245  dwarn:0   dfail:0   fail:0   skip:40  time:588s
fi-snb-2600      total:285  pass:245  dwarn:0   dfail:0   fail:0   skip:40  time:399s

141def2a45f4a3ad7c7e9144cd26e97bb1298397 drm-tip: 2018y-03m-19d-23h-48m-43s UTC integration manifest
caf8f3ffe736 drm/i915: Add control flags to i915_handle_error()
02b19c71acb4 drm/i915: Specify which engines to reset following semaphore/event lockups

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8412/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* ✗ Fi.CI.IGT: failure for series starting with [1/2] drm/i915: Specify which engines to reset following semaphore/event lockups
  2018-03-20 10:04 [PATCH 1/2] drm/i915: Specify which engines to reset following semaphore/event lockups Chris Wilson
                   ` (2 preceding siblings ...)
  2018-03-20 10:27 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2018-03-20 12:23 ` Patchwork
  2018-03-20 14:08 ` [PATCH 1/2] " Michel Thierry
  4 siblings, 0 replies; 8+ messages in thread
From: Patchwork @ 2018-03-20 12:23 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/2] drm/i915: Specify which engines to reset following semaphore/event lockups
URL   : https://patchwork.freedesktop.org/series/40265/
State : failure

== Summary ==

---- Possible new issues:

Test kms_flip_tiling:
        Subgroup flip-changes-tiling:
                pass       -> FAIL       (shard-apl)

---- Known issues:

Test kms_cursor_legacy:
        Subgroup 2x-long-flip-vs-cursor-legacy:
                incomplete -> PASS       (shard-hsw) fdo#104873
Test kms_flip:
        Subgroup dpms-vs-vblank-race:
                fail       -> PASS       (shard-hsw) fdo#103060
        Subgroup flip-vs-absolute-wf_vblank:
                fail       -> PASS       (shard-hsw) fdo#100368
Test kms_flip_tiling:
        Subgroup flip-x-tiled:
                fail       -> PASS       (shard-apl) fdo#103822
Test kms_pipe_crc_basic:
        Subgroup suspend-read-crc-pipe-a:
                skip       -> PASS       (shard-snb) fdo#103375 +1
Test kms_plane:
        Subgroup plane-panning-top-left-pipe-c-planes:
                pass       -> FAIL       (shard-apl) fdo#103166 +1
Test perf:
        Subgroup blocking:
                pass       -> FAIL       (shard-hsw) fdo#102252

fdo#104873 https://bugs.freedesktop.org/show_bug.cgi?id=104873
fdo#103060 https://bugs.freedesktop.org/show_bug.cgi?id=103060
fdo#100368 https://bugs.freedesktop.org/show_bug.cgi?id=100368
fdo#103822 https://bugs.freedesktop.org/show_bug.cgi?id=103822
fdo#103375 https://bugs.freedesktop.org/show_bug.cgi?id=103375
fdo#103166 https://bugs.freedesktop.org/show_bug.cgi?id=103166
fdo#102252 https://bugs.freedesktop.org/show_bug.cgi?id=102252

shard-apl        total:3478 pass:1812 dwarn:1   dfail:0   fail:9   skip:1655 time:13075s
shard-hsw        total:3478 pass:1767 dwarn:1   dfail:0   fail:2   skip:1707 time:11720s
shard-snb        total:3478 pass:1357 dwarn:1   dfail:0   fail:2   skip:2118 time:7157s
Blacklisted hosts:
shard-kbl        total:3478 pass:1937 dwarn:1   dfail:1   fail:9   skip:1530 time:9785s

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8412/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] drm/i915: Specify which engines to reset following semaphore/event lockups
  2018-03-20 10:04 [PATCH 1/2] drm/i915: Specify which engines to reset following semaphore/event lockups Chris Wilson
                   ` (3 preceding siblings ...)
  2018-03-20 12:23 ` ✗ Fi.CI.IGT: failure " Patchwork
@ 2018-03-20 14:08 ` Michel Thierry
  4 siblings, 0 replies; 8+ messages in thread
From: Michel Thierry @ 2018-03-20 14:08 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Mika Kuoppala

On 3/20/2018 3:04 AM, Chris Wilson wrote:
> If the GPU is stuck waiting for an event or for a semaphore, we need to
> reset the GPU in order to recover. We have to tell the reset routine
> which engines we want reset, but we were still using the old interface
> and declaring it as "not-fatal".
> 
> Fixes: 14b730fcb8d9 ("drm/i915/tdr: Prepare error handler to accept mask of hung engines")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>

Reviewed-by: Michel Thierry <michel.thierry@intel.com>

> ---
>   drivers/gpu/drm/i915/intel_hangcheck.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index 42e45ae87393..c8ea510629fa 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -246,7 +246,7 @@ engine_stuck(struct intel_engine_cs *engine, u64 acthd)
>   	 */
>   	tmp = I915_READ_CTL(engine);
>   	if (tmp & RING_WAIT) {
> -		i915_handle_error(dev_priv, 0,
> +		i915_handle_error(dev_priv, BIT(engine->id),
>   				  "Kicking stuck wait on %s",
>   				  engine->name);
>   		I915_WRITE_CTL(engine, tmp);
> @@ -258,7 +258,7 @@ engine_stuck(struct intel_engine_cs *engine, u64 acthd)
>   		default:
>   			return ENGINE_DEAD;
>   		case 1:
> -			i915_handle_error(dev_priv, 0,
> +			i915_handle_error(dev_priv, ALL_ENGINES,
>   					  "Kicking stuck semaphore on %s",
>   					  engine->name);
>   			I915_WRITE_CTL(engine, tmp);
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] drm/i915: Add control flags to i915_handle_error()
  2018-03-20 10:04 ` [PATCH 2/2] drm/i915: Add control flags to i915_handle_error() Chris Wilson
@ 2018-03-20 14:11   ` Michel Thierry
  2018-03-20 15:04     ` Chris Wilson
  0 siblings, 1 reply; 8+ messages in thread
From: Michel Thierry @ 2018-03-20 14:11 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Mika Kuoppala

On 3/20/2018 3:04 AM, Chris Wilson wrote:
> Not all callers want the GPU error to handled in the same way, so expose
> a control parameter. In the first instance, some callers do not want the
> heavyweight error capture so add a bit to request the state to be
> captured and saved.
> 
> v2: Pass msg down to i915_reset/i915_reset_engine so that we include the
> reason for the reset in the dev_notice(), superseding the earlier option
> to not print that notice.
> v3: Stash the reason inside the i915->gpu_error to handover to the direct
> reset from the blocking waiter.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jeff McGee <jeff.mcgee@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>

Reviewed-by: Michel Thierry <michel.thierry@intel.com>

> ---
>   drivers/gpu/drm/i915/i915_debugfs.c              |  4 +-
>   drivers/gpu/drm/i915/i915_drv.c                  | 17 ++++----
>   drivers/gpu/drm/i915/i915_drv.h                  | 10 ++---
>   drivers/gpu/drm/i915/i915_gpu_error.h            |  3 ++
>   drivers/gpu/drm/i915/i915_irq.c                  | 55 ++++++++++++++----------
>   drivers/gpu/drm/i915/i915_request.c              |  2 +-
>   drivers/gpu/drm/i915/intel_hangcheck.c           | 13 +++---
>   drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 13 +++---
>   8 files changed, 62 insertions(+), 55 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index 964ea1a12357..7816cd53100a 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -4011,8 +4011,8 @@ i915_wedged_set(void *data, u64 val)
>   		engine->hangcheck.stalled = true;
>   	}
>   
> -	i915_handle_error(i915, val, "Manually set wedged engine mask = %llx",
> -			  val);
> +	i915_handle_error(i915, val, I915_ERROR_CAPTURE,
> +			  "Manually set wedged engine mask = %llx", val);
>   
>   	wait_on_bit(&i915->gpu_error.flags,
>   		    I915_RESET_HANDOFF,
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 1021bf40e236..6b04cc3be513 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1869,7 +1869,6 @@ static int i915_resume_switcheroo(struct drm_device *dev)
>   /**
>    * i915_reset - reset chip after a hang
>    * @i915: #drm_i915_private to reset
> - * @flags: Instructions
>    *
>    * Reset the chip.  Useful if a hang is detected. Marks the device as wedged
>    * on failure.
> @@ -1884,7 +1883,7 @@ static int i915_resume_switcheroo(struct drm_device *dev)
>    *   - re-init interrupt state
>    *   - re-init display
>    */
> -void i915_reset(struct drm_i915_private *i915, unsigned int flags)
> +void i915_reset(struct drm_i915_private *i915)
>   {
>   	struct i915_gpu_error *error = &i915->gpu_error;
>   	int ret;
> @@ -1901,8 +1900,9 @@ void i915_reset(struct drm_i915_private *i915, unsigned int flags)
>   	if (!i915_gem_unset_wedged(i915))
>   		goto wakeup;
>   
> -	if (!(flags & I915_RESET_QUIET))
> -		dev_notice(i915->drm.dev, "Resetting chip after gpu hang\n");
> +	if (error->reason)
> +		dev_notice(i915->drm.dev,
> +			   "Resetting chip for %s\n", error->reason);
>   	error->reset_count++;
>   
>   	disable_irq(i915->drm.irq);
> @@ -2003,7 +2003,7 @@ static inline int intel_gt_reset_engine(struct drm_i915_private *dev_priv,
>   /**
>    * i915_reset_engine - reset GPU engine to recover from a hang
>    * @engine: engine to reset
> - * @flags: options
> + * @msg: reason for GPU reset; or NULL for no dev_notice()
>    *
>    * Reset a specific GPU engine. Useful if a hang is detected.
>    * Returns zero on successful reset or otherwise an error code.
> @@ -2013,7 +2013,7 @@ static inline int intel_gt_reset_engine(struct drm_i915_private *dev_priv,
>    *  - reset engine (which will force the engine to idle)
>    *  - re-init/configure engine
>    */
> -int i915_reset_engine(struct intel_engine_cs *engine, unsigned int flags)
> +int i915_reset_engine(struct intel_engine_cs *engine, const char *msg)
>   {
>   	struct i915_gpu_error *error = &engine->i915->gpu_error;
>   	struct i915_request *active_request;
> @@ -2028,10 +2028,9 @@ int i915_reset_engine(struct intel_engine_cs *engine, unsigned int flags)
>   		goto out;
>   	}
>   
> -	if (!(flags & I915_RESET_QUIET)) {
> +	if (msg)
>   		dev_notice(engine->i915->drm.dev,
> -			   "Resetting %s after gpu hang\n", engine->name);
> -	}
> +			   "Resetting %s for %s\n", engine->name, msg);
>   	error->reset_engine_count[engine->id]++;
>   
>   	if (!engine->i915->guc.execbuf_client)
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index e27ba8fb64e6..c9c3b2ba6a86 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2700,10 +2700,8 @@ extern void i915_driver_unload(struct drm_device *dev);
>   extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
>   extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
>   
> -#define I915_RESET_QUIET BIT(0)
> -extern void i915_reset(struct drm_i915_private *i915, unsigned int flags);
> -extern int i915_reset_engine(struct intel_engine_cs *engine,
> -			     unsigned int flags);
> +extern void i915_reset(struct drm_i915_private *i915);
> +extern int i915_reset_engine(struct intel_engine_cs *engine, const char *msg);
>   
>   extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
>   extern int intel_reset_guc(struct drm_i915_private *dev_priv);
> @@ -2751,10 +2749,12 @@ static inline void i915_queue_hangcheck(struct drm_i915_private *dev_priv)
>   			   &dev_priv->gpu_error.hangcheck_work, delay);
>   }
>   
> -__printf(3, 4)
> +__printf(4, 5)
>   void i915_handle_error(struct drm_i915_private *dev_priv,
>   		       u32 engine_mask,
> +		       unsigned long flags,
>   		       const char *fmt, ...);
> +#define I915_ERROR_CAPTURE BIT(0)
>   
>   extern void intel_irq_init(struct drm_i915_private *dev_priv);
>   extern void intel_irq_fini(struct drm_i915_private *dev_priv);
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
> index ebbdf37e2879..ac5760673cc9 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> @@ -269,6 +269,9 @@ struct i915_gpu_error {
>   	/** Number of times an engine has been reset */
>   	u32 reset_engine_count[I915_NUM_ENGINES];
>   
> +	/** Reason for the current *global* reset */
> +	const char *reason;
> +
>   	/**
>   	 * Waitqueue to signal when a hang is detected. Used to for waiters
>   	 * to release the struct_mutex for the reset to procede.
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 44eef355e12c..fa7310766217 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2877,15 +2877,10 @@ static irqreturn_t gen11_irq_handler(int irq, void *arg)
>   	return IRQ_HANDLED;
>   }
>   
> -/**
> - * i915_reset_device - do process context error handling work
> - * @dev_priv: i915 device private
> - *
> - * Fire an error uevent so userspace can see that a hang or error
> - * was detected.
> - */
> -static void i915_reset_device(struct drm_i915_private *dev_priv)
> +static void i915_reset_device(struct drm_i915_private *dev_priv,
> +			      const char *msg)
>   {
> +	struct i915_gpu_error *error = &dev_priv->gpu_error;
>   	struct kobject *kobj = &dev_priv->drm.primary->kdev->kobj;
>   	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
>   	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
> @@ -2901,29 +2896,32 @@ static void i915_reset_device(struct drm_i915_private *dev_priv)
>   	i915_wedge_on_timeout(&w, dev_priv, 5*HZ) {
>   		intel_prepare_reset(dev_priv);
>   
> +		error->reason = msg;
> +
>   		/* Signal that locked waiters should reset the GPU */
> -		set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags);
> -		wake_up_all(&dev_priv->gpu_error.wait_queue);
> +		set_bit(I915_RESET_HANDOFF, &error->flags);
> +		wake_up_all(&error->wait_queue);
>   
>   		/* Wait for anyone holding the lock to wakeup, without
>   		 * blocking indefinitely on struct_mutex.
>   		 */
>   		do {
>   			if (mutex_trylock(&dev_priv->drm.struct_mutex)) {
> -				i915_reset(dev_priv, 0);
> +				i915_reset(dev_priv);
>   				mutex_unlock(&dev_priv->drm.struct_mutex);
>   			}
> -		} while (wait_on_bit_timeout(&dev_priv->gpu_error.flags,
> +		} while (wait_on_bit_timeout(&error->flags,
>   					     I915_RESET_HANDOFF,
>   					     TASK_UNINTERRUPTIBLE,
>   					     1));
>   
> +		error->reason = NULL;
> +
>   		intel_finish_reset(dev_priv);
>   	}
>   
> -	if (!test_bit(I915_WEDGED, &dev_priv->gpu_error.flags))
> -		kobject_uevent_env(kobj,
> -				   KOBJ_CHANGE, reset_done_event);
> +	if (!test_bit(I915_WEDGED, &error->flags))
> +		kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event);
>   }
>   
>   static void i915_clear_error_registers(struct drm_i915_private *dev_priv)
> @@ -2955,6 +2953,7 @@ static void i915_clear_error_registers(struct drm_i915_private *dev_priv)
>    * i915_handle_error - handle a gpu error
>    * @dev_priv: i915 device private
>    * @engine_mask: mask representing engines that are hung
> + * @flags: control flags
>    * @fmt: Error message format string
>    *
>    * Do some basic checking of register state at error time and
> @@ -2965,16 +2964,23 @@ static void i915_clear_error_registers(struct drm_i915_private *dev_priv)
>    */
>   void i915_handle_error(struct drm_i915_private *dev_priv,
>   		       u32 engine_mask,
> +		       unsigned long flags,
>   		       const char *fmt, ...)
>   {
>   	struct intel_engine_cs *engine;
>   	unsigned int tmp;
> -	va_list args;
>   	char error_msg[80];
> +	char *msg = NULL;
>   
> -	va_start(args, fmt);
> -	vscnprintf(error_msg, sizeof(error_msg), fmt, args);
> -	va_end(args);
> +	if (fmt) {
> +		va_list args;
> +
> +		va_start(args, fmt);
> +		vscnprintf(error_msg, sizeof(error_msg), fmt, args);
> +		va_end(args);
> +
> +		msg = error_msg;
> +	}
>   
>   	/*
>   	 * In most cases it's guaranteed that we get here with an RPM
> @@ -2986,8 +2992,11 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
>   	intel_runtime_pm_get(dev_priv);
>   
>   	engine_mask &= INTEL_INFO(dev_priv)->ring_mask;
> -	i915_capture_error_state(dev_priv, engine_mask, error_msg);
> -	i915_clear_error_registers(dev_priv);
> +
> +	if (flags & I915_ERROR_CAPTURE) {
> +		i915_capture_error_state(dev_priv, engine_mask, msg);
> +		i915_clear_error_registers(dev_priv);
> +	}
>   
>   	/*
>   	 * Try engine reset when available. We fall back to full reset if
> @@ -3000,7 +3009,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
>   					     &dev_priv->gpu_error.flags))
>   				continue;
>   
> -			if (i915_reset_engine(engine, 0) == 0)
> +			if (i915_reset_engine(engine, msg) == 0)
>   				engine_mask &= ~intel_engine_flag(engine);
>   
>   			clear_bit(I915_RESET_ENGINE + engine->id,
> @@ -3030,7 +3039,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
>   				    TASK_UNINTERRUPTIBLE);
>   	}
>   
> -	i915_reset_device(dev_priv);
> +	i915_reset_device(dev_priv, msg);
>   
>   	for_each_engine(engine, dev_priv, tmp) {
>   		clear_bit(I915_RESET_ENGINE + engine->id,
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 43c7134a9b93..2325886d1d55 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1229,7 +1229,7 @@ static bool __i915_wait_request_check_and_reset(struct i915_request *request)
>   		return false;
>   
>   	__set_current_state(TASK_RUNNING);
> -	i915_reset(request->i915, 0);
> +	i915_reset(request->i915);
>   	return true;
>   }
>   
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index c8ea510629fa..fd0ffb8328d0 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -246,9 +246,8 @@ engine_stuck(struct intel_engine_cs *engine, u64 acthd)
>   	 */
>   	tmp = I915_READ_CTL(engine);
>   	if (tmp & RING_WAIT) {
> -		i915_handle_error(dev_priv, BIT(engine->id),
> -				  "Kicking stuck wait on %s",
> -				  engine->name);
> +		i915_handle_error(dev_priv, BIT(engine->id), 0,
> +				  "stuck wait on %s", engine->name);
>   		I915_WRITE_CTL(engine, tmp);
>   		return ENGINE_WAIT_KICK;
>   	}
> @@ -258,8 +257,8 @@ engine_stuck(struct intel_engine_cs *engine, u64 acthd)
>   		default:
>   			return ENGINE_DEAD;
>   		case 1:
> -			i915_handle_error(dev_priv, ALL_ENGINES,
> -					  "Kicking stuck semaphore on %s",
> +			i915_handle_error(dev_priv, ALL_ENGINES, 0,
> +					  "stuck semaphore on %s",
>   					  engine->name);
>   			I915_WRITE_CTL(engine, tmp);
>   			return ENGINE_WAIT_KICK;
> @@ -386,13 +385,13 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
>   	if (stuck != hung)
>   		hung &= ~stuck;
>   	len = scnprintf(msg, sizeof(msg),
> -			"%s on ", stuck == hung ? "No progress" : "Hang");
> +			"%s on ", stuck == hung ? "no progress" : "hang");
>   	for_each_engine_masked(engine, i915, hung, tmp)
>   		len += scnprintf(msg + len, sizeof(msg) - len,
>   				 "%s, ", engine->name);
>   	msg[len-2] = '\0';
>   
> -	return i915_handle_error(i915, hung, "%s", msg);
> +	return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s", msg);
>   }
>   
>   /*
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index df7898c8edcb..4372826998aa 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -433,7 +433,7 @@ static int igt_global_reset(void *arg)
>   	mutex_lock(&i915->drm.struct_mutex);
>   	reset_count = i915_reset_count(&i915->gpu_error);
>   
> -	i915_reset(i915, I915_RESET_QUIET);
> +	i915_reset(i915);
>   
>   	if (i915_reset_count(&i915->gpu_error) == reset_count) {
>   		pr_err("No GPU reset recorded!\n");
> @@ -518,7 +518,7 @@ static int __igt_reset_engine(struct drm_i915_private *i915, bool active)
>   			engine->hangcheck.seqno =
>   				intel_engine_get_seqno(engine);
>   
> -			err = i915_reset_engine(engine, I915_RESET_QUIET);
> +			err = i915_reset_engine(engine, NULL);
>   			if (err) {
>   				pr_err("i915_reset_engine failed\n");
>   				break;
> @@ -725,7 +725,7 @@ static int __igt_reset_engine_others(struct drm_i915_private *i915,
>   			engine->hangcheck.seqno =
>   				intel_engine_get_seqno(engine);
>   
> -			err = i915_reset_engine(engine, I915_RESET_QUIET);
> +			err = i915_reset_engine(engine, NULL);
>   			if (err) {
>   				pr_err("i915_reset_engine(%s:%s) failed, err=%d\n",
>   				       engine->name, active ? "active" : "idle", err);
> @@ -865,7 +865,6 @@ static int igt_wait_reset(void *arg)
>   		       __func__, rq->fence.seqno, hws_seqno(&h, rq));
>   		intel_engine_dump(rq->engine, &p, "%s\n", rq->engine->name);
>   
> -		i915_reset(i915, 0);
>   		i915_gem_set_wedged(i915);
>   
>   		err = -EIO;
> @@ -962,7 +961,6 @@ static int igt_reset_queue(void *arg)
>   				i915_request_put(rq);
>   				i915_request_put(prev);
>   
> -				i915_reset(i915, 0);
>   				i915_gem_set_wedged(i915);
>   
>   				err = -EIO;
> @@ -971,7 +969,7 @@ static int igt_reset_queue(void *arg)
>   
>   			reset_count = fake_hangcheck(prev);
>   
> -			i915_reset(i915, I915_RESET_QUIET);
> +			i915_reset(i915);
>   
>   			GEM_BUG_ON(test_bit(I915_RESET_HANDOFF,
>   					    &i915->gpu_error.flags));
> @@ -1069,7 +1067,6 @@ static int igt_handle_error(void *arg)
>   		       __func__, rq->fence.seqno, hws_seqno(&h, rq));
>   		intel_engine_dump(rq->engine, &p, "%s\n", rq->engine->name);
>   
> -		i915_reset(i915, 0);
>   		i915_gem_set_wedged(i915);
>   
>   		err = -EIO;
> @@ -1084,7 +1081,7 @@ static int igt_handle_error(void *arg)
>   	engine->hangcheck.stalled = true;
>   	engine->hangcheck.seqno = intel_engine_get_seqno(engine);
>   
> -	i915_handle_error(i915, intel_engine_flag(engine), "%s", __func__);
> +	i915_handle_error(i915, intel_engine_flag(engine), 0, NULL);
>   
>   	xchg(&i915->gpu_error.first_error, error);
>   
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] drm/i915: Add control flags to i915_handle_error()
  2018-03-20 14:11   ` Michel Thierry
@ 2018-03-20 15:04     ` Chris Wilson
  0 siblings, 0 replies; 8+ messages in thread
From: Chris Wilson @ 2018-03-20 15:04 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx; +Cc: Mika Kuoppala

Quoting Michel Thierry (2018-03-20 14:11:03)
> On 3/20/2018 3:04 AM, Chris Wilson wrote:
> > Not all callers want the GPU error to handled in the same way, so expose
> > a control parameter. In the first instance, some callers do not want the
> > heavyweight error capture so add a bit to request the state to be
> > captured and saved.
> > 
> > v2: Pass msg down to i915_reset/i915_reset_engine so that we include the
> > reason for the reset in the dev_notice(), superseding the earlier option
> > to not print that notice.
> > v3: Stash the reason inside the i915->gpu_error to handover to the direct
> > reset from the blocking waiter.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Jeff McGee <jeff.mcgee@intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > Cc: Michel Thierry <michel.thierry@intel.com>
> 
> Reviewed-by: Michel Thierry <michel.thierry@intel.com>

Thanks for the pointing out the regression and the review,
pushed.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-03-20 15:04 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-20 10:04 [PATCH 1/2] drm/i915: Specify which engines to reset following semaphore/event lockups Chris Wilson
2018-03-20 10:04 ` [PATCH 2/2] drm/i915: Add control flags to i915_handle_error() Chris Wilson
2018-03-20 14:11   ` Michel Thierry
2018-03-20 15:04     ` Chris Wilson
2018-03-20 10:12 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/2] drm/i915: Specify which engines to reset following semaphore/event lockups Patchwork
2018-03-20 10:27 ` ✓ Fi.CI.BAT: success " Patchwork
2018-03-20 12:23 ` ✗ Fi.CI.IGT: failure " Patchwork
2018-03-20 14:08 ` [PATCH 1/2] " Michel Thierry

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.