All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/5] GEN8+ GPU Watchdog Reset Support
@ 2019-02-21  2:58 Carlos Santa
  2019-02-21  2:58 ` [PATCH v4 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl Carlos Santa
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Carlos Santa @ 2019-02-21  2:58 UTC (permalink / raw)
  To: intel-gfx

This is a rebased on the original patch series from Michel Thierry
that can be found here:

https://patchwork.freedesktop.org/series/21868

Note that this series is only limited to the GPU Watchdog timeout
for execlists as it leaves out support
for GuC based submission for a later time.

PATCH v4 of this series was successfully tested from userspace
through an IGT test gem_watchdog --run-subtest basic-bsd1,
that test not in upstream yet.

Also, the changes on the i965 media userspace driver are currently
under review at

https://github.com/intel/intel-vaapi-driver/pull/429/files

The testbed used on this series included a SKL-based NUC with 
2 BSD rings as well as a KBL-based Chromebook with 1 BSD ring.

Michel Thierry (5):
  drm/i915: Add engine reset count in get-reset-stats ioctl
  drm/i915: Watchdog timeout: IRQ handler for gen8+
  drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  drm/i915: Watchdog timeout: Include threshold value in error state

 drivers/gpu/drm/i915/i915_drv.h         |  56 ++++++++++
 drivers/gpu/drm/i915/i915_gem_context.c | 103 ++++++++++++++++-
 drivers/gpu/drm/i915/i915_gem_context.h |   4 +
 drivers/gpu/drm/i915/i915_gpu_error.c   |  12 +-
 drivers/gpu/drm/i915/i915_gpu_error.h   |   5 +
 drivers/gpu/drm/i915/i915_irq.c         |  12 +-
 drivers/gpu/drm/i915/i915_reg.h         |   6 +
 drivers/gpu/drm/i915/intel_engine_cs.c  |   3 +
 drivers/gpu/drm/i915/intel_hangcheck.c  |  17 ++-
 drivers/gpu/drm/i915/intel_lrc.c        | 142 +++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_lrc.h        |   2 +
 drivers/gpu/drm/i915/intel_ringbuffer.h |  25 ++++-
 include/uapi/drm/i915_drm.h             |   7 +-
 13 files changed, 374 insertions(+), 20 deletions(-)

-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v4 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl
  2019-02-21  2:58 [PATCH v4 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
@ 2019-02-21  2:58 ` Carlos Santa
  2019-02-25 13:34   ` Tvrtko Ursulin
  2019-02-21  2:58 ` [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+ Carlos Santa
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Carlos Santa @ 2019-02-21  2:58 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

From: Michel Thierry <michel.thierry@intel.com>

Users/tests relying on the total reset count will start seeing a smaller
number since most of the hangs can be handled by engine reset.
Note that if reset engine x, context a running on engine y will be unaware
and unaffected.

To start the discussion, include just a total engine reset count. If it
is deemed useful, it can be extended to report each engine separately.

Our igt's gem_reset_stats test will need changes to ignore the pad field,
since it can now return reset_engine_count.

v2: s/engine_reset/reset_engine/, use union in uapi to not break compatibility.
v3: Keep rejecting attempts to use pad as input (Antonio)
v4: Rebased.
v5: Rebased.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 12 ++++++++++--
 include/uapi/drm/i915_drm.h             |  6 +++++-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 459f8eae1c39..cbfe8f2eb3f2 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -1889,6 +1889,8 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct drm_i915_reset_stats *args = data;
 	struct i915_gem_context *ctx;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
 	int ret;
 
 	if (args->flags || args->pad)
@@ -1907,10 +1909,16 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 	 * we should wrap the hangstats with a seqlock.
 	 */
 
-	if (capable(CAP_SYS_ADMIN))
+	if (capable(CAP_SYS_ADMIN)) {
 		args->reset_count = i915_reset_count(&dev_priv->gpu_error);
-	else
+		for_each_engine(engine, dev_priv, id)
+			args->reset_engine_count +=
+				i915_reset_engine_count(&dev_priv->gpu_error,
+							engine);
+	} else {
 		args->reset_count = 0;
+		args->reset_engine_count = 0;
+	}
 
 	args->batch_active = atomic_read(&ctx->guilty_count);
 	args->batch_pending = atomic_read(&ctx->active_count);
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index cc03ef9f885f..3f2c89740b0e 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1642,7 +1642,11 @@ struct drm_i915_reset_stats {
 	/* Number of batches lost pending for execution, for this context */
 	__u32 batch_pending;
 
-	__u32 pad;
+	union {
+		__u32 pad;
+		/* Engine resets since boot/module reload, for all contexts */
+		__u32 reset_engine_count;
+	};
 };
 
 struct drm_i915_gem_userptr {
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-02-21  2:58 [PATCH v4 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
  2019-02-21  2:58 ` [PATCH v4 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl Carlos Santa
@ 2019-02-21  2:58 ` Carlos Santa
  2019-02-28 17:38   ` Tvrtko Ursulin
  2019-03-01  9:36   ` Chris Wilson
  2019-02-21  2:58 ` [PATCH v4 3/5] drm/i915: Watchdog timeout: Ringbuffer command emission " Carlos Santa
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 23+ messages in thread
From: Carlos Santa @ 2019-02-21  2:58 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

From: Michel Thierry <michel.thierry@intel.com>

*** General ***

Watchdog timeout (or "media engine reset") is a feature that allows
userland applications to enable hang detection on individual batch buffers.
The detection mechanism itself is mostly bound to the hardware and the only
thing that the driver needs to do to support this form of hang detection
is to implement the interrupt handling support as well as watchdog command
emission before and after the emitted batch buffer start instruction in the
ring buffer.

The principle of the hang detection mechanism is as follows:

1. Once the decision has been made to enable watchdog timeout for a
particular batch buffer and the driver is in the process of emitting the
batch buffer start instruction into the ring buffer it also emits a
watchdog timer start instruction before and a watchdog timer cancellation
instruction after the batch buffer start instruction in the ring buffer.

2. Once the GPU execution reaches the watchdog timer start instruction
the hardware watchdog counter is started by the hardware. The counter
keeps counting until either reaching a previously configured threshold
value or the timer cancellation instruction is executed.

2a. If the counter reaches the threshold value the hardware fires a
watchdog interrupt that is picked up by the watchdog interrupt handler.
This means that a hang has been detected and the driver needs to deal with
it the same way it would deal with a engine hang detected by the periodic
hang checker. The only difference between the two is that we already blamed
the active request (to ensure an engine reset).

2b. If the batch buffer completes and the execution reaches the watchdog
cancellation instruction before the watchdog counter reaches its
threshold value the watchdog is cancelled and nothing more comes of it.
No hang is detected.

Note about future interaction with preemption: Preemption could happen
in a command sequence prior to watchdog counter getting disabled,
resulting in watchdog being triggered following preemption (e.g. when
watchdog had been enabled in the low priority batch). The driver will
need to explicitly disable the watchdog counter as part of the
preemption sequence.

*** This patch introduces: ***

1. IRQ handler code for watchdog timeout allowing direct hang recovery
based on hardware-driven hang detection, which then integrates directly
with the hang recovery path. This is independent of having per-engine reset
or just full gpu reset.

2. Watchdog specific register information.

Currently the render engine and all available media engines support
watchdog timeout (VECS is only supported in GEN9). The specifications elude
to the BCS engine being supported but that is currently not supported by
this commit.

Note that the value to stop the counter is different between render and
non-render engines in GEN8; GEN9 onwards it's the same.

v2: Move irq handler to tasklet, arm watchdog for a 2nd time to check
against false-positives.

v3: Don't use high priority tasklet, use engine_last_submit while
checking for false-positives. From GEN9 onwards, the stop counter bit is
the same for all engines.

v4: Remove unnecessary brackets, use current_seqno to mark the request
as guilty in the hangcheck/capture code.

v5: Rebased after RESET_ENGINEs flag.

v6: Don't capture error state in case of watchdog timeout. The capture
process is time consuming and this will align to what happens when we
use GuC to handle the watchdog timeout. (Chris)

v7: Rebase.

v8: Rebase, use HZ to reschedule.

v9: Rebase, get forcewake domains in function (no longer in execlists
struct).

v10: Rebase.

v11: Rebase,
     remove extra braces (Tvrtko),
     implement watchdog_to_clock_counts helper (Tvrtko),
     Move tasklet_kill(watchdog_tasklet) inside intel_engines (Tvrtko),
     Use a global heartbeat seqno instead of engine seqno (Chris)
     Make all engines checks all class based checks (Tvrtko)

Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |  8 +++
 drivers/gpu/drm/i915/i915_gpu_error.h   |  4 ++
 drivers/gpu/drm/i915/i915_irq.c         | 12 ++++-
 drivers/gpu/drm/i915/i915_reg.h         |  6 +++
 drivers/gpu/drm/i915/intel_engine_cs.c  |  1 +
 drivers/gpu/drm/i915/intel_hangcheck.c  | 17 +++++--
 drivers/gpu/drm/i915/intel_lrc.c        | 65 +++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h |  7 +++
 8 files changed, 114 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 63a008aebfcd..0fcb2df869a2 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3120,6 +3120,14 @@ i915_gem_context_lookup(struct drm_i915_file_private *file_priv, u32 id)
 	return ctx;
 }
 
+static inline u32
+watchdog_to_clock_counts(struct drm_i915_private *dev_priv, u64 value_in_us)
+{
+	u64 threshold = 0;
+
+	return threshold;
+}
+
 int i915_perf_open_ioctl(struct drm_device *dev, void *data,
 			 struct drm_file *file);
 int i915_perf_add_config_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
index f408060e0667..bd1821c73ecd 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.h
+++ b/drivers/gpu/drm/i915/i915_gpu_error.h
@@ -233,6 +233,9 @@ struct i915_gpu_error {
 	 * i915_mutex_lock_interruptible()?). I915_RESET_BACKOFF serves a
 	 * secondary role in preventing two concurrent global reset attempts.
 	 *
+	 * #I915_RESET_WATCHDOG - When hw detects a hang before us, we can use
+	 * I915_RESET_WATCHDOG to report the hang detection cause accurately.
+	 *
 	 * #I915_RESET_ENGINE[num_engines] - Since the driver doesn't need to
 	 * acquire the struct_mutex to reset an engine, we need an explicit
 	 * flag to prevent two concurrent reset attempts in the same engine.
@@ -248,6 +251,7 @@ struct i915_gpu_error {
 #define I915_RESET_BACKOFF	0
 #define I915_RESET_MODESET	1
 #define I915_RESET_ENGINE	2
+#define I915_RESET_WATCHDOG	3
 #define I915_WEDGED		(BITS_PER_LONG - 1)
 
 	/** Number of times an engine has been reset */
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 4b23b2fd1fad..e2a1a07b0f2c 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1456,6 +1456,9 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
 
 	if (tasklet)
 		tasklet_hi_schedule(&engine->execlists.tasklet);
+
+	if (iir & GT_GEN8_WATCHDOG_INTERRUPT)
+		tasklet_schedule(&engine->execlists.watchdog_tasklet);
 }
 
 static void gen8_gt_irq_ack(struct drm_i915_private *i915,
@@ -3883,17 +3886,24 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
 	u32 gt_interrupts[] = {
 		GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 			GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT,
 		GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
 			GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
-			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
+			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
 		0,
 		GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT
 		};
 
+	/* VECS watchdog is only available in skl+ */
+	if (INTEL_GEN(dev_priv) >= 9)
+		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
+
 	dev_priv->pm_ier = 0x0;
 	dev_priv->pm_imr = ~dev_priv->pm_ier;
 	GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 1eca166d95bb..a0e101bbcbce 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -2335,6 +2335,11 @@ enum i915_power_well_id {
 #define RING_START(base)	_MMIO((base) + 0x38)
 #define RING_CTL(base)		_MMIO((base) + 0x3c)
 #define   RING_CTL_SIZE(size)	((size) - PAGE_SIZE) /* in bytes -> pages */
+#define RING_CNTR(base)		_MMIO((base) + 0x178)
+#define   GEN8_WATCHDOG_ENABLE		0
+#define   GEN8_WATCHDOG_DISABLE		1
+#define   GEN8_XCS_WATCHDOG_DISABLE	0xFFFFFFFF /* GEN8 & non-render only */
+#define RING_THRESH(base)	_MMIO((base) + 0x17C)
 #define RING_SYNC_0(base)	_MMIO((base) + 0x40)
 #define RING_SYNC_1(base)	_MMIO((base) + 0x44)
 #define RING_SYNC_2(base)	_MMIO((base) + 0x48)
@@ -2894,6 +2899,7 @@ enum i915_power_well_id {
 #define GT_BSD_USER_INTERRUPT			(1 << 12)
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /* hsw+; rsvd on snb, ivb, vlv */
 #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
+#define GT_GEN8_WATCHDOG_INTERRUPT		(1 <<  6) /* gen8+ */
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /* !snb */
 #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
 #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index 7ae753358a6d..74f563d23cc8 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -1106,6 +1106,7 @@ void intel_engines_park(struct drm_i915_private *i915)
 		/* Flush the residual irq tasklets first. */
 		intel_engine_disarm_breadcrumbs(engine);
 		tasklet_kill(&engine->execlists.tasklet);
+		tasklet_kill(&engine->execlists.watchdog_tasklet);
 
 		/*
 		 * We are committed now to parking the engines, make sure there
diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index 58b6ff8453dc..bc10acb24d9a 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -218,7 +218,8 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
 
 static void hangcheck_declare_hang(struct drm_i915_private *i915,
 				   unsigned int hung,
-				   unsigned int stuck)
+				   unsigned int stuck,
+				   unsigned int watchdog)
 {
 	struct intel_engine_cs *engine;
 	char msg[80];
@@ -231,13 +232,16 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
 	if (stuck != hung)
 		hung &= ~stuck;
 	len = scnprintf(msg, sizeof(msg),
-			"%s on ", stuck == hung ? "no progress" : "hang");
+			"%s on ", watchdog ? "watchdog timeout" :
+				  stuck == hung ? "no progress" : "hang");
 	for_each_engine_masked(engine, i915, hung, tmp)
 		len += scnprintf(msg + len, sizeof(msg) - len,
 				 "%s, ", engine->name);
 	msg[len-2] = '\0';
 
-	return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s", msg);
+	return i915_handle_error(i915, hung,
+				 watchdog ? 0 : I915_ERROR_CAPTURE,
+				 "%s", msg);
 }
 
 /*
@@ -255,7 +259,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 			     gpu_error.hangcheck_work.work);
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
-	unsigned int hung = 0, stuck = 0, wedged = 0;
+	unsigned int hung = 0, stuck = 0, wedged = 0, watchdog = 0;
 
 	if (!i915_modparams.enable_hangcheck)
 		return;
@@ -266,6 +270,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	if (i915_terminally_wedged(&dev_priv->gpu_error))
 		return;
 
+	if (test_and_clear_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags))
+		watchdog = 1;
+
 	/* As enabling the GPU requires fairly extensive mmio access,
 	 * periodically arm the mmio checker to see if we are triggering
 	 * any invalid access.
@@ -311,7 +318,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	}
 
 	if (hung)
-		hangcheck_declare_hang(dev_priv, hung, stuck);
+		hangcheck_declare_hang(dev_priv, hung, stuck, watchdog);
 
 	/* Reset timer in case GPU hangs without another request being added */
 	i915_queue_hangcheck(dev_priv);
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 9ca7dc7a6fa5..c38b239ab39e 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -2352,6 +2352,53 @@ static int gen8_emit_flush_render(struct i915_request *request,
 	return 0;
 }
 
+/* From GEN9 onwards, all engines use the same RING_CNTR format */
+static inline u32 get_watchdog_disable(struct intel_engine_cs *engine)
+{
+	if (engine->id == RCS || INTEL_GEN(engine->i915) >= 9)
+		return GEN8_WATCHDOG_DISABLE;
+	else
+		return GEN8_XCS_WATCHDOG_DISABLE;
+}
+
+#define GEN8_WATCHDOG_1000US(dev_priv) watchdog_to_clock_counts(dev_priv, 1000)
+static void gen8_watchdog_irq_handler(unsigned long data)
+{
+	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
+	struct drm_i915_private *dev_priv = engine->i915;
+	unsigned int hung = 0;
+	u32 current_seqno=0;
+	char msg[80];
+	unsigned int tmp;
+	int len;
+
+	/* Stop the counter to prevent further timeout interrupts */
+	I915_WRITE_FW(RING_CNTR(engine->mmio_base), get_watchdog_disable(engine));
+
+	/* Read the heartbeat seqno once again to check if we are stuck? */
+	current_seqno = intel_engine_get_hangcheck_seqno(engine);
+
+    if (current_seqno == engine->current_seqno) {
+		hung |= engine->mask;
+
+		len = scnprintf(msg, sizeof(msg), "%s on ", "watchdog timeout");
+		for_each_engine_masked(engine, dev_priv, hung, tmp)
+			len += scnprintf(msg + len, sizeof(msg) - len,
+					 "%s, ", engine->name);
+		msg[len-2] = '\0';
+
+		i915_handle_error(dev_priv, hung, 0, "%s", msg);
+
+		/* Reset timer in case GPU hangs without another request being added */
+		i915_queue_hangcheck(dev_priv);
+    }else{
+		/* Re-start the counter, if really hung, it will expire again */
+		I915_WRITE_FW(RING_THRESH(engine->mmio_base),
+			      GEN8_WATCHDOG_1000US(dev_priv));
+		I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
+    }
+}
+
 /*
  * Reserve space for 2 NOOPs at the end of each request to be
  * used as a workaround for not being allowed to do lite
@@ -2539,6 +2586,21 @@ logical_ring_default_irqs(struct intel_engine_cs *engine)
 
 	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
 	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
+
+	switch (engine->class) {
+	default:
+		/* BCS engine does not support hw watchdog */
+		break;
+	case RENDER_CLASS:
+	case VIDEO_DECODE_CLASS:
+		engine->irq_keep_mask |= GT_GEN8_WATCHDOG_INTERRUPT << shift;
+		break;
+	case VIDEO_ENHANCEMENT_CLASS:
+		if (INTEL_GEN(engine->i915) >= 9)
+			engine->irq_keep_mask |=
+				GT_GEN8_WATCHDOG_INTERRUPT << shift;
+		break;
+	}
 }
 
 static int
@@ -2556,6 +2618,9 @@ logical_ring_setup(struct intel_engine_cs *engine)
 	tasklet_init(&engine->execlists.tasklet,
 		     execlists_submission_tasklet, (unsigned long)engine);
 
+	tasklet_init(&engine->execlists.watchdog_tasklet,
+		     gen8_watchdog_irq_handler, (unsigned long)engine);
+
 	logical_ring_default_vfuncs(engine);
 	logical_ring_default_irqs(engine);
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 465094e38d32..17250ba0246f 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -122,6 +122,7 @@ struct intel_engine_hangcheck {
 	u64 acthd;
 	u32 last_seqno;
 	u32 next_seqno;
+	u32 watchdog;
 	unsigned long action_timestamp;
 	struct intel_instdone instdone;
 };
@@ -222,6 +223,11 @@ struct intel_engine_execlists {
 	 */
 	struct tasklet_struct tasklet;
 
+	/**
+	 * @watchdog_tasklet: stop counter and re-schedule hangcheck_work asap
+	 */
+	struct tasklet_struct watchdog_tasklet;
+
 	/**
 	 * @default_priolist: priority list for I915_PRIORITY_NORMAL
 	 */
@@ -353,6 +359,7 @@ struct intel_engine_cs {
 	unsigned int hw_id;
 	unsigned int guc_id;
 	unsigned long mask;
+	u32 current_seqno;
 
 	u8 uabi_class;
 
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 3/5] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2019-02-21  2:58 [PATCH v4 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
  2019-02-21  2:58 ` [PATCH v4 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl Carlos Santa
  2019-02-21  2:58 ` [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+ Carlos Santa
@ 2019-02-21  2:58 ` Carlos Santa
  2019-02-21  2:58 ` [PATCH v4 4/5] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Carlos Santa
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Carlos Santa @ 2019-02-21  2:58 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

From: Michel Thierry <michel.thierry@intel.com>

Emit the required commands into the ring buffer for starting and
stopping the watchdog timer before/after batch buffer start during
batch buffer submission.

v2: Support watchdog threshold per context engine, merge lri commands,
and move watchdog commands emission to emit_bb_start. Request space of
combined start_watchdog, bb_start and stop_watchdog to avoid any error
after emitting bb_start.

v3: There were too many req->engine in emit_bb_start.
Use GEM_BUG_ON instead of returning a very late EINVAL in the remote
case of watchdog misprogramming; set correct LRI cmd size in
emit_stop_watchdog. (Chris)

v4: Rebase.
v5: use to_intel_context instead of ctx->engine.
v6: Rebase.
v7: Rebase,
    Store gpu watchdog capability in engine flag (Tvrtko)
    Store WATCHDOG_DISABLE magic # in engine (Tvrtko)
    No need to declare emit_{start|stop}_watchdog as vfuncs (Tvrtko)
    Replace flag watchdog_running with enable_watchdog (Tvrtko)
    Emit a single MI_NOOP by conditionally checking whether the #
    of emitted OPs is odd (Tvrtko)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.h |  4 ++
 drivers/gpu/drm/i915/intel_engine_cs.c  |  2 +
 drivers/gpu/drm/i915/intel_lrc.c        | 79 +++++++++++++++++++++++--
 drivers/gpu/drm/i915/intel_lrc.h        |  2 +
 drivers/gpu/drm/i915/intel_ringbuffer.h | 18 ++++--
 5 files changed, 97 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index b1eeac64da8b..dcf4e98666a6 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -183,6 +183,10 @@ struct i915_gem_context {
 		u32 *lrc_reg_state;
 		u64 lrc_desc;
 		int pin_count;
+		/** watchdog_threshold: hw watchdog threshold value,
+		 * in clock counts
+		 */
+		u32 watchdog_threshold;
 
 		/**
 		 * active_tracker: Active tracker for the external rq activity
diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index 74f563d23cc8..438bf93a4340 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -324,6 +324,8 @@ intel_engine_setup(struct drm_i915_private *dev_priv,
 	if (engine->context_size)
 		DRIVER_CAPS(dev_priv)->has_logical_contexts = true;
 
+	engine->watchdog_disable_id = get_watchdog_disable(engine);
+
 	/* Nothing to do here, execute in order of dependencies */
 	engine->schedule = NULL;
 
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index c38b239ab39e..9406d3f2b789 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -2193,16 +2193,75 @@ static void execlists_reset_finish(struct intel_engine_cs *engine)
 		  atomic_read(&execlists->tasklet.count));
 }
 
+static u32 *gen8_emit_start_watchdog(struct i915_request *rq, u32 *cs)
+{
+	struct intel_engine_cs *engine = rq->engine;
+	struct i915_gem_context *ctx = rq->gem_context;
+	struct intel_context *ce = to_intel_context(ctx, engine);
+
+	GEM_BUG_ON(!intel_engine_supports_watchdog(engine));
+
+	/*
+	 * watchdog register must never be programmed to zero. This would
+	 * cause the watchdog counter to exceed and not allow the engine to
+	 * go into IDLE state
+	 */
+	GEM_BUG_ON(ce->watchdog_threshold == 0);
+
+	/* Set counter period */
+	*cs++ = MI_LOAD_REGISTER_IMM(2);
+	*cs++ = i915_mmio_reg_offset(RING_THRESH(engine->mmio_base));
+	*cs++ = ce->watchdog_threshold;
+	/* Start counter */
+	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
+	*cs++ = GEN8_WATCHDOG_ENABLE;
+
+	return cs;
+}
+
+static u32 *gen8_emit_stop_watchdog(struct i915_request *rq, u32 *cs)
+{
+	struct intel_engine_cs *engine = rq->engine;
+
+	GEM_BUG_ON(!intel_engine_supports_watchdog(engine));
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
+	*cs++ = engine->watchdog_disable_id;
+
+	return cs;
+}
+
 static int gen8_emit_bb_start(struct i915_request *rq,
 			      u64 offset, u32 len,
 			      const unsigned int flags)
 {
+	struct intel_engine_cs *engine = rq->engine;
 	u32 *cs;
+	u32 num_dwords;
+	bool enable_watchdog = false;
 
-	cs = intel_ring_begin(rq, 6);
+	/* bb_start only */
+	num_dwords = 6;
+
+	/* check if watchdog will be required */
+	if (to_intel_context(rq->gem_context, engine)->watchdog_threshold != 0) {
+
+		/* + start_watchdog (6) + stop_watchdog (4) */
+		num_dwords += 10;
+		enable_watchdog = true;
+        }
+
+	cs = intel_ring_begin(rq, num_dwords);
 	if (IS_ERR(cs))
 		return PTR_ERR(cs);
 
+	if (enable_watchdog) {
+		/* Start watchdog timer */
+		cs = gen8_emit_start_watchdog(rq, cs);
+		engine->current_seqno = intel_engine_get_hangcheck_seqno(engine);
+	}
+
 	/*
 	 * WaDisableCtxRestoreArbitration:bdw,chv
 	 *
@@ -2229,10 +2288,16 @@ static int gen8_emit_bb_start(struct i915_request *rq,
 	*cs++ = upper_32_bits(offset);
 
 	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
-	*cs++ = MI_NOOP;
 
-	intel_ring_advance(rq, cs);
+	if (enable_watchdog) {
+		/* Cancel watchdog timer */
+		cs = gen8_emit_stop_watchdog(rq, cs);
+	}
+
+	if (*cs%2 != 0)
+		*cs++ = MI_NOOP;
 
+	intel_ring_advance(rq, cs);
 	return 0;
 }
 
@@ -2353,7 +2418,7 @@ static int gen8_emit_flush_render(struct i915_request *request,
 }
 
 /* From GEN9 onwards, all engines use the same RING_CNTR format */
-static inline u32 get_watchdog_disable(struct intel_engine_cs *engine)
+u32 get_watchdog_disable(struct intel_engine_cs *engine)
 {
 	if (engine->id == RCS || INTEL_GEN(engine->i915) >= 9)
 		return GEN8_WATCHDOG_DISABLE;
@@ -2532,6 +2597,9 @@ void intel_execlists_set_default_submission(struct intel_engine_cs *engine)
 		I915_SCHEDULER_CAP_PRIORITY;
 	if (intel_engine_has_preemption(engine))
 		engine->i915->caps.scheduler |= I915_SCHEDULER_CAP_PREEMPTION;
+
+	if(engine->id != BCS)
+		engine->flags |= I915_ENGINE_SUPPORTS_WATCHDOG;
 }
 
 static void
@@ -2710,6 +2778,9 @@ int logical_xcs_ring_init(struct intel_engine_cs *engine)
 	if (err)
 		return err;
 
+	/* BCS engine does not have a watchdog-expired irq */
+	GEM_BUG_ON(!intel_engine_supports_watchdog(engine));
+
 	return logical_ring_init(engine);
 }
 
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 5779e776cc3f..9db4f6369574 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -120,4 +120,6 @@ void intel_virtual_engine_put(struct intel_engine_cs *engine);
 
 u32 gen8_make_rpcs(struct drm_i915_private *i915, struct intel_sseu *ctx_sseu);
 
+u32 get_watchdog_disable(struct intel_engine_cs *engine);
+
 #endif /* _INTEL_LRC_H_ */
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 17250ba0246f..ec0b7d3c6315 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -360,6 +360,7 @@ struct intel_engine_cs {
 	unsigned int guc_id;
 	unsigned long mask;
 	u32 current_seqno;
+	u32 watchdog_disable_id;
 
 	u8 uabi_class;
 
@@ -463,6 +464,7 @@ struct intel_engine_cs {
 	int		(*init_context)(struct i915_request *rq);
 
 	int		(*emit_flush)(struct i915_request *request, u32 mode);
+
 #define EMIT_INVALIDATE	BIT(0)
 #define EMIT_FLUSH	BIT(1)
 #define EMIT_BARRIER	(EMIT_INVALIDATE | EMIT_FLUSH)
@@ -520,10 +522,12 @@ struct intel_engine_cs {
 
 	struct intel_engine_hangcheck hangcheck;
 
-#define I915_ENGINE_NEEDS_CMD_PARSER BIT(0)
-#define I915_ENGINE_SUPPORTS_STATS   BIT(1)
-#define I915_ENGINE_HAS_PREEMPTION   BIT(2)
-#define I915_ENGINE_IS_VIRTUAL       BIT(3)
+#define I915_ENGINE_NEEDS_CMD_PARSER  BIT(0)
+#define I915_ENGINE_SUPPORTS_STATS    BIT(1)
+#define I915_ENGINE_HAS_PREEMPTION    BIT(2)
+#define I915_ENGINE_IS_VIRTUAL        BIT(3)
+#define I915_ENGINE_SUPPORTS_WATCHDOG BIT(4)
+
 	unsigned int flags;
 
 	/*
@@ -612,6 +616,12 @@ intel_engine_is_virtual(const struct intel_engine_cs *engine)
 	return engine->flags & I915_ENGINE_IS_VIRTUAL;
 }
 
+static inline bool
+intel_engine_supports_watchdog(const struct intel_engine_cs *engine)
+{
+	return engine->flags & I915_ENGINE_SUPPORTS_WATCHDOG;
+}
+
 static inline void
 execlists_set_active(struct intel_engine_execlists *execlists,
 		     unsigned int bit)
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 4/5] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2019-02-21  2:58 [PATCH v4 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
                   ` (2 preceding siblings ...)
  2019-02-21  2:58 ` [PATCH v4 3/5] drm/i915: Watchdog timeout: Ringbuffer command emission " Carlos Santa
@ 2019-02-21  2:58 ` Carlos Santa
  2019-02-28 17:22   ` Tvrtko Ursulin
  2019-02-21  2:58 ` [PATCH v4 5/5] drm/i915: Watchdog timeout: Include threshold value in error state Carlos Santa
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Carlos Santa @ 2019-02-21  2:58 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

From: Michel Thierry <michel.thierry@intel.com>

Final enablement patch for GPU hang detection using watchdog timeout.
Using the gem_context_setparam ioctl, users can specify the desired
timeout value in microseconds, and the driver will do the conversion to
'timestamps'.

The recommended default watchdog threshold for video engines is 60000 us,
since this has been _empirically determined_ to be a good compromise for
low-latency requirements and low rate of false positives. The default
register value is ~106000us and the theoretical max value (all 1s) is
353 seconds.

[1] http://patchwork.freedesktop.org/patch/msgid/20170329135831.30254-2-chris@chris-wilson.co.uk

v2: Fixed get api to return values in microseconds. Threshold updated to
be per context engine. Check for u32 overflow. Capture ctx threshold
value in error state.

v3: Add a way to get array size, short-cut to disable all thresholds,
return EFAULT / EINVAL as needed. Move the capture of the threshold
value in the error state into a new patch. BXT has a different
timestamp base (because why not?).

v4: Checking if watchdog is available should be the first thing to
do, instead of giving false hopes to abi users; remove unnecessary & in
set_watchdog; ignore args->size in getparam.

v5: GEN9-LP platforms have a different crystal clock frequency, use the
right timestamp base for them (magic 8-ball predicts this will change
again later on, so future-proof it). (Daniele)

v6: Rebase, no more mutex BLK in getparam_ioctl.

v7: use to_intel_context instead of ctx->engine.

v8: Rebase, remove extra mutex from i915_gem_context_set_watchdog (Tvrtko),
Update UAPI to use engine class while keeping thresholds per
engine class (Michel).

v9: Rebase,
    Remove outdated comment from the commit message (Tvrtko)
    Use the engine->flag to verify for gpu watchdog support (Tvrtko)
    Use the standard copy_to_user() instead (Tvrtko)
    Use the correct type when declaring engine class iterator (Tvrtko)
    Remove yet another unncessary mutex_lock (Tvrtko)

Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         | 50 +++++++++++++-
 drivers/gpu/drm/i915/i915_gem_context.c | 91 +++++++++++++++++++++++++
 include/uapi/drm/i915_drm.h             |  1 +
 3 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 0fcb2df869a2..aaa5810ba76c 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1582,6 +1582,9 @@ struct drm_i915_private {
 	struct drm_i915_fence_reg fence_regs[I915_MAX_NUM_FENCES]; /* assume 965 */
 	int num_fence_regs; /* 8 on pre-965, 16 otherwise */
 
+	/* Command stream timestamp base - helps define watchdog threshold */
+	u32 cs_timestamp_base;
+
 	unsigned int fsb_freq, mem_freq, is_ddr3;
 	unsigned int skl_preferred_vco_freq;
 	unsigned int max_cdclk_freq;
@@ -3120,10 +3123,55 @@ i915_gem_context_lookup(struct drm_i915_file_private *file_priv, u32 id)
 	return ctx;
 }
 
+/*
+ * BDW, CHV & SKL+ Timestamp timer resolution = 0.080 uSec,
+ * or 12500000 counts per second, or ~12 counts per microsecond.
+ *
+ * But BXT/GLK Timestamp timer resolution is different, 0.052 uSec,
+ * or 19200000 counts per second, or ~19 counts per microsecond.
+ *
+ * Future-proofing, some day it won't be as simple as just GEN & IS_LP.
+ */
+#define GEN8_TIMESTAMP_CNTS_PER_USEC 12
+#define GEN9_LP_TIMESTAMP_CNTS_PER_USEC 19
+static inline u32 cs_timestamp_in_us(struct drm_i915_private *dev_priv)
+{
+	u32 cs_timestamp_base = dev_priv->cs_timestamp_base;
+
+	if (cs_timestamp_base)
+		return cs_timestamp_base;
+
+	switch (INTEL_GEN(dev_priv)) {
+	default:
+		MISSING_CASE(INTEL_GEN(dev_priv));
+		/* fall through */
+	case 9:
+		cs_timestamp_base = IS_GEN9_LP(dev_priv) ?
+					GEN9_LP_TIMESTAMP_CNTS_PER_USEC :
+					GEN8_TIMESTAMP_CNTS_PER_USEC;
+		break;
+	case 8:
+		cs_timestamp_base = GEN8_TIMESTAMP_CNTS_PER_USEC;
+		break;
+	}
+
+	dev_priv->cs_timestamp_base = cs_timestamp_base;
+	return cs_timestamp_base;
+}
+
+static inline u32
+watchdog_to_us(struct drm_i915_private *dev_priv, u32 value_in_clock_counts)
+{
+	return value_in_clock_counts / cs_timestamp_in_us(dev_priv);
+}
+
 static inline u32
 watchdog_to_clock_counts(struct drm_i915_private *dev_priv, u64 value_in_us)
 {
-	u64 threshold = 0;
+	u64 threshold = value_in_us * cs_timestamp_in_us(dev_priv);
+
+	if (overflows_type(threshold, u32))
+		return -EINVAL;
 
 	return threshold;
 }
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index cbfe8f2eb3f2..e1abca28140b 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -1573,6 +1573,89 @@ get_engines(struct i915_gem_context *ctx,
 	return err;
 }
 
+/* Return the timer count threshold in microseconds. */
+int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
+				  struct drm_i915_gem_context_param *args)
+{
+	struct drm_i915_private *dev_priv = ctx->i915;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	u32 threshold_in_us[OTHER_CLASS];
+
+	if(!intel_engine_supports_watchdog(dev_priv->engine[VCS]))
+		return -ENODEV;
+
+	for_each_engine(engine, dev_priv, id) {
+		struct intel_context *ce = to_intel_context(ctx, engine);
+
+		threshold_in_us[engine->class] = watchdog_to_us(dev_priv,
+								ce->watchdog_threshold);
+	}
+
+	if (copy_to_user(u64_to_user_ptr(args->value),
+			   &threshold_in_us,
+			   sizeof(threshold_in_us))) {
+		return -EFAULT;
+	}
+
+	args->size = sizeof(threshold_in_us);
+
+	return 0;
+}
+
+/*
+ * Based on time out value in microseconds (us) calculate
+ * timer count thresholds needed based on core frequency.
+ * Watchdog can be disabled by setting it to 0.
+ */
+int i915_gem_context_set_watchdog(struct i915_gem_context *ctx,
+				  struct drm_i915_gem_context_param *args)
+{
+	struct drm_i915_private *dev_priv = ctx->i915;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	int i;
+	u32 threshold[OTHER_CLASS];
+
+	if(!intel_engine_supports_watchdog(dev_priv->engine[VCS]))
+		return -ENODEV;
+
+	memset(threshold, 0, sizeof(threshold));
+
+	/* shortcut to disable in all engines */
+	if (args->size == 0)
+		goto set_watchdog;
+
+	if (args->size < sizeof(threshold))
+		return -EFAULT;
+
+	if (copy_from_user(threshold,
+			   u64_to_user_ptr(args->value),
+			   sizeof(threshold))) {
+		return -EFAULT;
+	}
+
+	/* not supported in blitter engine */
+	if (threshold[COPY_ENGINE_CLASS] != 0)
+		return -EINVAL;
+
+	for (i = RENDER_CLASS; i < OTHER_CLASS; i++) {
+		threshold[i] = watchdog_to_clock_counts(dev_priv, threshold[i]);
+
+		if (threshold[i] == -EINVAL)
+			return -EINVAL;
+	}
+
+set_watchdog:
+	for_each_engine(engine, dev_priv, id) {
+		struct intel_context *ce = to_intel_context(ctx, engine);
+
+		ce->watchdog_threshold = threshold[engine->class];
+	}
+
+	return 0;
+}
+
 static int ctx_setparam(struct i915_gem_context *ctx,
 			struct drm_i915_gem_context_param *args)
 {
@@ -1640,6 +1723,10 @@ static int ctx_setparam(struct i915_gem_context *ctx,
 		ret = set_engines(ctx, args);
 		break;
 
+	case I915_CONTEXT_PARAM_WATCHDOG:
+		ret = i915_gem_context_set_watchdog(ctx, args);
+		break;
+
 	case I915_CONTEXT_PARAM_BAN_PERIOD:
 	default:
 		ret = -EINVAL;
@@ -1843,6 +1930,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 		args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
 		break;
 
+	case I915_CONTEXT_PARAM_WATCHDOG:
+		ret = i915_gem_context_get_watchdog(ctx, args);
+		break;
+
 	case I915_CONTEXT_PARAM_SSEU:
 		ret = get_sseu(ctx, args);
 		break;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 3f2c89740b0e..7dabdb3e0fad 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1492,6 +1492,7 @@ struct drm_i915_gem_context_param {
  * See struct i915_context_param_engines.
  */
 #define I915_CONTEXT_PARAM_ENGINES	0x9
+#define I915_CONTEXT_PARAM_WATCHDOG	0x10
 
 	__u64 value;
 };
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v4 5/5] drm/i915: Watchdog timeout: Include threshold value in error state
  2019-02-21  2:58 [PATCH v4 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
                   ` (3 preceding siblings ...)
  2019-02-21  2:58 ` [PATCH v4 4/5] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Carlos Santa
@ 2019-02-21  2:58 ` Carlos Santa
  2019-02-21  2:58 ` drm/i915: Replace global_seqno with a hangcheck heartbeat seqno Carlos Santa
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Carlos Santa @ 2019-02-21  2:58 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

From: Michel Thierry <michel.thierry@intel.com>

Save the watchdog threshold (in us) as part of the engine state.

v2: Only do it for gen8+ (and prevent a missing-case warn).
v3: use ctx->__engine.
v4: Rebase.
v5: Rebase.

Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 12 ++++++++----
 drivers/gpu/drm/i915/i915_gpu_error.h |  1 +
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 8792ad12373d..a2dddaaeb215 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -460,10 +460,12 @@ static void error_print_context(struct drm_i915_error_state_buf *m,
 				const char *header,
 				const struct drm_i915_error_context *ctx)
 {
-	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, prio %d, ban score %d%s guilty %d active %d\n",
+	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, prio %d, ban score %d%s guilty %d active %d, watchdog %dus\n",
 		   header, ctx->comm, ctx->pid, ctx->handle, ctx->hw_id,
 		   ctx->sched_attr.priority, ctx->ban_score, bannable(ctx),
-		   ctx->guilty, ctx->active);
+		   ctx->guilty, ctx->active,
+		   INTEL_GEN(m->i915) >= 8 ?
+			watchdog_to_us(m->i915, ctx->watchdog_threshold) : 0);
 }
 
 static void error_print_engine(struct drm_i915_error_state_buf *m,
@@ -1348,7 +1350,8 @@ static void error_record_engine_execlists(struct intel_engine_cs *engine,
 }
 
 static void record_context(struct drm_i915_error_context *e,
-			   struct i915_gem_context *ctx)
+			   struct i915_gem_context *ctx,
+			   u32 engine_id)
 {
 	if (ctx->pid) {
 		struct task_struct *task;
@@ -1369,6 +1372,7 @@ static void record_context(struct drm_i915_error_context *e,
 	e->bannable = i915_gem_context_is_bannable(ctx);
 	e->guilty = atomic_read(&ctx->guilty_count);
 	e->active = atomic_read(&ctx->active_count);
+	e->watchdog_threshold =	ctx->__engine[engine_id].watchdog_threshold;
 }
 
 static void request_record_user_bo(struct i915_request *request,
@@ -1452,7 +1456,7 @@ static void gem_record_rings(struct i915_gpu_state *error)
 
 			ee->vm = ctx->ppgtt ? &ctx->ppgtt->vm : &ggtt->vm;
 
-			record_context(&ee->context, ctx);
+			record_context(&ee->context, ctx, engine->id);
 
 			/* We need to copy these to an anonymous buffer
 			 * as the simplest method to avoid being overwritten
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
index bd1821c73ecd..454707848248 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.h
+++ b/drivers/gpu/drm/i915/i915_gpu_error.h
@@ -122,6 +122,7 @@ struct i915_gpu_state {
 			int ban_score;
 			int active;
 			int guilty;
+			int watchdog_threshold;
 			bool bannable;
 			struct i915_sched_attr sched_attr;
 		} context;
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* drm/i915: Replace global_seqno with a hangcheck heartbeat seqno
  2019-02-21  2:58 [PATCH v4 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
                   ` (4 preceding siblings ...)
  2019-02-21  2:58 ` [PATCH v4 5/5] drm/i915: Watchdog timeout: Include threshold value in error state Carlos Santa
@ 2019-02-21  2:58 ` Carlos Santa
  2019-02-21  3:24 ` ✗ Fi.CI.BAT: failure for drm/i915: Replace global_seqno with a hangcheck heartbeat seqno (rev3) Patchwork
  2019-03-11 11:54 ` [PATCH v4 0/5] GEN8+ GPU Watchdog Reset Support Chris Wilson
  7 siblings, 0 replies; 23+ messages in thread
From: Carlos Santa @ 2019-02-21  2:58 UTC (permalink / raw)
  To: intel-gfx

From: Chris Wilson <chris@chris-wilson.co.uk>

To determine whether an engine has 'stuck', we simply check whether or
not is still on the same seqno for several seconds. To keep this simple
mechanism intact over the loss of a global seqno, we can simply add a
new global heartbeat seqno instead. As we cannot know the sequence in
which requests will then be completed, we use a primitive random number
generator instead (with a cycle long enough to not matter over an
interval of a few thousand requests between hangcheck samples).

The alternative to using a dedicated seqno on every request is to issue
a heartbeat request and query its progress through the system. Sadly
this requires us to reduce struct_mutex so that we can issue requests
without requiring that bkl.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_debugfs.c     |  7 ++++---
 drivers/gpu/drm/i915/intel_engine_cs.c  |  5 +++--
 drivers/gpu/drm/i915/intel_hangcheck.c  |  6 +++---
 drivers/gpu/drm/i915/intel_lrc.c        | 15 +++++++++++++++
 drivers/gpu/drm/i915/intel_ringbuffer.c | 20 ++++++++++++++++++--
 drivers/gpu/drm/i915/intel_ringbuffer.h | 19 ++++++++++++++++++-
 6 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index ea15e6336515..a6cbd7dfa64c 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1298,7 +1298,7 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 	with_intel_runtime_pm(dev_priv, wakeref) {
 		for_each_engine(engine, dev_priv, id) {
 			acthd[id] = intel_engine_get_active_head(engine);
-			seqno[id] = intel_engine_get_seqno(engine);
+			seqno[id] = intel_engine_get_hangcheck_seqno(engine);
 		}
 
 		intel_engine_get_instdone(dev_priv->engine[RCS], &instdone);
@@ -1318,8 +1318,9 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 	for_each_engine(engine, dev_priv, id) {
 		seq_printf(m, "%s:\n", engine->name);
 		seq_printf(m, "\tseqno = %x [current %x, last %x], %dms ago\n",
-			   engine->hangcheck.seqno, seqno[id],
-			   intel_engine_last_submit(engine),
+			   engine->hangcheck.last_seqno,
+			   seqno[id],
+			   engine->hangcheck.next_seqno,
 			   jiffies_to_msecs(jiffies -
 					    engine->hangcheck.action_timestamp));
 
diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index 4b4004d56e53..fe29ec0c008b 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -1498,10 +1498,11 @@ void intel_engine_dump(struct intel_engine_cs *engine,
 	if (i915_terminally_wedged(&engine->i915->gpu_error))
 		drm_printf(m, "*** WEDGED ***\n");
 
-	drm_printf(m, "\tcurrent seqno %x, last %x, hangcheck %x [%d ms]\n",
+	drm_printf(m, "\tcurrent seqno %x, last %x, hangcheck %x/%x [%d ms]\n",
 		   intel_engine_get_seqno(engine),
 		   intel_engine_last_submit(engine),
-		   engine->hangcheck.seqno,
+		   engine->hangcheck.last_seqno,
+		   engine->hangcheck.next_seqno,
 		   jiffies_to_msecs(jiffies - engine->hangcheck.action_timestamp));
 	drm_printf(m, "\tReset count: %d (global %d)\n",
 		   i915_reset_engine_count(error, engine),
diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index a219c796e56d..e04b2560369e 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -133,21 +133,21 @@ static void hangcheck_load_sample(struct intel_engine_cs *engine,
 				  struct hangcheck *hc)
 {
 	hc->acthd = intel_engine_get_active_head(engine);
-	hc->seqno = intel_engine_get_seqno(engine);
+	hc->seqno = intel_engine_get_hangcheck_seqno(engine);
 }
 
 static void hangcheck_store_sample(struct intel_engine_cs *engine,
 				   const struct hangcheck *hc)
 {
 	engine->hangcheck.acthd = hc->acthd;
-	engine->hangcheck.seqno = hc->seqno;
+	engine->hangcheck.last_seqno = hc->seqno;
 }
 
 static enum intel_engine_hangcheck_action
 hangcheck_get_action(struct intel_engine_cs *engine,
 		     const struct hangcheck *hc)
 {
-	if (engine->hangcheck.seqno != hc->seqno)
+	if (engine->hangcheck.last_seqno != hc->seqno)
 		return ENGINE_ACTIVE_SEQNO;
 
 	if (intel_engine_is_idle(engine))
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 4fcee493dddb..3c8b11f6e830 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -180,6 +180,12 @@ static inline u32 intel_hws_seqno_address(struct intel_engine_cs *engine)
 		I915_GEM_HWS_INDEX_ADDR);
 }
 
+static inline u32 intel_hws_hangcheck_address(struct intel_engine_cs *engine)
+{
+	return (i915_ggtt_offset(engine->status_page.vma) +
+		I915_GEM_HWS_HANGCHECK_ADDR);
+}
+
 static inline struct i915_priolist *to_priolist(struct rb_node *rb)
 {
 	return rb_entry(rb, struct i915_priolist, node);
@@ -2209,6 +2215,10 @@ static u32 *gen8_emit_fini_breadcrumb(struct i915_request *request, u32 *cs)
 				  request->fence.seqno,
 				  request->timeline->hwsp_offset);
 
+	cs = gen8_emit_ggtt_write(cs,
+				  intel_engine_next_hangcheck_seqno(request->engine),
+				  intel_hws_hangcheck_address(request->engine));
+
 	cs = gen8_emit_ggtt_write(cs,
 				  request->global_seqno,
 				  intel_hws_seqno_address(request->engine));
@@ -2233,6 +2243,11 @@ static u32 *gen8_emit_fini_breadcrumb_rcs(struct i915_request *request, u32 *cs)
 				      PIPE_CONTROL_FLUSH_ENABLE |
 				      PIPE_CONTROL_CS_STALL);
 
+	cs = gen8_emit_ggtt_write_rcs(cs,
+				      intel_engine_next_hangcheck_seqno(request->engine),
+				      intel_hws_hangcheck_address(request->engine),
+				      PIPE_CONTROL_CS_STALL);
+
 	cs = gen8_emit_ggtt_write_rcs(cs,
 				      request->global_seqno,
 				      intel_hws_seqno_address(request->engine),
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index b889b27f8aeb..a1c85a338d50 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -460,12 +460,15 @@ static u32 *gen6_xcs_emit_breadcrumb(struct i915_request *rq, u32 *cs)
 	*cs++ = I915_GEM_HWS_SEQNO_ADDR | MI_FLUSH_DW_USE_GTT;
 	*cs++ = rq->fence.seqno;
 
+	*cs++ = MI_FLUSH_DW | MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_DW_STORE_INDEX;
+	*cs++ = I915_GEM_HWS_HANGCHECK_ADDR | MI_FLUSH_DW_USE_GTT;
+	*cs++ = intel_engine_next_hangcheck_seqno(rq->engine);
+
 	*cs++ = MI_FLUSH_DW | MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_DW_STORE_INDEX;
 	*cs++ = I915_GEM_HWS_INDEX_ADDR | MI_FLUSH_DW_USE_GTT;
 	*cs++ = rq->global_seqno;
 
 	*cs++ = MI_USER_INTERRUPT;
-	*cs++ = MI_NOOP;
 
 	rq->tail = intel_ring_offset(rq, cs);
 	assert_ring_tail_valid(rq->ring, rq->tail);
@@ -485,6 +488,10 @@ static u32 *gen7_xcs_emit_breadcrumb(struct i915_request *rq, u32 *cs)
 	*cs++ = I915_GEM_HWS_SEQNO_ADDR | MI_FLUSH_DW_USE_GTT;
 	*cs++ = rq->fence.seqno;
 
+	*cs++ = MI_FLUSH_DW | MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_DW_STORE_INDEX;
+	*cs++ = I915_GEM_HWS_HANGCHECK_ADDR | MI_FLUSH_DW_USE_GTT;
+	*cs++ = intel_engine_next_hangcheck_seqno(rq->engine);
+
 	*cs++ = MI_FLUSH_DW | MI_FLUSH_DW_OP_STOREDW | MI_FLUSH_DW_STORE_INDEX;
 	*cs++ = I915_GEM_HWS_INDEX_ADDR | MI_FLUSH_DW_USE_GTT;
 	*cs++ = rq->global_seqno;
@@ -500,6 +507,7 @@ static u32 *gen7_xcs_emit_breadcrumb(struct i915_request *rq, u32 *cs)
 	*cs++ = 0;
 
 	*cs++ = MI_USER_INTERRUPT;
+	*cs++ = MI_NOOP;
 
 	rq->tail = intel_ring_offset(rq, cs);
 	assert_ring_tail_valid(rq->ring, rq->tail);
@@ -943,11 +951,16 @@ static u32 *i9xx_emit_breadcrumb(struct i915_request *rq, u32 *cs)
 	*cs++ = I915_GEM_HWS_SEQNO_ADDR;
 	*cs++ = rq->fence.seqno;
 
+	*cs++ = MI_STORE_DWORD_INDEX;
+	*cs++ = I915_GEM_HWS_HANGCHECK_ADDR;
+	*cs++ = intel_engine_next_hangcheck_seqno(rq->engine);
+
 	*cs++ = MI_STORE_DWORD_INDEX;
 	*cs++ = I915_GEM_HWS_INDEX_ADDR;
 	*cs++ = rq->global_seqno;
 
 	*cs++ = MI_USER_INTERRUPT;
+	*cs++ = MI_NOOP;
 
 	rq->tail = intel_ring_offset(rq, cs);
 	assert_ring_tail_valid(rq->ring, rq->tail);
@@ -969,6 +982,10 @@ static u32 *gen5_emit_breadcrumb(struct i915_request *rq, u32 *cs)
 	*cs++ = I915_GEM_HWS_SEQNO_ADDR;
 	*cs++ = rq->fence.seqno;
 
+	*cs++ = MI_STORE_DWORD_INDEX;
+	*cs++ = I915_GEM_HWS_HANGCHECK_ADDR;
+	*cs++ = intel_engine_next_hangcheck_seqno(rq->engine);
+
 	BUILD_BUG_ON(GEN5_WA_STORES < 1);
 	for (i = 0; i < GEN5_WA_STORES; i++) {
 		*cs++ = MI_STORE_DWORD_INDEX;
@@ -977,7 +994,6 @@ static u32 *gen5_emit_breadcrumb(struct i915_request *rq, u32 *cs)
 	}
 
 	*cs++ = MI_USER_INTERRUPT;
-	*cs++ = MI_NOOP;
 
 	rq->tail = intel_ring_offset(rq, cs);
 	assert_ring_tail_valid(rq->ring, rq->tail);
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 8bbdf9fba196..752794cd0fb5 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -6,6 +6,7 @@
 
 #include <linux/hashtable.h>
 #include <linux/irq_work.h>
+#include <linux/random.h>
 #include <linux/seqlock.h>
 
 #include "i915_gem_batch_pool.h"
@@ -119,7 +120,8 @@ struct intel_instdone {
 
 struct intel_engine_hangcheck {
 	u64 acthd;
-	u32 seqno;
+	u32 last_seqno;
+	u32 next_seqno;
 	unsigned long action_timestamp;
 	struct intel_instdone instdone;
 };
@@ -712,6 +714,8 @@ intel_write_status_page(struct intel_engine_cs *engine, int reg, u32 value)
 #define I915_GEM_HWS_INDEX_ADDR		(I915_GEM_HWS_INDEX * sizeof(u32))
 #define I915_GEM_HWS_PREEMPT		0x32
 #define I915_GEM_HWS_PREEMPT_ADDR	(I915_GEM_HWS_PREEMPT * sizeof(u32))
+#define I915_GEM_HWS_HANGCHECK		0x34
+#define I915_GEM_HWS_HANGCHECK_ADDR	(I915_GEM_HWS_HANGCHECK * sizeof(u32))
 #define I915_GEM_HWS_SEQNO		0x40
 #define I915_GEM_HWS_SEQNO_ADDR		(I915_GEM_HWS_SEQNO * sizeof(u32))
 #define I915_GEM_HWS_SCRATCH		0x80
@@ -1060,4 +1064,17 @@ static inline bool inject_preempt_hang(struct intel_engine_execlists *execlists)
 
 #endif
 
+static inline u32 intel_engine_next_hangcheck_seqno(struct intel_engine_cs *engine)
+{
+	engine->hangcheck.next_seqno =
+		next_pseudo_random32(engine->hangcheck.next_seqno);
+
+	return engine->hangcheck.next_seqno;
+}
+
+static inline u32 intel_engine_get_hangcheck_seqno(struct intel_engine_cs *engine)
+{
+	return intel_read_status_page(engine, I915_GEM_HWS_HANGCHECK);
+}
+
 #endif /* _INTEL_RINGBUFFER_H_ */
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* ✗ Fi.CI.BAT: failure for drm/i915: Replace global_seqno with a hangcheck heartbeat seqno (rev3)
  2019-02-21  2:58 [PATCH v4 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
                   ` (5 preceding siblings ...)
  2019-02-21  2:58 ` drm/i915: Replace global_seqno with a hangcheck heartbeat seqno Carlos Santa
@ 2019-02-21  3:24 ` Patchwork
  2019-03-11 11:54 ` [PATCH v4 0/5] GEN8+ GPU Watchdog Reset Support Chris Wilson
  7 siblings, 0 replies; 23+ messages in thread
From: Patchwork @ 2019-02-21  3:24 UTC (permalink / raw)
  To: intel-gfx

== Series Details ==

Series: drm/i915: Replace global_seqno with a hangcheck heartbeat seqno (rev3)
URL   : https://patchwork.freedesktop.org/series/56587/
State : failure

== Summary ==

Applying: drm/i915: Add engine reset count in get-reset-stats ioctl
Applying: drm/i915: Watchdog timeout: IRQ handler for gen8+
Using index info to reconstruct a base tree...
M	drivers/gpu/drm/i915/i915_drv.h
M	drivers/gpu/drm/i915/i915_gpu_error.h
M	drivers/gpu/drm/i915/i915_irq.c
M	drivers/gpu/drm/i915/i915_reg.h
M	drivers/gpu/drm/i915/intel_engine_cs.c
M	drivers/gpu/drm/i915/intel_hangcheck.c
M	drivers/gpu/drm/i915/intel_lrc.c
M	drivers/gpu/drm/i915/intel_ringbuffer.h
Falling back to patching base and 3-way merge...
Auto-merging drivers/gpu/drm/i915/intel_ringbuffer.h
CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/intel_ringbuffer.h
Auto-merging drivers/gpu/drm/i915/intel_lrc.c
Auto-merging drivers/gpu/drm/i915/intel_hangcheck.c
Auto-merging drivers/gpu/drm/i915/intel_engine_cs.c
Auto-merging drivers/gpu/drm/i915/i915_reg.h
Auto-merging drivers/gpu/drm/i915/i915_irq.c
Auto-merging drivers/gpu/drm/i915/i915_gpu_error.h
Auto-merging drivers/gpu/drm/i915/i915_drv.h
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch' to see the failed patch
Patch failed at 0002 drm/i915: Watchdog timeout: IRQ handler for gen8+
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl
  2019-02-21  2:58 ` [PATCH v4 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl Carlos Santa
@ 2019-02-25 13:34   ` Tvrtko Ursulin
  2019-03-06 23:08     ` Carlos Santa
  0 siblings, 1 reply; 23+ messages in thread
From: Tvrtko Ursulin @ 2019-02-25 13:34 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx


On 21/02/2019 02:58, Carlos Santa wrote:
> From: Michel Thierry <michel.thierry@intel.com>
> 
> Users/tests relying on the total reset count will start seeing a smaller
> number since most of the hangs can be handled by engine reset.
> Note that if reset engine x, context a running on engine y will be unaware
> and unaffected.
> 
> To start the discussion, include just a total engine reset count. If it
> is deemed useful, it can be extended to report each engine separately.
> 
> Our igt's gem_reset_stats test will need changes to ignore the pad field,
> since it can now return reset_engine_count.
> 
> v2: s/engine_reset/reset_engine/, use union in uapi to not break compatibility.
> v3: Keep rejecting attempts to use pad as input (Antonio)
> v4: Rebased.
> v5: Rebased.
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> Cc: Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gem_context.c | 12 ++++++++++--
>   include/uapi/drm/i915_drm.h             |  6 +++++-
>   2 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 459f8eae1c39..cbfe8f2eb3f2 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -1889,6 +1889,8 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
>   	struct drm_i915_private *dev_priv = to_i915(dev);
>   	struct drm_i915_reset_stats *args = data;
>   	struct i915_gem_context *ctx;
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
>   	int ret;
>   
>   	if (args->flags || args->pad)
> @@ -1907,10 +1909,16 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
>   	 * we should wrap the hangstats with a seqlock.
>   	 */
>   
> -	if (capable(CAP_SYS_ADMIN))
> +	if (capable(CAP_SYS_ADMIN)) {
>   		args->reset_count = i915_reset_count(&dev_priv->gpu_error);
> -	else
> +		for_each_engine(engine, dev_priv, id)
> +			args->reset_engine_count +=
> +				i915_reset_engine_count(&dev_priv->gpu_error,
> +							engine);

If access to global GPU reset count is privileged, why is access to 
global engine reset count not? It seems to be fundamentally same level 
of data leakage.

If we wanted to provide some numbers to unprivileged users I think we 
would need to store some counters per file_priv/context and return those 
when !CAP_SYS_ADMIN.

> +	} else {
>   		args->reset_count = 0;
> +		args->reset_engine_count = 0;
> +	}
>   
>   	args->batch_active = atomic_read(&ctx->guilty_count);
>   	args->batch_pending = atomic_read(&ctx->active_count);
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index cc03ef9f885f..3f2c89740b0e 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1642,7 +1642,11 @@ struct drm_i915_reset_stats {
>   	/* Number of batches lost pending for execution, for this context */
>   	__u32 batch_pending;
>   
> -	__u32 pad;
> +	union {
> +		__u32 pad;
> +		/* Engine resets since boot/module reload, for all contexts */
> +		__u32 reset_engine_count;
> +	};

Chris pointed out in some other review that anonymous unions are not 
friendly towards C++ compilers.

Not sure what is the best option here. Renaming the field could break 
old userspace building against newer headers. Is that acceptable?

>   };
>   
>   struct drm_i915_gem_userptr {
> 

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 4/5] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2019-02-21  2:58 ` [PATCH v4 4/5] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Carlos Santa
@ 2019-02-28 17:22   ` Tvrtko Ursulin
  0 siblings, 0 replies; 23+ messages in thread
From: Tvrtko Ursulin @ 2019-02-28 17:22 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx


On 21/02/2019 02:58, Carlos Santa wrote:
> From: Michel Thierry <michel.thierry@intel.com>
> 
> Final enablement patch for GPU hang detection using watchdog timeout.
> Using the gem_context_setparam ioctl, users can specify the desired
> timeout value in microseconds, and the driver will do the conversion to
> 'timestamps'.
> 
> The recommended default watchdog threshold for video engines is 60000 us,
> since this has been _empirically determined_ to be a good compromise for
> low-latency requirements and low rate of false positives. The default
> register value is ~106000us and the theoretical max value (all 1s) is
> 353 seconds.
> 
> [1] http://patchwork.freedesktop.org/patch/msgid/20170329135831.30254-2-chris@chris-wilson.co.uk
> 
> v2: Fixed get api to return values in microseconds. Threshold updated to
> be per context engine. Check for u32 overflow. Capture ctx threshold
> value in error state.
> 
> v3: Add a way to get array size, short-cut to disable all thresholds,
> return EFAULT / EINVAL as needed. Move the capture of the threshold
> value in the error state into a new patch. BXT has a different
> timestamp base (because why not?).
> 
> v4: Checking if watchdog is available should be the first thing to
> do, instead of giving false hopes to abi users; remove unnecessary & in
> set_watchdog; ignore args->size in getparam.
> 
> v5: GEN9-LP platforms have a different crystal clock frequency, use the
> right timestamp base for them (magic 8-ball predicts this will change
> again later on, so future-proof it). (Daniele)
> 
> v6: Rebase, no more mutex BLK in getparam_ioctl.
> 
> v7: use to_intel_context instead of ctx->engine.
> 
> v8: Rebase, remove extra mutex from i915_gem_context_set_watchdog (Tvrtko),
> Update UAPI to use engine class while keeping thresholds per
> engine class (Michel).
> 
> v9: Rebase,
>      Remove outdated comment from the commit message (Tvrtko)
>      Use the engine->flag to verify for gpu watchdog support (Tvrtko)
>      Use the standard copy_to_user() instead (Tvrtko)
>      Use the correct type when declaring engine class iterator (Tvrtko)
>      Remove yet another unncessary mutex_lock (Tvrtko)
> 
> Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.h         | 50 +++++++++++++-
>   drivers/gpu/drm/i915/i915_gem_context.c | 91 +++++++++++++++++++++++++
>   include/uapi/drm/i915_drm.h             |  1 +
>   3 files changed, 141 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 0fcb2df869a2..aaa5810ba76c 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1582,6 +1582,9 @@ struct drm_i915_private {
>   	struct drm_i915_fence_reg fence_regs[I915_MAX_NUM_FENCES]; /* assume 965 */
>   	int num_fence_regs; /* 8 on pre-965, 16 otherwise */
>   
> +	/* Command stream timestamp base - helps define watchdog threshold */
> +	u32 cs_timestamp_base;
> +
>   	unsigned int fsb_freq, mem_freq, is_ddr3;
>   	unsigned int skl_preferred_vco_freq;
>   	unsigned int max_cdclk_freq;
> @@ -3120,10 +3123,55 @@ i915_gem_context_lookup(struct drm_i915_file_private *file_priv, u32 id)
>   	return ctx;
>   }
>   
> +/*
> + * BDW, CHV & SKL+ Timestamp timer resolution = 0.080 uSec,
> + * or 12500000 counts per second, or ~12 counts per microsecond.
> + *
> + * But BXT/GLK Timestamp timer resolution is different, 0.052 uSec,
> + * or 19200000 counts per second, or ~19 counts per microsecond.
> + *
> + * Future-proofing, some day it won't be as simple as just GEN & IS_LP.
> + */
> +#define GEN8_TIMESTAMP_CNTS_PER_USEC 12
> +#define GEN9_LP_TIMESTAMP_CNTS_PER_USEC 19
> +static inline u32 cs_timestamp_in_us(struct drm_i915_private *dev_priv)

Probably let the compiler decide on the inline.

And s/dev_priv/i915/ is preferred in the GEM areas unless there are 
pesky I915_READ/WRITE around.

> +{
> +	u32 cs_timestamp_base = dev_priv->cs_timestamp_base;
> +
> +	if (cs_timestamp_base)
> +		return cs_timestamp_base;
> +
> +	switch (INTEL_GEN(dev_priv)) {
> +	default:
> +		MISSING_CASE(INTEL_GEN(dev_priv));
> +		/* fall through */
> +	case 9:
> +		cs_timestamp_base = IS_GEN9_LP(dev_priv) ?
> +					GEN9_LP_TIMESTAMP_CNTS_PER_USEC :
> +					GEN8_TIMESTAMP_CNTS_PER_USEC;
> +		break;
> +	case 8:
> +		cs_timestamp_base = GEN8_TIMESTAMP_CNTS_PER_USEC;
> +		break;
> +	}
> +
> +	dev_priv->cs_timestamp_base = cs_timestamp_base;
> +	return cs_timestamp_base;
> +}
> +
> +static inline u32
> +watchdog_to_us(struct drm_i915_private *dev_priv, u32 value_in_clock_counts)

Drop the inline as well.

> +{
> +	return value_in_clock_counts / cs_timestamp_in_us(dev_priv);
> +}
> +
>   static inline u32
>   watchdog_to_clock_counts(struct drm_i915_private *dev_priv, u64 value_in_us)

Here as well.

u32 for value_in_us should be enough according to the caller.

>   {
> -	u64 threshold = 0;
> +	u64 threshold = value_in_us * cs_timestamp_in_us(dev_priv);
> +
> +	if (overflows_type(threshold, u32))
> +		return -EINVAL;

You could use an u64 local for checking the overflow. And it is also a 
bit unusual to return -EINVAL in u32.

Maybe return an int success/error, or a bool, and value via pointer 
argument? Or return zero on overflow and in the caller check for it if 
passed in value was non-zero?

>   
>   	return threshold;
>   }
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index cbfe8f2eb3f2..e1abca28140b 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -1573,6 +1573,89 @@ get_engines(struct i915_gem_context *ctx,
>   	return err;
>   }
>   
> +/* Return the timer count threshold in microseconds. */

Not really true even if it is talking about what it copies back to 
userspace.

> +int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
> +				  struct drm_i915_gem_context_param *args)
> +{
> +	struct drm_i915_private *dev_priv = ctx->i915;
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	u32 threshold_in_us[OTHER_CLASS];
> +
> +	if(!intel_engine_supports_watchdog(dev_priv->engine[VCS]))
> +		return -ENODEV;
> +
> +	for_each_engine(engine, dev_priv, id) {
> +		struct intel_context *ce = to_intel_context(ctx, engine);
> +
> +		threshold_in_us[engine->class] = watchdog_to_us(dev_priv,
> +								ce->watchdog_threshold);
> +	}
> +
> +	if (copy_to_user(u64_to_user_ptr(args->value),
> +			   &threshold_in_us,
> +			   sizeof(threshold_in_us))) {
> +		return -EFAULT;
> +	}
> +
> +	args->size = sizeof(threshold_in_us);
> +
> +	return 0;
> +}
> +
> +/*
> + * Based on time out value in microseconds (us) calculate
> + * timer count thresholds needed based on core frequency.
> + * Watchdog can be disabled by setting it to 0.
> + */
> +int i915_gem_context_set_watchdog(struct i915_gem_context *ctx,
> +				  struct drm_i915_gem_context_param *args)
> +{
> +	struct drm_i915_private *dev_priv = ctx->i915;
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	int i;
> +	u32 threshold[OTHER_CLASS];
> +
> +	if(!intel_engine_supports_watchdog(dev_priv->engine[VCS]))
> +		return -ENODEV;

You could check for each engine if the suggested uAPI was considered.

> +
> +	memset(threshold, 0, sizeof(threshold));
> +
> +	/* shortcut to disable in all engines */
> +	if (args->size == 0)
> +		goto set_watchdog;
> +
> +	if (args->size < sizeof(threshold))
> +		return -EFAULT; > +
> +	if (copy_from_user(threshold,
> +			   u64_to_user_ptr(args->value),
> +			   sizeof(threshold))) {
> +		return -EFAULT;
> +	}
> +
> +	/* not supported in blitter engine */
> +	if (threshold[COPY_ENGINE_CLASS] != 0)
> +		return -EINVAL;

You added engine_supports_watchdog.

> +
> +	for (i = RENDER_CLASS; i < OTHER_CLASS; i++) {
> +		threshold[i] = watchdog_to_clock_counts(dev_priv, threshold[i]);
> +
> +		if (threshold[i] == -EINVAL)
> +			return -EINVAL;
> +	}
> +
> +set_watchdog:
> +	for_each_engine(engine, dev_priv, id) {
> +		struct intel_context *ce = to_intel_context(ctx, engine);
> +
> +		ce->watchdog_threshold = threshold[engine->class];
> +	}
> +
> +	return 0;
> +}
> +
>   static int ctx_setparam(struct i915_gem_context *ctx,
>   			struct drm_i915_gem_context_param *args)
>   {
> @@ -1640,6 +1723,10 @@ static int ctx_setparam(struct i915_gem_context *ctx,
>   		ret = set_engines(ctx, args);
>   		break;
>   
> +	case I915_CONTEXT_PARAM_WATCHDOG:
> +		ret = i915_gem_context_set_watchdog(ctx, args);
> +		break;
> +
>   	case I915_CONTEXT_PARAM_BAN_PERIOD:
>   	default:
>   		ret = -EINVAL;
> @@ -1843,6 +1930,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   		args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
>   		break;
>   
> +	case I915_CONTEXT_PARAM_WATCHDOG:
> +		ret = i915_gem_context_get_watchdog(ctx, args);
> +		break;
> +
>   	case I915_CONTEXT_PARAM_SSEU:
>   		ret = get_sseu(ctx, args);
>   		break;
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 3f2c89740b0e..7dabdb3e0fad 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1492,6 +1492,7 @@ struct drm_i915_gem_context_param {
>    * See struct i915_context_param_engines.
>    */
>   #define I915_CONTEXT_PARAM_ENGINES	0x9
> +#define I915_CONTEXT_PARAM_WATCHDOG	0x10
>   
>   	__u64 value;
>   };
> 

uAPI is still not documented in i915_drm.h and you have not considered 
the suggestion to use the array of structs, which would match the other 
bits of new uAPI we are adding. Like making ctx set param args->value 
point to an array of:

struct drm_i915_watchdog_timeout {
     struct {
         __u16 class;
         __u16 instance;
     };
     __u32 timeout_us;
};

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-02-21  2:58 ` [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+ Carlos Santa
@ 2019-02-28 17:38   ` Tvrtko Ursulin
  2019-03-01  1:51     ` Carlos Santa
  2019-03-01  9:36   ` Chris Wilson
  1 sibling, 1 reply; 23+ messages in thread
From: Tvrtko Ursulin @ 2019-02-28 17:38 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx; +Cc: Michel Thierry


On 21/02/2019 02:58, Carlos Santa wrote:
> From: Michel Thierry <michel.thierry@intel.com>
> 
> *** General ***
> 
> Watchdog timeout (or "media engine reset") is a feature that allows
> userland applications to enable hang detection on individual batch buffers.
> The detection mechanism itself is mostly bound to the hardware and the only
> thing that the driver needs to do to support this form of hang detection
> is to implement the interrupt handling support as well as watchdog command
> emission before and after the emitted batch buffer start instruction in the
> ring buffer.
> 
> The principle of the hang detection mechanism is as follows:
> 
> 1. Once the decision has been made to enable watchdog timeout for a
> particular batch buffer and the driver is in the process of emitting the
> batch buffer start instruction into the ring buffer it also emits a
> watchdog timer start instruction before and a watchdog timer cancellation
> instruction after the batch buffer start instruction in the ring buffer.
> 
> 2. Once the GPU execution reaches the watchdog timer start instruction
> the hardware watchdog counter is started by the hardware. The counter
> keeps counting until either reaching a previously configured threshold
> value or the timer cancellation instruction is executed.
> 
> 2a. If the counter reaches the threshold value the hardware fires a
> watchdog interrupt that is picked up by the watchdog interrupt handler.
> This means that a hang has been detected and the driver needs to deal with
> it the same way it would deal with a engine hang detected by the periodic
> hang checker. The only difference between the two is that we already blamed
> the active request (to ensure an engine reset).
> 
> 2b. If the batch buffer completes and the execution reaches the watchdog
> cancellation instruction before the watchdog counter reaches its
> threshold value the watchdog is cancelled and nothing more comes of it.
> No hang is detected.
> 
> Note about future interaction with preemption: Preemption could happen
> in a command sequence prior to watchdog counter getting disabled,
> resulting in watchdog being triggered following preemption (e.g. when
> watchdog had been enabled in the low priority batch). The driver will
> need to explicitly disable the watchdog counter as part of the
> preemption sequence.
> 
> *** This patch introduces: ***
> 
> 1. IRQ handler code for watchdog timeout allowing direct hang recovery
> based on hardware-driven hang detection, which then integrates directly
> with the hang recovery path. This is independent of having per-engine reset
> or just full gpu reset.
> 
> 2. Watchdog specific register information.
> 
> Currently the render engine and all available media engines support
> watchdog timeout (VECS is only supported in GEN9). The specifications elude
> to the BCS engine being supported but that is currently not supported by
> this commit.
> 
> Note that the value to stop the counter is different between render and
> non-render engines in GEN8; GEN9 onwards it's the same.
> 
> v2: Move irq handler to tasklet, arm watchdog for a 2nd time to check
> against false-positives.
> 
> v3: Don't use high priority tasklet, use engine_last_submit while
> checking for false-positives. From GEN9 onwards, the stop counter bit is
> the same for all engines.
> 
> v4: Remove unnecessary brackets, use current_seqno to mark the request
> as guilty in the hangcheck/capture code.
> 
> v5: Rebased after RESET_ENGINEs flag.
> 
> v6: Don't capture error state in case of watchdog timeout. The capture
> process is time consuming and this will align to what happens when we
> use GuC to handle the watchdog timeout. (Chris)
> 
> v7: Rebase.
> 
> v8: Rebase, use HZ to reschedule.
> 
> v9: Rebase, get forcewake domains in function (no longer in execlists
> struct).
> 
> v10: Rebase.
> 
> v11: Rebase,
>       remove extra braces (Tvrtko),
>       implement watchdog_to_clock_counts helper (Tvrtko),
>       Move tasklet_kill(watchdog_tasklet) inside intel_engines (Tvrtko),
>       Use a global heartbeat seqno instead of engine seqno (Chris)
>       Make all engines checks all class based checks (Tvrtko)
> 
> Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.h         |  8 +++
>   drivers/gpu/drm/i915/i915_gpu_error.h   |  4 ++
>   drivers/gpu/drm/i915/i915_irq.c         | 12 ++++-
>   drivers/gpu/drm/i915/i915_reg.h         |  6 +++
>   drivers/gpu/drm/i915/intel_engine_cs.c  |  1 +
>   drivers/gpu/drm/i915/intel_hangcheck.c  | 17 +++++--
>   drivers/gpu/drm/i915/intel_lrc.c        | 65 +++++++++++++++++++++++++
>   drivers/gpu/drm/i915/intel_ringbuffer.h |  7 +++
>   8 files changed, 114 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 63a008aebfcd..0fcb2df869a2 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -3120,6 +3120,14 @@ i915_gem_context_lookup(struct drm_i915_file_private *file_priv, u32 id)
>   	return ctx;
>   }
>   
> +static inline u32
> +watchdog_to_clock_counts(struct drm_i915_private *dev_priv, u64 value_in_us)
> +{
> +	u64 threshold = 0;
> +
> +	return threshold;
> +}
> +
>   int i915_perf_open_ioctl(struct drm_device *dev, void *data,
>   			 struct drm_file *file);
>   int i915_perf_add_config_ioctl(struct drm_device *dev, void *data,
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
> index f408060e0667..bd1821c73ecd 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> @@ -233,6 +233,9 @@ struct i915_gpu_error {
>   	 * i915_mutex_lock_interruptible()?). I915_RESET_BACKOFF serves a
>   	 * secondary role in preventing two concurrent global reset attempts.
>   	 *
> +	 * #I915_RESET_WATCHDOG - When hw detects a hang before us, we can use
> +	 * I915_RESET_WATCHDOG to report the hang detection cause accurately.
> +	 *
>   	 * #I915_RESET_ENGINE[num_engines] - Since the driver doesn't need to
>   	 * acquire the struct_mutex to reset an engine, we need an explicit
>   	 * flag to prevent two concurrent reset attempts in the same engine.
> @@ -248,6 +251,7 @@ struct i915_gpu_error {
>   #define I915_RESET_BACKOFF	0
>   #define I915_RESET_MODESET	1
>   #define I915_RESET_ENGINE	2
> +#define I915_RESET_WATCHDOG	3
>   #define I915_WEDGED		(BITS_PER_LONG - 1)
>   
>   	/** Number of times an engine has been reset */
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 4b23b2fd1fad..e2a1a07b0f2c 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1456,6 +1456,9 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
>   
>   	if (tasklet)
>   		tasklet_hi_schedule(&engine->execlists.tasklet);
> +
> +	if (iir & GT_GEN8_WATCHDOG_INTERRUPT)
> +		tasklet_schedule(&engine->execlists.watchdog_tasklet);
>   }
>   
>   static void gen8_gt_irq_ack(struct drm_i915_private *i915,
> @@ -3883,17 +3886,24 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
>   	u32 gt_interrupts[] = {
>   		GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
>   			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> +			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
>   			GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
>   			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT,
>   		GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
>   			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> +			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
>   			GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
> -			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
> +			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
> +			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
>   		0,
>   		GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
>   			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT
>   		};
>   
> +	/* VECS watchdog is only available in skl+ */
> +	if (INTEL_GEN(dev_priv) >= 9)
> +		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
> +
>   	dev_priv->pm_ier = 0x0;
>   	dev_priv->pm_imr = ~dev_priv->pm_ier;
>   	GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 1eca166d95bb..a0e101bbcbce 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -2335,6 +2335,11 @@ enum i915_power_well_id {
>   #define RING_START(base)	_MMIO((base) + 0x38)
>   #define RING_CTL(base)		_MMIO((base) + 0x3c)
>   #define   RING_CTL_SIZE(size)	((size) - PAGE_SIZE) /* in bytes -> pages */
> +#define RING_CNTR(base)		_MMIO((base) + 0x178)
> +#define   GEN8_WATCHDOG_ENABLE		0
> +#define   GEN8_WATCHDOG_DISABLE		1
> +#define   GEN8_XCS_WATCHDOG_DISABLE	0xFFFFFFFF /* GEN8 & non-render only */
> +#define RING_THRESH(base)	_MMIO((base) + 0x17C)
>   #define RING_SYNC_0(base)	_MMIO((base) + 0x40)
>   #define RING_SYNC_1(base)	_MMIO((base) + 0x44)
>   #define RING_SYNC_2(base)	_MMIO((base) + 0x48)
> @@ -2894,6 +2899,7 @@ enum i915_power_well_id {
>   #define GT_BSD_USER_INTERRUPT			(1 << 12)
>   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /* hsw+; rsvd on snb, ivb, vlv */
>   #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
> +#define GT_GEN8_WATCHDOG_INTERRUPT		(1 <<  6) /* gen8+ */
>   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /* !snb */
>   #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
>   #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> index 7ae753358a6d..74f563d23cc8 100644
> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> @@ -1106,6 +1106,7 @@ void intel_engines_park(struct drm_i915_private *i915)
>   		/* Flush the residual irq tasklets first. */
>   		intel_engine_disarm_breadcrumbs(engine);
>   		tasklet_kill(&engine->execlists.tasklet);
> +		tasklet_kill(&engine->execlists.watchdog_tasklet);
>   
>   		/*
>   		 * We are committed now to parking the engines, make sure there
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index 58b6ff8453dc..bc10acb24d9a 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -218,7 +218,8 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
>   
>   static void hangcheck_declare_hang(struct drm_i915_private *i915,
>   				   unsigned int hung,
> -				   unsigned int stuck)
> +				   unsigned int stuck,
> +				   unsigned int watchdog)
>   {
>   	struct intel_engine_cs *engine;
>   	char msg[80];
> @@ -231,13 +232,16 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
>   	if (stuck != hung)
>   		hung &= ~stuck;
>   	len = scnprintf(msg, sizeof(msg),
> -			"%s on ", stuck == hung ? "no progress" : "hang");
> +			"%s on ", watchdog ? "watchdog timeout" :
> +				  stuck == hung ? "no progress" : "hang");
>   	for_each_engine_masked(engine, i915, hung, tmp)
>   		len += scnprintf(msg + len, sizeof(msg) - len,
>   				 "%s, ", engine->name);
>   	msg[len-2] = '\0';
>   
> -	return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s", msg);
> +	return i915_handle_error(i915, hung,
> +				 watchdog ? 0 : I915_ERROR_CAPTURE,
> +				 "%s", msg);
>   }
>   
>   /*
> @@ -255,7 +259,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>   			     gpu_error.hangcheck_work.work);
>   	struct intel_engine_cs *engine;
>   	enum intel_engine_id id;
> -	unsigned int hung = 0, stuck = 0, wedged = 0;
> +	unsigned int hung = 0, stuck = 0, wedged = 0, watchdog = 0;
>   
>   	if (!i915_modparams.enable_hangcheck)
>   		return;
> @@ -266,6 +270,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>   	if (i915_terminally_wedged(&dev_priv->gpu_error))
>   		return;
>   
> +	if (test_and_clear_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags))
> +		watchdog = 1;
> +
>   	/* As enabling the GPU requires fairly extensive mmio access,
>   	 * periodically arm the mmio checker to see if we are triggering
>   	 * any invalid access.
> @@ -311,7 +318,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>   	}
>   
>   	if (hung)
> -		hangcheck_declare_hang(dev_priv, hung, stuck);
> +		hangcheck_declare_hang(dev_priv, hung, stuck, watchdog);
>   
>   	/* Reset timer in case GPU hangs without another request being added */
>   	i915_queue_hangcheck(dev_priv);
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 9ca7dc7a6fa5..c38b239ab39e 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -2352,6 +2352,53 @@ static int gen8_emit_flush_render(struct i915_request *request,
>   	return 0;
>   }
>   
> +/* From GEN9 onwards, all engines use the same RING_CNTR format */
> +static inline u32 get_watchdog_disable(struct intel_engine_cs *engine)

I'd let the compiler decide on the inline or not.

> +{
> +	if (engine->id == RCS || INTEL_GEN(engine->i915) >= 9)
> +		return GEN8_WATCHDOG_DISABLE;
> +	else
> +		return GEN8_XCS_WATCHDOG_DISABLE;
> +}
> +
> +#define GEN8_WATCHDOG_1000US(dev_priv) watchdog_to_clock_counts(dev_priv, 1000)

Not sure macro is useful.

> +static void gen8_watchdog_irq_handler(unsigned long data)

gen8_watchdog_tasklet I guess.

> +{
> +	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> +	struct drm_i915_private *dev_priv = engine->i915;
> +	unsigned int hung = 0;
> +	u32 current_seqno=0;

Coding style.

> +	char msg[80];
> +	unsigned int tmp;
> +	int len;
> +
> +	/* Stop the counter to prevent further timeout interrupts */
> +	I915_WRITE_FW(RING_CNTR(engine->mmio_base), get_watchdog_disable(engine));

These registers do not need forcewake?

> +
> +	/* Read the heartbeat seqno once again to check if we are stuck? */
> +	current_seqno = intel_engine_get_hangcheck_seqno(engine);
> +
> +    if (current_seqno == engine->current_seqno) {
> +		hung |= engine->mask;
> +
> +		len = scnprintf(msg, sizeof(msg), "%s on ", "watchdog timeout");
> +		for_each_engine_masked(engine, dev_priv, hung, tmp)
> +			len += scnprintf(msg + len, sizeof(msg) - len,
> +					 "%s, ", engine->name);
> +		msg[len-2] = '\0';

Copy/paste from intel_hangcheck.c ? Moving to common helper would be good.

> +
> +		i915_handle_error(dev_priv, hung, 0, "%s", msg);
> +
> +		/* Reset timer in case GPU hangs without another request being added */
> +		i915_queue_hangcheck(dev_priv);

Mis-indented block.

> +    }else{

Coding style.

> +		/* Re-start the counter, if really hung, it will expire again */
> +		I915_WRITE_FW(RING_THRESH(engine->mmio_base),
> +			      GEN8_WATCHDOG_1000US(dev_priv));
> +		I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
> +    }
> +}
> +
>   /*
>    * Reserve space for 2 NOOPs at the end of each request to be
>    * used as a workaround for not being allowed to do lite
> @@ -2539,6 +2586,21 @@ logical_ring_default_irqs(struct intel_engine_cs *engine)
>   
>   	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
>   	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
> +
> +	switch (engine->class) {
> +	default:
> +		/* BCS engine does not support hw watchdog */
> +		break;
> +	case RENDER_CLASS:
> +	case VIDEO_DECODE_CLASS:
> +		engine->irq_keep_mask |= GT_GEN8_WATCHDOG_INTERRUPT << shift;
> +		break;
> +	case VIDEO_ENHANCEMENT_CLASS:
> +		if (INTEL_GEN(engine->i915) >= 9)
> +			engine->irq_keep_mask |=
> +				GT_GEN8_WATCHDOG_INTERRUPT << shift;
> +		break;
> +	}
>   }
>   
>   static int
> @@ -2556,6 +2618,9 @@ logical_ring_setup(struct intel_engine_cs *engine)
>   	tasklet_init(&engine->execlists.tasklet,
>   		     execlists_submission_tasklet, (unsigned long)engine);
>   
> +	tasklet_init(&engine->execlists.watchdog_tasklet,
> +		     gen8_watchdog_irq_handler, (unsigned long)engine);
> +
>   	logical_ring_default_vfuncs(engine);
>   	logical_ring_default_irqs(engine);
>   
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 465094e38d32..17250ba0246f 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -122,6 +122,7 @@ struct intel_engine_hangcheck {
>   	u64 acthd;
>   	u32 last_seqno;
>   	u32 next_seqno;
> +	u32 watchdog;

Looks unused.

>   	unsigned long action_timestamp;
>   	struct intel_instdone instdone;
>   };
> @@ -222,6 +223,11 @@ struct intel_engine_execlists {
>   	 */
>   	struct tasklet_struct tasklet;
>   
> +	/**
> +	 * @watchdog_tasklet: stop counter and re-schedule hangcheck_work asap
> +	 */
> +	struct tasklet_struct watchdog_tasklet;
> +
>   	/**
>   	 * @default_priolist: priority list for I915_PRIORITY_NORMAL
>   	 */
> @@ -353,6 +359,7 @@ struct intel_engine_cs {
>   	unsigned int hw_id;
>   	unsigned int guc_id;
>   	unsigned long mask;
> +	u32 current_seqno;

I don't see where this is set in this patch?

And I'd recommend calling it watchdog_last_seqno or something along 
those lines so it is obvious it is not a fundamental part of the engine.

>   
>   	u8 uabi_class;
>   
> 

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-02-28 17:38   ` Tvrtko Ursulin
@ 2019-03-01  1:51     ` Carlos Santa
  0 siblings, 0 replies; 23+ messages in thread
From: Carlos Santa @ 2019-03-01  1:51 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: Michel Thierry

On Thu, 2019-02-28 at 17:38 +0000, Tvrtko Ursulin wrote:
> On 21/02/2019 02:58, Carlos Santa wrote:
> > From: Michel Thierry <michel.thierry@intel.com>
> > 
> > *** General ***
> > 
> > Watchdog timeout (or "media engine reset") is a feature that allows
> > userland applications to enable hang detection on individual batch
> > buffers.
> > The detection mechanism itself is mostly bound to the hardware and
> > the only
> > thing that the driver needs to do to support this form of hang
> > detection
> > is to implement the interrupt handling support as well as watchdog
> > command
> > emission before and after the emitted batch buffer start
> > instruction in the
> > ring buffer.
> > 
> > The principle of the hang detection mechanism is as follows:
> > 
> > 1. Once the decision has been made to enable watchdog timeout for a
> > particular batch buffer and the driver is in the process of
> > emitting the
> > batch buffer start instruction into the ring buffer it also emits a
> > watchdog timer start instruction before and a watchdog timer
> > cancellation
> > instruction after the batch buffer start instruction in the ring
> > buffer.
> > 
> > 2. Once the GPU execution reaches the watchdog timer start
> > instruction
> > the hardware watchdog counter is started by the hardware. The
> > counter
> > keeps counting until either reaching a previously configured
> > threshold
> > value or the timer cancellation instruction is executed.
> > 
> > 2a. If the counter reaches the threshold value the hardware fires a
> > watchdog interrupt that is picked up by the watchdog interrupt
> > handler.
> > This means that a hang has been detected and the driver needs to
> > deal with
> > it the same way it would deal with a engine hang detected by the
> > periodic
> > hang checker. The only difference between the two is that we
> > already blamed
> > the active request (to ensure an engine reset).
> > 
> > 2b. If the batch buffer completes and the execution reaches the
> > watchdog
> > cancellation instruction before the watchdog counter reaches its
> > threshold value the watchdog is cancelled and nothing more comes of
> > it.
> > No hang is detected.
> > 
> > Note about future interaction with preemption: Preemption could
> > happen
> > in a command sequence prior to watchdog counter getting disabled,
> > resulting in watchdog being triggered following preemption (e.g.
> > when
> > watchdog had been enabled in the low priority batch). The driver
> > will
> > need to explicitly disable the watchdog counter as part of the
> > preemption sequence.
> > 
> > *** This patch introduces: ***
> > 
> > 1. IRQ handler code for watchdog timeout allowing direct hang
> > recovery
> > based on hardware-driven hang detection, which then integrates
> > directly
> > with the hang recovery path. This is independent of having per-
> > engine reset
> > or just full gpu reset.
> > 
> > 2. Watchdog specific register information.
> > 
> > Currently the render engine and all available media engines support
> > watchdog timeout (VECS is only supported in GEN9). The
> > specifications elude
> > to the BCS engine being supported but that is currently not
> > supported by
> > this commit.
> > 
> > Note that the value to stop the counter is different between render
> > and
> > non-render engines in GEN8; GEN9 onwards it's the same.
> > 
> > v2: Move irq handler to tasklet, arm watchdog for a 2nd time to
> > check
> > against false-positives.
> > 
> > v3: Don't use high priority tasklet, use engine_last_submit while
> > checking for false-positives. From GEN9 onwards, the stop counter
> > bit is
> > the same for all engines.
> > 
> > v4: Remove unnecessary brackets, use current_seqno to mark the
> > request
> > as guilty in the hangcheck/capture code.
> > 
> > v5: Rebased after RESET_ENGINEs flag.
> > 
> > v6: Don't capture error state in case of watchdog timeout. The
> > capture
> > process is time consuming and this will align to what happens when
> > we
> > use GuC to handle the watchdog timeout. (Chris)
> > 
> > v7: Rebase.
> > 
> > v8: Rebase, use HZ to reschedule.
> > 
> > v9: Rebase, get forcewake domains in function (no longer in
> > execlists
> > struct).
> > 
> > v10: Rebase.
> > 
> > v11: Rebase,
> >       remove extra braces (Tvrtko),
> >       implement watchdog_to_clock_counts helper (Tvrtko),
> >       Move tasklet_kill(watchdog_tasklet) inside intel_engines
> > (Tvrtko),
> >       Use a global heartbeat seqno instead of engine seqno (Chris)
> >       Make all engines checks all class based checks (Tvrtko)
> > 
> > Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> > Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> > Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_drv.h         |  8 +++
> >   drivers/gpu/drm/i915/i915_gpu_error.h   |  4 ++
> >   drivers/gpu/drm/i915/i915_irq.c         | 12 ++++-
> >   drivers/gpu/drm/i915/i915_reg.h         |  6 +++
> >   drivers/gpu/drm/i915/intel_engine_cs.c  |  1 +
> >   drivers/gpu/drm/i915/intel_hangcheck.c  | 17 +++++--
> >   drivers/gpu/drm/i915/intel_lrc.c        | 65
> > +++++++++++++++++++++++++
> >   drivers/gpu/drm/i915/intel_ringbuffer.h |  7 +++
> >   8 files changed, 114 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_drv.h
> > b/drivers/gpu/drm/i915/i915_drv.h
> > index 63a008aebfcd..0fcb2df869a2 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.h
> > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > @@ -3120,6 +3120,14 @@ i915_gem_context_lookup(struct
> > drm_i915_file_private *file_priv, u32 id)
> >   	return ctx;
> >   }
> >   
> > +static inline u32
> > +watchdog_to_clock_counts(struct drm_i915_private *dev_priv, u64
> > value_in_us)
> > +{
> > +	u64 threshold = 0;
> > +
> > +	return threshold;
> > +}
> > +
> >   int i915_perf_open_ioctl(struct drm_device *dev, void *data,
> >   			 struct drm_file *file);
> >   int i915_perf_add_config_ioctl(struct drm_device *dev, void
> > *data,
> > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h
> > b/drivers/gpu/drm/i915/i915_gpu_error.h
> > index f408060e0667..bd1821c73ecd 100644
> > --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> > +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> > @@ -233,6 +233,9 @@ struct i915_gpu_error {
> >   	 * i915_mutex_lock_interruptible()?). I915_RESET_BACKOFF serves
> > a
> >   	 * secondary role in preventing two concurrent global reset
> > attempts.
> >   	 *
> > +	 * #I915_RESET_WATCHDOG - When hw detects a hang before us, we
> > can use
> > +	 * I915_RESET_WATCHDOG to report the hang detection cause
> > accurately.
> > +	 *
> >   	 * #I915_RESET_ENGINE[num_engines] - Since the driver doesn't
> > need to
> >   	 * acquire the struct_mutex to reset an engine, we need an
> > explicit
> >   	 * flag to prevent two concurrent reset attempts in the same
> > engine.
> > @@ -248,6 +251,7 @@ struct i915_gpu_error {
> >   #define I915_RESET_BACKOFF	0
> >   #define I915_RESET_MODESET	1
> >   #define I915_RESET_ENGINE	2
> > +#define I915_RESET_WATCHDOG	3
> >   #define I915_WEDGED		(BITS_PER_LONG - 1)
> >   
> >   	/** Number of times an engine has been reset */
> > diff --git a/drivers/gpu/drm/i915/i915_irq.c
> > b/drivers/gpu/drm/i915/i915_irq.c
> > index 4b23b2fd1fad..e2a1a07b0f2c 100644
> > --- a/drivers/gpu/drm/i915/i915_irq.c
> > +++ b/drivers/gpu/drm/i915/i915_irq.c
> > @@ -1456,6 +1456,9 @@ gen8_cs_irq_handler(struct intel_engine_cs
> > *engine, u32 iir)
> >   
> >   	if (tasklet)
> >   		tasklet_hi_schedule(&engine->execlists.tasklet);
> > +
> > +	if (iir & GT_GEN8_WATCHDOG_INTERRUPT)
> > +		tasklet_schedule(&engine->execlists.watchdog_tasklet);
> >   }
> >   
> >   static void gen8_gt_irq_ack(struct drm_i915_private *i915,
> > @@ -3883,17 +3886,24 @@ static void gen8_gt_irq_postinstall(struct
> > drm_i915_private *dev_priv)
> >   	u32 gt_interrupts[] = {
> >   		GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> >   			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_RCS_IRQ_SHIFT |
> > +			GT_GEN8_WATCHDOG_INTERRUPT <<
> > GEN8_RCS_IRQ_SHIFT |
> >   			GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT
> > |
> >   			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_BCS_IRQ_SHIFT,
> >   		GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> >   			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_VCS1_IRQ_SHIFT |
> > +			GT_GEN8_WATCHDOG_INTERRUPT <<
> > GEN8_VCS1_IRQ_SHIFT |
> >   			GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT
> > |
> > -			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_VCS2_IRQ_SHIFT,
> > +			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_VCS2_IRQ_SHIFT |
> > +			GT_GEN8_WATCHDOG_INTERRUPT <<
> > GEN8_VCS2_IRQ_SHIFT,
> >   		0,
> >   		GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
> >   			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_VECS_IRQ_SHIFT
> >   		};
> >   
> > +	/* VECS watchdog is only available in skl+ */
> > +	if (INTEL_GEN(dev_priv) >= 9)
> > +		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
> > +
> >   	dev_priv->pm_ier = 0x0;
> >   	dev_priv->pm_imr = ~dev_priv->pm_ier;
> >   	GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
> > diff --git a/drivers/gpu/drm/i915/i915_reg.h
> > b/drivers/gpu/drm/i915/i915_reg.h
> > index 1eca166d95bb..a0e101bbcbce 100644
> > --- a/drivers/gpu/drm/i915/i915_reg.h
> > +++ b/drivers/gpu/drm/i915/i915_reg.h
> > @@ -2335,6 +2335,11 @@ enum i915_power_well_id {
> >   #define RING_START(base)	_MMIO((base) + 0x38)
> >   #define RING_CTL(base)		_MMIO((base) + 0x3c)
> >   #define   RING_CTL_SIZE(size)	((size) - PAGE_SIZE) /* in
> > bytes -> pages */
> > +#define RING_CNTR(base)		_MMIO((base) + 0x178)
> > +#define   GEN8_WATCHDOG_ENABLE		0
> > +#define   GEN8_WATCHDOG_DISABLE		1
> > +#define   GEN8_XCS_WATCHDOG_DISABLE	0xFFFFFFFF /* GEN8 &
> > non-render only */
> > +#define RING_THRESH(base)	_MMIO((base) + 0x17C)
> >   #define RING_SYNC_0(base)	_MMIO((base) + 0x40)
> >   #define RING_SYNC_1(base)	_MMIO((base) + 0x44)
> >   #define RING_SYNC_2(base)	_MMIO((base) + 0x48)
> > @@ -2894,6 +2899,7 @@ enum i915_power_well_id {
> >   #define GT_BSD_USER_INTERRUPT			(1 << 12)
> >   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /*
> > hsw+; rsvd on snb, ivb, vlv */
> >   #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
> > +#define GT_GEN8_WATCHDOG_INTERRUPT		(1 <<  6) /* gen8+ */
> >   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /*
> > !snb */
> >   #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
> >   #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
> > diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c
> > b/drivers/gpu/drm/i915/intel_engine_cs.c
> > index 7ae753358a6d..74f563d23cc8 100644
> > --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> > +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> > @@ -1106,6 +1106,7 @@ void intel_engines_park(struct
> > drm_i915_private *i915)
> >   		/* Flush the residual irq tasklets first. */
> >   		intel_engine_disarm_breadcrumbs(engine);
> >   		tasklet_kill(&engine->execlists.tasklet);
> > +		tasklet_kill(&engine->execlists.watchdog_tasklet);
> >   
> >   		/*
> >   		 * We are committed now to parking the engines, make
> > sure there
> > diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c
> > b/drivers/gpu/drm/i915/intel_hangcheck.c
> > index 58b6ff8453dc..bc10acb24d9a 100644
> > --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> > +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> > @@ -218,7 +218,8 @@ static void hangcheck_accumulate_sample(struct
> > intel_engine_cs *engine,
> >   
> >   static void hangcheck_declare_hang(struct drm_i915_private *i915,
> >   				   unsigned int hung,
> > -				   unsigned int stuck)
> > +				   unsigned int stuck,
> > +				   unsigned int watchdog)
> >   {
> >   	struct intel_engine_cs *engine;
> >   	char msg[80];
> > @@ -231,13 +232,16 @@ static void hangcheck_declare_hang(struct
> > drm_i915_private *i915,
> >   	if (stuck != hung)
> >   		hung &= ~stuck;
> >   	len = scnprintf(msg, sizeof(msg),
> > -			"%s on ", stuck == hung ? "no progress" :
> > "hang");
> > +			"%s on ", watchdog ? "watchdog timeout" :
> > +				  stuck == hung ? "no progress" :
> > "hang");
> >   	for_each_engine_masked(engine, i915, hung, tmp)
> >   		len += scnprintf(msg + len, sizeof(msg) - len,
> >   				 "%s, ", engine->name);
> >   	msg[len-2] = '\0';
> >   
> > -	return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s",
> > msg);
> > +	return i915_handle_error(i915, hung,
> > +				 watchdog ? 0 : I915_ERROR_CAPTURE,
> > +				 "%s", msg);
> >   }
> >   
> >   /*
> > @@ -255,7 +259,7 @@ static void i915_hangcheck_elapsed(struct
> > work_struct *work)
> >   			     gpu_error.hangcheck_work.work);
> >   	struct intel_engine_cs *engine;
> >   	enum intel_engine_id id;
> > -	unsigned int hung = 0, stuck = 0, wedged = 0;
> > +	unsigned int hung = 0, stuck = 0, wedged = 0, watchdog = 0;
> >   
> >   	if (!i915_modparams.enable_hangcheck)
> >   		return;
> > @@ -266,6 +270,9 @@ static void i915_hangcheck_elapsed(struct
> > work_struct *work)
> >   	if (i915_terminally_wedged(&dev_priv->gpu_error))
> >   		return;
> >   
> > +	if (test_and_clear_bit(I915_RESET_WATCHDOG, &dev_priv-
> > >gpu_error.flags))
> > +		watchdog = 1;
> > +
> >   	/* As enabling the GPU requires fairly extensive mmio access,
> >   	 * periodically arm the mmio checker to see if we are
> > triggering
> >   	 * any invalid access.
> > @@ -311,7 +318,7 @@ static void i915_hangcheck_elapsed(struct
> > work_struct *work)
> >   	}
> >   
> >   	if (hung)
> > -		hangcheck_declare_hang(dev_priv, hung, stuck);
> > +		hangcheck_declare_hang(dev_priv, hung, stuck,
> > watchdog);
> >   
> >   	/* Reset timer in case GPU hangs without another request being
> > added */
> >   	i915_queue_hangcheck(dev_priv);
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c
> > b/drivers/gpu/drm/i915/intel_lrc.c
> > index 9ca7dc7a6fa5..c38b239ab39e 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -2352,6 +2352,53 @@ static int gen8_emit_flush_render(struct
> > i915_request *request,
> >   	return 0;
> >   }
> >   
> > +/* From GEN9 onwards, all engines use the same RING_CNTR format */
> > +static inline u32 get_watchdog_disable(struct intel_engine_cs
> > *engine)
> 
> I'd let the compiler decide on the inline or not.
> 
> > +{
> > +	if (engine->id == RCS || INTEL_GEN(engine->i915) >= 9)
> > +		return GEN8_WATCHDOG_DISABLE;
> > +	else
> > +		return GEN8_XCS_WATCHDOG_DISABLE;
> > +}
> > +
> > +#define GEN8_WATCHDOG_1000US(dev_priv)
> > watchdog_to_clock_counts(dev_priv, 1000)
> 
> Not sure macro is useful.
> 
> > +static void gen8_watchdog_irq_handler(unsigned long data)
> 
> gen8_watchdog_tasklet I guess.
> 
> > +{
> > +	struct intel_engine_cs *engine = (struct intel_engine_cs
> > *)data;
> > +	struct drm_i915_private *dev_priv = engine->i915;
> > +	unsigned int hung = 0;
> > +	u32 current_seqno=0;
> 
> Coding style.
> 
> > +	char msg[80];
> > +	unsigned int tmp;
> > +	int len;
> > +
> > +	/* Stop the counter to prevent further timeout interrupts */
> > +	I915_WRITE_FW(RING_CNTR(engine->mmio_base),
> > get_watchdog_disable(engine));
> 
> These registers do not need forcewake?
> 
> > +
> > +	/* Read the heartbeat seqno once again to check if we are
> > stuck? */
> > +	current_seqno = intel_engine_get_hangcheck_seqno(engine);
> > +
> > +    if (current_seqno == engine->current_seqno) {
> > +		hung |= engine->mask;
> > +
> > +		len = scnprintf(msg, sizeof(msg), "%s on ", "watchdog
> > timeout");
> > +		for_each_engine_masked(engine, dev_priv, hung, tmp)
> > +			len += scnprintf(msg + len, sizeof(msg) - len,
> > +					 "%s, ", engine->name);
> > +		msg[len-2] = '\0';
> 
> Copy/paste from intel_hangcheck.c ? Moving to common helper would be
> good.
> 
> > +
> > +		i915_handle_error(dev_priv, hung, 0, "%s", msg);
> > +
> > +		/* Reset timer in case GPU hangs without another
> > request being added */
> > +		i915_queue_hangcheck(dev_priv);
> 
> Mis-indented block.
> 
> > +    }else{
> 
> Coding style.
> 
> > +		/* Re-start the counter, if really hung, it will expire
> > again */
> > +		I915_WRITE_FW(RING_THRESH(engine->mmio_base),
> > +			      GEN8_WATCHDOG_1000US(dev_priv));
> > +		I915_WRITE_FW(RING_CNTR(engine->mmio_base),
> > GEN8_WATCHDOG_ENABLE);
> > +    }
> > +}
> > +
> >   /*
> >    * Reserve space for 2 NOOPs at the end of each request to be
> >    * used as a workaround for not being allowed to do lite
> > @@ -2539,6 +2586,21 @@ logical_ring_default_irqs(struct
> > intel_engine_cs *engine)
> >   
> >   	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
> >   	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
> > +
> > +	switch (engine->class) {
> > +	default:
> > +		/* BCS engine does not support hw watchdog */
> > +		break;
> > +	case RENDER_CLASS:
> > +	case VIDEO_DECODE_CLASS:
> > +		engine->irq_keep_mask |= GT_GEN8_WATCHDOG_INTERRUPT <<
> > shift;
> > +		break;
> > +	case VIDEO_ENHANCEMENT_CLASS:
> > +		if (INTEL_GEN(engine->i915) >= 9)
> > +			engine->irq_keep_mask |=
> > +				GT_GEN8_WATCHDOG_INTERRUPT << shift;
> > +		break;
> > +	}
> >   }
> >   
> >   static int
> > @@ -2556,6 +2618,9 @@ logical_ring_setup(struct intel_engine_cs
> > *engine)
> >   	tasklet_init(&engine->execlists.tasklet,
> >   		     execlists_submission_tasklet, (unsigned
> > long)engine);
> >   
> > +	tasklet_init(&engine->execlists.watchdog_tasklet,
> > +		     gen8_watchdog_irq_handler, (unsigned long)engine);
> > +
> >   	logical_ring_default_vfuncs(engine);
> >   	logical_ring_default_irqs(engine);
> >   
> > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h
> > b/drivers/gpu/drm/i915/intel_ringbuffer.h
> > index 465094e38d32..17250ba0246f 100644
> > --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> > @@ -122,6 +122,7 @@ struct intel_engine_hangcheck {
> >   	u64 acthd;
> >   	u32 last_seqno;
> >   	u32 next_seqno;
> > +	u32 watchdog;
> 
> Looks unused.
> 
> >   	unsigned long action_timestamp;
> >   	struct intel_instdone instdone;
> >   };
> > @@ -222,6 +223,11 @@ struct intel_engine_execlists {
> >   	 */
> >   	struct tasklet_struct tasklet;
> >   
> > +	/**
> > +	 * @watchdog_tasklet: stop counter and re-schedule
> > hangcheck_work asap
> > +	 */
> > +	struct tasklet_struct watchdog_tasklet;
> > +
> >   	/**
> >   	 * @default_priolist: priority list for I915_PRIORITY_NORMAL
> >   	 */
> > @@ -353,6 +359,7 @@ struct intel_engine_cs {
> >   	unsigned int hw_id;
> >   	unsigned int guc_id;
> >   	unsigned long mask;
> > +	u32 current_seqno;
> 
> I don't see where this is set in this patch?

It was declared here but assigned on patch #3 of the series, right
after the watchdog timer is started. The idea was to store the seqno we
are currently working (before we hang) and then cros check once again
right before we reset inside the irq for the watchdog. 

/* Read the heartbeat seqno once again to check if we are stuck? */
current_seqno = intel_engine_get_hangcheck_seqno(engine);

if (current_seqno == engine->current_seqno) {

> 
> And I'd recommend calling it watchdog_last_seqno or something along 
> those lines so it is obvious it is not a fundamental part of the
> engine.
> 
> >   
> >   	u8 uabi_class;
> >   
> > 
> 
> Regards,
> 
> Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-02-21  2:58 ` [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+ Carlos Santa
  2019-02-28 17:38   ` Tvrtko Ursulin
@ 2019-03-01  9:36   ` Chris Wilson
  2019-03-02  2:08     ` Carlos Santa
  2019-03-08  3:16     ` Carlos Santa
  1 sibling, 2 replies; 23+ messages in thread
From: Chris Wilson @ 2019-03-01  9:36 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx; +Cc: Michel Thierry

Quoting Carlos Santa (2019-02-21 02:58:16)
> +#define GEN8_WATCHDOG_1000US(dev_priv) watchdog_to_clock_counts(dev_priv, 1000)
> +static void gen8_watchdog_irq_handler(unsigned long data)
> +{
> +       struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> +       struct drm_i915_private *dev_priv = engine->i915;
> +       unsigned int hung = 0;
> +       u32 current_seqno=0;
> +       char msg[80];
> +       unsigned int tmp;
> +       int len;
> +
> +       /* Stop the counter to prevent further timeout interrupts */
> +       I915_WRITE_FW(RING_CNTR(engine->mmio_base), get_watchdog_disable(engine));
> +
> +       /* Read the heartbeat seqno once again to check if we are stuck? */
> +       current_seqno = intel_engine_get_hangcheck_seqno(engine);

I have said this before, but this doesn't exist either, it's just a
temporary glitch in the matrix.

> +    if (current_seqno == engine->current_seqno) {
> +               hung |= engine->mask;
> +
> +               len = scnprintf(msg, sizeof(msg), "%s on ", "watchdog timeout");
> +               for_each_engine_masked(engine, dev_priv, hung, tmp)
> +                       len += scnprintf(msg + len, sizeof(msg) - len,
> +                                        "%s, ", engine->name);
> +               msg[len-2] = '\0';
> +
> +               i915_handle_error(dev_priv, hung, 0, "%s", msg);
> +
> +               /* Reset timer in case GPU hangs without another request being added */
> +               i915_queue_hangcheck(dev_priv);

You still haven't explained why we are not just resetting the engine
immediately. Have you looked at the preempt-timeout patches that need to
do the same thing from timer-irq context?

Resending the same old stuff over and over again is just exasperating.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-03-01  9:36   ` Chris Wilson
@ 2019-03-02  2:08     ` Carlos Santa
  2019-03-08  3:16     ` Carlos Santa
  1 sibling, 0 replies; 23+ messages in thread
From: Carlos Santa @ 2019-03-02  2:08 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Michel Thierry

On Fri, 2019-03-01 at 09:36 +0000, Chris Wilson wrote:
> Quoting Carlos Santa (2019-02-21 02:58:16)
> > +#define GEN8_WATCHDOG_1000US(dev_priv)
> > watchdog_to_clock_counts(dev_priv, 1000)
> > +static void gen8_watchdog_irq_handler(unsigned long data)
> > +{
> > +       struct intel_engine_cs *engine = (struct intel_engine_cs
> > *)data;
> > +       struct drm_i915_private *dev_priv = engine->i915;
> > +       unsigned int hung = 0;
> > +       u32 current_seqno=0;
> > +       char msg[80];
> > +       unsigned int tmp;
> > +       int len;
> > +
> > +       /* Stop the counter to prevent further timeout interrupts
> > */
> > +       I915_WRITE_FW(RING_CNTR(engine->mmio_base),
> > get_watchdog_disable(engine));
> > +
> > +       /* Read the heartbeat seqno once again to check if we are
> > stuck? */
> > +       current_seqno = intel_engine_get_hangcheck_seqno(engine);
> 
> I have said this before, but this doesn't exist either, it's just a
> temporary glitch in the matrix.

That was my only way to check for the "quilty" seqno right before
resetting during smoke testing... Will reach out again before sending a
new rev to cross check on the new approach you mentioned today.

> 
> > +    if (current_seqno == engine->current_seqno) {
> > +               hung |= engine->mask;
> > +
> > +               len = scnprintf(msg, sizeof(msg), "%s on ",
> > "watchdog timeout");
> > +               for_each_engine_masked(engine, dev_priv, hung, tmp)
> > +                       len += scnprintf(msg + len, sizeof(msg) -
> > len,
> > +                                        "%s, ", engine->name);
> > +               msg[len-2] = '\0';
> > +
> > +               i915_handle_error(dev_priv, hung, 0, "%s", msg);
> > +
> > +               /* Reset timer in case GPU hangs without another
> > request being added */
> > +               i915_queue_hangcheck(dev_priv);
> 
> You still haven't explained why we are not just resetting the engine
> immediately. Have you looked at the preempt-timeout patches that need
> to
> do the same thing from timer-irq context?
> 
> Resending the same old stuff over and over again is just
> exasperating.
> -Chris

Oops, I had the wrong assumption, as I honestly thought removing the
workqueue from v3 would allow for an immediate reset. Thanks for the
feedback on the preempt-timeout series... will rework this. 

Carlos

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl
  2019-02-25 13:34   ` Tvrtko Ursulin
@ 2019-03-06 23:08     ` Carlos Santa
  2019-03-07  7:27       ` Tvrtko Ursulin
  0 siblings, 1 reply; 23+ messages in thread
From: Carlos Santa @ 2019-03-06 23:08 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

On Mon, 2019-02-25 at 13:34 +0000, Tvrtko Ursulin wrote:
> On 21/02/2019 02:58, Carlos Santa wrote:
> > From: Michel Thierry <michel.thierry@intel.com>
> > 
> > Users/tests relying on the total reset count will start seeing a
> > smaller
> > number since most of the hangs can be handled by engine reset.
> > Note that if reset engine x, context a running on engine y will be
> > unaware
> > and unaffected.
> > 
> > To start the discussion, include just a total engine reset count.
> > If it
> > is deemed useful, it can be extended to report each engine
> > separately.
> > 
> > Our igt's gem_reset_stats test will need changes to ignore the pad
> > field,
> > since it can now return reset_engine_count.
> > 
> > v2: s/engine_reset/reset_engine/, use union in uapi to not break
> > compatibility.
> > v3: Keep rejecting attempts to use pad as input (Antonio)
> > v4: Rebased.
> > v5: Rebased.
> > 
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> > Cc: Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> > Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> > Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_gem_context.c | 12 ++++++++++--
> >   include/uapi/drm/i915_drm.h             |  6 +++++-
> >   2 files changed, 15 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
> > b/drivers/gpu/drm/i915/i915_gem_context.c
> > index 459f8eae1c39..cbfe8f2eb3f2 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_context.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> > @@ -1889,6 +1889,8 @@ int i915_gem_context_reset_stats_ioctl(struct
> > drm_device *dev,
> >   	struct drm_i915_private *dev_priv = to_i915(dev);
> >   	struct drm_i915_reset_stats *args = data;
> >   	struct i915_gem_context *ctx;
> > +	struct intel_engine_cs *engine;
> > +	enum intel_engine_id id;
> >   	int ret;
> >   
> >   	if (args->flags || args->pad)
> > @@ -1907,10 +1909,16 @@ int
> > i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
> >   	 * we should wrap the hangstats with a seqlock.
> >   	 */
> >   
> > -	if (capable(CAP_SYS_ADMIN))
> > +	if (capable(CAP_SYS_ADMIN)) {
> >   		args->reset_count = i915_reset_count(&dev_priv-
> > >gpu_error);
> > -	else
> > +		for_each_engine(engine, dev_priv, id)
> > +			args->reset_engine_count +=
> > +				i915_reset_engine_count(&dev_priv-
> > >gpu_error,
> > +							engine);
> 
> If access to global GPU reset count is privileged, why is access to 
> global engine reset count not? It seems to be fundamentally same
> level 
> of data leakage.

But access to global engine reset count (i915_reset_engine_count) is
indeed priviledged. They both are inside if(CAP_SYS_ADMIN){...}, or
maybe I am missing something?

> 
> If we wanted to provide some numbers to unprivileged users I think
> we 
> would need to store some counters per file_priv/context and return
> those 
> when !CAP_SYS_ADMIN.

The question would be why access to global GPU reset count is
priviledged then? I can't think of a reason why it should be
priviledged. I think the new counter (per engine) should fall in the
same category as the global GPU reset one, right? So, can we make them
both unpriviledged? 


> 
> > +	} else {
> >   		args->reset_count = 0;
> > +		args->reset_engine_count = 0;
> > +	}
> >   
> >   	args->batch_active = atomic_read(&ctx->guilty_count);
> >   	args->batch_pending = atomic_read(&ctx->active_count);
> > diff --git a/include/uapi/drm/i915_drm.h
> > b/include/uapi/drm/i915_drm.h
> > index cc03ef9f885f..3f2c89740b0e 100644
> > --- a/include/uapi/drm/i915_drm.h
> > +++ b/include/uapi/drm/i915_drm.h
> > @@ -1642,7 +1642,11 @@ struct drm_i915_reset_stats {
> >   	/* Number of batches lost pending for execution, for this
> > context */
> >   	__u32 batch_pending;
> >   
> > -	__u32 pad;
> > +	union {
> > +		__u32 pad;
> > +		/* Engine resets since boot/module reload, for all
> > contexts */
> > +		__u32 reset_engine_count;
> > +	};
> 
> Chris pointed out in some other review that anonymous unions are not 
> friendly towards C++ compilers.
> 
> Not sure what is the best option here. Renaming the field could
> break 
> old userspace building against newer headers. Is that acceptable?
> 

I dug up some old comments from Chris and he stated that recycling the
union like that would be a bad idea since that would make the pad field
an output only parameter thus invalidating gem_reset_stats...

Why can't we simply add a new field __u32 reset_engine_count; as part
of the drm_i915_reset_stats struct?

Regards,
Carlos

> >   };
> >   
> >   struct drm_i915_gem_userptr {
> > 
> 
> Regards,
> 
> Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl
  2019-03-06 23:08     ` Carlos Santa
@ 2019-03-07  7:27       ` Tvrtko Ursulin
  0 siblings, 0 replies; 23+ messages in thread
From: Tvrtko Ursulin @ 2019-03-07  7:27 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx


On 06/03/2019 23:08, Carlos Santa wrote:
> On Mon, 2019-02-25 at 13:34 +0000, Tvrtko Ursulin wrote:
>> On 21/02/2019 02:58, Carlos Santa wrote:
>>> From: Michel Thierry <michel.thierry@intel.com>
>>>
>>> Users/tests relying on the total reset count will start seeing a
>>> smaller
>>> number since most of the hangs can be handled by engine reset.
>>> Note that if reset engine x, context a running on engine y will be
>>> unaware
>>> and unaffected.
>>>
>>> To start the discussion, include just a total engine reset count.
>>> If it
>>> is deemed useful, it can be extended to report each engine
>>> separately.
>>>
>>> Our igt's gem_reset_stats test will need changes to ignore the pad
>>> field,
>>> since it can now return reset_engine_count.
>>>
>>> v2: s/engine_reset/reset_engine/, use union in uapi to not break
>>> compatibility.
>>> v3: Keep rejecting attempts to use pad as input (Antonio)
>>> v4: Rebased.
>>> v5: Rebased.
>>>
>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>> Cc: Antonio Argenziano <antonio.argenziano@intel.com>
>>> Cc: Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>>> Signed-off-by: Carlos Santa <carlos.santa@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_gem_context.c | 12 ++++++++++--
>>>    include/uapi/drm/i915_drm.h             |  6 +++++-
>>>    2 files changed, 15 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c
>>> b/drivers/gpu/drm/i915/i915_gem_context.c
>>> index 459f8eae1c39..cbfe8f2eb3f2 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_context.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
>>> @@ -1889,6 +1889,8 @@ int i915_gem_context_reset_stats_ioctl(struct
>>> drm_device *dev,
>>>    	struct drm_i915_private *dev_priv = to_i915(dev);
>>>    	struct drm_i915_reset_stats *args = data;
>>>    	struct i915_gem_context *ctx;
>>> +	struct intel_engine_cs *engine;
>>> +	enum intel_engine_id id;
>>>    	int ret;
>>>    
>>>    	if (args->flags || args->pad)
>>> @@ -1907,10 +1909,16 @@ int
>>> i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
>>>    	 * we should wrap the hangstats with a seqlock.
>>>    	 */
>>>    
>>> -	if (capable(CAP_SYS_ADMIN))
>>> +	if (capable(CAP_SYS_ADMIN)) {
>>>    		args->reset_count = i915_reset_count(&dev_priv-
>>>> gpu_error);
>>> -	else
>>> +		for_each_engine(engine, dev_priv, id)
>>> +			args->reset_engine_count +=
>>> +				i915_reset_engine_count(&dev_priv-
>>>> gpu_error,
>>> +							engine);
>>
>> If access to global GPU reset count is privileged, why is access to
>> global engine reset count not? It seems to be fundamentally same
>> level
>> of data leakage.
> 
> But access to global engine reset count (i915_reset_engine_count) is
> indeed priviledged. They both are inside if(CAP_SYS_ADMIN){...}, or
> maybe I am missing something?

Looks like I misread the diff, sorry. Been processing a lot of patches 
lately.

Regards,

Tvrtko

>>
>> If we wanted to provide some numbers to unprivileged users I think
>> we
>> would need to store some counters per file_priv/context and return
>> those
>> when !CAP_SYS_ADMIN.
> 
> The question would be why access to global GPU reset count is
> priviledged then? I can't think of a reason why it should be
> priviledged. I think the new counter (per engine) should fall in the
> same category as the global GPU reset one, right? So, can we make them
> both unpriviledged?
> 
> 
>>
>>> +	} else {
>>>    		args->reset_count = 0;
>>> +		args->reset_engine_count = 0;
>>> +	}
>>>    
>>>    	args->batch_active = atomic_read(&ctx->guilty_count);
>>>    	args->batch_pending = atomic_read(&ctx->active_count);
>>> diff --git a/include/uapi/drm/i915_drm.h
>>> b/include/uapi/drm/i915_drm.h
>>> index cc03ef9f885f..3f2c89740b0e 100644
>>> --- a/include/uapi/drm/i915_drm.h
>>> +++ b/include/uapi/drm/i915_drm.h
>>> @@ -1642,7 +1642,11 @@ struct drm_i915_reset_stats {
>>>    	/* Number of batches lost pending for execution, for this
>>> context */
>>>    	__u32 batch_pending;
>>>    
>>> -	__u32 pad;
>>> +	union {
>>> +		__u32 pad;
>>> +		/* Engine resets since boot/module reload, for all
>>> contexts */
>>> +		__u32 reset_engine_count;
>>> +	};
>>
>> Chris pointed out in some other review that anonymous unions are not
>> friendly towards C++ compilers.
>>
>> Not sure what is the best option here. Renaming the field could
>> break
>> old userspace building against newer headers. Is that acceptable?
>>
> 
> I dug up some old comments from Chris and he stated that recycling the
> union like that would be a bad idea since that would make the pad field
> an output only parameter thus invalidating gem_reset_stats...
> 
> Why can't we simply add a new field __u32 reset_engine_count; as part
> of the drm_i915_reset_stats struct?
> 
> Regards,
> Carlos
> 
>>>    };
>>>    
>>>    struct drm_i915_gem_userptr {
>>>
>>
>> Regards,
>>
>> Tvrtko
> 
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-03-01  9:36   ` Chris Wilson
  2019-03-02  2:08     ` Carlos Santa
@ 2019-03-08  3:16     ` Carlos Santa
  2019-03-11 10:39       ` Tvrtko Ursulin
  1 sibling, 1 reply; 23+ messages in thread
From: Carlos Santa @ 2019-03-08  3:16 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, Tvrtko Ursulin; +Cc: Michel Thierry

On Fri, 2019-03-01 at 09:36 +0000, Chris Wilson wrote:
> > 
> Quoting Carlos Santa (2019-02-21 02:58:16)
> > +#define GEN8_WATCHDOG_1000US(dev_priv)
> > watchdog_to_clock_counts(dev_priv, 1000)
> > +static void gen8_watchdog_irq_handler(unsigned long data)
> > +{
> > +       struct intel_engine_cs *engine = (struct intel_engine_cs
> > *)data;
> > +       struct drm_i915_private *dev_priv = engine->i915;
> > +       unsigned int hung = 0;
> > +       u32 current_seqno=0;
> > +       char msg[80];
> > +       unsigned int tmp;
> > +       int len;
> > +
> > +       /* Stop the counter to prevent further timeout interrupts
> > */
> > +       I915_WRITE_FW(RING_CNTR(engine->mmio_base),
> > get_watchdog_disable(engine));
> > +
> > +       /* Read the heartbeat seqno once again to check if we are
> > stuck? */
> > +       current_seqno = intel_engine_get_hangcheck_seqno(engine);
> 
> I have said this before, but this doesn't exist either, it's just a
> temporary glitch in the matrix.
> 

Chris, Tvrtko, I need some guidance on how to find the quilty seqno
during a hang, can you please advice here what to do? 

Thanks,
Carlos

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-03-08  3:16     ` Carlos Santa
@ 2019-03-11 10:39       ` Tvrtko Ursulin
  2019-03-18  0:15         ` Carlos Santa
  0 siblings, 1 reply; 23+ messages in thread
From: Tvrtko Ursulin @ 2019-03-11 10:39 UTC (permalink / raw)
  To: Carlos Santa, Chris Wilson, intel-gfx; +Cc: Michel Thierry


On 08/03/2019 03:16, Carlos Santa wrote:
> On Fri, 2019-03-01 at 09:36 +0000, Chris Wilson wrote:
>>>
>> Quoting Carlos Santa (2019-02-21 02:58:16)
>>> +#define GEN8_WATCHDOG_1000US(dev_priv)
>>> watchdog_to_clock_counts(dev_priv, 1000)
>>> +static void gen8_watchdog_irq_handler(unsigned long data)
>>> +{
>>> +       struct intel_engine_cs *engine = (struct intel_engine_cs
>>> *)data;
>>> +       struct drm_i915_private *dev_priv = engine->i915;
>>> +       unsigned int hung = 0;
>>> +       u32 current_seqno=0;
>>> +       char msg[80];
>>> +       unsigned int tmp;
>>> +       int len;
>>> +
>>> +       /* Stop the counter to prevent further timeout interrupts
>>> */
>>> +       I915_WRITE_FW(RING_CNTR(engine->mmio_base),
>>> get_watchdog_disable(engine));
>>> +
>>> +       /* Read the heartbeat seqno once again to check if we are
>>> stuck? */
>>> +       current_seqno = intel_engine_get_hangcheck_seqno(engine);
>>
>> I have said this before, but this doesn't exist either, it's just a
>> temporary glitch in the matrix.
>>
> 
> Chris, Tvrtko, I need some guidance on how to find the quilty seqno
> during a hang, can you please advice here what to do?

When an interrupt fires you need to ascertain whether the same request 
which enabled the watchdog is running, correct?

So I think you would need this, with a disclaimer that I haven't thought 
about the details really:

1. Take a reference to timeline hwsp when setting up the watchdog for a 
request.

2. Store the initial seqno associated with this request.

3. Force enable user interrupts.

4. When timeout fires, inspect the HWSP seqno to see if the request 
completed or not.

5. Reset the engine if not completed.

6. Put the timeline/hwsp reference.

If the user interrupt fires with the request completed cancel the above 
operations.

There could be an inherent race between inspecting the seqno and 
deciding to reset. Not sure at the moment what to do. Maybe just call it 
bad luck?

I also think for the software implementation you need to force no 
request coalescing for contexts with timeout set. Because you want to 
have 100% defined borders for request in and out - since the timeout is 
defined per request.

In this case you don't need the user interrupt for the trailing edge 
signal but can use context complete. Maybe putting hooks into 
context_in/out in intel_lrc.c would work under these circumstances.

Also if preempted you need to cancel the timer setup and store elapsed 
execution time.

Or it may make sense to just disable preemption for these contexts. 
Otherwise there is no point in trying to mandate the timeout?

But it is also kind of bad since non-privileged contexts can make 
themselves non-preemptable by setting the watchdog timeout.

Maybe as a compromise we need to automatically apply an elevated 
priority level, but not as high to be completely non-preemptable. Sounds 
like a hard question.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 0/5] GEN8+ GPU Watchdog Reset Support
  2019-02-21  2:58 [PATCH v4 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
                   ` (6 preceding siblings ...)
  2019-02-21  3:24 ` ✗ Fi.CI.BAT: failure for drm/i915: Replace global_seqno with a hangcheck heartbeat seqno (rev3) Patchwork
@ 2019-03-11 11:54 ` Chris Wilson
  7 siblings, 0 replies; 23+ messages in thread
From: Chris Wilson @ 2019-03-11 11:54 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx

Quoting Carlos Santa (2019-02-21 02:58:14)
> This is a rebased on the original patch series from Michel Thierry
> that can be found here:
> 
> https://patchwork.freedesktop.org/series/21868
> 
> Note that this series is only limited to the GPU Watchdog timeout
> for execlists as it leaves out support
> for GuC based submission for a later time.

We should also mention that using the watchdog disables idle cycle
detection, and it is recommended not to use semaphore waits in
conjunction with the watchdog. I also wonder what impact this has on rc6
and rps?
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-03-11 10:39       ` Tvrtko Ursulin
@ 2019-03-18  0:15         ` Carlos Santa
  2019-03-19 12:39           ` Tvrtko Ursulin
  0 siblings, 1 reply; 23+ messages in thread
From: Carlos Santa @ 2019-03-18  0:15 UTC (permalink / raw)
  To: Tvrtko Ursulin, Chris Wilson, intel-gfx; +Cc: Michel Thierry

On Mon, 2019-03-11 at 10:39 +0000, Tvrtko Ursulin wrote:
> On 08/03/2019 03:16, Carlos Santa wrote:
> > On Fri, 2019-03-01 at 09:36 +0000, Chris Wilson wrote:
> > > > 
> > > 
> > > Quoting Carlos Santa (2019-02-21 02:58:16)
> > > > +#define GEN8_WATCHDOG_1000US(dev_priv)
> > > > watchdog_to_clock_counts(dev_priv, 1000)
> > > > +static void gen8_watchdog_irq_handler(unsigned long data)
> > > > +{
> > > > +       struct intel_engine_cs *engine = (struct
> > > > intel_engine_cs
> > > > *)data;
> > > > +       struct drm_i915_private *dev_priv = engine->i915;
> > > > +       unsigned int hung = 0;
> > > > +       u32 current_seqno=0;
> > > > +       char msg[80];
> > > > +       unsigned int tmp;
> > > > +       int len;
> > > > +
> > > > +       /* Stop the counter to prevent further timeout
> > > > interrupts
> > > > */
> > > > +       I915_WRITE_FW(RING_CNTR(engine->mmio_base),
> > > > get_watchdog_disable(engine));
> > > > +
> > > > +       /* Read the heartbeat seqno once again to check if we
> > > > are
> > > > stuck? */
> > > > +       current_seqno =
> > > > intel_engine_get_hangcheck_seqno(engine);
> > > 
> > > I have said this before, but this doesn't exist either, it's just
> > > a
> > > temporary glitch in the matrix.
> > > 
> > 
> > Chris, Tvrtko, I need some guidance on how to find the quilty seqno
> > during a hang, can you please advice here what to do?
> 
> When an interrupt fires you need to ascertain whether the same
> request 
> which enabled the watchdog is running, correct?
> 
> So I think you would need this, with a disclaimer that I haven't
> thought 
> about the details really:
> 
> 1. Take a reference to timeline hwsp when setting up the watchdog for
> a 
> request.
> 
> 2. Store the initial seqno associated with this request.
> 
> 3. Force enable user interrupts.
> 
> 4. When timeout fires, inspect the HWSP seqno to see if the request 
> completed or not.
> 
> 5. Reset the engine if not completed.
> 
> 6. Put the timeline/hwsp reference.


static int gen8_emit_bb_start(struct i915_request *rq,
							u64 offset, u32
len,
							const unsigned
int flags)
{
	struct i915_timeline *tl;
	u32 seqno;

	if (enable_watchdog) {
		/* Start watchdog timer */
		cs = gen8_emit_start_watchdog(rq, cs);
		tl = ce->ring->timeline;
		i915_timeline_get_seqno(tl, rq, &seqno);
		/*Store initial hwsp seqno associated with this request 
		engine->watchdog_hwsp_seqno = tl->hwsp_seqno;
	}

}

static void gen8_watchdog_tasklet(unsigned long data)
{
		struct i915_request *rq;

		rq = intel_engine_find_active_request(engine);

		/* Inspect the watchdog seqno once again for
completion? */
		if (!i915_seqno_passed(engine->watchdog_hwsp_seqno, rq-
>fence.seqno)) {
			//Reset Engine
		}
}

Tvrtko, is the above acceptable to inspect whether the seqno has
completed?

I noticed there's a helper function i915_request_completed(struct
i915_request *rq) but it will require me to modify it in order to pass
2 different seqnos.

Regards,
Carlos

> 
> If the user interrupt fires with the request completed cancel the
> above 
> operations.
> 
> There could be an inherent race between inspecting the seqno and 
> deciding to reset. Not sure at the moment what to do. Maybe just call
> it 
> bad luck?
> 
> I also think for the software implementation you need to force no 
> request coalescing for contexts with timeout set. Because you want
> to 
> have 100% defined borders for request in and out - since the timeout
> is 
> defined per request.
> 
> In this case you don't need the user interrupt for the trailing edge 
> signal but can use context complete. Maybe putting hooks into 
> context_in/out in intel_lrc.c would work under these circumstances.
> 
> Also if preempted you need to cancel the timer setup and store
> elapsed 
> execution time.
> 
> Or it may make sense to just disable preemption for these contexts. 
> Otherwise there is no point in trying to mandate the timeout?
> 
> But it is also kind of bad since non-privileged contexts can make 
> themselves non-preemptable by setting the watchdog timeout.
> 
> Maybe as a compromise we need to automatically apply an elevated 
> priority level, but not as high to be completely non-preemptable.
> Sounds 
> like a hard question.
> 
> Regards,
> 
> Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-03-18  0:15         ` Carlos Santa
@ 2019-03-19 12:39           ` Tvrtko Ursulin
  2019-03-19 12:46             ` Tvrtko Ursulin
  0 siblings, 1 reply; 23+ messages in thread
From: Tvrtko Ursulin @ 2019-03-19 12:39 UTC (permalink / raw)
  To: Carlos Santa, Chris Wilson, intel-gfx; +Cc: Michel Thierry


On 18/03/2019 00:15, Carlos Santa wrote:
> On Mon, 2019-03-11 at 10:39 +0000, Tvrtko Ursulin wrote:
>> On 08/03/2019 03:16, Carlos Santa wrote:
>>> On Fri, 2019-03-01 at 09:36 +0000, Chris Wilson wrote:
>>>>>
>>>>
>>>> Quoting Carlos Santa (2019-02-21 02:58:16)
>>>>> +#define GEN8_WATCHDOG_1000US(dev_priv)
>>>>> watchdog_to_clock_counts(dev_priv, 1000)
>>>>> +static void gen8_watchdog_irq_handler(unsigned long data)
>>>>> +{
>>>>> +       struct intel_engine_cs *engine = (struct
>>>>> intel_engine_cs
>>>>> *)data;
>>>>> +       struct drm_i915_private *dev_priv = engine->i915;
>>>>> +       unsigned int hung = 0;
>>>>> +       u32 current_seqno=0;
>>>>> +       char msg[80];
>>>>> +       unsigned int tmp;
>>>>> +       int len;
>>>>> +
>>>>> +       /* Stop the counter to prevent further timeout
>>>>> interrupts
>>>>> */
>>>>> +       I915_WRITE_FW(RING_CNTR(engine->mmio_base),
>>>>> get_watchdog_disable(engine));
>>>>> +
>>>>> +       /* Read the heartbeat seqno once again to check if we
>>>>> are
>>>>> stuck? */
>>>>> +       current_seqno =
>>>>> intel_engine_get_hangcheck_seqno(engine);
>>>>
>>>> I have said this before, but this doesn't exist either, it's just
>>>> a
>>>> temporary glitch in the matrix.
>>>>
>>>
>>> Chris, Tvrtko, I need some guidance on how to find the quilty seqno
>>> during a hang, can you please advice here what to do?
>>
>> When an interrupt fires you need to ascertain whether the same
>> request
>> which enabled the watchdog is running, correct?
>>
>> So I think you would need this, with a disclaimer that I haven't
>> thought
>> about the details really:
>>
>> 1. Take a reference to timeline hwsp when setting up the watchdog for
>> a
>> request.
>>
>> 2. Store the initial seqno associated with this request.
>>
>> 3. Force enable user interrupts.
>>
>> 4. When timeout fires, inspect the HWSP seqno to see if the request
>> completed or not.
>>
>> 5. Reset the engine if not completed.
>>
>> 6. Put the timeline/hwsp reference.
> 
> 
> static int gen8_emit_bb_start(struct i915_request *rq,
> 							u64 offset, u32
> len,
> 							const unsigned
> int flags)
> {
> 	struct i915_timeline *tl;
> 	u32 seqno;
> 
> 	if (enable_watchdog) {
> 		/* Start watchdog timer */
> 		cs = gen8_emit_start_watchdog(rq, cs);
> 		tl = ce->ring->timeline;
> 		i915_timeline_get_seqno(tl, rq, &seqno);
> 		/*Store initial hwsp seqno associated with this request
> 		engine->watchdog_hwsp_seqno = tl->hwsp_seqno;

You should not need to allocate a new seqno and also having something 
stored per engine does not make clear how will you solve out of order.

Maybe you just set up the timer, then lets see below..

Also, are you not trying to do the software implementation to start with?

> 	}
> 
> }
> 
> static void gen8_watchdog_tasklet(unsigned long data)
> {
> 		struct i915_request *rq;
> 
> 		rq = intel_engine_find_active_request(engine);
> 
> 		/* Inspect the watchdog seqno once again for
> completion? */
> 		if (!i915_seqno_passed(engine->watchdog_hwsp_seqno, rq-
>> fence.seqno)) {
> 			//Reset Engine
> 		}
> }

What happens if you simply reset without checking anything? You know hw 
timer wouldn't have fired if the context wasn't running, correct?

(Ignoring the race condition between interrupt raised -> hw interrupt 
delivered -> serviced -> tasklet scheduled -> tasklet running. Which may 
mean request has completed in the meantime and you reset the engine for 
nothing. But this is probably not 100% solvable.)

Regards,

Tvrtko

> Tvrtko, is the above acceptable to inspect whether the seqno has
> completed?
> 
> I noticed there's a helper function i915_request_completed(struct
> i915_request *rq) but it will require me to modify it in order to pass
> 2 different seqnos.
> 
> Regards,
> Carlos
> 
>>
>> If the user interrupt fires with the request completed cancel the
>> above
>> operations.
>>
>> There could be an inherent race between inspecting the seqno and
>> deciding to reset. Not sure at the moment what to do. Maybe just call
>> it
>> bad luck?
>>
>> I also think for the software implementation you need to force no
>> request coalescing for contexts with timeout set. Because you want
>> to
>> have 100% defined borders for request in and out - since the timeout
>> is
>> defined per request.
>>
>> In this case you don't need the user interrupt for the trailing edge
>> signal but can use context complete. Maybe putting hooks into
>> context_in/out in intel_lrc.c would work under these circumstances.
>>
>> Also if preempted you need to cancel the timer setup and store
>> elapsed
>> execution time.
>>
>> Or it may make sense to just disable preemption for these contexts.
>> Otherwise there is no point in trying to mandate the timeout?
>>
>> But it is also kind of bad since non-privileged contexts can make
>> themselves non-preemptable by setting the watchdog timeout.
>>
>> Maybe as a compromise we need to automatically apply an elevated
>> priority level, but not as high to be completely non-preemptable.
>> Sounds
>> like a hard question.
>>
>> Regards,
>>
>> Tvrtko
> 
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-03-19 12:39           ` Tvrtko Ursulin
@ 2019-03-19 12:46             ` Tvrtko Ursulin
  2019-03-19 17:52               ` Carlos Santa
  0 siblings, 1 reply; 23+ messages in thread
From: Tvrtko Ursulin @ 2019-03-19 12:46 UTC (permalink / raw)
  To: Carlos Santa, Chris Wilson, intel-gfx; +Cc: Michel Thierry


On 19/03/2019 12:39, Tvrtko Ursulin wrote:
> 
> On 18/03/2019 00:15, Carlos Santa wrote:
>> On Mon, 2019-03-11 at 10:39 +0000, Tvrtko Ursulin wrote:
>>> On 08/03/2019 03:16, Carlos Santa wrote:
>>>> On Fri, 2019-03-01 at 09:36 +0000, Chris Wilson wrote:
>>>>>>
>>>>>
>>>>> Quoting Carlos Santa (2019-02-21 02:58:16)
>>>>>> +#define GEN8_WATCHDOG_1000US(dev_priv)
>>>>>> watchdog_to_clock_counts(dev_priv, 1000)
>>>>>> +static void gen8_watchdog_irq_handler(unsigned long data)
>>>>>> +{
>>>>>> +       struct intel_engine_cs *engine = (struct
>>>>>> intel_engine_cs
>>>>>> *)data;
>>>>>> +       struct drm_i915_private *dev_priv = engine->i915;
>>>>>> +       unsigned int hung = 0;
>>>>>> +       u32 current_seqno=0;
>>>>>> +       char msg[80];
>>>>>> +       unsigned int tmp;
>>>>>> +       int len;
>>>>>> +
>>>>>> +       /* Stop the counter to prevent further timeout
>>>>>> interrupts
>>>>>> */
>>>>>> +       I915_WRITE_FW(RING_CNTR(engine->mmio_base),
>>>>>> get_watchdog_disable(engine));
>>>>>> +
>>>>>> +       /* Read the heartbeat seqno once again to check if we
>>>>>> are
>>>>>> stuck? */
>>>>>> +       current_seqno =
>>>>>> intel_engine_get_hangcheck_seqno(engine);
>>>>>
>>>>> I have said this before, but this doesn't exist either, it's just
>>>>> a
>>>>> temporary glitch in the matrix.
>>>>>
>>>>
>>>> Chris, Tvrtko, I need some guidance on how to find the quilty seqno
>>>> during a hang, can you please advice here what to do?
>>>
>>> When an interrupt fires you need to ascertain whether the same
>>> request
>>> which enabled the watchdog is running, correct?
>>>
>>> So I think you would need this, with a disclaimer that I haven't
>>> thought
>>> about the details really:
>>>
>>> 1. Take a reference to timeline hwsp when setting up the watchdog for
>>> a
>>> request.
>>>
>>> 2. Store the initial seqno associated with this request.
>>>
>>> 3. Force enable user interrupts.
>>>
>>> 4. When timeout fires, inspect the HWSP seqno to see if the request
>>> completed or not.
>>>
>>> 5. Reset the engine if not completed.
>>>
>>> 6. Put the timeline/hwsp reference.
>>
>>
>> static int gen8_emit_bb_start(struct i915_request *rq,
>>                             u64 offset, u32
>> len,
>>                             const unsigned
>> int flags)
>> {
>>     struct i915_timeline *tl;
>>     u32 seqno;
>>
>>     if (enable_watchdog) {
>>         /* Start watchdog timer */
>>         cs = gen8_emit_start_watchdog(rq, cs);
>>         tl = ce->ring->timeline;
>>         i915_timeline_get_seqno(tl, rq, &seqno);
>>         /*Store initial hwsp seqno associated with this request
>>         engine->watchdog_hwsp_seqno = tl->hwsp_seqno;
> 
> You should not need to allocate a new seqno and also having something 
> stored per engine does not make clear how will you solve out of order.
> 
> Maybe you just set up the timer, then lets see below..
> 
> Also, are you not trying to do the software implementation to start with?
> 
>>     }
>>
>> }
>>
>> static void gen8_watchdog_tasklet(unsigned long data)
>> {
>>         struct i915_request *rq;
>>
>>         rq = intel_engine_find_active_request(engine);
>>
>>         /* Inspect the watchdog seqno once again for
>> completion? */
>>         if (!i915_seqno_passed(engine->watchdog_hwsp_seqno, rq-
>>> fence.seqno)) {
>>             //Reset Engine
>>         }
>> }
> 
> What happens if you simply reset without checking anything? You know hw 
> timer wouldn't have fired if the context wasn't running, correct?
> 
> (Ignoring the race condition between interrupt raised -> hw interrupt 
> delivered -> serviced -> tasklet scheduled -> tasklet running. Which may 
> mean request has completed in the meantime and you reset the engine for 
> nothing. But this is probably not 100% solvable.)

Good idea would be to write some tests to exercise some normal and more 
edge case scenarios like coalesced requests, preemption etc. Checking 
which request got reset etc.

Regards,

Tvrtko

> Regards,
> 
> Tvrtko
> 
>> Tvrtko, is the above acceptable to inspect whether the seqno has
>> completed?
>>
>> I noticed there's a helper function i915_request_completed(struct
>> i915_request *rq) but it will require me to modify it in order to pass
>> 2 different seqnos.
>>
>> Regards,
>> Carlos
>>
>>>
>>> If the user interrupt fires with the request completed cancel the
>>> above
>>> operations.
>>>
>>> There could be an inherent race between inspecting the seqno and
>>> deciding to reset. Not sure at the moment what to do. Maybe just call
>>> it
>>> bad luck?
>>>
>>> I also think for the software implementation you need to force no
>>> request coalescing for contexts with timeout set. Because you want
>>> to
>>> have 100% defined borders for request in and out - since the timeout
>>> is
>>> defined per request.
>>>
>>> In this case you don't need the user interrupt for the trailing edge
>>> signal but can use context complete. Maybe putting hooks into
>>> context_in/out in intel_lrc.c would work under these circumstances.
>>>
>>> Also if preempted you need to cancel the timer setup and store
>>> elapsed
>>> execution time.
>>>
>>> Or it may make sense to just disable preemption for these contexts.
>>> Otherwise there is no point in trying to mandate the timeout?
>>>
>>> But it is also kind of bad since non-privileged contexts can make
>>> themselves non-preemptable by setting the watchdog timeout.
>>>
>>> Maybe as a compromise we need to automatically apply an elevated
>>> priority level, but not as high to be completely non-preemptable.
>>> Sounds
>>> like a hard question.
>>>
>>> Regards,
>>>
>>> Tvrtko
>>
>>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-03-19 12:46             ` Tvrtko Ursulin
@ 2019-03-19 17:52               ` Carlos Santa
  0 siblings, 0 replies; 23+ messages in thread
From: Carlos Santa @ 2019-03-19 17:52 UTC (permalink / raw)
  To: Tvrtko Ursulin, Chris Wilson, intel-gfx; +Cc: Michel Thierry

On Tue, 2019-03-19 at 12:46 +0000, Tvrtko Ursulin wrote:
> On 19/03/2019 12:39, Tvrtko Ursulin wrote:
> > 
> > On 18/03/2019 00:15, Carlos Santa wrote:
> > > On Mon, 2019-03-11 at 10:39 +0000, Tvrtko Ursulin wrote:
> > > > On 08/03/2019 03:16, Carlos Santa wrote:
> > > > > On Fri, 2019-03-01 at 09:36 +0000, Chris Wilson wrote:
> > > > > > > 
> > > > > > 
> > > > > > Quoting Carlos Santa (2019-02-21 02:58:16)
> > > > > > > +#define GEN8_WATCHDOG_1000US(dev_priv)
> > > > > > > watchdog_to_clock_counts(dev_priv, 1000)
> > > > > > > +static void gen8_watchdog_irq_handler(unsigned long
> > > > > > > data)
> > > > > > > +{
> > > > > > > +       struct intel_engine_cs *engine = (struct
> > > > > > > intel_engine_cs
> > > > > > > *)data;
> > > > > > > +       struct drm_i915_private *dev_priv = engine->i915;
> > > > > > > +       unsigned int hung = 0;
> > > > > > > +       u32 current_seqno=0;
> > > > > > > +       char msg[80];
> > > > > > > +       unsigned int tmp;
> > > > > > > +       int len;
> > > > > > > +
> > > > > > > +       /* Stop the counter to prevent further timeout
> > > > > > > interrupts
> > > > > > > */
> > > > > > > +       I915_WRITE_FW(RING_CNTR(engine->mmio_base),
> > > > > > > get_watchdog_disable(engine));
> > > > > > > +
> > > > > > > +       /* Read the heartbeat seqno once again to check
> > > > > > > if we
> > > > > > > are
> > > > > > > stuck? */
> > > > > > > +       current_seqno =
> > > > > > > intel_engine_get_hangcheck_seqno(engine);
> > > > > > 
> > > > > > I have said this before, but this doesn't exist either,
> > > > > > it's just
> > > > > > a
> > > > > > temporary glitch in the matrix.
> > > > > > 
> > > > > 
> > > > > Chris, Tvrtko, I need some guidance on how to find the quilty
> > > > > seqno
> > > > > during a hang, can you please advice here what to do?
> > > > 
> > > > When an interrupt fires you need to ascertain whether the same
> > > > request
> > > > which enabled the watchdog is running, correct?
> > > > 
> > > > So I think you would need this, with a disclaimer that I
> > > > haven't
> > > > thought
> > > > about the details really:
> > > > 
> > > > 1. Take a reference to timeline hwsp when setting up the
> > > > watchdog for
> > > > a
> > > > request.
> > > > 
> > > > 2. Store the initial seqno associated with this request.
> > > > 
> > > > 3. Force enable user interrupts.
> > > > 
> > > > 4. When timeout fires, inspect the HWSP seqno to see if the
> > > > request
> > > > completed or not.
> > > > 
> > > > 5. Reset the engine if not completed.
> > > > 
> > > > 6. Put the timeline/hwsp reference.
> > > 
> > > 
> > > static int gen8_emit_bb_start(struct i915_request *rq,
> > >                             u64 offset, u32
> > > len,
> > >                             const unsigned
> > > int flags)
> > > {
> > >     struct i915_timeline *tl;
> > >     u32 seqno;
> > > 
> > >     if (enable_watchdog) {
> > >         /* Start watchdog timer */
> > >         cs = gen8_emit_start_watchdog(rq, cs);
> > >         tl = ce->ring->timeline;
> > >         i915_timeline_get_seqno(tl, rq, &seqno);
> > >         /*Store initial hwsp seqno associated with this request
> > >         engine->watchdog_hwsp_seqno = tl->hwsp_seqno;
> > 
> > You should not need to allocate a new seqno and also having
> > something 
> > stored per engine does not make clear how will you solve out of
> > order.

Understood, I missed that there's a convenience pointer available to us
per request (i.e.,  *hwsp_seqno). On step #1 above you have said to
take a reference to the timeline so I was trying to make a link between
the timeline and the seqno but if the request comes already with a
convenience pointer then we may not need the timeline after all...

However, on v4 of the series I was using
intel_engine_get_hangcheck_seqno(engine) for this purpose, and even
though Chris was against it, I saw that it landed recently on the
tree...

> > 
> > Maybe you just set up the timer, then lets see below..

I think you're suggesting simply not to bother checking for the guilty
seqno in the tasklet and simply reset...

> > 
> > Also, are you not trying to do the software implementation to start
> > with?

Trying to keep it simple with just the h/w timers for now... adding
front/back end to accomodate the s/w timers will just muddy the waters?
Will get to it once we agree on what to do here...

> > 
> > >     }
> > > 
> > > }
> > > 
> > > static void gen8_watchdog_tasklet(unsigned long data)
> > > {
> > >         struct i915_request *rq;
> > > 
> > >         rq = intel_engine_find_active_request(engine);
> > > 
> > >         /* Inspect the watchdog seqno once again for
> > > completion? */
> > >         if (!i915_seqno_passed(engine->watchdog_hwsp_seqno, rq-
> > > > fence.seqno)) {
> > > 
> > >             //Reset Engine
> > >         }
> > > }
> > 
> > What happens if you simply reset without checking anything? You
> > know hw 
> > timer wouldn't have fired if the context wasn't running, correct?

Need to verify this by running some tests then...

> > 
> > (Ignoring the race condition between interrupt raised -> hw
> > interrupt 
> > delivered -> serviced -> tasklet scheduled -> tasklet running.
> > Which may 
> > mean request has completed in the meantime and you reset the engine
> > for 
> > nothing. But this is probably not 100% solvable.)
> 
> Good idea would be to write some tests to exercise some normal and
> more 
> edge case scenarios like coalesced requests, preemption etc.
> Checking 
> which request got reset etc.

Ok, need to try some test cases then.

Regards,
Carlos

> 
> Regards,
> 
> Tvrtko
> 
> > Regards,
> > 
> > Tvrtko
> > 
> > > Tvrtko, is the above acceptable to inspect whether the seqno has
> > > completed?
> > > 
> > > I noticed there's a helper function i915_request_completed(struct
> > > i915_request *rq) but it will require me to modify it in order to
> > > pass
> > > 2 different seqnos.
> > > 
> > > Regards,
> > > Carlos
> > > 
> > > > 
> > > > If the user interrupt fires with the request completed cancel
> > > > the
> > > > above
> > > > operations.
> > > > 
> > > > There could be an inherent race between inspecting the seqno
> > > > and
> > > > deciding to reset. Not sure at the moment what to do. Maybe
> > > > just call
> > > > it
> > > > bad luck?
> > > > 
> > > > I also think for the software implementation you need to force
> > > > no
> > > > request coalescing for contexts with timeout set. Because you
> > > > want
> > > > to
> > > > have 100% defined borders for request in and out - since the
> > > > timeout
> > > > is
> > > > defined per request.
> > > > 
> > > > In this case you don't need the user interrupt for the trailing
> > > > edge
> > > > signal but can use context complete. Maybe putting hooks into
> > > > context_in/out in intel_lrc.c would work under these
> > > > circumstances.
> > > > 
> > > > Also if preempted you need to cancel the timer setup and store
> > > > elapsed
> > > > execution time.
> > > > 
> > > > Or it may make sense to just disable preemption for these
> > > > contexts.
> > > > Otherwise there is no point in trying to mandate the timeout?
> > > > 
> > > > But it is also kind of bad since non-privileged contexts can
> > > > make
> > > > themselves non-preemptable by setting the watchdog timeout.
> > > > 
> > > > Maybe as a compromise we need to automatically apply an
> > > > elevated
> > > > priority level, but not as high to be completely non-
> > > > preemptable.
> > > > Sounds
> > > > like a hard question.
> > > > 
> > > > Regards,
> > > > 
> > > > Tvrtko
> > > 
> > > 

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-03-19 17:53 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-21  2:58 [PATCH v4 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
2019-02-21  2:58 ` [PATCH v4 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl Carlos Santa
2019-02-25 13:34   ` Tvrtko Ursulin
2019-03-06 23:08     ` Carlos Santa
2019-03-07  7:27       ` Tvrtko Ursulin
2019-02-21  2:58 ` [PATCH v4 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+ Carlos Santa
2019-02-28 17:38   ` Tvrtko Ursulin
2019-03-01  1:51     ` Carlos Santa
2019-03-01  9:36   ` Chris Wilson
2019-03-02  2:08     ` Carlos Santa
2019-03-08  3:16     ` Carlos Santa
2019-03-11 10:39       ` Tvrtko Ursulin
2019-03-18  0:15         ` Carlos Santa
2019-03-19 12:39           ` Tvrtko Ursulin
2019-03-19 12:46             ` Tvrtko Ursulin
2019-03-19 17:52               ` Carlos Santa
2019-02-21  2:58 ` [PATCH v4 3/5] drm/i915: Watchdog timeout: Ringbuffer command emission " Carlos Santa
2019-02-21  2:58 ` [PATCH v4 4/5] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Carlos Santa
2019-02-28 17:22   ` Tvrtko Ursulin
2019-02-21  2:58 ` [PATCH v4 5/5] drm/i915: Watchdog timeout: Include threshold value in error state Carlos Santa
2019-02-21  2:58 ` drm/i915: Replace global_seqno with a hangcheck heartbeat seqno Carlos Santa
2019-02-21  3:24 ` ✗ Fi.CI.BAT: failure for drm/i915: Replace global_seqno with a hangcheck heartbeat seqno (rev3) Patchwork
2019-03-11 11:54 ` [PATCH v4 0/5] GEN8+ GPU Watchdog Reset Support Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.