All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/5] GEN8+ GPU Watchdog Reset Support
@ 2019-03-22 23:41 Carlos Santa
  2019-03-22 23:41 ` [PATCH v5 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl Carlos Santa
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Carlos Santa @ 2019-03-22 23:41 UTC (permalink / raw)
  To: intel-gfx

This is a rebased on the original patch series from Michel Thierry:
https://patchwork.freedesktop.org/series/21868

Note that this series is only limited to the GPU Watchdog timeout for
execlists as it leaves out support for GuC based submissions for later.

PATCH v5 of this series was tested from userspace through an IGT
test gem_watchdog --run-subtest basic-bsd1 that is is not in upstream
yet.

The corresponding changes on the i965 media userspace are also under
review: https://github.com/intel/intel-vaapi-driver/pull/429/files

Michel Thierry (5):
  drm/i915: Add engine reset count in get-reset-stats ioctl
  drm/i915: Watchdog timeout: IRQ handler for gen8+
  drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  drm/i915: Watchdog timeout: Include threshold value in error state

 drivers/gpu/drm/i915/i915_drv.h            |   5 +
 drivers/gpu/drm/i915/i915_gem_context.c    | 162 ++++++++++++++++++++-
 drivers/gpu/drm/i915/i915_gpu_error.c      |  14 +-
 drivers/gpu/drm/i915/i915_gpu_error.h      |   5 +
 drivers/gpu/drm/i915/i915_irq.c            |  14 +-
 drivers/gpu/drm/i915/i915_reg.h            |   6 +
 drivers/gpu/drm/i915/i915_reset.c          |  20 +++
 drivers/gpu/drm/i915/i915_reset.h          |   6 +
 drivers/gpu/drm/i915/intel_context_types.h |   4 +
 drivers/gpu/drm/i915/intel_engine_cs.c     |   3 +
 drivers/gpu/drm/i915/intel_engine_types.h  |  22 ++-
 drivers/gpu/drm/i915/intel_hangcheck.c     |  11 +-
 drivers/gpu/drm/i915/intel_lrc.c           | 141 +++++++++++++++++-
 drivers/gpu/drm/i915/intel_lrc.h           |   2 +
 include/uapi/drm/i915_drm.h                |  21 +++
 15 files changed, 410 insertions(+), 26 deletions(-)

-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v5 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl
  2019-03-22 23:41 [PATCH v5 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
@ 2019-03-22 23:41 ` Carlos Santa
  2019-03-30  8:45   ` Chris Wilson
  2019-03-22 23:41 ` [PATCH v5 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+ Carlos Santa
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Carlos Santa @ 2019-03-22 23:41 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

From: Michel Thierry <michel.thierry@intel.com>

Users/tests relying on the total reset count will start seeing a smaller
number since most of the hangs can be handled by engine reset.
Note that if reset engine x, context a running on engine y will be unaware
and unaffected.

To start the discussion, include just a total engine reset count. If it
is deemed useful, it can be extended to report each engine separately.

Our igt's gem_reset_stats test will need changes to ignore the pad field,
since it can now return reset_engine_count.

v2: s/engine_reset/reset_engine/, use union in uapi to not break compatibility.
v3: Keep rejecting attempts to use pad as input (Antonio)
v4: Rebased.
v5: Rebased.
    Get rid of the union to store pad/engine count (Chris)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 12 ++++++++++--
 include/uapi/drm/i915_drm.h             |  4 ++++
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 21208a865380..9625b5f7faf7 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -1350,6 +1350,8 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct drm_i915_reset_stats *args = data;
 	struct i915_gem_context *ctx;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
 	int ret;
 
 	if (args->flags || args->pad)
@@ -1368,10 +1370,16 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 	 * we should wrap the hangstats with a seqlock.
 	 */
 
-	if (capable(CAP_SYS_ADMIN))
+	if (capable(CAP_SYS_ADMIN)) {
 		args->reset_count = i915_reset_count(&dev_priv->gpu_error);
-	else
+		for_each_engine(engine, dev_priv, id)
+			args->reset_engine_count +=
+				i915_reset_engine_count(&dev_priv->gpu_error,
+							engine);
+	} else {
 		args->reset_count = 0;
+		args->reset_engine_count = 0;
+	}
 
 	args->batch_active = atomic_read(&ctx->guilty_count);
 	args->batch_pending = atomic_read(&ctx->active_count);
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index aa2d4c73a97d..5e7bc6412880 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1459,6 +1459,9 @@ struct drm_i915_reset_stats {
 	/* All resets since boot/module reload, for all contexts */
 	__u32 reset_count;
 
+	/* Engine resets since boot/module reload, for all contexts */
+	__u32 reset_engine_count;
+
 	/* Number of batches lost when active in GPU, for this context */
 	__u32 batch_active;
 
@@ -1466,6 +1469,7 @@ struct drm_i915_reset_stats {
 	__u32 batch_pending;
 
 	__u32 pad;
+
 };
 
 struct drm_i915_gem_userptr {
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-03-22 23:41 [PATCH v5 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
  2019-03-22 23:41 ` [PATCH v5 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl Carlos Santa
@ 2019-03-22 23:41 ` Carlos Santa
  2019-03-25 10:00   ` Tvrtko Ursulin
  2019-03-22 23:41 ` [PATCH v5 3/5] drm/i915: Watchdog timeout: Ringbuffer command emission " Carlos Santa
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Carlos Santa @ 2019-03-22 23:41 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

From: Michel Thierry <michel.thierry@intel.com>

*** General ***

Watchdog timeout (or "media engine reset") is a feature that allows
userland applications to enable hang detection on individual batch buffers.
The detection mechanism itself is mostly bound to the hardware and the only
thing that the driver needs to do to support this form of hang detection
is to implement the interrupt handling support as well as watchdog command
emission before and after the emitted batch buffer start instruction in the
ring buffer.

The principle of the hang detection mechanism is as follows:

1. Once the decision has been made to enable watchdog timeout for a
particular batch buffer and the driver is in the process of emitting the
batch buffer start instruction into the ring buffer it also emits a
watchdog timer start instruction before and a watchdog timer cancellation
instruction after the batch buffer start instruction in the ring buffer.

2. Once the GPU execution reaches the watchdog timer start instruction
the hardware watchdog counter is started by the hardware. The counter
keeps counting until either reaching a previously configured threshold
value or the timer cancellation instruction is executed.

2a. If the counter reaches the threshold value the hardware fires a
watchdog interrupt that is picked up by the watchdog interrupt handler.
This means that a hang has been detected and the driver needs to deal with
it the same way it would deal with a engine hang detected by the periodic
hang checker. The only difference between the two is that we already blamed
the active request (to ensure an engine reset).

2b. If the batch buffer completes and the execution reaches the watchdog
cancellation instruction before the watchdog counter reaches its
threshold value the watchdog is cancelled and nothing more comes of it.
No hang is detected.

Note about future interaction with preemption: Preemption could happen
in a command sequence prior to watchdog counter getting disabled,
resulting in watchdog being triggered following preemption (e.g. when
watchdog had been enabled in the low priority batch). The driver will
need to explicitly disable the watchdog counter as part of the
preemption sequence.

*** This patch introduces: ***

1. IRQ handler code for watchdog timeout allowing direct hang recovery
based on hardware-driven hang detection, which then integrates directly
with the hang recovery path. This is independent of having per-engine reset
or just full gpu reset.

2. Watchdog specific register information.

Currently the render engine and all available media engines support
watchdog timeout (VECS is only supported in GEN9). The specifications elude
to the BCS engine being supported but that is currently not supported by
this commit.

Note that the value to stop the counter is different between render and
non-render engines in GEN8; GEN9 onwards it's the same.

v2: Move irq handler to tasklet, arm watchdog for a 2nd time to check
against false-positives.

v3: Don't use high priority tasklet, use engine_last_submit while
checking for false-positives. From GEN9 onwards, the stop counter bit is
the same for all engines.

v4: Remove unnecessary brackets, use current_seqno to mark the request
as guilty in the hangcheck/capture code.

v5: Rebased after RESET_ENGINEs flag.

v6: Don't capture error state in case of watchdog timeout. The capture
process is time consuming and this will align to what happens when we
use GuC to handle the watchdog timeout. (Chris)

v7: Rebase.

v8: Rebase, use HZ to reschedule.

v9: Rebase, get forcewake domains in function (no longer in execlists
struct).

v10: Rebase.

v11: Rebase,
     remove extra braces (Tvrtko),
     implement watchdog_to_clock_counts helper (Tvrtko),
     Move tasklet_kill(watchdog_tasklet) inside intel_engines (Tvrtko),
     Use a global heartbeat seqno instead of engine seqno (Chris)
     Make all engines checks all class based checks (Tvrtko)

v12: Rebase,
     Reset immediately upon entering the IRQ (Chris)
     Make reset_engine_to_str a helper (Tvrtko)
     Rename watchdog_irq_handler as watchdog_tasklet (Tvrtko)
     Let the compiler itself do the inline (Tvrtko)

v13: Rebase
v14: Rebase, skip checking for the guilty seqno in the tasklet (Tvrtko)

Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.h     |  4 ++
 drivers/gpu/drm/i915/i915_irq.c           | 14 ++++--
 drivers/gpu/drm/i915/i915_reg.h           |  6 +++
 drivers/gpu/drm/i915/i915_reset.c         | 20 +++++++++
 drivers/gpu/drm/i915/i915_reset.h         |  6 +++
 drivers/gpu/drm/i915/intel_engine_cs.c    |  1 +
 drivers/gpu/drm/i915/intel_engine_types.h |  5 +++
 drivers/gpu/drm/i915/intel_hangcheck.c    | 11 +----
 drivers/gpu/drm/i915/intel_lrc.c          | 52 +++++++++++++++++++++++
 9 files changed, 107 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
index 99d6b7b270c2..6cf6a8679b26 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.h
+++ b/drivers/gpu/drm/i915/i915_gpu_error.h
@@ -203,6 +203,9 @@ struct i915_gpu_error {
 	 * any global resources that may be clobber by the reset (such as
 	 * FENCE registers).
 	 *
+	 * #I915_RESET_WATCHDOG - When hw detects a hang before us, we can use
+	 * I915_RESET_WATCHDOG to report the hang detection cause accurately.
+	 *
 	 * #I915_RESET_ENGINE[num_engines] - Since the driver doesn't need to
 	 * acquire the struct_mutex to reset an engine, we need an explicit
 	 * flag to prevent two concurrent reset attempts in the same engine.
@@ -218,6 +221,7 @@ struct i915_gpu_error {
 #define I915_RESET_BACKOFF	0
 #define I915_RESET_MODESET	1
 #define I915_RESET_ENGINE	2
+#define I915_RESET_WATCHDOG	3
 #define I915_WEDGED		(BITS_PER_LONG - 1)
 
 	/** Number of times the device has been reset (global) */
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 82d487189a34..e64994be25c3 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1466,6 +1466,9 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
 
 	if (tasklet)
 		tasklet_hi_schedule(&engine->execlists.tasklet);
+
+	if (iir & GT_GEN8_WATCHDOG_INTERRUPT)
+		tasklet_schedule(&engine->execlists.watchdog_tasklet);
 }
 
 static void gen8_gt_irq_ack(struct drm_i915_private *i915,
@@ -3892,20 +3895,25 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
 	u32 gt_interrupts[] = {
 		(GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
+		 GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 		 GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
 		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT),
-
 		(GT_RENDER_USER_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
 		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
+		 GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
 		 GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
-		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT),
-
+		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
+		 GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT),
 		0,
 
 		(GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
 		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT)
 	};
 
+	/* VECS watchdog is only available in skl+ */
+	if (INTEL_GEN(dev_priv) >= 9)
+		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
+
 	dev_priv->pm_ier = 0x0;
 	dev_priv->pm_imr = ~dev_priv->pm_ier;
 	GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 9b69cec21f7b..ac8d984e16ae 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -2363,6 +2363,11 @@ enum i915_power_well_id {
 #define RING_START(base)	_MMIO((base) + 0x38)
 #define RING_CTL(base)		_MMIO((base) + 0x3c)
 #define   RING_CTL_SIZE(size)	((size) - PAGE_SIZE) /* in bytes -> pages */
+#define RING_CNTR(base)		_MMIO((base) + 0x178)
+#define GEN8_WATCHDOG_ENABLE		0
+#define GEN8_WATCHDOG_DISABLE		1
+#define GEN8_XCS_WATCHDOG_DISABLE	0xFFFFFFFF /* GEN8 & non-render only */
+#define RING_THRESH(base)	_MMIO((base) + 0x17C)
 #define RING_SYNC_0(base)	_MMIO((base) + 0x40)
 #define RING_SYNC_1(base)	_MMIO((base) + 0x44)
 #define RING_SYNC_2(base)	_MMIO((base) + 0x48)
@@ -2925,6 +2930,7 @@ enum i915_power_well_id {
 #define GT_BSD_USER_INTERRUPT			(1 << 12)
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /* hsw+; rsvd on snb, ivb, vlv */
 #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
+#define GT_GEN8_WATCHDOG_INTERRUPT		(1 <<  6) /* gen8+ */
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /* !snb */
 #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
 #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
index 861fe083e383..739fa5ad1a8d 100644
--- a/drivers/gpu/drm/i915/i915_reset.c
+++ b/drivers/gpu/drm/i915/i915_reset.c
@@ -1208,6 +1208,26 @@ void i915_clear_error_registers(struct drm_i915_private *dev_priv)
 	}
 }
 
+void engine_reset_error_to_str(struct drm_i915_private *i915,
+	           char *msg,
+	           size_t sz,
+	           unsigned int hung,
+	           unsigned int stuck,
+	           unsigned int watchdog)
+{
+	int len;
+	unsigned int tmp;
+	struct intel_engine_cs *engine;
+
+	len = scnprintf(msg, sz,
+			"%s on ", watchdog ? "watchdog timeout" :
+				stuck == hung ? "no_progress" : "hang");
+	for_each_engine_masked(engine, i915, hung, tmp)
+		len += scnprintf(msg + len, sz - len,
+				"%s, ", engine->name);
+	msg[len-2] = '\0';
+}
+
 /**
  * i915_handle_error - handle a gpu error
  * @i915: i915 device private
diff --git a/drivers/gpu/drm/i915/i915_reset.h b/drivers/gpu/drm/i915/i915_reset.h
index 16f2389f656f..8582d1242248 100644
--- a/drivers/gpu/drm/i915/i915_reset.h
+++ b/drivers/gpu/drm/i915/i915_reset.h
@@ -20,6 +20,12 @@ void i915_handle_error(struct drm_i915_private *i915,
 		       u32 engine_mask,
 		       unsigned long flags,
 		       const char *fmt, ...);
+void engine_reset_error_to_str(struct drm_i915_private *i915,
+               char *str,
+               size_t sz,
+               unsigned int hung,
+               unsigned int stuck,
+               unsigned int watchdog);
 #define I915_ERROR_CAPTURE BIT(0)
 
 void i915_clear_error_registers(struct drm_i915_private *i915);
diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index 652c1b3ba190..88cf0fc07623 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -1149,6 +1149,7 @@ void intel_engines_park(struct drm_i915_private *i915)
 		/* Flush the residual irq tasklets first. */
 		intel_engine_disarm_breadcrumbs(engine);
 		tasklet_kill(&engine->execlists.tasklet);
+		tasklet_kill(&engine->execlists.watchdog_tasklet);
 
 		/*
 		 * We are committed now to parking the engines, make sure there
diff --git a/drivers/gpu/drm/i915/intel_engine_types.h b/drivers/gpu/drm/i915/intel_engine_types.h
index b0aa1f0d4e47..c4f66b774e7c 100644
--- a/drivers/gpu/drm/i915/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/intel_engine_types.h
@@ -124,6 +124,11 @@ struct intel_engine_execlists {
 	 */
 	struct tasklet_struct tasklet;
 
+	/*
+	 * @watchdog_tasklet: stop counter and reschedule hangcheck_work asap
+	 */
+	struct tasklet_struct watchdog_tasklet;
+
 	/**
 	 * @default_priolist: priority list for I915_PRIORITY_NORMAL
 	 */
diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index 57ed49dc19c4..4bf26863678c 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -220,22 +220,15 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
 				   unsigned int hung,
 				   unsigned int stuck)
 {
-	struct intel_engine_cs *engine;
 	char msg[80];
-	unsigned int tmp;
-	int len;
+	size_t len = sizeof(msg);
 
 	/* If some rings hung but others were still busy, only
 	 * blame the hanging rings in the synopsis.
 	 */
 	if (stuck != hung)
 		hung &= ~stuck;
-	len = scnprintf(msg, sizeof(msg),
-			"%s on ", stuck == hung ? "no progress" : "hang");
-	for_each_engine_masked(engine, i915, hung, tmp)
-		len += scnprintf(msg + len, sizeof(msg) - len,
-				 "%s, ", engine->name);
-	msg[len-2] = '\0';
+	engine_reset_error_to_str(i915, msg, len, hung, stuck, 0);
 
 	return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s", msg);
 }
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index e54e0064b2d6..85785a94f6ae 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -2195,6 +2195,40 @@ static int gen8_emit_flush_render(struct i915_request *request,
 	return 0;
 }
 
+static void gen8_watchdog_tasklet(unsigned long data)
+{
+	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
+	struct drm_i915_private *dev_priv = engine->i915;
+	enum forcewake_domains fw_domains;
+	char msg[80];
+	size_t len = sizeof(msg);
+	unsigned long *lock = &engine->i915->gpu_error.flags;
+	unsigned int bit = I915_RESET_ENGINE + engine->id;
+
+	switch (engine->class) {
+	default:
+		MISSING_CASE(engine->id);
+		/* fall through */
+	case RENDER_CLASS:
+		fw_domains = FORCEWAKE_RENDER;
+		break;
+	case VIDEO_DECODE_CLASS:
+	case VIDEO_ENHANCEMENT_CLASS:
+		fw_domains = FORCEWAKE_MEDIA;
+		break;
+	}
+
+	intel_uncore_forcewake_get(dev_priv, fw_domains);
+
+	if (!test_and_set_bit(bit, lock)) {
+		unsigned int hung = engine->mask;
+		engine_reset_error_to_str(dev_priv, msg, len, hung, 0, 1);
+		i915_reset_engine(engine, msg);
+		clear_bit(bit, lock);
+		wake_up_bit(lock, bit);
+	}
+}
+
 /*
  * Reserve space for 2 NOOPs at the end of each request to be
  * used as a workaround for not being allowed to do lite
@@ -2377,6 +2411,21 @@ logical_ring_default_irqs(struct intel_engine_cs *engine)
 
 	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
 	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
+
+	switch (engine->class) {
+	default:
+		/* BCS engine does not support hw watchdog */
+		break;
+	case RENDER_CLASS:
+	case VIDEO_DECODE_CLASS:
+		engine->irq_keep_mask |= GT_GEN8_WATCHDOG_INTERRUPT << shift;
+		break;
+	case VIDEO_ENHANCEMENT_CLASS:
+		if (INTEL_GEN(engine->i915) >= 9)
+			engine->irq_keep_mask |=
+				GT_GEN8_WATCHDOG_INTERRUPT << shift;
+		break;
+	}
 }
 
 static int
@@ -2394,6 +2443,9 @@ logical_ring_setup(struct intel_engine_cs *engine)
 	tasklet_init(&engine->execlists.tasklet,
 		     execlists_submission_tasklet, (unsigned long)engine);
 
+	tasklet_init(&engine->execlists.watchdog_tasklet,
+		     gen8_watchdog_tasklet, (unsigned long)engine);
+
 	logical_ring_default_vfuncs(engine);
 	logical_ring_default_irqs(engine);
 
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 3/5] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2019-03-22 23:41 [PATCH v5 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
  2019-03-22 23:41 ` [PATCH v5 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl Carlos Santa
  2019-03-22 23:41 ` [PATCH v5 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+ Carlos Santa
@ 2019-03-22 23:41 ` Carlos Santa
  2019-03-30  8:49   ` Chris Wilson
  2019-03-30  9:01   ` Chris Wilson
  2019-03-22 23:41 ` [PATCH v5 4/5] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Carlos Santa
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 14+ messages in thread
From: Carlos Santa @ 2019-03-22 23:41 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

From: Michel Thierry <michel.thierry@intel.com>

Emit the required commands into the ring buffer for starting and
stopping the watchdog timer before/after batch buffer start during
batch buffer submission.

v2: Support watchdog threshold per context engine, merge lri commands,
and move watchdog commands emission to emit_bb_start. Request space of
combined start_watchdog, bb_start and stop_watchdog to avoid any error
after emitting bb_start.

v3: There were too many req->engine in emit_bb_start.
Use GEM_BUG_ON instead of returning a very late EINVAL in the remote
case of watchdog misprogramming; set correct LRI cmd size in
emit_stop_watchdog. (Chris)

v4: Rebase.
v5: use to_intel_context instead of ctx->engine.
v6: Rebase.
v7: Rebase,
    Store gpu watchdog capability in engine flag (Tvrtko)
    Store WATCHDOG_DISABLE magic # in engine (Tvrtko)
    No need to declare emit_{start|stop}_watchdog as vfuncs (Tvrtko)
    Replace flag watchdog_running with enable_watchdog (Tvrtko)
    Emit a single MI_NOOP by conditionally checking whether the #
    of emitted OPs is odd (Tvrtko)
v8: Rebase

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/intel_context_types.h |  4 +
 drivers/gpu/drm/i915/intel_engine_cs.c     |  2 +
 drivers/gpu/drm/i915/intel_engine_types.h  | 17 ++++-
 drivers/gpu/drm/i915/intel_lrc.c           | 89 +++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_lrc.h           |  2 +
 5 files changed, 106 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_context_types.h b/drivers/gpu/drm/i915/intel_context_types.h
index 6dc9b4b9067b..e56fc263568e 100644
--- a/drivers/gpu/drm/i915/intel_context_types.h
+++ b/drivers/gpu/drm/i915/intel_context_types.h
@@ -51,6 +51,10 @@ struct intel_context {
 	u64 lrc_desc;
 
 	atomic_t pin_count;
+	/** watchdog_threshold: hw watchdog threshold value,
+	 * in clock counts
+	 */
+	u32 watchdog_threshold;
 	struct mutex pin_mutex; /* guards pinning and associated on-gpuing */
 
 	/**
diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index 88cf0fc07623..d4ea07b70904 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -324,6 +324,8 @@ intel_engine_setup(struct drm_i915_private *dev_priv,
 	if (engine->context_size)
 		DRIVER_CAPS(dev_priv)->has_logical_contexts = true;
 
+	engine->watchdog_disable_id = get_watchdog_disable(engine);
+
 	/* Nothing to do here, execute in order of dependencies */
 	engine->schedule = NULL;
 
diff --git a/drivers/gpu/drm/i915/intel_engine_types.h b/drivers/gpu/drm/i915/intel_engine_types.h
index c4f66b774e7c..1f99b536471d 100644
--- a/drivers/gpu/drm/i915/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/intel_engine_types.h
@@ -260,6 +260,7 @@ struct intel_engine_cs {
 	unsigned int guc_id;
 	intel_engine_mask_t mask;
 
+	u32 watchdog_disable_id;
 	u8 uabi_class;
 
 	u8 class;
@@ -422,10 +423,12 @@ struct intel_engine_cs {
 
 	struct intel_engine_hangcheck hangcheck;
 
-#define I915_ENGINE_NEEDS_CMD_PARSER BIT(0)
-#define I915_ENGINE_SUPPORTS_STATS   BIT(1)
-#define I915_ENGINE_HAS_PREEMPTION   BIT(2)
-#define I915_ENGINE_HAS_SEMAPHORES   BIT(3)
+#define I915_ENGINE_NEEDS_CMD_PARSER  BIT(0)
+#define I915_ENGINE_SUPPORTS_STATS    BIT(1)
+#define I915_ENGINE_HAS_PREEMPTION    BIT(2)
+#define I915_ENGINE_HAS_SEMAPHORES    BIT(3)
+#define I915_ENGINE_SUPPORTS_WATCHDOG BIT(4)
+
 	unsigned int flags;
 
 	/*
@@ -509,6 +512,12 @@ intel_engine_has_semaphores(const struct intel_engine_cs *engine)
 	return engine->flags & I915_ENGINE_HAS_SEMAPHORES;
 }
 
+static inline bool
+intel_engine_supports_watchdog(const struct intel_engine_cs *engine)
+{
+	return engine->flags & I915_ENGINE_SUPPORTS_WATCHDOG;
+}
+
 #define instdone_slice_mask(dev_priv__) \
 	(IS_GEN(dev_priv__, 7) ? \
 	 1 : RUNTIME_INFO(dev_priv__)->sseu.slice_mask)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 85785a94f6ae..78ea54a5dbc3 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -2036,16 +2036,75 @@ static void execlists_reset_finish(struct intel_engine_cs *engine)
 		  atomic_read(&execlists->tasklet.count));
 }
 
+static u32 *gen8_emit_start_watchdog(struct i915_request *rq, u32 *cs)
+{
+	struct intel_engine_cs *engine = rq->engine;
+	struct i915_gem_context *ctx = rq->gem_context;
+	struct intel_context *ce = intel_context_lookup(ctx, engine);
+
+	GEM_BUG_ON(!intel_engine_supports_watchdog(engine));
+
+	/*
+	 * watchdog register must never be programmed to zero. This would
+	 * cause the watchdog counter to exceed and not allow the engine to
+	 * go into IDLE state
+	 */
+	GEM_BUG_ON(ce->watchdog_threshold == 0);
+
+	/* Set counter period */
+	*cs++ = MI_LOAD_REGISTER_IMM(2);
+	*cs++ = i915_mmio_reg_offset(RING_THRESH(engine->mmio_base));
+	*cs++ = ce->watchdog_threshold;
+	/* Start counter */
+	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
+	*cs++ = GEN8_WATCHDOG_ENABLE;
+
+	return cs;
+}
+
+static u32 *gen8_emit_stop_watchdog(struct i915_request *rq, u32 *cs)
+{
+	struct intel_engine_cs *engine = rq->engine;
+
+	GEM_BUG_ON(!intel_engine_supports_watchdog(engine));
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
+	*cs++ = engine->watchdog_disable_id;
+
+	return cs;
+}
+
 static int gen8_emit_bb_start(struct i915_request *rq,
 			      u64 offset, u32 len,
 			      const unsigned int flags)
 {
+	struct intel_engine_cs *engine = rq->engine;
+	struct i915_gem_context *ctx = rq->gem_context;
+	struct intel_context *ce = intel_context_lookup(ctx, engine);
 	u32 *cs;
+	u32 num_dwords;
+	bool enable_watchdog = false;
 
-	cs = intel_ring_begin(rq, 6);
+	/* bb_start only */
+	num_dwords = 6;
+
+	/* check if watchdog will be required */
+	if (ce->watchdog_threshold != 0) {
+		/* + start_watchdog (6) + stop_watchdog (4) */
+		num_dwords += 10;
+		enable_watchdog = true;
+	}
+
+	cs = intel_ring_begin(rq, num_dwords);
 	if (IS_ERR(cs))
 		return PTR_ERR(cs);
 
+	if (enable_watchdog) {
+		/* Start watchdog timer */
+		cs = gen8_emit_start_watchdog(rq, cs);
+	}
+
 	/*
 	 * WaDisableCtxRestoreArbitration:bdw,chv
 	 *
@@ -2072,10 +2131,16 @@ static int gen8_emit_bb_start(struct i915_request *rq,
 	*cs++ = upper_32_bits(offset);
 
 	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
-	*cs++ = MI_NOOP;
 
-	intel_ring_advance(rq, cs);
+	if (enable_watchdog) {
+		/* Cancel watchdog timer */
+		cs = gen8_emit_stop_watchdog(rq, cs);
+	}
+
+	if (*cs%2 != 0)
+		*cs++ = MI_NOOP;
 
+	intel_ring_advance(rq, cs);
 	return 0;
 }
 
@@ -2195,6 +2260,15 @@ static int gen8_emit_flush_render(struct i915_request *request,
 	return 0;
 }
 
+/* From GEN9 onwards, all engines use the same RING_CNTR format */
+u32 get_watchdog_disable(struct intel_engine_cs *engine)
+{
+	if (engine->id == RCS0 || INTEL_GEN(engine->i915) >= 9)
+		return GEN8_WATCHDOG_DISABLE;
+	else
+		return GEN8_XCS_WATCHDOG_DISABLE;
+}
+
 static void gen8_watchdog_tasklet(unsigned long data)
 {
 	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
@@ -2357,6 +2431,9 @@ void intel_execlists_set_default_submission(struct intel_engine_cs *engine)
 	engine->flags |= I915_ENGINE_SUPPORTS_STATS;
 	if (engine->preempt_context)
 		engine->flags |= I915_ENGINE_HAS_PREEMPTION;
+
+	if(engine->id != BCS0)
+		engine->flags |= I915_ENGINE_SUPPORTS_WATCHDOG;
 }
 
 static void
@@ -2531,6 +2608,9 @@ int logical_xcs_ring_init(struct intel_engine_cs *engine)
 	if (err)
 		return err;
 
+	/* BCS engine does not have a watchdog-expired irq */
+	GEM_BUG_ON(!intel_engine_supports_watchdog(engine));
+
 	return logical_ring_init(engine);
 }
 
@@ -2666,7 +2746,7 @@ u32 gen8_make_rpcs(struct drm_i915_private *i915, struct intel_sseu *req_sseu)
 	return rpcs;
 }
 
-static u32 intel_lr_indirect_ctx_offset(struct intel_engine_cs *engine)
+u32 intel_lr_indirect_ctx_offset(struct intel_engine_cs *engine)
 {
 	u32 indirect_ctx_offset;
 
@@ -2693,6 +2773,7 @@ static u32 intel_lr_indirect_ctx_offset(struct intel_engine_cs *engine)
 	}
 
 	return indirect_ctx_offset;
+
 }
 
 static void execlists_init_reg_state(u32 *regs,
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index f1aec8a6986f..4ce97fd5bb2e 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -114,4 +114,6 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine,
 
 u32 gen8_make_rpcs(struct drm_i915_private *i915, struct intel_sseu *ctx_sseu);
 
+u32 get_watchdog_disable(struct intel_engine_cs *engine);
+
 #endif /* _INTEL_LRC_H_ */
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 4/5] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2019-03-22 23:41 [PATCH v5 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
                   ` (2 preceding siblings ...)
  2019-03-22 23:41 ` [PATCH v5 3/5] drm/i915: Watchdog timeout: Ringbuffer command emission " Carlos Santa
@ 2019-03-22 23:41 ` Carlos Santa
  2019-03-22 23:41 ` [PATCH v5 5/5] drm/i915: Watchdog timeout: Include threshold value in error state Carlos Santa
  2019-03-22 23:59 ` ✗ Fi.CI.BAT: failure for GEN8+ GPU Watchdog Reset Support Patchwork
  5 siblings, 0 replies; 14+ messages in thread
From: Carlos Santa @ 2019-03-22 23:41 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

From: Michel Thierry <michel.thierry@intel.com>

Final enablement patch for GPU hang detection using watchdog timeout.
Using the gem_context_setparam ioctl, users can specify the desired
timeout value in microseconds, and the driver will do the conversion to
'timestamps'.

The recommended default watchdog threshold for video engines is 60000 us,
since this has been _empirically determined_ to be a good compromise for
low-latency requirements and low rate of false positives. The default
register value is ~106000us and the theoretical max value (all 1s) is
353 seconds.

[1] http://patchwork.freedesktop.org/patch/msgid/20170329135831.30254-2-chris@chris-wilson.co.uk

v2: Fixed get api to return values in microseconds. Threshold updated to
be per context engine. Check for u32 overflow. Capture ctx threshold
value in error state.

v3: Add a way to get array size, short-cut to disable all thresholds,
return EFAULT / EINVAL as needed. Move the capture of the threshold
value in the error state into a new patch. BXT has a different
timestamp base (because why not?).

v4: Checking if watchdog is available should be the first thing to
do, instead of giving false hopes to abi users; remove unnecessary & in
set_watchdog; ignore args->size in getparam.

v5: GEN9-LP platforms have a different crystal clock frequency, use the
right timestamp base for them (magic 8-ball predicts this will change
again later on, so future-proof it). (Daniele)

v6: Rebase, no more mutex BLK in getparam_ioctl.

v7: use to_intel_context instead of ctx->engine.

v8: Rebase, remove extra mutex from i915_gem_context_set_watchdog (Tvrtko),
Update UAPI to use engine class while keeping thresholds per
engine class (Michel).

v9: Rebase,
    Remove outdated comment from the commit message (Tvrtko)
    Use the engine->flag to verify for gpu watchdog support (Tvrtko)
    Use the standard copy_to_user() instead (Tvrtko)
    Use the correct type when declaring engine class iterator (Tvrtko)
    Remove yet another unncessary mutex_lock (Tvrtko)

v10: Rebase,
    Document uAPI struct drm_i915_watchdog_timeout and use it (Tvrtko)
    Let the compiler takes care of inlines (Tvrtko)
    Make watchdog_to_clock_counts more robust (Tvrtko)

Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         |   3 +
 drivers/gpu/drm/i915/i915_gem_context.c | 150 ++++++++++++++++++++++++
 include/uapi/drm/i915_drm.h             |  17 +++
 3 files changed, 170 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index c65c2e6649df..5324397c3801 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1598,6 +1598,9 @@ struct drm_i915_private {
 	struct drm_i915_fence_reg fence_regs[I915_MAX_NUM_FENCES]; /* assume 965 */
 	int num_fence_regs; /* 8 on pre-965, 16 otherwise */
 
+	/* Command stream timestamp base - helps define watchdog threshold */
+	u32 cs_timestamp_base;
+
 	unsigned int fsb_freq, mem_freq, is_ddr3;
 	unsigned int skl_preferred_vco_freq;
 	unsigned int max_cdclk_freq;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 9625b5f7faf7..cfd33ca5c13f 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -878,6 +878,149 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 	return 0;
 }
 
+/*
+ * BDW, CHV & SKL+ Timestamp timer resolution = 0.080 uSec,
+ * or 12500000 counts per second, or ~12 counts per microsecond.
+ *
+ * But BXT/GLK Timestamp timer resolution is different, 0.052 uSec,
+ * or 19200000 counts per second, or ~19 counts per microsecond.
+ *
+ * Future-proofing, some day it won't be as simple as just GEN & IS_LP.
+ */
+#define GEN8_TIMESTAMP_CNTS_PER_USEC 12
+#define GEN9_LP_TIMESTAMP_CNTS_PER_USEC 19
+u32 cs_timestamp_in_us(struct drm_i915_private *i915)
+{
+	u32 cs_timestamp_base = i915->cs_timestamp_base;
+
+	if (cs_timestamp_base)
+		return cs_timestamp_base;
+
+	switch (INTEL_GEN(i915)) {
+	default:
+		MISSING_CASE(INTEL_GEN(i915));
+		/* fall through */
+	case 9:
+		cs_timestamp_base = IS_GEN9_LP(i915) ?
+					GEN9_LP_TIMESTAMP_CNTS_PER_USEC :
+					GEN8_TIMESTAMP_CNTS_PER_USEC;
+		break;
+	case 8:
+		cs_timestamp_base = GEN8_TIMESTAMP_CNTS_PER_USEC;
+		break;
+	}
+
+	i915->cs_timestamp_base = cs_timestamp_base;
+	return cs_timestamp_base;
+}
+
+u32 watchdog_to_us(struct drm_i915_private *i915, u32 value_in_clock_counts)
+{
+	return value_in_clock_counts / cs_timestamp_in_us(i915);
+}
+
+int watchdog_to_clock_counts(struct drm_i915_private *i915, u32 *value_in_us)
+{
+	u64 threshold = *value_in_us * cs_timestamp_in_us(i915);
+	int err = 0;
+
+	if (overflows_type(threshold, u64))
+		return -E2BIG;
+
+	*value_in_us = threshold;
+
+	return err;
+}
+
+/* On success copies to userspace the threshold value for the
+ * watchdog timer calculated in terms of clock_counts / timestamp (us)
+ */
+int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
+				  struct drm_i915_gem_context_param *args)
+{
+	struct drm_i915_private *i915 = ctx->i915;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	struct drm_i915_gem_watchdog_timeout threshold_in_us[OTHER_CLASS];
+
+	for_each_engine(engine, i915, id) {
+		/* not supported in blitter engine */
+		if (id!=BCS0 && !intel_engine_supports_watchdog(i915->engine[id]))
+			return -ENODEV;
+	}
+
+	for_each_engine(engine, i915, id) {
+		struct intel_context *ce = intel_context_lookup(ctx, engine);
+
+		threshold_in_us[engine->class].timeout_us = watchdog_to_us(i915,
+								ce->watchdog_threshold);
+	}
+
+	if (copy_to_user(u64_to_user_ptr(args->value),
+			   &threshold_in_us,
+			   sizeof(threshold_in_us))) {
+		return -EFAULT;
+	}
+
+	args->size = sizeof(threshold_in_us);
+
+	return 0;
+}
+
+/*
+ * Based on time out value in microseconds (us) calculate
+ * timer count thresholds needed based on core frequency.
+ * Watchdog can be disabled by setting it to 0.
+ */
+int i915_gem_context_set_watchdog(struct i915_gem_context *ctx,
+				  struct drm_i915_gem_context_param *args)
+{
+	struct drm_i915_private *i915 = ctx->i915;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	int i, err = 0;
+	struct drm_i915_gem_watchdog_timeout threshold[OTHER_CLASS];
+
+	for_each_engine(engine, i915, id) {
+		if (id!=BCS0 && !intel_engine_supports_watchdog(i915->engine[id]))
+			return -ENODEV;
+	}
+
+	memset(threshold, 0, sizeof(threshold));
+
+	/* shortcut to disable in all engines */
+	if (args->size == 0)
+		goto set_watchdog;
+
+	if (args->size < sizeof(threshold))
+		return -EFAULT;
+
+	if (copy_from_user(threshold,
+			   u64_to_user_ptr(args->value),
+			   sizeof(threshold))) {
+		return -EFAULT;
+	}
+
+	/* not supported in blitter engine */
+	if (threshold[COPY_ENGINE_CLASS].timeout_us > 0)
+		return -EINVAL;
+
+	for (i = RENDER_CLASS; i < OTHER_CLASS; i++) {
+		err = watchdog_to_clock_counts(i915, &threshold[i].timeout_us);
+		if (err)
+			return -EINVAL;
+	}
+
+set_watchdog:
+	for_each_engine(engine, i915, id) {
+		struct intel_context *ce = intel_context_lookup(ctx, engine);
+
+		ce->watchdog_threshold = threshold[engine->class].timeout_us;
+	}
+
+	return 0;
+}
+
 static int get_sseu(struct i915_gem_context *ctx,
 		    struct drm_i915_gem_context_param *args)
 {
@@ -970,6 +1113,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 		args->size = 0;
 		args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
 		break;
+	case I915_CONTEXT_PARAM_WATCHDOG:
+		ret = i915_gem_context_get_watchdog(ctx, args);
+		break;
+
 	case I915_CONTEXT_PARAM_SSEU:
 		ret = get_sseu(ctx, args);
 		break;
@@ -1335,6 +1482,9 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_SSEU:
 		ret = set_sseu(ctx, args);
 		break;
+	case I915_CONTEXT_PARAM_WATCHDOG:
+		ret = i915_gem_context_set_watchdog(ctx, args);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 5e7bc6412880..3b1bfb9996ea 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1440,6 +1440,7 @@ struct drm_i915_reg_read {
 	 */
 	__u64 offset;
 #define I915_REG_READ_8B_WA (1ul << 0)
+#define I915_CONTEXT_PARAM_WATCHDOG	0x10
 
 	__u64 val; /* Return value */
 };
@@ -1472,6 +1473,22 @@ struct drm_i915_reset_stats {
 
 };
 
+struct drm_i915_gem_watchdog_timeout {
+	union {
+		struct {
+			/*
+			 * Engine class & instance to be configured or queried.
+			 */
+			__u16 engine_class;
+			__u16 engine_instance;
+		};
+		/* Index based addressing mode */
+		__u32 index;
+	};
+	/* GPU Engine watchdog reset timeout in us */
+	__u32 timeout_us;
+};
+
 struct drm_i915_gem_userptr {
 	__u64 user_ptr;
 	__u64 user_size;
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 5/5] drm/i915: Watchdog timeout: Include threshold value in error state
  2019-03-22 23:41 [PATCH v5 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
                   ` (3 preceding siblings ...)
  2019-03-22 23:41 ` [PATCH v5 4/5] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Carlos Santa
@ 2019-03-22 23:41 ` Carlos Santa
  2019-03-22 23:59 ` ✗ Fi.CI.BAT: failure for GEN8+ GPU Watchdog Reset Support Patchwork
  5 siblings, 0 replies; 14+ messages in thread
From: Carlos Santa @ 2019-03-22 23:41 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

From: Michel Thierry <michel.thierry@intel.com>

Save the watchdog threshold (in us) as part of the engine state.

v2: Only do it for gen8+ (and prevent a missing-case warn).
v3: use ctx->__engine.
v4: Rebase.
v5: Rebase.
v6: Rebase, use intel_context_lookup()

Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h       |  2 ++
 drivers/gpu/drm/i915/i915_gpu_error.c | 14 ++++++++++----
 drivers/gpu/drm/i915/i915_gpu_error.h |  1 +
 3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 5324397c3801..5dbb3938e159 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3118,6 +3118,8 @@ i915_gem_context_lookup(struct drm_i915_file_private *file_priv, u32 id)
 	return ctx;
 }
 
+u32 watchdog_to_us(struct drm_i915_private *i915, u32 value_in_clock_counts);
+
 int i915_perf_open_ioctl(struct drm_device *dev, void *data,
 			 struct drm_file *file);
 int i915_perf_add_config_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 26bac517e383..1f8d29bf00d0 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -454,9 +454,11 @@ static void error_print_context(struct drm_i915_error_state_buf *m,
 				const char *header,
 				const struct drm_i915_error_context *ctx)
 {
-	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, prio %d, guilty %d active %d\n",
+	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, prio %d, guilty %d active %d, watchdog %dus\n",
 		   header, ctx->comm, ctx->pid, ctx->handle, ctx->hw_id,
-		   ctx->sched_attr.priority, ctx->guilty, ctx->active);
+		   ctx->sched_attr.priority, ctx->guilty, ctx->active,
+		   INTEL_GEN(m->i915) >= 8 ?
+			watchdog_to_us(m->i915, ctx->watchdog_threshold) : 0);
 }
 
 static void error_print_engine(struct drm_i915_error_state_buf *m,
@@ -1316,8 +1318,11 @@ static void error_record_engine_execlists(struct intel_engine_cs *engine,
 }
 
 static void record_context(struct drm_i915_error_context *e,
-			   struct i915_gem_context *ctx)
+			   struct i915_gem_context *ctx,
+			   u32 engine_id)
 {
+	struct drm_i915_private *dev_priv = ctx->i915;
+
 	if (ctx->pid) {
 		struct task_struct *task;
 
@@ -1335,6 +1340,7 @@ static void record_context(struct drm_i915_error_context *e,
 	e->sched_attr = ctx->sched;
 	e->guilty = atomic_read(&ctx->guilty_count);
 	e->active = atomic_read(&ctx->active_count);
+	e->watchdog_threshold = intel_context_lookup(ctx, dev_priv->engine[engine_id])->watchdog_threshold;
 }
 
 static void request_record_user_bo(struct i915_request *request,
@@ -1418,7 +1424,7 @@ static void gem_record_rings(struct i915_gpu_state *error)
 
 			ee->vm = ctx->ppgtt ? &ctx->ppgtt->vm : &ggtt->vm;
 
-			record_context(&ee->context, ctx);
+			record_context(&ee->context, ctx, engine->id);
 
 			/* We need to copy these to an anonymous buffer
 			 * as the simplest method to avoid being overwritten
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
index 6cf6a8679b26..439a31f5db3b 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.h
+++ b/drivers/gpu/drm/i915/i915_gpu_error.h
@@ -120,6 +120,7 @@ struct i915_gpu_state {
 			u32 hw_id;
 			int active;
 			int guilty;
+			int watchdog_threshold;
 			struct i915_sched_attr sched_attr;
 		} context;
 
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* ✗ Fi.CI.BAT: failure for GEN8+ GPU Watchdog Reset Support
  2019-03-22 23:41 [PATCH v5 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
                   ` (4 preceding siblings ...)
  2019-03-22 23:41 ` [PATCH v5 5/5] drm/i915: Watchdog timeout: Include threshold value in error state Carlos Santa
@ 2019-03-22 23:59 ` Patchwork
  5 siblings, 0 replies; 14+ messages in thread
From: Patchwork @ 2019-03-22 23:59 UTC (permalink / raw)
  To: Carlos Santa; +Cc: intel-gfx

== Series Details ==

Series: GEN8+ GPU Watchdog Reset Support
URL   : https://patchwork.freedesktop.org/series/58443/
State : failure

== Summary ==

Applying: drm/i915: Add engine reset count in get-reset-stats ioctl
Applying: drm/i915: Watchdog timeout: IRQ handler for gen8+
Applying: drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
Applying: drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
error: sha1 information is lacking or useless (drivers/gpu/drm/i915/i915_gem_context.c).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch' to see the failed patch
Patch failed at 0004 drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-03-22 23:41 ` [PATCH v5 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+ Carlos Santa
@ 2019-03-25 10:00   ` Tvrtko Ursulin
  2019-03-27  1:58     ` Carlos Santa
  0 siblings, 1 reply; 14+ messages in thread
From: Tvrtko Ursulin @ 2019-03-25 10:00 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx


On 22/03/2019 23:41, Carlos Santa wrote:
> From: Michel Thierry <michel.thierry@intel.com>
> 
> *** General ***
> 
> Watchdog timeout (or "media engine reset") is a feature that allows
> userland applications to enable hang detection on individual batch buffers.
> The detection mechanism itself is mostly bound to the hardware and the only
> thing that the driver needs to do to support this form of hang detection
> is to implement the interrupt handling support as well as watchdog command
> emission before and after the emitted batch buffer start instruction in the
> ring buffer.
> 
> The principle of the hang detection mechanism is as follows:
> 
> 1. Once the decision has been made to enable watchdog timeout for a
> particular batch buffer and the driver is in the process of emitting the
> batch buffer start instruction into the ring buffer it also emits a
> watchdog timer start instruction before and a watchdog timer cancellation
> instruction after the batch buffer start instruction in the ring buffer.
> 
> 2. Once the GPU execution reaches the watchdog timer start instruction
> the hardware watchdog counter is started by the hardware. The counter
> keeps counting until either reaching a previously configured threshold
> value or the timer cancellation instruction is executed.
> 
> 2a. If the counter reaches the threshold value the hardware fires a
> watchdog interrupt that is picked up by the watchdog interrupt handler.
> This means that a hang has been detected and the driver needs to deal with
> it the same way it would deal with a engine hang detected by the periodic
> hang checker. The only difference between the two is that we already blamed
> the active request (to ensure an engine reset).
> 
> 2b. If the batch buffer completes and the execution reaches the watchdog
> cancellation instruction before the watchdog counter reaches its
> threshold value the watchdog is cancelled and nothing more comes of it.
> No hang is detected.
> 
> Note about future interaction with preemption: Preemption could happen
> in a command sequence prior to watchdog counter getting disabled,
> resulting in watchdog being triggered following preemption (e.g. when
> watchdog had been enabled in the low priority batch). The driver will
> need to explicitly disable the watchdog counter as part of the
> preemption sequence.
> 
> *** This patch introduces: ***
> 
> 1. IRQ handler code for watchdog timeout allowing direct hang recovery
> based on hardware-driven hang detection, which then integrates directly
> with the hang recovery path. This is independent of having per-engine reset
> or just full gpu reset.
> 
> 2. Watchdog specific register information.
> 
> Currently the render engine and all available media engines support
> watchdog timeout (VECS is only supported in GEN9). The specifications elude
> to the BCS engine being supported but that is currently not supported by
> this commit.
> 
> Note that the value to stop the counter is different between render and
> non-render engines in GEN8; GEN9 onwards it's the same.
> 
> v2: Move irq handler to tasklet, arm watchdog for a 2nd time to check
> against false-positives.
> 
> v3: Don't use high priority tasklet, use engine_last_submit while
> checking for false-positives. From GEN9 onwards, the stop counter bit is
> the same for all engines.
> 
> v4: Remove unnecessary brackets, use current_seqno to mark the request
> as guilty in the hangcheck/capture code.
> 
> v5: Rebased after RESET_ENGINEs flag.
> 
> v6: Don't capture error state in case of watchdog timeout. The capture
> process is time consuming and this will align to what happens when we
> use GuC to handle the watchdog timeout. (Chris)
> 
> v7: Rebase.
> 
> v8: Rebase, use HZ to reschedule.
> 
> v9: Rebase, get forcewake domains in function (no longer in execlists
> struct).
> 
> v10: Rebase.
> 
> v11: Rebase,
>       remove extra braces (Tvrtko),
>       implement watchdog_to_clock_counts helper (Tvrtko),
>       Move tasklet_kill(watchdog_tasklet) inside intel_engines (Tvrtko),
>       Use a global heartbeat seqno instead of engine seqno (Chris)
>       Make all engines checks all class based checks (Tvrtko)
> 
> v12: Rebase,
>       Reset immediately upon entering the IRQ (Chris)
>       Make reset_engine_to_str a helper (Tvrtko)
>       Rename watchdog_irq_handler as watchdog_tasklet (Tvrtko)
>       Let the compiler itself do the inline (Tvrtko)
> 
> v13: Rebase
> v14: Rebase, skip checking for the guilty seqno in the tasklet (Tvrtko)

IIRC I only asked about it so I guess you tried it and it works well?

Can you also post the IGTs so we can see the test coverage?

Regards,

Tvrtko

> 
> Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gpu_error.h     |  4 ++
>   drivers/gpu/drm/i915/i915_irq.c           | 14 ++++--
>   drivers/gpu/drm/i915/i915_reg.h           |  6 +++
>   drivers/gpu/drm/i915/i915_reset.c         | 20 +++++++++
>   drivers/gpu/drm/i915/i915_reset.h         |  6 +++
>   drivers/gpu/drm/i915/intel_engine_cs.c    |  1 +
>   drivers/gpu/drm/i915/intel_engine_types.h |  5 +++
>   drivers/gpu/drm/i915/intel_hangcheck.c    | 11 +----
>   drivers/gpu/drm/i915/intel_lrc.c          | 52 +++++++++++++++++++++++
>   9 files changed, 107 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
> index 99d6b7b270c2..6cf6a8679b26 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> @@ -203,6 +203,9 @@ struct i915_gpu_error {
>   	 * any global resources that may be clobber by the reset (such as
>   	 * FENCE registers).
>   	 *
> +	 * #I915_RESET_WATCHDOG - When hw detects a hang before us, we can use
> +	 * I915_RESET_WATCHDOG to report the hang detection cause accurately.
> +	 *
>   	 * #I915_RESET_ENGINE[num_engines] - Since the driver doesn't need to
>   	 * acquire the struct_mutex to reset an engine, we need an explicit
>   	 * flag to prevent two concurrent reset attempts in the same engine.
> @@ -218,6 +221,7 @@ struct i915_gpu_error {
>   #define I915_RESET_BACKOFF	0
>   #define I915_RESET_MODESET	1
>   #define I915_RESET_ENGINE	2
> +#define I915_RESET_WATCHDOG	3
>   #define I915_WEDGED		(BITS_PER_LONG - 1)
>   
>   	/** Number of times the device has been reset (global) */
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 82d487189a34..e64994be25c3 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1466,6 +1466,9 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
>   
>   	if (tasklet)
>   		tasklet_hi_schedule(&engine->execlists.tasklet);
> +
> +	if (iir & GT_GEN8_WATCHDOG_INTERRUPT)
> +		tasklet_schedule(&engine->execlists.watchdog_tasklet);
>   }
>   
>   static void gen8_gt_irq_ack(struct drm_i915_private *i915,
> @@ -3892,20 +3895,25 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
>   	u32 gt_interrupts[] = {
>   		(GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
>   		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> +		 GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
>   		 GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
>   		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT),
> -
>   		(GT_RENDER_USER_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
>   		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
> +		 GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
>   		 GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> -		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT),
> -
> +		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> +		 GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT),
>   		0,
>   
>   		(GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
>   		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT)
>   	};
>   
> +	/* VECS watchdog is only available in skl+ */
> +	if (INTEL_GEN(dev_priv) >= 9)
> +		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
> +
>   	dev_priv->pm_ier = 0x0;
>   	dev_priv->pm_imr = ~dev_priv->pm_ier;
>   	GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 9b69cec21f7b..ac8d984e16ae 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -2363,6 +2363,11 @@ enum i915_power_well_id {
>   #define RING_START(base)	_MMIO((base) + 0x38)
>   #define RING_CTL(base)		_MMIO((base) + 0x3c)
>   #define   RING_CTL_SIZE(size)	((size) - PAGE_SIZE) /* in bytes -> pages */
> +#define RING_CNTR(base)		_MMIO((base) + 0x178)
> +#define GEN8_WATCHDOG_ENABLE		0
> +#define GEN8_WATCHDOG_DISABLE		1
> +#define GEN8_XCS_WATCHDOG_DISABLE	0xFFFFFFFF /* GEN8 & non-render only */
> +#define RING_THRESH(base)	_MMIO((base) + 0x17C)
>   #define RING_SYNC_0(base)	_MMIO((base) + 0x40)
>   #define RING_SYNC_1(base)	_MMIO((base) + 0x44)
>   #define RING_SYNC_2(base)	_MMIO((base) + 0x48)
> @@ -2925,6 +2930,7 @@ enum i915_power_well_id {
>   #define GT_BSD_USER_INTERRUPT			(1 << 12)
>   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /* hsw+; rsvd on snb, ivb, vlv */
>   #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
> +#define GT_GEN8_WATCHDOG_INTERRUPT		(1 <<  6) /* gen8+ */
>   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /* !snb */
>   #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
>   #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
> diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
> index 861fe083e383..739fa5ad1a8d 100644
> --- a/drivers/gpu/drm/i915/i915_reset.c
> +++ b/drivers/gpu/drm/i915/i915_reset.c
> @@ -1208,6 +1208,26 @@ void i915_clear_error_registers(struct drm_i915_private *dev_priv)
>   	}
>   }
>   
> +void engine_reset_error_to_str(struct drm_i915_private *i915,
> +	           char *msg,
> +	           size_t sz,
> +	           unsigned int hung,
> +	           unsigned int stuck,
> +	           unsigned int watchdog)
> +{
> +	int len;
> +	unsigned int tmp;
> +	struct intel_engine_cs *engine;
> +
> +	len = scnprintf(msg, sz,
> +			"%s on ", watchdog ? "watchdog timeout" :
> +				stuck == hung ? "no_progress" : "hang");
> +	for_each_engine_masked(engine, i915, hung, tmp)
> +		len += scnprintf(msg + len, sz - len,
> +				"%s, ", engine->name);
> +	msg[len-2] = '\0';
> +}
> +
>   /**
>    * i915_handle_error - handle a gpu error
>    * @i915: i915 device private
> diff --git a/drivers/gpu/drm/i915/i915_reset.h b/drivers/gpu/drm/i915/i915_reset.h
> index 16f2389f656f..8582d1242248 100644
> --- a/drivers/gpu/drm/i915/i915_reset.h
> +++ b/drivers/gpu/drm/i915/i915_reset.h
> @@ -20,6 +20,12 @@ void i915_handle_error(struct drm_i915_private *i915,
>   		       u32 engine_mask,
>   		       unsigned long flags,
>   		       const char *fmt, ...);
> +void engine_reset_error_to_str(struct drm_i915_private *i915,
> +               char *str,
> +               size_t sz,
> +               unsigned int hung,
> +               unsigned int stuck,
> +               unsigned int watchdog);
>   #define I915_ERROR_CAPTURE BIT(0)
>   
>   void i915_clear_error_registers(struct drm_i915_private *i915);
> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> index 652c1b3ba190..88cf0fc07623 100644
> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> @@ -1149,6 +1149,7 @@ void intel_engines_park(struct drm_i915_private *i915)
>   		/* Flush the residual irq tasklets first. */
>   		intel_engine_disarm_breadcrumbs(engine);
>   		tasklet_kill(&engine->execlists.tasklet);
> +		tasklet_kill(&engine->execlists.watchdog_tasklet);
>   
>   		/*
>   		 * We are committed now to parking the engines, make sure there
> diff --git a/drivers/gpu/drm/i915/intel_engine_types.h b/drivers/gpu/drm/i915/intel_engine_types.h
> index b0aa1f0d4e47..c4f66b774e7c 100644
> --- a/drivers/gpu/drm/i915/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/intel_engine_types.h
> @@ -124,6 +124,11 @@ struct intel_engine_execlists {
>   	 */
>   	struct tasklet_struct tasklet;
>   
> +	/*
> +	 * @watchdog_tasklet: stop counter and reschedule hangcheck_work asap
> +	 */
> +	struct tasklet_struct watchdog_tasklet;
> +
>   	/**
>   	 * @default_priolist: priority list for I915_PRIORITY_NORMAL
>   	 */
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index 57ed49dc19c4..4bf26863678c 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -220,22 +220,15 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
>   				   unsigned int hung,
>   				   unsigned int stuck)
>   {
> -	struct intel_engine_cs *engine;
>   	char msg[80];
> -	unsigned int tmp;
> -	int len;
> +	size_t len = sizeof(msg);
>   
>   	/* If some rings hung but others were still busy, only
>   	 * blame the hanging rings in the synopsis.
>   	 */
>   	if (stuck != hung)
>   		hung &= ~stuck;
> -	len = scnprintf(msg, sizeof(msg),
> -			"%s on ", stuck == hung ? "no progress" : "hang");
> -	for_each_engine_masked(engine, i915, hung, tmp)
> -		len += scnprintf(msg + len, sizeof(msg) - len,
> -				 "%s, ", engine->name);
> -	msg[len-2] = '\0';
> +	engine_reset_error_to_str(i915, msg, len, hung, stuck, 0);
>   
>   	return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s", msg);
>   }
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index e54e0064b2d6..85785a94f6ae 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -2195,6 +2195,40 @@ static int gen8_emit_flush_render(struct i915_request *request,
>   	return 0;
>   }
>   
> +static void gen8_watchdog_tasklet(unsigned long data)
> +{
> +	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> +	struct drm_i915_private *dev_priv = engine->i915;
> +	enum forcewake_domains fw_domains;
> +	char msg[80];
> +	size_t len = sizeof(msg);
> +	unsigned long *lock = &engine->i915->gpu_error.flags;
> +	unsigned int bit = I915_RESET_ENGINE + engine->id;
> +
> +	switch (engine->class) {
> +	default:
> +		MISSING_CASE(engine->id);
> +		/* fall through */
> +	case RENDER_CLASS:
> +		fw_domains = FORCEWAKE_RENDER;
> +		break;
> +	case VIDEO_DECODE_CLASS:
> +	case VIDEO_ENHANCEMENT_CLASS:
> +		fw_domains = FORCEWAKE_MEDIA;
> +		break;
> +	}
> +
> +	intel_uncore_forcewake_get(dev_priv, fw_domains);
> +
> +	if (!test_and_set_bit(bit, lock)) {
> +		unsigned int hung = engine->mask;
> +		engine_reset_error_to_str(dev_priv, msg, len, hung, 0, 1);
> +		i915_reset_engine(engine, msg);
> +		clear_bit(bit, lock);
> +		wake_up_bit(lock, bit);
> +	}
> +}
> +
>   /*
>    * Reserve space for 2 NOOPs at the end of each request to be
>    * used as a workaround for not being allowed to do lite
> @@ -2377,6 +2411,21 @@ logical_ring_default_irqs(struct intel_engine_cs *engine)
>   
>   	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
>   	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
> +
> +	switch (engine->class) {
> +	default:
> +		/* BCS engine does not support hw watchdog */
> +		break;
> +	case RENDER_CLASS:
> +	case VIDEO_DECODE_CLASS:
> +		engine->irq_keep_mask |= GT_GEN8_WATCHDOG_INTERRUPT << shift;
> +		break;
> +	case VIDEO_ENHANCEMENT_CLASS:
> +		if (INTEL_GEN(engine->i915) >= 9)
> +			engine->irq_keep_mask |=
> +				GT_GEN8_WATCHDOG_INTERRUPT << shift;
> +		break;
> +	}
>   }
>   
>   static int
> @@ -2394,6 +2443,9 @@ logical_ring_setup(struct intel_engine_cs *engine)
>   	tasklet_init(&engine->execlists.tasklet,
>   		     execlists_submission_tasklet, (unsigned long)engine);
>   
> +	tasklet_init(&engine->execlists.watchdog_tasklet,
> +		     gen8_watchdog_tasklet, (unsigned long)engine);
> +
>   	logical_ring_default_vfuncs(engine);
>   	logical_ring_default_irqs(engine);
>   
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-03-25 10:00   ` Tvrtko Ursulin
@ 2019-03-27  1:58     ` Carlos Santa
  2019-03-27 10:40       ` Tvrtko Ursulin
  0 siblings, 1 reply; 14+ messages in thread
From: Carlos Santa @ 2019-03-27  1:58 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

On Mon, 2019-03-25 at 10:00 +0000, Tvrtko Ursulin wrote:
> On 22/03/2019 23:41, Carlos Santa wrote:
> > From: Michel Thierry <michel.thierry@intel.com>
> > 
> > *** General ***
> > 
> > Watchdog timeout (or "media engine reset") is a feature that allows
> > userland applications to enable hang detection on individual batch
> > buffers.
> > The detection mechanism itself is mostly bound to the hardware and
> > the only
> > thing that the driver needs to do to support this form of hang
> > detection
> > is to implement the interrupt handling support as well as watchdog
> > command
> > emission before and after the emitted batch buffer start
> > instruction in the
> > ring buffer.
> > 
> > The principle of the hang detection mechanism is as follows:
> > 
> > 1. Once the decision has been made to enable watchdog timeout for a
> > particular batch buffer and the driver is in the process of
> > emitting the
> > batch buffer start instruction into the ring buffer it also emits a
> > watchdog timer start instruction before and a watchdog timer
> > cancellation
> > instruction after the batch buffer start instruction in the ring
> > buffer.
> > 
> > 2. Once the GPU execution reaches the watchdog timer start
> > instruction
> > the hardware watchdog counter is started by the hardware. The
> > counter
> > keeps counting until either reaching a previously configured
> > threshold
> > value or the timer cancellation instruction is executed.
> > 
> > 2a. If the counter reaches the threshold value the hardware fires a
> > watchdog interrupt that is picked up by the watchdog interrupt
> > handler.
> > This means that a hang has been detected and the driver needs to
> > deal with
> > it the same way it would deal with a engine hang detected by the
> > periodic
> > hang checker. The only difference between the two is that we
> > already blamed
> > the active request (to ensure an engine reset).
> > 
> > 2b. If the batch buffer completes and the execution reaches the
> > watchdog
> > cancellation instruction before the watchdog counter reaches its
> > threshold value the watchdog is cancelled and nothing more comes of
> > it.
> > No hang is detected.
> > 
> > Note about future interaction with preemption: Preemption could
> > happen
> > in a command sequence prior to watchdog counter getting disabled,
> > resulting in watchdog being triggered following preemption (e.g.
> > when
> > watchdog had been enabled in the low priority batch). The driver
> > will
> > need to explicitly disable the watchdog counter as part of the
> > preemption sequence.
> > 
> > *** This patch introduces: ***
> > 
> > 1. IRQ handler code for watchdog timeout allowing direct hang
> > recovery
> > based on hardware-driven hang detection, which then integrates
> > directly
> > with the hang recovery path. This is independent of having per-
> > engine reset
> > or just full gpu reset.
> > 
> > 2. Watchdog specific register information.
> > 
> > Currently the render engine and all available media engines support
> > watchdog timeout (VECS is only supported in GEN9). The
> > specifications elude
> > to the BCS engine being supported but that is currently not
> > supported by
> > this commit.
> > 
> > Note that the value to stop the counter is different between render
> > and
> > non-render engines in GEN8; GEN9 onwards it's the same.
> > 
> > v2: Move irq handler to tasklet, arm watchdog for a 2nd time to
> > check
> > against false-positives.
> > 
> > v3: Don't use high priority tasklet, use engine_last_submit while
> > checking for false-positives. From GEN9 onwards, the stop counter
> > bit is
> > the same for all engines.
> > 
> > v4: Remove unnecessary brackets, use current_seqno to mark the
> > request
> > as guilty in the hangcheck/capture code.
> > 
> > v5: Rebased after RESET_ENGINEs flag.
> > 
> > v6: Don't capture error state in case of watchdog timeout. The
> > capture
> > process is time consuming and this will align to what happens when
> > we
> > use GuC to handle the watchdog timeout. (Chris)
> > 
> > v7: Rebase.
> > 
> > v8: Rebase, use HZ to reschedule.
> > 
> > v9: Rebase, get forcewake domains in function (no longer in
> > execlists
> > struct).
> > 
> > v10: Rebase.
> > 
> > v11: Rebase,
> >       remove extra braces (Tvrtko),
> >       implement watchdog_to_clock_counts helper (Tvrtko),
> >       Move tasklet_kill(watchdog_tasklet) inside intel_engines
> > (Tvrtko),
> >       Use a global heartbeat seqno instead of engine seqno (Chris)
> >       Make all engines checks all class based checks (Tvrtko)
> > 
> > v12: Rebase,
> >       Reset immediately upon entering the IRQ (Chris)
> >       Make reset_engine_to_str a helper (Tvrtko)
> >       Rename watchdog_irq_handler as watchdog_tasklet (Tvrtko)
> >       Let the compiler itself do the inline (Tvrtko)
> > 
> > v13: Rebase
> > v14: Rebase, skip checking for the guilty seqno in the tasklet
> > (Tvrtko)
> 
> IIRC I only asked about it so I guess you tried it and it works well?
> 
> Can you also post the IGTs so we can see the test coverage?
> 
> Regards,
> 
> Tvrtko

Yeah, the unit test works but it's still the very simple one (updated
for the uAPI mods) I started with last August, I still need to add more
to it though...


https://lists.freedesktop.org/archives/igt-dev/2018-September/005834.html

Right now, I am on the hrtimer workaround to see how that works, will
get back to the test coverage after that...

Regards,
Carlos

> 
> > 
> > Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> > Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> > Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_gpu_error.h     |  4 ++
> >   drivers/gpu/drm/i915/i915_irq.c           | 14 ++++--
> >   drivers/gpu/drm/i915/i915_reg.h           |  6 +++
> >   drivers/gpu/drm/i915/i915_reset.c         | 20 +++++++++
> >   drivers/gpu/drm/i915/i915_reset.h         |  6 +++
> >   drivers/gpu/drm/i915/intel_engine_cs.c    |  1 +
> >   drivers/gpu/drm/i915/intel_engine_types.h |  5 +++
> >   drivers/gpu/drm/i915/intel_hangcheck.c    | 11 +----
> >   drivers/gpu/drm/i915/intel_lrc.c          | 52
> > +++++++++++++++++++++++
> >   9 files changed, 107 insertions(+), 12 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h
> > b/drivers/gpu/drm/i915/i915_gpu_error.h
> > index 99d6b7b270c2..6cf6a8679b26 100644
> > --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> > +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> > @@ -203,6 +203,9 @@ struct i915_gpu_error {
> >   	 * any global resources that may be clobber by the reset (such
> > as
> >   	 * FENCE registers).
> >   	 *
> > +	 * #I915_RESET_WATCHDOG - When hw detects a hang before us, we
> > can use
> > +	 * I915_RESET_WATCHDOG to report the hang detection cause
> > accurately.
> > +	 *
> >   	 * #I915_RESET_ENGINE[num_engines] - Since the driver doesn't
> > need to
> >   	 * acquire the struct_mutex to reset an engine, we need an
> > explicit
> >   	 * flag to prevent two concurrent reset attempts in the same
> > engine.
> > @@ -218,6 +221,7 @@ struct i915_gpu_error {
> >   #define I915_RESET_BACKOFF	0
> >   #define I915_RESET_MODESET	1
> >   #define I915_RESET_ENGINE	2
> > +#define I915_RESET_WATCHDOG	3
> >   #define I915_WEDGED		(BITS_PER_LONG - 1)
> >   
> >   	/** Number of times the device has been reset (global) */
> > diff --git a/drivers/gpu/drm/i915/i915_irq.c
> > b/drivers/gpu/drm/i915/i915_irq.c
> > index 82d487189a34..e64994be25c3 100644
> > --- a/drivers/gpu/drm/i915/i915_irq.c
> > +++ b/drivers/gpu/drm/i915/i915_irq.c
> > @@ -1466,6 +1466,9 @@ gen8_cs_irq_handler(struct intel_engine_cs
> > *engine, u32 iir)
> >   
> >   	if (tasklet)
> >   		tasklet_hi_schedule(&engine->execlists.tasklet);
> > +
> > +	if (iir & GT_GEN8_WATCHDOG_INTERRUPT)
> > +		tasklet_schedule(&engine->execlists.watchdog_tasklet);
> >   }
> >   
> >   static void gen8_gt_irq_ack(struct drm_i915_private *i915,
> > @@ -3892,20 +3895,25 @@ static void gen8_gt_irq_postinstall(struct
> > drm_i915_private *dev_priv)
> >   	u32 gt_interrupts[] = {
> >   		(GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> >   		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> > +		 GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> >   		 GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
> >   		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT),
> > -
> >   		(GT_RENDER_USER_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
> >   		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
> > +		 GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
> >   		 GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> > -		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT),
> > -
> > +		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> > +		 GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT),
> >   		0,
> >   
> >   		(GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
> >   		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT)
> >   	};
> >   
> > +	/* VECS watchdog is only available in skl+ */
> > +	if (INTEL_GEN(dev_priv) >= 9)
> > +		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
> > +
> >   	dev_priv->pm_ier = 0x0;
> >   	dev_priv->pm_imr = ~dev_priv->pm_ier;
> >   	GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
> > diff --git a/drivers/gpu/drm/i915/i915_reg.h
> > b/drivers/gpu/drm/i915/i915_reg.h
> > index 9b69cec21f7b..ac8d984e16ae 100644
> > --- a/drivers/gpu/drm/i915/i915_reg.h
> > +++ b/drivers/gpu/drm/i915/i915_reg.h
> > @@ -2363,6 +2363,11 @@ enum i915_power_well_id {
> >   #define RING_START(base)	_MMIO((base) + 0x38)
> >   #define RING_CTL(base)		_MMIO((base) + 0x3c)
> >   #define   RING_CTL_SIZE(size)	((size) - PAGE_SIZE) /* in
> > bytes -> pages */
> > +#define RING_CNTR(base)		_MMIO((base) + 0x178)
> > +#define GEN8_WATCHDOG_ENABLE		0
> > +#define GEN8_WATCHDOG_DISABLE		1
> > +#define GEN8_XCS_WATCHDOG_DISABLE	0xFFFFFFFF /* GEN8 & non-render 
> > only */
> > +#define RING_THRESH(base)	_MMIO((base) + 0x17C)
> >   #define RING_SYNC_0(base)	_MMIO((base) + 0x40)
> >   #define RING_SYNC_1(base)	_MMIO((base) + 0x44)
> >   #define RING_SYNC_2(base)	_MMIO((base) + 0x48)
> > @@ -2925,6 +2930,7 @@ enum i915_power_well_id {
> >   #define GT_BSD_USER_INTERRUPT			(1 << 12)
> >   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /*
> > hsw+; rsvd on snb, ivb, vlv */
> >   #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
> > +#define GT_GEN8_WATCHDOG_INTERRUPT		(1 <<  6) /* gen8+ */
> >   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /*
> > !snb */
> >   #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
> >   #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
> > diff --git a/drivers/gpu/drm/i915/i915_reset.c
> > b/drivers/gpu/drm/i915/i915_reset.c
> > index 861fe083e383..739fa5ad1a8d 100644
> > --- a/drivers/gpu/drm/i915/i915_reset.c
> > +++ b/drivers/gpu/drm/i915/i915_reset.c
> > @@ -1208,6 +1208,26 @@ void i915_clear_error_registers(struct
> > drm_i915_private *dev_priv)
> >   	}
> >   }
> >   
> > +void engine_reset_error_to_str(struct drm_i915_private *i915,
> > +	           char *msg,
> > +	           size_t sz,
> > +	           unsigned int hung,
> > +	           unsigned int stuck,
> > +	           unsigned int watchdog)
> > +{
> > +	int len;
> > +	unsigned int tmp;
> > +	struct intel_engine_cs *engine;
> > +
> > +	len = scnprintf(msg, sz,
> > +			"%s on ", watchdog ? "watchdog timeout" :
> > +				stuck == hung ? "no_progress" :
> > "hang");
> > +	for_each_engine_masked(engine, i915, hung, tmp)
> > +		len += scnprintf(msg + len, sz - len,
> > +				"%s, ", engine->name);
> > +	msg[len-2] = '\0';
> > +}
> > +
> >   /**
> >    * i915_handle_error - handle a gpu error
> >    * @i915: i915 device private
> > diff --git a/drivers/gpu/drm/i915/i915_reset.h
> > b/drivers/gpu/drm/i915/i915_reset.h
> > index 16f2389f656f..8582d1242248 100644
> > --- a/drivers/gpu/drm/i915/i915_reset.h
> > +++ b/drivers/gpu/drm/i915/i915_reset.h
> > @@ -20,6 +20,12 @@ void i915_handle_error(struct drm_i915_private
> > *i915,
> >   		       u32 engine_mask,
> >   		       unsigned long flags,
> >   		       const char *fmt, ...);
> > +void engine_reset_error_to_str(struct drm_i915_private *i915,
> > +               char *str,
> > +               size_t sz,
> > +               unsigned int hung,
> > +               unsigned int stuck,
> > +               unsigned int watchdog);
> >   #define I915_ERROR_CAPTURE BIT(0)
> >   
> >   void i915_clear_error_registers(struct drm_i915_private *i915);
> > diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c
> > b/drivers/gpu/drm/i915/intel_engine_cs.c
> > index 652c1b3ba190..88cf0fc07623 100644
> > --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> > +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> > @@ -1149,6 +1149,7 @@ void intel_engines_park(struct
> > drm_i915_private *i915)
> >   		/* Flush the residual irq tasklets first. */
> >   		intel_engine_disarm_breadcrumbs(engine);
> >   		tasklet_kill(&engine->execlists.tasklet);
> > +		tasklet_kill(&engine->execlists.watchdog_tasklet);
> >   
> >   		/*
> >   		 * We are committed now to parking the engines, make
> > sure there
> > diff --git a/drivers/gpu/drm/i915/intel_engine_types.h
> > b/drivers/gpu/drm/i915/intel_engine_types.h
> > index b0aa1f0d4e47..c4f66b774e7c 100644
> > --- a/drivers/gpu/drm/i915/intel_engine_types.h
> > +++ b/drivers/gpu/drm/i915/intel_engine_types.h
> > @@ -124,6 +124,11 @@ struct intel_engine_execlists {
> >   	 */
> >   	struct tasklet_struct tasklet;
> >   
> > +	/*
> > +	 * @watchdog_tasklet: stop counter and reschedule
> > hangcheck_work asap
> > +	 */
> > +	struct tasklet_struct watchdog_tasklet;
> > +
> >   	/**
> >   	 * @default_priolist: priority list for I915_PRIORITY_NORMAL
> >   	 */
> > diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c
> > b/drivers/gpu/drm/i915/intel_hangcheck.c
> > index 57ed49dc19c4..4bf26863678c 100644
> > --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> > +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> > @@ -220,22 +220,15 @@ static void hangcheck_declare_hang(struct
> > drm_i915_private *i915,
> >   				   unsigned int hung,
> >   				   unsigned int stuck)
> >   {
> > -	struct intel_engine_cs *engine;
> >   	char msg[80];
> > -	unsigned int tmp;
> > -	int len;
> > +	size_t len = sizeof(msg);
> >   
> >   	/* If some rings hung but others were still busy, only
> >   	 * blame the hanging rings in the synopsis.
> >   	 */
> >   	if (stuck != hung)
> >   		hung &= ~stuck;
> > -	len = scnprintf(msg, sizeof(msg),
> > -			"%s on ", stuck == hung ? "no progress" :
> > "hang");
> > -	for_each_engine_masked(engine, i915, hung, tmp)
> > -		len += scnprintf(msg + len, sizeof(msg) - len,
> > -				 "%s, ", engine->name);
> > -	msg[len-2] = '\0';
> > +	engine_reset_error_to_str(i915, msg, len, hung, stuck, 0);
> >   
> >   	return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s",
> > msg);
> >   }
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c
> > b/drivers/gpu/drm/i915/intel_lrc.c
> > index e54e0064b2d6..85785a94f6ae 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -2195,6 +2195,40 @@ static int gen8_emit_flush_render(struct
> > i915_request *request,
> >   	return 0;
> >   }
> >   
> > +static void gen8_watchdog_tasklet(unsigned long data)
> > +{
> > +	struct intel_engine_cs *engine = (struct intel_engine_cs
> > *)data;
> > +	struct drm_i915_private *dev_priv = engine->i915;
> > +	enum forcewake_domains fw_domains;
> > +	char msg[80];
> > +	size_t len = sizeof(msg);
> > +	unsigned long *lock = &engine->i915->gpu_error.flags;
> > +	unsigned int bit = I915_RESET_ENGINE + engine->id;
> > +
> > +	switch (engine->class) {
> > +	default:
> > +		MISSING_CASE(engine->id);
> > +		/* fall through */
> > +	case RENDER_CLASS:
> > +		fw_domains = FORCEWAKE_RENDER;
> > +		break;
> > +	case VIDEO_DECODE_CLASS:
> > +	case VIDEO_ENHANCEMENT_CLASS:
> > +		fw_domains = FORCEWAKE_MEDIA;
> > +		break;
> > +	}
> > +
> > +	intel_uncore_forcewake_get(dev_priv, fw_domains);
> > +
> > +	if (!test_and_set_bit(bit, lock)) {
> > +		unsigned int hung = engine->mask;
> > +		engine_reset_error_to_str(dev_priv, msg, len, hung, 0,
> > 1);
> > +		i915_reset_engine(engine, msg);
> > +		clear_bit(bit, lock);
> > +		wake_up_bit(lock, bit);
> > +	}
> > +}
> > +
> >   /*
> >    * Reserve space for 2 NOOPs at the end of each request to be
> >    * used as a workaround for not being allowed to do lite
> > @@ -2377,6 +2411,21 @@ logical_ring_default_irqs(struct
> > intel_engine_cs *engine)
> >   
> >   	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
> >   	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
> > +
> > +	switch (engine->class) {
> > +	default:
> > +		/* BCS engine does not support hw watchdog */
> > +		break;
> > +	case RENDER_CLASS:
> > +	case VIDEO_DECODE_CLASS:
> > +		engine->irq_keep_mask |= GT_GEN8_WATCHDOG_INTERRUPT <<
> > shift;
> > +		break;
> > +	case VIDEO_ENHANCEMENT_CLASS:
> > +		if (INTEL_GEN(engine->i915) >= 9)
> > +			engine->irq_keep_mask |=
> > +				GT_GEN8_WATCHDOG_INTERRUPT << shift;
> > +		break;
> > +	}
> >   }
> >   
> >   static int
> > @@ -2394,6 +2443,9 @@ logical_ring_setup(struct intel_engine_cs
> > *engine)
> >   	tasklet_init(&engine->execlists.tasklet,
> >   		     execlists_submission_tasklet, (unsigned
> > long)engine);
> >   
> > +	tasklet_init(&engine->execlists.watchdog_tasklet,
> > +		     gen8_watchdog_tasklet, (unsigned long)engine);
> > +
> >   	logical_ring_default_vfuncs(engine);
> >   	logical_ring_default_irqs(engine);
> >   
> > 

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-03-27  1:58     ` Carlos Santa
@ 2019-03-27 10:40       ` Tvrtko Ursulin
  0 siblings, 0 replies; 14+ messages in thread
From: Tvrtko Ursulin @ 2019-03-27 10:40 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx


On 27/03/2019 01:58, Carlos Santa wrote:

[snip]

>>> v13: Rebase
>>> v14: Rebase, skip checking for the guilty seqno in the tasklet
>>> (Tvrtko)
>>
>> IIRC I only asked about it so I guess you tried it and it works well?
>>
>> Can you also post the IGTs so we can see the test coverage?
>>
>> Regards,
>>
>> Tvrtko
> 
> Yeah, the unit test works but it's still the very simple one (updated
> for the uAPI mods) I started with last August, I still need to add more
> to it though...
> 
> 
> https://lists.freedesktop.org/archives/igt-dev/2018-September/005834.html
> 
> Right now, I am on the hrtimer workaround to see how that works, will
> get back to the test coverage after that...

It may be tricky though so it may make sense to extend the test first 
until you can make a test which fails with the hw watchdog 
implementation. :)

What are the test cases I can think of..

Probably could use fence status to figure out which context was reset, 
and could use spin batches with userspace timers to control durations. 
And corking to control ordering. Then you need some preemption 
(priorities). A mix of batches on contexts with and without the watchdog 
and checking the expected one was reset.

ctx = create_context

long batch -> executed

ctx.set_watchdog

ctx.long batch -> canceled

ctx2.long_batch
ctx.long_batch

  -> executed, canceled

ctx.long_batch
ctx2.long_batch

  -> canceled, executed

(Hm maybe you end up having to use nop calibrated batches to make the 
last one work.)

ctx.batch_just_below_threshold -> executed

(Try a few times in a loop, or a bunch of batches like this one 
submitted at once and check all executed.)

Then expand and submit one random one of them as over threshold and 
check only that one was canceled.

Flavour of the above with two contexts, one with watchdog, one without.

Flavour where watchdog is set only on some engines. Single context, 
batches on different engines = different watchog behaviour.

Then preemption handling.. submit a long batch and after half of 
expected runtime submit a higher prio batch with half duration. Check 
the latter was executed. Then check the status of the former. What 
should it be? (Don't know, open to define the semantics..)

Submit a low prio long batch, wo/ watchdog, then a higher prio with 
watchdog and check it was canceled before the low prio one completed.

This is how much I can think of right now, but should be a good start.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl
  2019-03-22 23:41 ` [PATCH v5 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl Carlos Santa
@ 2019-03-30  8:45   ` Chris Wilson
  0 siblings, 0 replies; 14+ messages in thread
From: Chris Wilson @ 2019-03-30  8:45 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx; +Cc: Michel Thierry

Quoting Carlos Santa (2019-03-22 23:41:14)
> From: Michel Thierry <michel.thierry@intel.com>
> 
> Users/tests relying on the total reset count will start seeing a smaller
> number since most of the hangs can be handled by engine reset.
> Note that if reset engine x, context a running on engine y will be unaware
> and unaffected.
> 
> To start the discussion, include just a total engine reset count. If it
> is deemed useful, it can be extended to report each engine separately.
> 
> Our igt's gem_reset_stats test will need changes to ignore the pad field,
> since it can now return reset_engine_count.
> 
> v2: s/engine_reset/reset_engine/, use union in uapi to not break compatibility.
> v3: Keep rejecting attempts to use pad as input (Antonio)
> v4: Rebased.
> v5: Rebased.
>     Get rid of the union to store pad/engine count (Chris)
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> Cc: Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_context.c | 12 ++++++++++--
>  include/uapi/drm/i915_drm.h             |  4 ++++
>  2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 21208a865380..9625b5f7faf7 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -1350,6 +1350,8 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
>         struct drm_i915_private *dev_priv = to_i915(dev);
>         struct drm_i915_reset_stats *args = data;
>         struct i915_gem_context *ctx;
> +       struct intel_engine_cs *engine;
> +       enum intel_engine_id id;
>         int ret;
>  
>         if (args->flags || args->pad)
> @@ -1368,10 +1370,16 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
>          * we should wrap the hangstats with a seqlock.
>          */
>  
> -       if (capable(CAP_SYS_ADMIN))
> +       if (capable(CAP_SYS_ADMIN)) {
>                 args->reset_count = i915_reset_count(&dev_priv->gpu_error);
> -       else
> +               for_each_engine(engine, dev_priv, id)
> +                       args->reset_engine_count +=
> +                               i915_reset_engine_count(&dev_priv->gpu_error,
> +                                                       engine);

Do we really care about device-vs-engine here? Amalgamating all engine
resets into one variable is barely any more information than
amalgamating them with the device count.

> +       } else {
>                 args->reset_count = 0;
> +               args->reset_engine_count = 0;
> +       }
>  
>         args->batch_active = atomic_read(&ctx->guilty_count);
>         args->batch_pending = atomic_read(&ctx->active_count);
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index aa2d4c73a97d..5e7bc6412880 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1459,6 +1459,9 @@ struct drm_i915_reset_stats {
>         /* All resets since boot/module reload, for all contexts */
>         __u32 reset_count;
>  
> +       /* Engine resets since boot/module reload, for all contexts */
> +       __u32 reset_engine_count;

You cannot insert elements into the middle of an uABI struct. You can
only extend. To avoid problems with sw layout, always introduce a new
struct.

> +
>         /* Number of batches lost when active in GPU, for this context */
>         __u32 batch_active;
>  
> @@ -1466,6 +1469,7 @@ struct drm_i915_reset_stats {
>         __u32 batch_pending;
>  
>         __u32 pad;
> +
>  };
>  
>  struct drm_i915_gem_userptr {
> -- 
> 2.17.1
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/5] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2019-03-22 23:41 ` [PATCH v5 3/5] drm/i915: Watchdog timeout: Ringbuffer command emission " Carlos Santa
@ 2019-03-30  8:49   ` Chris Wilson
  2019-03-30  9:01   ` Chris Wilson
  1 sibling, 0 replies; 14+ messages in thread
From: Chris Wilson @ 2019-03-30  8:49 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx; +Cc: Michel Thierry

Quoting Carlos Santa (2019-03-22 23:41:16)
>  static int gen8_emit_bb_start(struct i915_request *rq,
>                               u64 offset, u32 len,
>                               const unsigned int flags)
>  {
> +       struct intel_engine_cs *engine = rq->engine;
> +       struct i915_gem_context *ctx = rq->gem_context;
> +       struct intel_context *ce = intel_context_lookup(ctx, engine);
>         u32 *cs;
> +       u32 num_dwords;
> +       bool enable_watchdog = false;
>  
> -       cs = intel_ring_begin(rq, 6);
> +       /* bb_start only */
> +       num_dwords = 6;
> +
> +       /* check if watchdog will be required */
> +       if (ce->watchdog_threshold != 0) {
> +               /* + start_watchdog (6) + stop_watchdog (4) */
> +               num_dwords += 10;
> +               enable_watchdog = true;

What do you do about the recommendation not to enable the watchdog
across semaphores inside user batches? Is that caveat made clear in the
uAPI docs?
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/5] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2019-03-22 23:41 ` [PATCH v5 3/5] drm/i915: Watchdog timeout: Ringbuffer command emission " Carlos Santa
  2019-03-30  8:49   ` Chris Wilson
@ 2019-03-30  9:01   ` Chris Wilson
  2019-04-02  0:57     ` Carlos Santa
  1 sibling, 1 reply; 14+ messages in thread
From: Chris Wilson @ 2019-03-30  9:01 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx; +Cc: Michel Thierry

Quoting Carlos Santa (2019-03-22 23:41:16)
> From: Michel Thierry <michel.thierry@intel.com>
> 
> Emit the required commands into the ring buffer for starting and
> stopping the watchdog timer before/after batch buffer start during
> batch buffer submission.

I'm expecting to see some discussion of how this is handled across
preemption here since you are inside an arbitration enabled section.
 
> v2: Support watchdog threshold per context engine, merge lri commands,
> and move watchdog commands emission to emit_bb_start. Request space of
> combined start_watchdog, bb_start and stop_watchdog to avoid any error
> after emitting bb_start.
> 
> v3: There were too many req->engine in emit_bb_start.
> Use GEM_BUG_ON instead of returning a very late EINVAL in the remote
> case of watchdog misprogramming; set correct LRI cmd size in
> emit_stop_watchdog. (Chris)
> 
> v4: Rebase.
> v5: use to_intel_context instead of ctx->engine.
> v6: Rebase.
> v7: Rebase,
>     Store gpu watchdog capability in engine flag (Tvrtko)
>     Store WATCHDOG_DISABLE magic # in engine (Tvrtko)
>     No need to declare emit_{start|stop}_watchdog as vfuncs (Tvrtko)
>     Replace flag watchdog_running with enable_watchdog (Tvrtko)
>     Emit a single MI_NOOP by conditionally checking whether the #
>     of emitted OPs is odd (Tvrtko)
> v8: Rebase
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_context_types.h |  4 +
>  drivers/gpu/drm/i915/intel_engine_cs.c     |  2 +
>  drivers/gpu/drm/i915/intel_engine_types.h  | 17 ++++-
>  drivers/gpu/drm/i915/intel_lrc.c           | 89 +++++++++++++++++++++-
>  drivers/gpu/drm/i915/intel_lrc.h           |  2 +
>  5 files changed, 106 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_context_types.h b/drivers/gpu/drm/i915/intel_context_types.h
> index 6dc9b4b9067b..e56fc263568e 100644
> --- a/drivers/gpu/drm/i915/intel_context_types.h
> +++ b/drivers/gpu/drm/i915/intel_context_types.h
> @@ -51,6 +51,10 @@ struct intel_context {
>         u64 lrc_desc;
>  
>         atomic_t pin_count;
> +       /** watchdog_threshold: hw watchdog threshold value,
> +        * in clock counts
> +        */

Gah. Why would you put it here? Between a tightly coupled count + mutex.

> +       u32 watchdog_threshold;
>         struct mutex pin_mutex; /* guards pinning and associated on-gpuing */
>  
>         /**
> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> index 88cf0fc07623..d4ea07b70904 100644
> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> @@ -324,6 +324,8 @@ intel_engine_setup(struct drm_i915_private *dev_priv,
>         if (engine->context_size)
>                 DRIVER_CAPS(dev_priv)->has_logical_contexts = true;
>  
> +       engine->watchdog_disable_id = get_watchdog_disable(engine);
> +
>         /* Nothing to do here, execute in order of dependencies */
>         engine->schedule = NULL;
>  
> diff --git a/drivers/gpu/drm/i915/intel_engine_types.h b/drivers/gpu/drm/i915/intel_engine_types.h
> index c4f66b774e7c..1f99b536471d 100644
> --- a/drivers/gpu/drm/i915/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/intel_engine_types.h
> @@ -260,6 +260,7 @@ struct intel_engine_cs {
>         unsigned int guc_id;
>         intel_engine_mask_t mask;
>  
> +       u32 watchdog_disable_id;

You've just put this between a pair of u8s.

>         u8 uabi_class;
>  
>         u8 class;
> @@ -422,10 +423,12 @@ struct intel_engine_cs {
>  
>         struct intel_engine_hangcheck hangcheck;
>  
> -#define I915_ENGINE_NEEDS_CMD_PARSER BIT(0)
> -#define I915_ENGINE_SUPPORTS_STATS   BIT(1)
> -#define I915_ENGINE_HAS_PREEMPTION   BIT(2)
> -#define I915_ENGINE_HAS_SEMAPHORES   BIT(3)
> +#define I915_ENGINE_NEEDS_CMD_PARSER  BIT(0)
> +#define I915_ENGINE_SUPPORTS_STATS    BIT(1)
> +#define I915_ENGINE_HAS_PREEMPTION    BIT(2)
> +#define I915_ENGINE_HAS_SEMAPHORES    BIT(3)
> +#define I915_ENGINE_SUPPORTS_WATCHDOG BIT(4)
> +
>         unsigned int flags;
>  
>         /*
> @@ -509,6 +512,12 @@ intel_engine_has_semaphores(const struct intel_engine_cs *engine)
>         return engine->flags & I915_ENGINE_HAS_SEMAPHORES;
>  }
>  
> +static inline bool
> +intel_engine_supports_watchdog(const struct intel_engine_cs *engine)
> +{
> +       return engine->flags & I915_ENGINE_SUPPORTS_WATCHDOG;
> +}
> +
>  #define instdone_slice_mask(dev_priv__) \
>         (IS_GEN(dev_priv__, 7) ? \
>          1 : RUNTIME_INFO(dev_priv__)->sseu.slice_mask)
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 85785a94f6ae..78ea54a5dbc3 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -2036,16 +2036,75 @@ static void execlists_reset_finish(struct intel_engine_cs *engine)
>                   atomic_read(&execlists->tasklet.count));
>  }
>  
> +static u32 *gen8_emit_start_watchdog(struct i915_request *rq, u32 *cs)
> +{
> +       struct intel_engine_cs *engine = rq->engine;
> +       struct i915_gem_context *ctx = rq->gem_context;
> +       struct intel_context *ce = intel_context_lookup(ctx, engine);
> +
> +       GEM_BUG_ON(!intel_engine_supports_watchdog(engine));
> +
> +       /*
> +        * watchdog register must never be programmed to zero. This would
> +        * cause the watchdog counter to exceed and not allow the engine to
> +        * go into IDLE state
> +        */
> +       GEM_BUG_ON(ce->watchdog_threshold == 0);
> +
> +       /* Set counter period */
> +       *cs++ = MI_LOAD_REGISTER_IMM(2);
> +       *cs++ = i915_mmio_reg_offset(RING_THRESH(engine->mmio_base));
> +       *cs++ = ce->watchdog_threshold;
> +       /* Start counter */
> +       *cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
> +       *cs++ = GEN8_WATCHDOG_ENABLE;

Hmm, so no watchdog seqno.

> +       return cs;
> +}
> +
> +static u32 *gen8_emit_stop_watchdog(struct i915_request *rq, u32 *cs)
> +{
> +       struct intel_engine_cs *engine = rq->engine;
> +
> +       GEM_BUG_ON(!intel_engine_supports_watchdog(engine));
> +
> +       *cs++ = MI_LOAD_REGISTER_IMM(1);
> +       *cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
> +       *cs++ = engine->watchdog_disable_id;
> +
> +       return cs;
> +}
> +
>  static int gen8_emit_bb_start(struct i915_request *rq,
>                               u64 offset, u32 len,
>                               const unsigned int flags)
>  {
> +       struct intel_engine_cs *engine = rq->engine;
> +       struct i915_gem_context *ctx = rq->gem_context;
> +       struct intel_context *ce = intel_context_lookup(ctx, engine);

Ahem. This keeps on getting worse.

>         u32 *cs;
> +       u32 num_dwords;
> +       bool enable_watchdog = false;
>  
> -       cs = intel_ring_begin(rq, 6);
> +       /* bb_start only */
> +       num_dwords = 6;
> +
> +       /* check if watchdog will be required */
> +       if (ce->watchdog_threshold != 0) {
> +               /* + start_watchdog (6) + stop_watchdog (4) */
> +               num_dwords += 10;
> +               enable_watchdog = true;
> +       }
> +
> +       cs = intel_ring_begin(rq, num_dwords);
>         if (IS_ERR(cs))
>                 return PTR_ERR(cs);
>  
> +       if (enable_watchdog) {
> +               /* Start watchdog timer */

Please don't simply repeat code in comments.

> +               cs = gen8_emit_start_watchdog(rq, cs);
> +       }
> +
>         /*
>          * WaDisableCtxRestoreArbitration:bdw,chv
>          *
> @@ -2072,10 +2131,16 @@ static int gen8_emit_bb_start(struct i915_request *rq,
>         *cs++ = upper_32_bits(offset);
>  
>         *cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> -       *cs++ = MI_NOOP;
>  
> -       intel_ring_advance(rq, cs);
> +       if (enable_watchdog) {
> +               /* Cancel watchdog timer */
> +               cs = gen8_emit_stop_watchdog(rq, cs);
> +       }
> +
> +       if (*cs%2 != 0)
> +               *cs++ = MI_NOOP;

This is wrong. cs points into the unset portion of the ring. The
watchdog commands are even, so there is no reason to move the original
NOOP, or at least no reason to make it conditional.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 3/5] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2019-03-30  9:01   ` Chris Wilson
@ 2019-04-02  0:57     ` Carlos Santa
  0 siblings, 0 replies; 14+ messages in thread
From: Carlos Santa @ 2019-04-02  0:57 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx; +Cc: Michel Thierry

On Sat, 2019-03-30 at 09:01 +0000, Chris Wilson wrote:
> Quoting Carlos Santa (2019-03-22 23:41:16)
> > From: Michel Thierry <michel.thierry@intel.com>
> > 
> > Emit the required commands into the ring buffer for starting and
> > stopping the watchdog timer before/after batch buffer start during
> > batch buffer submission.
> 
> I'm expecting to see some discussion of how this is handled across
> preemption here since you are inside an arbitration enabled section.
>  
> > v2: Support watchdog threshold per context engine, merge lri
> > commands,
> > and move watchdog commands emission to emit_bb_start. Request space
> > of
> > combined start_watchdog, bb_start and stop_watchdog to avoid any
> > error
> > after emitting bb_start.
> > 
> > v3: There were too many req->engine in emit_bb_start.
> > Use GEM_BUG_ON instead of returning a very late EINVAL in the
> > remote
> > case of watchdog misprogramming; set correct LRI cmd size in
> > emit_stop_watchdog. (Chris)
> > 
> > v4: Rebase.
> > v5: use to_intel_context instead of ctx->engine.
> > v6: Rebase.
> > v7: Rebase,
> >     Store gpu watchdog capability in engine flag (Tvrtko)
> >     Store WATCHDOG_DISABLE magic # in engine (Tvrtko)
> >     No need to declare emit_{start|stop}_watchdog as vfuncs
> > (Tvrtko)
> >     Replace flag watchdog_running with enable_watchdog (Tvrtko)
> >     Emit a single MI_NOOP by conditionally checking whether the #
> >     of emitted OPs is odd (Tvrtko)
> > v8: Rebase
> > 
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> > Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> > Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> > ---
> >  drivers/gpu/drm/i915/intel_context_types.h |  4 +
> >  drivers/gpu/drm/i915/intel_engine_cs.c     |  2 +
> >  drivers/gpu/drm/i915/intel_engine_types.h  | 17 ++++-
> >  drivers/gpu/drm/i915/intel_lrc.c           | 89
> > +++++++++++++++++++++-
> >  drivers/gpu/drm/i915/intel_lrc.h           |  2 +
> >  5 files changed, 106 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_context_types.h
> > b/drivers/gpu/drm/i915/intel_context_types.h
> > index 6dc9b4b9067b..e56fc263568e 100644
> > --- a/drivers/gpu/drm/i915/intel_context_types.h
> > +++ b/drivers/gpu/drm/i915/intel_context_types.h
> > @@ -51,6 +51,10 @@ struct intel_context {
> >         u64 lrc_desc;
> >  
> >         atomic_t pin_count;
> > +       /** watchdog_threshold: hw watchdog threshold value,
> > +        * in clock counts
> > +        */
> 
> Gah. Why would you put it here? Between a tightly coupled count +
> mutex.
> 
> > +       u32 watchdog_threshold;
> >         struct mutex pin_mutex; /* guards pinning and associated
> > on-gpuing */
> >  
> >         /**
> > diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c
> > b/drivers/gpu/drm/i915/intel_engine_cs.c
> > index 88cf0fc07623..d4ea07b70904 100644
> > --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> > +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> > @@ -324,6 +324,8 @@ intel_engine_setup(struct drm_i915_private
> > *dev_priv,
> >         if (engine->context_size)
> >                 DRIVER_CAPS(dev_priv)->has_logical_contexts = true;
> >  
> > +       engine->watchdog_disable_id = get_watchdog_disable(engine);
> > +
> >         /* Nothing to do here, execute in order of dependencies */
> >         engine->schedule = NULL;
> >  
> > diff --git a/drivers/gpu/drm/i915/intel_engine_types.h
> > b/drivers/gpu/drm/i915/intel_engine_types.h
> > index c4f66b774e7c..1f99b536471d 100644
> > --- a/drivers/gpu/drm/i915/intel_engine_types.h
> > +++ b/drivers/gpu/drm/i915/intel_engine_types.h
> > @@ -260,6 +260,7 @@ struct intel_engine_cs {
> >         unsigned int guc_id;
> >         intel_engine_mask_t mask;
> >  
> > +       u32 watchdog_disable_id;
> 
> You've just put this between a pair of u8s.
> 
> >         u8 uabi_class;
> >  
> >         u8 class;
> > @@ -422,10 +423,12 @@ struct intel_engine_cs {
> >  
> >         struct intel_engine_hangcheck hangcheck;
> >  
> > -#define I915_ENGINE_NEEDS_CMD_PARSER BIT(0)
> > -#define I915_ENGINE_SUPPORTS_STATS   BIT(1)
> > -#define I915_ENGINE_HAS_PREEMPTION   BIT(2)
> > -#define I915_ENGINE_HAS_SEMAPHORES   BIT(3)
> > +#define I915_ENGINE_NEEDS_CMD_PARSER  BIT(0)
> > +#define I915_ENGINE_SUPPORTS_STATS    BIT(1)
> > +#define I915_ENGINE_HAS_PREEMPTION    BIT(2)
> > +#define I915_ENGINE_HAS_SEMAPHORES    BIT(3)
> > +#define I915_ENGINE_SUPPORTS_WATCHDOG BIT(4)
> > +
> >         unsigned int flags;
> >  
> >         /*
> > @@ -509,6 +512,12 @@ intel_engine_has_semaphores(const struct
> > intel_engine_cs *engine)
> >         return engine->flags & I915_ENGINE_HAS_SEMAPHORES;
> >  }
> >  
> > +static inline bool
> > +intel_engine_supports_watchdog(const struct intel_engine_cs
> > *engine)
> > +{
> > +       return engine->flags & I915_ENGINE_SUPPORTS_WATCHDOG;
> > +}
> > +
> >  #define instdone_slice_mask(dev_priv__) \
> >         (IS_GEN(dev_priv__, 7) ? \
> >          1 : RUNTIME_INFO(dev_priv__)->sseu.slice_mask)
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c
> > b/drivers/gpu/drm/i915/intel_lrc.c
> > index 85785a94f6ae..78ea54a5dbc3 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -2036,16 +2036,75 @@ static void execlists_reset_finish(struct
> > intel_engine_cs *engine)
> >                   atomic_read(&execlists->tasklet.count));
> >  }
> >  
> > +static u32 *gen8_emit_start_watchdog(struct i915_request *rq, u32
> > *cs)
> > +{
> > +       struct intel_engine_cs *engine = rq->engine;
> > +       struct i915_gem_context *ctx = rq->gem_context;
> > +       struct intel_context *ce = intel_context_lookup(ctx,
> > engine);
> > +
> > +       GEM_BUG_ON(!intel_engine_supports_watchdog(engine));
> > +
> > +       /*
> > +        * watchdog register must never be programmed to zero. This
> > would
> > +        * cause the watchdog counter to exceed and not allow the
> > engine to
> > +        * go into IDLE state
> > +        */
> > +       GEM_BUG_ON(ce->watchdog_threshold == 0);
> > +
> > +       /* Set counter period */
> > +       *cs++ = MI_LOAD_REGISTER_IMM(2);
> > +       *cs++ = i915_mmio_reg_offset(RING_THRESH(engine-
> > >mmio_base));
> > +       *cs++ = ce->watchdog_threshold;
> > +       /* Start counter */
> > +       *cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
> > +       *cs++ = GEN8_WATCHDOG_ENABLE;
> 
> Hmm, so no watchdog seqno.
> 
> > +       return cs;
> > +}
> > +
> > +static u32 *gen8_emit_stop_watchdog(struct i915_request *rq, u32
> > *cs)
> > +{
> > +       struct intel_engine_cs *engine = rq->engine;
> > +
> > +       GEM_BUG_ON(!intel_engine_supports_watchdog(engine));
> > +
> > +       *cs++ = MI_LOAD_REGISTER_IMM(1);
> > +       *cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
> > +       *cs++ = engine->watchdog_disable_id;
> > +
> > +       return cs;
> > +}
> > +
> >  static int gen8_emit_bb_start(struct i915_request *rq,
> >                               u64 offset, u32 len,
> >                               const unsigned int flags)
> >  {
> > +       struct intel_engine_cs *engine = rq->engine;
> > +       struct i915_gem_context *ctx = rq->gem_context;
> > +       struct intel_context *ce = intel_context_lookup(ctx,
> > engine);
> 
> Ahem. This keeps on getting worse.

Can you explain a bit more why?

> 
> >         u32 *cs;
> > +       u32 num_dwords;
> > +       bool enable_watchdog = false;
> >  
> > -       cs = intel_ring_begin(rq, 6);
> > +       /* bb_start only */
> > +       num_dwords = 6;
> > +
> > +       /* check if watchdog will be required */
> > +       if (ce->watchdog_threshold != 0) {
> > +               /* + start_watchdog (6) + stop_watchdog (4) */
> > +               num_dwords += 10;
> > +               enable_watchdog = true;
> > +       }
> > +
> > +       cs = intel_ring_begin(rq, num_dwords);
> >         if (IS_ERR(cs))
> >                 return PTR_ERR(cs);
> >  
> > +       if (enable_watchdog) {
> > +               /* Start watchdog timer */
> 
> Please don't simply repeat code in comments.

Ack.

> 
> > +               cs = gen8_emit_start_watchdog(rq, cs);
> > +       }
> > +
> >         /*
> >          * WaDisableCtxRestoreArbitration:bdw,chv
> >          *
> > @@ -2072,10 +2131,16 @@ static int gen8_emit_bb_start(struct
> > i915_request *rq,
> >         *cs++ = upper_32_bits(offset);
> >  
> >         *cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> > -       *cs++ = MI_NOOP;
> >  
> > -       intel_ring_advance(rq, cs);
> > +       if (enable_watchdog) {
> > +               /* Cancel watchdog timer */
> > +               cs = gen8_emit_stop_watchdog(rq, cs);
> > +       }
> > +
> > +       if (*cs%2 != 0)
> > +               *cs++ = MI_NOOP;
> 
> This is wrong. cs points into the unset portion of the ring. The
> watchdog commands are even, so there is no reason to move the
> original
> NOOP, or at least no reason to make it conditional.

Ok, I initially took the suggestion from Tvrtko from way back, where we
were trying to get rid of too many MI_NOOPs by emitting them
conditionally based on whether the cs pointer was odd... 

Carlos


> -Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-04-02  0:57 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-22 23:41 [PATCH v5 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
2019-03-22 23:41 ` [PATCH v5 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl Carlos Santa
2019-03-30  8:45   ` Chris Wilson
2019-03-22 23:41 ` [PATCH v5 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+ Carlos Santa
2019-03-25 10:00   ` Tvrtko Ursulin
2019-03-27  1:58     ` Carlos Santa
2019-03-27 10:40       ` Tvrtko Ursulin
2019-03-22 23:41 ` [PATCH v5 3/5] drm/i915: Watchdog timeout: Ringbuffer command emission " Carlos Santa
2019-03-30  8:49   ` Chris Wilson
2019-03-30  9:01   ` Chris Wilson
2019-04-02  0:57     ` Carlos Santa
2019-03-22 23:41 ` [PATCH v5 4/5] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Carlos Santa
2019-03-22 23:41 ` [PATCH v5 5/5] drm/i915: Watchdog timeout: Include threshold value in error state Carlos Santa
2019-03-22 23:59 ` ✗ Fi.CI.BAT: failure for GEN8+ GPU Watchdog Reset Support Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.