All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+
@ 2017-02-23 19:44 Michel Thierry
  2017-02-23 19:44 ` [RFC 2/3] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Michel Thierry @ 2017-02-23 19:44 UTC (permalink / raw)
  To: intel-gfx

*** General ***

Watchdog timeout (or "media engine reset") is a feature that allows
userland applications to enable hang detection on individual batch buffers.
The detection mechanism itself is mostly bound to the hardware and the only
thing that the driver needs to do to support this form of hang detection
is to implement the interrupt handling support as well as watchdog command
emission before and after the emitted batch buffer start instruction in the
ring buffer.

The principle of the hang detection mechanism is as follows:

1. Once the decision has been made to enable watchdog timeout for a
particular batch buffer and the driver is in the process of emitting the
batch buffer start instruction into the ring buffer it also emits a
watchdog timer start instruction before and a watchdog timer cancellation
instruction after the batch buffer start instruction in the ring buffer.

2. Once the GPU execution reaches the watchdog timer start instruction
the hardware watchdog counter is started by the hardware. The counter
keeps counting until either reaching a previously configured threshold
value or the timer cancellation instruction is executed.

2a. If the counter reaches the threshold value the hardware fires a
watchdog interrupt that is picked up by the watchdog interrupt handler.
This means that a hang has been detected and the driver needs to deal with
it the same way it would deal with a engine hang detected by the periodic
hang checker. The only difference between the two is that we already blamed
the active request (to ensure an engine reset).

2b. If the batch buffer completes and the execution reaches the watchdog
cancellation instruction before the watchdog counter reaches its
threshold value the watchdog is cancelled and nothing more comes of it.
No hang is detected.

Note about future interaction with preemption: Preemption could happen
in a command sequence prior to watchdog counter getting disabled,
resulting in watchdog being triggered following preemption. The driver will
need to explicitly disable the watchdog counter as part of the
preemption sequence.

*** This patch introduces: ***

1. IRQ handler code for watchdog timeout allowing direct hang recovery
based on hardware-driven hang detection, which then integrates directly
with the hang recovery path. This is independent of having per-engine reset
or just full gpu reset.

2. Watchdog specific register information.

Currently the render engine and all available media engines support
watchdog timeout (VECS is only supported in GEN9). The specifications elude
to the BCS engine being supported but that is currently not supported by
this commit.

Note that the value to stop the counter is different between render and
non-render engines.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h        |  4 ++++
 drivers/gpu/drm/i915/i915_irq.c        | 31 ++++++++++++++++++++++++++++++-
 drivers/gpu/drm/i915/i915_reg.h        |  6 ++++++
 drivers/gpu/drm/i915/intel_hangcheck.c | 13 +++++++++----
 drivers/gpu/drm/i915/intel_lrc.c       | 16 ++++++++++++++++
 5 files changed, 65 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index eed9ead1b592..0e4f4cc3c6de 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1568,6 +1568,9 @@ struct i915_gpu_error {
 	 * recovery. All waiters on the reset_queue will be woken when
 	 * that happens.
 	 *
+	 * When hw detects a hang before us, we can use I915_RESET_WATCHDOG to
+	 * report the hang detection cause accurately.
+	 *
 	 * This counter is used by the wait_seqno code to notice that reset
 	 * event happened and it needs to restart the entire ioctl (since most
 	 * likely the seqno it waited for won't ever signal anytime soon).
@@ -1580,6 +1583,7 @@ struct i915_gpu_error {
 
 	unsigned long flags;
 #define I915_RESET_IN_PROGRESS	0
+#define I915_RESET_WATCHDOG	2 /* looking at the future */
 #define I915_WEDGED		(BITS_PER_LONG - 1)
 
 	/**
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index bc70e2c451b2..4ef73363bbe9 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1352,6 +1352,28 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
 		set_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
 		tasklet_hi_schedule(&engine->irq_tasklet);
 	}
+
+	if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift)) {
+		struct drm_i915_private *dev_priv = engine->i915;
+		u32 watchdog_disable;
+
+		if (engine->id == RCS)
+			watchdog_disable = GEN8_RCS_WATCHDOG_DISABLE;
+		else
+			watchdog_disable = GEN8_XCS_WATCHDOG_DISABLE;
+
+		/* Stop the counter to prevent further timeout interrupts */
+		I915_WRITE_FW(RING_CNTR(engine->mmio_base), watchdog_disable);
+
+		/* Make sure the active request will be marked as guilty */
+		engine->hangcheck.stalled = true;
+		engine->hangcheck.seqno = intel_engine_get_seqno(engine);
+
+		/* And try to run the hangcheck_work as soon as possible */
+		set_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags);
+		queue_delayed_work(system_long_wq,
+				   &dev_priv->gpu_error.hangcheck_work, 0);
+	}
 }
 
 static irqreturn_t gen8_gt_irq_ack(struct drm_i915_private *dev_priv,
@@ -3433,12 +3455,15 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
 	uint32_t gt_interrupts[] = {
 		GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 			GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT,
 		GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
 			GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
-			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
+			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
 		0,
 		GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT
@@ -3447,6 +3472,10 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
 	if (HAS_L3_DPF(dev_priv))
 		gt_interrupts[0] |= GT_RENDER_L3_PARITY_ERROR_INTERRUPT;
 
+	/* VECS watchdog is only available in skl+ */
+	if (INTEL_GEN(dev_priv) >= 9)
+		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
+
 	dev_priv->pm_ier = 0x0;
 	dev_priv->pm_imr = ~dev_priv->pm_ier;
 	GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 141a5c1e3895..b28cd6eee2dd 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -1896,6 +1896,11 @@ enum skl_disp_power_wells {
 #define RING_START(base)	_MMIO((base)+0x38)
 #define RING_CTL(base)		_MMIO((base)+0x3c)
 #define   RING_CTL_SIZE(size)	((size) - PAGE_SIZE) /* in bytes -> pages */
+#define RING_CNTR(base)        _MMIO((base) + 0x178)
+#define   GEN8_WATCHDOG_ENABLE		0
+#define   GEN8_RCS_WATCHDOG_DISABLE	1
+#define   GEN8_XCS_WATCHDOG_DISABLE	0xFFFFFFFF
+#define RING_THRESH(base)      _MMIO((base) + 0x17C)
 #define RING_SYNC_0(base)	_MMIO((base)+0x40)
 #define RING_SYNC_1(base)	_MMIO((base)+0x44)
 #define RING_SYNC_2(base)	_MMIO((base)+0x48)
@@ -2374,6 +2379,7 @@ enum skl_disp_power_wells {
 #define GT_BSD_USER_INTERRUPT			(1 << 12)
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /* hsw+; rsvd on snb, ivb, vlv */
 #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
+#define GT_GEN8_WATCHDOG_INTERRUPT		(1 <<  6) /* gen8+ */
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /* !snb */
 #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
 #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index dce742243ba6..0e9272c97096 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -388,7 +388,8 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
 
 static void hangcheck_declare_hang(struct drm_i915_private *i915,
 				   unsigned int hung,
-				   unsigned int stuck)
+				   unsigned int stuck,
+				   unsigned int watchdog)
 {
 	struct intel_engine_cs *engine;
 	char msg[80];
@@ -401,7 +402,8 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
 	if (stuck != hung)
 		hung &= ~stuck;
 	len = scnprintf(msg, sizeof(msg),
-			"%s on ", stuck == hung ? "No progress" : "Hang");
+			"%s on ", watchdog ? "Watchdog timeout" :
+				  stuck == hung ? "No progress" : "Hang");
 	for_each_engine_masked(engine, i915, hung, tmp)
 		len += scnprintf(msg + len, sizeof(msg) - len,
 				 "%s, ", engine->name);
@@ -425,7 +427,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 			     gpu_error.hangcheck_work.work);
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
-	unsigned int hung = 0, stuck = 0;
+	unsigned int hung = 0, stuck = 0, watchdog = 0;
 	int busy_count = 0;
 
 	if (!i915.enable_hangcheck)
@@ -437,6 +439,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	if (i915_terminally_wedged(&dev_priv->gpu_error))
 		return;
 
+	if (test_and_clear_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags))
+		watchdog = 1;
+
 	/* As enabling the GPU requires fairly extensive mmio access,
 	 * periodically arm the mmio checker to see if we are triggering
 	 * any invalid access.
@@ -463,7 +468,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	}
 
 	if (hung)
-		hangcheck_declare_hang(dev_priv, hung, stuck);
+		hangcheck_declare_hang(dev_priv, hung, stuck, watchdog);
 
 	/* Reset timer in case GPU hangs without another request being added */
 	if (busy_count)
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 39329d40da46..8c9ebf0cebf7 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1629,6 +1629,22 @@ logical_ring_default_irqs(struct intel_engine_cs *engine)
 	unsigned shift = engine->irq_shift;
 	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
 	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
+
+	switch (engine->id) {
+	default:
+		/* BCS engine does not support hw watchdog */
+		break;
+	case RCS:
+	case VCS:
+	case VCS2:
+		engine->irq_keep_mask |= (GT_GEN8_WATCHDOG_INTERRUPT << shift);
+		break;
+	case VECS:
+		if (INTEL_GEN(engine->i915) >= 9)
+			engine->irq_keep_mask |=
+				(GT_GEN8_WATCHDOG_INTERRUPT << shift);
+		break;
+	}
 }
 
 static int
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC 2/3] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2017-02-23 19:44 [RFC 1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
@ 2017-02-23 19:44 ` Michel Thierry
  2017-02-23 21:03   ` Chris Wilson
  2017-02-23 19:44 ` [RFC 3/3] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Michel Thierry @ 2017-02-23 19:44 UTC (permalink / raw)
  To: intel-gfx

Emit the required commands into the ring buffer for starting and
stopping the watchdog timer before/after batch buffer start during
batch buffer submission.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_execbuffer.c | 26 +++++++++++
 drivers/gpu/drm/i915/intel_lrc.c           | 74 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h    |  2 +
 3 files changed, 102 insertions(+)

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 35d2cb979452..348d81c40e81 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1416,8 +1416,15 @@ execbuf_submit(struct i915_execbuffer_params *params,
 	u64 exec_start, exec_len;
 	int instp_mode;
 	u32 instp_mask, *cs;
+	bool watchdog_running = false;
 	int ret;
 
+	/*
+	 * NB: Place-holder until watchdog timeout is enabled through DRM
+	 * interface
+	 */
+	bool enable_watchdog = false;
+
 	ret = i915_gem_execbuffer_move_to_gpu(params->request, vmas);
 	if (ret)
 		return ret;
@@ -1480,6 +1487,18 @@ execbuf_submit(struct i915_execbuffer_params *params,
 			return ret;
 	}
 
+	/* Start watchdog timer */
+	if (enable_watchdog) {
+		if (!params->engine->emit_start_watchdog)
+			return -EINVAL;
+
+		ret = params->engine->emit_start_watchdog(params->request);
+		if (ret)
+			return ret;
+
+		watchdog_running = true;
+	}
+
 	exec_len   = args->batch_len;
 	exec_start = params->batch->node.start +
 		     params->args_batch_start_offset;
@@ -1493,6 +1512,13 @@ execbuf_submit(struct i915_execbuffer_params *params,
 	if (ret)
 		return ret;
 
+	/* Cancel watchdog timer */
+	if (watchdog_running && params->engine->emit_stop_watchdog) {
+		ret = params->engine->emit_stop_watchdog(params->request);
+		if (ret)
+			return ret;
+	}
+
 	i915_gem_execbuffer_move_to_active(vmas, params->request);
 
 	return 0;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 8c9ebf0cebf7..b4939d9f338a 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1474,6 +1474,72 @@ static int gen8_emit_flush_render(struct drm_i915_gem_request *request,
 	return 0;
 }
 
+static int gen8_emit_start_watchdog(struct drm_i915_gem_request *req)
+{
+	struct intel_engine_cs *engine = req->engine;
+	struct i915_gem_context *ctx = req->ctx;
+	u32 *cs;
+
+	cs = intel_ring_begin(req, 8);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	/*
+	 * watchdog register must never be programmed to zero. This would
+	 * cause the watchdog counter to exceed and not allow the engine to
+	 * go into IDLE state
+	 */
+	GEM_BUG_ON(ctx->watchdog_threshold == 0);
+
+	/* Set counter period */
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(RING_THRESH(engine->mmio_base));
+	*cs++ = ctx->watchdog_threshold;
+	*cs++ = MI_NOOP;
+
+	/* Start counter */
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
+	*cs++ = GEN8_WATCHDOG_ENABLE;
+	*cs++ = MI_NOOP;
+	intel_ring_advance(req, cs);
+
+	return 0;
+}
+
+static int gen8_emit_stop_watchdog(struct drm_i915_gem_request *req)
+{
+	struct intel_engine_cs *engine = req->engine;
+	u32 *cs;
+
+	cs = intel_ring_begin(req, 4);
+	if (IS_ERR(cs))
+		return PTR_ERR(cs);
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
+
+	switch (engine->id) {
+	default:
+		WARN(1, "%s does not support watchdog timeout!\n",
+		     engine->name);
+	/* default to render engine */
+	case RCS:
+		*cs++ = GEN8_RCS_WATCHDOG_DISABLE;
+		break;
+	case VCS:
+	case VCS2:
+	case VECS:
+		*cs++ = GEN8_XCS_WATCHDOG_DISABLE;
+		break;
+	}
+
+	*cs++ = MI_NOOP;
+	intel_ring_advance(req, cs);
+
+	return 0;
+}
+
 /*
  * Reserve space for 2 NOOPs at the end of each request to be
  * used as a workaround for not being allowed to do lite
@@ -1740,6 +1806,8 @@ int logical_render_ring_init(struct intel_engine_cs *engine)
 	engine->emit_flush = gen8_emit_flush_render;
 	engine->emit_breadcrumb = gen8_emit_breadcrumb_render;
 	engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_render_sz;
+	engine->emit_start_watchdog = gen8_emit_start_watchdog;
+	engine->emit_stop_watchdog = gen8_emit_stop_watchdog;
 
 	ret = intel_engine_create_scratch(engine, PAGE_SIZE);
 	if (ret)
@@ -1763,6 +1831,12 @@ int logical_xcs_ring_init(struct intel_engine_cs *engine)
 {
 	logical_ring_setup(engine);
 
+	/* BCS engine does not have a watchdog-expired irq */
+	if (engine->id != BCS) {
+		engine->emit_start_watchdog = gen8_emit_start_watchdog;
+		engine->emit_stop_watchdog = gen8_emit_stop_watchdog;
+	}
+
 	return logical_ring_init(engine);
 }
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 0f29e07a9581..5a9764708186 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -282,6 +282,8 @@ struct intel_engine_cs {
 
 	int		(*emit_flush)(struct drm_i915_gem_request *request,
 				      u32 mode);
+	int		(*emit_start_watchdog)(struct drm_i915_gem_request *req);
+	int		(*emit_stop_watchdog)(struct drm_i915_gem_request *req);
 #define EMIT_INVALIDATE	BIT(0)
 #define EMIT_FLUSH	BIT(1)
 #define EMIT_BARRIER	(EMIT_INVALIDATE | EMIT_FLUSH)
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC 3/3] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-02-23 19:44 [RFC 1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
  2017-02-23 19:44 ` [RFC 2/3] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
@ 2017-02-23 19:44 ` Michel Thierry
  2017-02-23 21:14   ` Chris Wilson
  2017-02-23 19:44 ` [PATCH i-g-t 1/2] lib/igt_gt: Add watchdog gem_ctx_set_param ioctl interface Michel Thierry
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Michel Thierry @ 2017-02-23 19:44 UTC (permalink / raw)
  To: intel-gfx

Final enablement patch for GPU hang detection using watchdog timeout.
Using the gem_context_setparam ioctl, users can specify the desired
timeout value in milliseconds, and the driver will do the conversion to
'timestamps'.

The _recommended_ default watchdog threshold for video engines is 60 ms,
since this has been _empirically determined_ to be a good compromise for
low-latency requirements and low rate of false positives. The default
register value is ~106ms and the theoretical max value (all 1s) is
353 seconds.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c    | 46 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_gem_context.h    |  3 ++
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  8 +-----
 include/uapi/drm/i915_drm.h                |  1 +
 4 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 99c46f4dbde6..c3748878e64c 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -440,6 +440,32 @@ i915_gem_context_create_gvt(struct drm_device *dev)
 	return ctx;
 }
 
+void i915_context_watchdog_setup(struct i915_gem_context *ctx, u32 value_in_ms)
+{
+	/*
+	 * Based on time out value (ms) calculate
+	 * timer count thresholds needed based on core frequency.
+	 */
+#define TIMER_MILLISECOND 1000
+
+	/*
+	 * Timestamp timer resolution = 0.080 uSec,
+	 * or 12500000 counts per second
+	 */
+#define TIMESTAMP_CNTS_PER_SEC_80NS 12500000
+
+	ctx->watchdog_threshold =
+		((value_in_ms) *
+		 ((TIMESTAMP_CNTS_PER_SEC_80NS) / (TIMER_MILLISECOND)));
+
+	/*
+	 * watchdog register must never be programmed to zero. This would
+	 * cause the watchdog counter to exceed and not allow the engine to
+	 * go into IDLE state
+	 */
+	GEM_BUG_ON(ctx->watchdog_threshold == 0);
+}
+
 int i915_gem_context_init(struct drm_i915_private *dev_priv)
 {
 	struct i915_gem_context *ctx;
@@ -1056,6 +1082,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
+	struct intel_engine_cs *engine;
 	int ret;
 
 	ret = i915_mutex_lock_interruptible(dev);
@@ -1090,6 +1117,15 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_BANNABLE:
 		args->value = i915_gem_context_is_bannable(ctx);
 		break;
+	case I915_CONTEXT_PARAM_WATCHDOG:
+		engine = to_i915(dev)->engine[VCS];
+		if (!engine->emit_start_watchdog)
+			ret = -EINVAL;
+		else if (args->value)
+			ret = -EINVAL;
+		else
+			args->value = ctx->watchdog_threshold;
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -1105,6 +1141,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 	struct drm_i915_file_private *file_priv = file->driver_priv;
 	struct drm_i915_gem_context_param *args = data;
 	struct i915_gem_context *ctx;
+	struct intel_engine_cs *engine;
 	int ret;
 
 	ret = i915_mutex_lock_interruptible(dev);
@@ -1147,6 +1184,15 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 		else
 			i915_gem_context_clear_bannable(ctx);
 		break;
+	case I915_CONTEXT_PARAM_WATCHDOG:
+		engine = to_i915(dev)->engine[VCS];
+		if (!engine->emit_start_watchdog)
+			ret = -EINVAL;
+		else if (!args->value)
+			ret = -EINVAL;
+		else
+			i915_context_watchdog_setup(ctx, args->value);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index 0ac750b90f3d..133ed7b413aa 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -176,6 +176,9 @@ struct i915_gem_context {
 	/** ban_score: Accumulated score of all hangs caused by this context. */
 	int ban_score;
 
+	/** watchdog_threshold: hw watchdog threshold value, in clock counts */
+	u32 watchdog_threshold;
+
 	/** remap_slice: Bitmask of cache lines that need remapping */
 	u8 remap_slice;
 };
diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 348d81c40e81..26c50a6d6158 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1419,12 +1419,6 @@ execbuf_submit(struct i915_execbuffer_params *params,
 	bool watchdog_running = false;
 	int ret;
 
-	/*
-	 * NB: Place-holder until watchdog timeout is enabled through DRM
-	 * interface
-	 */
-	bool enable_watchdog = false;
-
 	ret = i915_gem_execbuffer_move_to_gpu(params->request, vmas);
 	if (ret)
 		return ret;
@@ -1488,7 +1482,7 @@ execbuf_submit(struct i915_execbuffer_params *params,
 	}
 
 	/* Start watchdog timer */
-	if (enable_watchdog) {
+	if (params->ctx->watchdog_threshold != 0) {
 		if (!params->engine->emit_start_watchdog)
 			return -EINVAL;
 
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 3554495bef13..e318c4f53a9e 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1289,6 +1289,7 @@ struct drm_i915_gem_context_param {
 #define I915_CONTEXT_PARAM_GTT_SIZE	0x3
 #define I915_CONTEXT_PARAM_NO_ERROR_CAPTURE	0x4
 #define I915_CONTEXT_PARAM_BANNABLE	0x5
+#define I915_CONTEXT_PARAM_WATCHDOG	0x6
 	__u64 value;
 };
 
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH i-g-t 1/2] lib/igt_gt: Add watchdog gem_ctx_set_param ioctl interface
  2017-02-23 19:44 [RFC 1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
  2017-02-23 19:44 ` [RFC 2/3] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
  2017-02-23 19:44 ` [RFC 3/3] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
@ 2017-02-23 19:44 ` Michel Thierry
  2017-02-23 19:44 ` [PATCH i-g-t 2/2] tests/drv_hangman: watchdog tests Michel Thierry
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Michel Thierry @ 2017-02-23 19:44 UTC (permalink / raw)
  To: intel-gfx

Set a watchdog timeout value in a given context.

Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 lib/igt_gt.c | 7 +++++++
 lib/igt_gt.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/lib/igt_gt.c b/lib/igt_gt.c
index 3bfaf2e4..e30385ec 100644
--- a/lib/igt_gt.c
+++ b/lib/igt_gt.c
@@ -40,6 +40,7 @@
 #include "intel_chipset.h"
 
 #define LOCAL_CONTEXT_PARAM_NO_ERROR_CAPTURE 0x4
+#define LOCAL_CONTEXT_PARAM_WATCHDOG	0x6
 
 /**
  * SECTION:igt_gt
@@ -252,6 +253,12 @@ igt_hang_t igt_hang_ctx(int fd,
 		__gem_context_set_param(fd, &param);
 	}
 
+	if (flags & HANG_USE_WATCHDOG) {
+		param.param = LOCAL_CONTEXT_PARAM_WATCHDOG;
+		param.value = 70; // in ms
+		__gem_context_set_param(fd, &param);
+	}
+
 	ban = context_get_ban(fd, ctx);
 
 	if ((flags & HANG_ALLOW_BAN) == 0)
diff --git a/lib/igt_gt.h b/lib/igt_gt.h
index c47d0c12..3b824a82 100644
--- a/lib/igt_gt.h
+++ b/lib/igt_gt.h
@@ -48,6 +48,7 @@ igt_hang_t igt_hang_ctx(int fd,
 			uint64_t *offset);
 #define HANG_ALLOW_BAN		BIT(0)
 #define HANG_ALLOW_CAPTURE	BIT(1)
+#define HANG_USE_WATCHDOG	BIT(2)
 
 igt_hang_t igt_hang_ring(int fd, int ring);
 void igt_post_hang_ring(int fd, igt_hang_t arg);
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH i-g-t 2/2] tests/drv_hangman: watchdog tests
  2017-02-23 19:44 [RFC 1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
                   ` (2 preceding siblings ...)
  2017-02-23 19:44 ` [PATCH i-g-t 1/2] lib/igt_gt: Add watchdog gem_ctx_set_param ioctl interface Michel Thierry
@ 2017-02-23 19:44 ` Michel Thierry
  2017-02-23 22:26   ` Chris Wilson
  2017-02-23 20:52 ` ✓ Fi.CI.BAT: success for series starting with [RFC,1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+ Patchwork
  2017-02-23 20:57 ` [RFC 1/3] " Chris Wilson
  5 siblings, 1 reply; 16+ messages in thread
From: Michel Thierry @ 2017-02-23 19:44 UTC (permalink / raw)
  To: intel-gfx

Test watchdog triggered resets.

Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 tests/drv_hangman.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/tests/drv_hangman.c b/tests/drv_hangman.c
index 51bdbdaa..ba230f65 100644
--- a/tests/drv_hangman.c
+++ b/tests/drv_hangman.c
@@ -126,13 +126,21 @@ static bool uses_cmd_parser(void)
 
 static void check_error_state(const char *expected_ring_name,
 			      uint64_t expected_offset,
-			      const uint32_t *batch)
+			      const uint32_t *batch,
+			      bool watchdog)
 {
 	bool cmd_parser = uses_cmd_parser();
 	FILE *file = open_error();
 	char *line = NULL;
+	const char *watchdog_hang_reason = "reason: Watchdog timeout";
 	size_t line_size = 0;
 
+	if (watchdog) {
+		/* NOTE: expected engine name check happens later */
+		if (getline(&line, &line_size, file) > 0)
+			igt_assert(strstr(line, watchdog_hang_reason));
+	}
+
 	while (getline(&line, &line_size, file) > 0) {
 		char *dashes;
 		uint32_t gtt_offset_upper, gtt_offset_lower;
@@ -180,21 +188,29 @@ static void check_error_state(const char *expected_ring_name,
 }
 
 static void test_error_state_capture(unsigned ring_id,
-				     const char *ring_name)
+				     const char *ring_name,
+				     bool watchdog)
 {
 	uint32_t *batch;
 	igt_hang_t hang;
 	uint64_t offset;
+	unsigned flags = HANG_ALLOW_CAPTURE;
 
+	igt_skip_on_f(watchdog && ring_id == I915_EXEC_BLT,
+		      "no official watchdog support in BLT engine\n");
 	igt_require(gem_has_ring(device, ring_id));
 	clear_error_state();
+	assert_error_state_clear();
 
-	hang = igt_hang_ctx(device, 0, ring_id, HANG_ALLOW_CAPTURE, &offset);
+	if (watchdog)
+		flags |= HANG_USE_WATCHDOG;
+
+	hang = igt_hang_ctx(device, 0, ring_id, flags, &offset);
 	batch = gem_mmap__cpu(device, hang.handle, 0, 4096, PROT_READ);
 	gem_set_domain(device, hang.handle, I915_GEM_DOMAIN_CPU, 0);
 	igt_post_hang_ring(device, hang);
 
-	check_error_state(ring_name, offset, batch);
+	check_error_state(ring_name, offset, batch, watchdog);
 	munmap(batch, 4096);
 }
 
@@ -259,7 +275,11 @@ igt_main
 
 		igt_subtest_f("error-state-capture-%s", e->name)
 			test_error_state_capture(e->exec_id | e->flags,
-						 e->full_name);
+						 e->full_name, false);
+
+		igt_subtest_f("watchdog-%s", e->name)
+			test_error_state_capture(e->exec_id | e->flags,
+						 e->full_name, true);
 	}
 
 	igt_subtest("hangcheck-unterminated")
-- 
2.11.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* ✓ Fi.CI.BAT: success for series starting with [RFC,1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2017-02-23 19:44 [RFC 1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
                   ` (3 preceding siblings ...)
  2017-02-23 19:44 ` [PATCH i-g-t 2/2] tests/drv_hangman: watchdog tests Michel Thierry
@ 2017-02-23 20:52 ` Patchwork
  2017-02-23 20:57 ` [RFC 1/3] " Chris Wilson
  5 siblings, 0 replies; 16+ messages in thread
From: Patchwork @ 2017-02-23 20:52 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

== Series Details ==

Series: series starting with [RFC,1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+
URL   : https://patchwork.freedesktop.org/series/20173/
State : success

== Summary ==

Series 20173v1 Series without cover letter
https://patchwork.freedesktop.org/api/1.0/series/20173/revisions/1/mbox/

fi-bdw-5557u     total:278  pass:267  dwarn:0   dfail:0   fail:0   skip:11 
fi-bsw-n3050     total:278  pass:239  dwarn:0   dfail:0   fail:0   skip:39 
fi-bxt-j4205     total:278  pass:259  dwarn:0   dfail:0   fail:0   skip:19 
fi-bxt-t5700     total:108  pass:95   dwarn:0   dfail:0   fail:0   skip:12 
fi-byt-j1900     total:278  pass:251  dwarn:0   dfail:0   fail:0   skip:27 
fi-byt-n2820     total:278  pass:247  dwarn:0   dfail:0   fail:0   skip:31 
fi-hsw-4770      total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16 
fi-hsw-4770r     total:278  pass:262  dwarn:0   dfail:0   fail:0   skip:16 
fi-ilk-650       total:278  pass:228  dwarn:0   dfail:0   fail:0   skip:50 
fi-ivb-3520m     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18 
fi-ivb-3770      total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18 
fi-kbl-7500u     total:278  pass:260  dwarn:0   dfail:0   fail:0   skip:18 
fi-skl-6260u     total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10 
fi-skl-6700hq    total:278  pass:261  dwarn:0   dfail:0   fail:0   skip:17 
fi-skl-6700k     total:278  pass:256  dwarn:4   dfail:0   fail:0   skip:18 
fi-skl-6770hq    total:278  pass:268  dwarn:0   dfail:0   fail:0   skip:10 
fi-snb-2520m     total:278  pass:250  dwarn:0   dfail:0   fail:0   skip:28 
fi-snb-2600      total:278  pass:249  dwarn:0   dfail:0   fail:0   skip:29 

5b28284e2cdc5f3154225255e081f1f2050820bb drm-tip: 2017y-02m-23d-18h-49m-02s UTC integration manifest
67bbbfd drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
fbedfad drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
a01004a drm/i915: Watchdog timeout: IRQ handler for gen8+

== Logs ==

For more details see: https://intel-gfx-ci.01.org/CI/Patchwork_3955/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2017-02-23 19:44 [RFC 1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
                   ` (4 preceding siblings ...)
  2017-02-23 20:52 ` ✓ Fi.CI.BAT: success for series starting with [RFC,1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+ Patchwork
@ 2017-02-23 20:57 ` Chris Wilson
  2017-02-23 21:21   ` Michel Thierry
  2017-02-23 23:38   ` Chris Wilson
  5 siblings, 2 replies; 16+ messages in thread
From: Chris Wilson @ 2017-02-23 20:57 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Thu, Feb 23, 2017 at 11:44:17AM -0800, Michel Thierry wrote:
> *** General ***
> 
> Watchdog timeout (or "media engine reset") is a feature that allows
> userland applications to enable hang detection on individual batch buffers.
> The detection mechanism itself is mostly bound to the hardware and the only
> thing that the driver needs to do to support this form of hang detection
> is to implement the interrupt handling support as well as watchdog command
> emission before and after the emitted batch buffer start instruction in the
> ring buffer.
> 
> The principle of the hang detection mechanism is as follows:
> 
> 1. Once the decision has been made to enable watchdog timeout for a
> particular batch buffer and the driver is in the process of emitting the
> batch buffer start instruction into the ring buffer it also emits a
> watchdog timer start instruction before and a watchdog timer cancellation
> instruction after the batch buffer start instruction in the ring buffer.
> 
> 2. Once the GPU execution reaches the watchdog timer start instruction
> the hardware watchdog counter is started by the hardware. The counter
> keeps counting until either reaching a previously configured threshold
> value or the timer cancellation instruction is executed.
> 
> 2a. If the counter reaches the threshold value the hardware fires a
> watchdog interrupt that is picked up by the watchdog interrupt handler.
> This means that a hang has been detected and the driver needs to deal with
> it the same way it would deal with a engine hang detected by the periodic
> hang checker. The only difference between the two is that we already blamed
> the active request (to ensure an engine reset).
> 
> 2b. If the batch buffer completes and the execution reaches the watchdog
> cancellation instruction before the watchdog counter reaches its
> threshold value the watchdog is cancelled and nothing more comes of it.
> No hang is detected.
> 
> Note about future interaction with preemption: Preemption could happen
> in a command sequence prior to watchdog counter getting disabled,
> resulting in watchdog being triggered following preemption. The driver will
> need to explicitly disable the watchdog counter as part of the
> preemption sequence.
> 
> *** This patch introduces: ***
> 
> 1. IRQ handler code for watchdog timeout allowing direct hang recovery
> based on hardware-driven hang detection, which then integrates directly
> with the hang recovery path. This is independent of having per-engine reset
> or just full gpu reset.
> 
> 2. Watchdog specific register information.
> 
> Currently the render engine and all available media engines support
> watchdog timeout (VECS is only supported in GEN9). The specifications elude
> to the BCS engine being supported but that is currently not supported by
> this commit.
> 
> Note that the value to stop the counter is different between render and
> non-render engines.
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Ian Lister <ian.lister@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h        |  4 ++++
>  drivers/gpu/drm/i915/i915_irq.c        | 31 ++++++++++++++++++++++++++++++-
>  drivers/gpu/drm/i915/i915_reg.h        |  6 ++++++
>  drivers/gpu/drm/i915/intel_hangcheck.c | 13 +++++++++----
>  drivers/gpu/drm/i915/intel_lrc.c       | 16 ++++++++++++++++
>  5 files changed, 65 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index eed9ead1b592..0e4f4cc3c6de 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1568,6 +1568,9 @@ struct i915_gpu_error {
>  	 * recovery. All waiters on the reset_queue will be woken when
>  	 * that happens.
>  	 *
> +	 * When hw detects a hang before us, we can use I915_RESET_WATCHDOG to
> +	 * report the hang detection cause accurately.
> +	 *
>  	 * This counter is used by the wait_seqno code to notice that reset
>  	 * event happened and it needs to restart the entire ioctl (since most
>  	 * likely the seqno it waited for won't ever signal anytime soon).
> @@ -1580,6 +1583,7 @@ struct i915_gpu_error {
>  
>  	unsigned long flags;
>  #define I915_RESET_IN_PROGRESS	0
> +#define I915_RESET_WATCHDOG	2 /* looking at the future */
>  #define I915_WEDGED		(BITS_PER_LONG - 1)
>  
>  	/**
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index bc70e2c451b2..4ef73363bbe9 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1352,6 +1352,28 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
>  		set_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
>  		tasklet_hi_schedule(&engine->irq_tasklet);
>  	}
> +
> +	if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift)) {
> +		struct drm_i915_private *dev_priv = engine->i915;
> +		u32 watchdog_disable;
> +
> +		if (engine->id == RCS)
> +			watchdog_disable = GEN8_RCS_WATCHDOG_DISABLE;
> +		else
> +			watchdog_disable = GEN8_XCS_WATCHDOG_DISABLE;
> +
> +		/* Stop the counter to prevent further timeout interrupts */
> +		I915_WRITE_FW(RING_CNTR(engine->mmio_base), watchdog_disable);

There's no guarrantee you hold forcewake, you need to use I915_WRITE.
Better yet would be to avoid having to wait for forcewake within the
hardirq handler.

> +
> +		/* Make sure the active request will be marked as guilty */
> +		engine->hangcheck.stalled = true;
> +		engine->hangcheck.seqno = intel_engine_get_seqno(engine);

Just set a flag saying the engine->hangcheck.watchdog = true. Don't
confuse us. engine->hangcheck.seqno does not give the guilty seqno!

Also there is no guarrantee here that seqno is the guilty party. That's
a nasty bug. Servicing the interrupt will be running in parallel with
the GPU that may complete the request before we read the HWS.

Please tell me we can use a PID along with the watchdog timer...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/3] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2017-02-23 19:44 ` [RFC 2/3] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
@ 2017-02-23 21:03   ` Chris Wilson
  2017-02-24  9:15     ` Tvrtko Ursulin
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Wilson @ 2017-02-23 21:03 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Thu, Feb 23, 2017 at 11:44:18AM -0800, Michel Thierry wrote:
> Emit the required commands into the ring buffer for starting and
> stopping the watchdog timer before/after batch buffer start during
> batch buffer submission.
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Ian Lister <ian.lister@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_execbuffer.c | 26 +++++++++++
>  drivers/gpu/drm/i915/intel_lrc.c           | 74 ++++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/intel_ringbuffer.h    |  2 +
>  3 files changed, 102 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> index 35d2cb979452..348d81c40e81 100644
> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> @@ -1416,8 +1416,15 @@ execbuf_submit(struct i915_execbuffer_params *params,
>  	u64 exec_start, exec_len;
>  	int instp_mode;
>  	u32 instp_mask, *cs;
> +	bool watchdog_running = false;
>  	int ret;
>  
> +	/*
> +	 * NB: Place-holder until watchdog timeout is enabled through DRM
> +	 * interface
> +	 */
> +	bool enable_watchdog = false;
> +
>  	ret = i915_gem_execbuffer_move_to_gpu(params->request, vmas);
>  	if (ret)
>  		return ret;
> @@ -1480,6 +1487,18 @@ execbuf_submit(struct i915_execbuffer_params *params,
>  			return ret;
>  	}
>  
> +	/* Start watchdog timer */
> +	if (enable_watchdog) {
> +		if (!params->engine->emit_start_watchdog)
> +			return -EINVAL;

In the future GEM API, this is checked much earlier!

> +
> +		ret = params->engine->emit_start_watchdog(params->request);
> +		if (ret)
> +			return ret;
> +
> +		watchdog_running = true;
> +	}
> +
>  	exec_len   = args->batch_len;
>  	exec_start = params->batch->node.start +
>  		     params->args_batch_start_offset;
> @@ -1493,6 +1512,13 @@ execbuf_submit(struct i915_execbuffer_params *params,
>  	if (ret)
>  		return ret;
>  
> +	/* Cancel watchdog timer */
> +	if (watchdog_running && params->engine->emit_stop_watchdog) {
> +		ret = params->engine->emit_stop_watchdog(params->request);
> +		if (ret)
> +			return ret;
> +	}

What happens *when* we hit the error path above? You need to reserve
space in the tail to disable the watchdog, or reserve space in the
bb_emit. Or will the watchdog firing randomly be fine?

>  	i915_gem_execbuffer_move_to_active(vmas, params->request);
>  
>  	return 0;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 8c9ebf0cebf7..b4939d9f338a 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1474,6 +1474,72 @@ static int gen8_emit_flush_render(struct drm_i915_gem_request *request,
>  	return 0;
>  }
>  
> +static int gen8_emit_start_watchdog(struct drm_i915_gem_request *req)
> +{
> +	struct intel_engine_cs *engine = req->engine;
> +	struct i915_gem_context *ctx = req->ctx;
> +	u32 *cs;
> +
> +	cs = intel_ring_begin(req, 8);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	/*
> +	 * watchdog register must never be programmed to zero. This would
> +	 * cause the watchdog counter to exceed and not allow the engine to
> +	 * go into IDLE state
> +	 */
> +	GEM_BUG_ON(ctx->watchdog_threshold == 0);
> +
> +	/* Set counter period */
> +	*cs++ = MI_LOAD_REGISTER_IMM(1);
> +	*cs++ = i915_mmio_reg_offset(RING_THRESH(engine->mmio_base));
> +	*cs++ = ctx->watchdog_threshold;
> +	*cs++ = MI_NOOP;

Why split into 2 instructions rather than LRI(2) ? Why the NOOPs?
Where's my PID?

> +
> +	/* Start counter */
> +	*cs++ = MI_LOAD_REGISTER_IMM(1);
> +	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
> +	*cs++ = GEN8_WATCHDOG_ENABLE;
> +	*cs++ = MI_NOOP;
> +	intel_ring_advance(req, cs);
> +
> +	return 0;
> +}
> +
> +static int gen8_emit_stop_watchdog(struct drm_i915_gem_request *req)
> +{
> +	struct intel_engine_cs *engine = req->engine;
> +	u32 *cs;
> +
> +	cs = intel_ring_begin(req, 4);
> +	if (IS_ERR(cs))
> +		return PTR_ERR(cs);
> +
> +	*cs++ = MI_LOAD_REGISTER_IMM(1);
> +	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
> +
> +	switch (engine->id) {
> +	default:
> +		WARN(1, "%s does not support watchdog timeout!\n",
> +		     engine->name);
> +	/* default to render engine */

Just
GEM_BUG_ON(engine->id == BCS);
if (engine->id == RCS)
	*cs++ = GEN8_RCS_WATCHDOG_DISABLE;
else
	*cs++ = GEN8_XCS_WATCHDOG_DISABLE;
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 3/3] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2017-02-23 19:44 ` [RFC 3/3] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
@ 2017-02-23 21:14   ` Chris Wilson
  0 siblings, 0 replies; 16+ messages in thread
From: Chris Wilson @ 2017-02-23 21:14 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Thu, Feb 23, 2017 at 11:44:19AM -0800, Michel Thierry wrote:
> Final enablement patch for GPU hang detection using watchdog timeout.
> Using the gem_context_setparam ioctl, users can specify the desired
> timeout value in milliseconds, and the driver will do the conversion to
> 'timestamps'.
> 
> The _recommended_ default watchdog threshold for video engines is 60 ms,
> since this has been _empirically determined_ to be a good compromise for
> low-latency requirements and low rate of false positives. The default
> register value is ~106ms and the theoretical max value (all 1s) is
> 353 seconds.
> 
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_context.c    | 46 ++++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_gem_context.h    |  3 ++
>  drivers/gpu/drm/i915/i915_gem_execbuffer.c |  8 +-----
>  include/uapi/drm/i915_drm.h                |  1 +
>  4 files changed, 51 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 99c46f4dbde6..c3748878e64c 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -440,6 +440,32 @@ i915_gem_context_create_gvt(struct drm_device *dev)
>  	return ctx;
>  }
>  
> +void i915_context_watchdog_setup(struct i915_gem_context *ctx, u32 value_in_ms)
Ahem.

> +{
> +	/*
> +	 * Based on time out value (ms) calculate
> +	 * timer count thresholds needed based on core frequency.
> +	 */
> +#define TIMER_MILLISECOND 1000
> +
> +	/*
> +	 * Timestamp timer resolution = 0.080 uSec,
> +	 * or 12500000 counts per second
> +	 */
> +#define TIMESTAMP_CNTS_PER_SEC_80NS 12500000
> +
> +	ctx->watchdog_threshold =
> +		((value_in_ms) *
> +		 ((TIMESTAMP_CNTS_PER_SEC_80NS) / (TIMER_MILLISECOND)));

Rescale to ns (u64 math), check for overflows before assigning to u32.

> +	/*
> +	 * watchdog register must never be programmed to zero. This would
> +	 * cause the watchdog counter to exceed and not allow the engine to
> +	 * go into IDLE state
> +	 */
> +	GEM_BUG_ON(ctx->watchdog_threshold == 0);

This throws a bug when the user tries to disable the watchdog
afterwards. [ctx->watchdog_threshold = 0 => disable watchdog around this
context, default].

> +}
> +
>  int i915_gem_context_init(struct drm_i915_private *dev_priv)
>  {
>  	struct i915_gem_context *ctx;
> @@ -1056,6 +1082,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>  	struct drm_i915_file_private *file_priv = file->driver_priv;
>  	struct drm_i915_gem_context_param *args = data;
>  	struct i915_gem_context *ctx;
> +	struct intel_engine_cs *engine;
>  	int ret;
>  
>  	ret = i915_mutex_lock_interruptible(dev);
> @@ -1090,6 +1117,15 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>  	case I915_CONTEXT_PARAM_BANNABLE:
>  		args->value = i915_gem_context_is_bannable(ctx);
>  		break;
> +	case I915_CONTEXT_PARAM_WATCHDOG:
> +		engine = to_i915(dev)->engine[VCS];
> +		if (!engine->emit_start_watchdog)
> +			ret = -EINVAL;
> +		else if (args->value)
> +			ret = -EINVAL;
> +		else
> +			args->value = ctx->watchdog_threshold;

Just 
	args->value = watchdog_to_ns(ctx->watchdog_threshold);
will be 0 where unset. On setting we complain if not supported.

Do we really want the user API to be in clock ticks rather than say ns?
I think we want ns, and don't forget to include that information in the
error state. Do we want to capture error state? Do we want this to
contribute to banning?

> +		break;
>  	default:
>  		ret = -EINVAL;
>  		break;
> @@ -1105,6 +1141,7 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>  	struct drm_i915_file_private *file_priv = file->driver_priv;
>  	struct drm_i915_gem_context_param *args = data;
>  	struct i915_gem_context *ctx;
> +	struct intel_engine_cs *engine;
>  	int ret;
>  
>  	ret = i915_mutex_lock_interruptible(dev);
> @@ -1147,6 +1184,15 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>  		else
>  			i915_gem_context_clear_bannable(ctx);
>  		break;
> +	case I915_CONTEXT_PARAM_WATCHDOG:
> +		engine = to_i915(dev)->engine[VCS];
> +		if (!engine->emit_start_watchdog)
> +			ret = -EINVAL;

ret = -ENODEV;

> +		else if (!args->value)
> +			ret = -EINVAL;
> +		else
> +			i915_context_watchdog_setup(ctx, args->value);

I'm pushing for ns. And this should be
i915_gem_context_set_watchdog(ctx, args->value);

> +		break;
>  	default:
>  		ret = -EINVAL;
>  		break;
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index 0ac750b90f3d..133ed7b413aa 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -176,6 +176,9 @@ struct i915_gem_context {
>  	/** ban_score: Accumulated score of all hangs caused by this context. */
>  	int ban_score;
>  
> +	/** watchdog_threshold: hw watchdog threshold value, in clock counts */
> +	u32 watchdog_threshold;
> +
>  	/** remap_slice: Bitmask of cache lines that need remapping */
>  	u8 remap_slice;
>  };
> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> index 348d81c40e81..26c50a6d6158 100644
> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> @@ -1419,12 +1419,6 @@ execbuf_submit(struct i915_execbuffer_params *params,
>  	bool watchdog_running = false;
>  	int ret;
>  
> -	/*
> -	 * NB: Place-holder until watchdog timeout is enabled through DRM
> -	 * interface
> -	 */
> -	bool enable_watchdog = false;
> -
>  	ret = i915_gem_execbuffer_move_to_gpu(params->request, vmas);
>  	if (ret)
>  		return ret;
> @@ -1488,7 +1482,7 @@ execbuf_submit(struct i915_execbuffer_params *params,
>  	}
>  
>  	/* Start watchdog timer */
> -	if (enable_watchdog) {
> +	if (params->ctx->watchdog_threshold != 0) {
>  		if (!params->engine->emit_start_watchdog)
>  			return -EINVAL;

No. If we set a watchdog on a ctx, this then causes all use of BCS to
fail. We need saner API than that - either per-engine thresholds in the
ctx, or just document that the watchdog is only supported where
available by hw.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2017-02-23 20:57 ` [RFC 1/3] " Chris Wilson
@ 2017-02-23 21:21   ` Michel Thierry
  2017-02-23 21:49     ` Chris Wilson
  2017-02-23 23:38   ` Chris Wilson
  1 sibling, 1 reply; 16+ messages in thread
From: Michel Thierry @ 2017-02-23 21:21 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, mika.kuoppala



On 23/02/17 12:57, Chris Wilson wrote:
> On Thu, Feb 23, 2017 at 11:44:17AM -0800, Michel Thierry wrote:
>> *** General ***
>>
>> Watchdog timeout (or "media engine reset") is a feature that allows
>> userland applications to enable hang detection on individual batch buffers.
>> The detection mechanism itself is mostly bound to the hardware and the only
>> thing that the driver needs to do to support this form of hang detection
>> is to implement the interrupt handling support as well as watchdog command
>> emission before and after the emitted batch buffer start instruction in the
>> ring buffer.
>>
>> The principle of the hang detection mechanism is as follows:
>>
>> 1. Once the decision has been made to enable watchdog timeout for a
>> particular batch buffer and the driver is in the process of emitting the
>> batch buffer start instruction into the ring buffer it also emits a
>> watchdog timer start instruction before and a watchdog timer cancellation
>> instruction after the batch buffer start instruction in the ring buffer.
>>
>> 2. Once the GPU execution reaches the watchdog timer start instruction
>> the hardware watchdog counter is started by the hardware. The counter
>> keeps counting until either reaching a previously configured threshold
>> value or the timer cancellation instruction is executed.
>>
>> 2a. If the counter reaches the threshold value the hardware fires a
>> watchdog interrupt that is picked up by the watchdog interrupt handler.
>> This means that a hang has been detected and the driver needs to deal with
>> it the same way it would deal with a engine hang detected by the periodic
>> hang checker. The only difference between the two is that we already blamed
>> the active request (to ensure an engine reset).
>>
>> 2b. If the batch buffer completes and the execution reaches the watchdog
>> cancellation instruction before the watchdog counter reaches its
>> threshold value the watchdog is cancelled and nothing more comes of it.
>> No hang is detected.
>>
>> Note about future interaction with preemption: Preemption could happen
>> in a command sequence prior to watchdog counter getting disabled,
>> resulting in watchdog being triggered following preemption. The driver will
>> need to explicitly disable the watchdog counter as part of the
>> preemption sequence.
>>
>> *** This patch introduces: ***
>>
>> 1. IRQ handler code for watchdog timeout allowing direct hang recovery
>> based on hardware-driven hang detection, which then integrates directly
>> with the hang recovery path. This is independent of having per-engine reset
>> or just full gpu reset.
>>
>> 2. Watchdog specific register information.
>>
>> Currently the render engine and all available media engines support
>> watchdog timeout (VECS is only supported in GEN9). The specifications elude
>> to the BCS engine being supported but that is currently not supported by
>> this commit.
>>
>> Note that the value to stop the counter is different between render and
>> non-render engines.
>>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Ian Lister <ian.lister@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>> ---
>>  drivers/gpu/drm/i915/i915_drv.h        |  4 ++++
>>  drivers/gpu/drm/i915/i915_irq.c        | 31 ++++++++++++++++++++++++++++++-
>>  drivers/gpu/drm/i915/i915_reg.h        |  6 ++++++
>>  drivers/gpu/drm/i915/intel_hangcheck.c | 13 +++++++++----
>>  drivers/gpu/drm/i915/intel_lrc.c       | 16 ++++++++++++++++
>>  5 files changed, 65 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>> index eed9ead1b592..0e4f4cc3c6de 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -1568,6 +1568,9 @@ struct i915_gpu_error {
>>  	 * recovery. All waiters on the reset_queue will be woken when
>>  	 * that happens.
>>  	 *
>> +	 * When hw detects a hang before us, we can use I915_RESET_WATCHDOG to
>> +	 * report the hang detection cause accurately.
>> +	 *
>>  	 * This counter is used by the wait_seqno code to notice that reset
>>  	 * event happened and it needs to restart the entire ioctl (since most
>>  	 * likely the seqno it waited for won't ever signal anytime soon).
>> @@ -1580,6 +1583,7 @@ struct i915_gpu_error {
>>
>>  	unsigned long flags;
>>  #define I915_RESET_IN_PROGRESS	0
>> +#define I915_RESET_WATCHDOG	2 /* looking at the future */
>>  #define I915_WEDGED		(BITS_PER_LONG - 1)
>>
>>  	/**
>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>> index bc70e2c451b2..4ef73363bbe9 100644
>> --- a/drivers/gpu/drm/i915/i915_irq.c
>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>> @@ -1352,6 +1352,28 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
>>  		set_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
>>  		tasklet_hi_schedule(&engine->irq_tasklet);
>>  	}
>> +
>> +	if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift)) {
>> +		struct drm_i915_private *dev_priv = engine->i915;
>> +		u32 watchdog_disable;
>> +
>> +		if (engine->id == RCS)
>> +			watchdog_disable = GEN8_RCS_WATCHDOG_DISABLE;
>> +		else
>> +			watchdog_disable = GEN8_XCS_WATCHDOG_DISABLE;
>> +
>> +		/* Stop the counter to prevent further timeout interrupts */
>> +		I915_WRITE_FW(RING_CNTR(engine->mmio_base), watchdog_disable);
>
> There's no guarrantee you hold forcewake, you need to use I915_WRITE.
> Better yet would be to avoid having to wait for forcewake within the
> hardirq handler.
>
>> +
>> +		/* Make sure the active request will be marked as guilty */
>> +		engine->hangcheck.stalled = true;
>> +		engine->hangcheck.seqno = intel_engine_get_seqno(engine);
>
> Just set a flag saying the engine->hangcheck.watchdog = true. Don't
> confuse us. engine->hangcheck.seqno does not give the guilty seqno!
>
> Also there is no guarrantee here that seqno is the guilty party. That's
> a nasty bug. Servicing the interrupt will be running in parallel with
> the GPU that may complete the request before we read the HWS.
>
> Please tell me we can use a PID along with the watchdog timer...

A 'watchdog' PID and 'running' PID in the HWSP would sound ok?

There's also the question if we want different thresholds per engine.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2017-02-23 21:21   ` Michel Thierry
@ 2017-02-23 21:49     ` Chris Wilson
  2017-02-23 22:12       ` Michel Thierry
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Wilson @ 2017-02-23 21:49 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Thu, Feb 23, 2017 at 01:21:03PM -0800, Michel Thierry wrote:
> 
> 
> On 23/02/17 12:57, Chris Wilson wrote:
> >On Thu, Feb 23, 2017 at 11:44:17AM -0800, Michel Thierry wrote:
> >>*** General ***
> >>
> >>Watchdog timeout (or "media engine reset") is a feature that allows
> >>userland applications to enable hang detection on individual batch buffers.
> >>The detection mechanism itself is mostly bound to the hardware and the only
> >>thing that the driver needs to do to support this form of hang detection
> >>is to implement the interrupt handling support as well as watchdog command
> >>emission before and after the emitted batch buffer start instruction in the
> >>ring buffer.
> >>
> >>The principle of the hang detection mechanism is as follows:
> >>
> >>1. Once the decision has been made to enable watchdog timeout for a
> >>particular batch buffer and the driver is in the process of emitting the
> >>batch buffer start instruction into the ring buffer it also emits a
> >>watchdog timer start instruction before and a watchdog timer cancellation
> >>instruction after the batch buffer start instruction in the ring buffer.
> >>
> >>2. Once the GPU execution reaches the watchdog timer start instruction
> >>the hardware watchdog counter is started by the hardware. The counter
> >>keeps counting until either reaching a previously configured threshold
> >>value or the timer cancellation instruction is executed.
> >>
> >>2a. If the counter reaches the threshold value the hardware fires a
> >>watchdog interrupt that is picked up by the watchdog interrupt handler.
> >>This means that a hang has been detected and the driver needs to deal with
> >>it the same way it would deal with a engine hang detected by the periodic
> >>hang checker. The only difference between the two is that we already blamed
> >>the active request (to ensure an engine reset).
> >>
> >>2b. If the batch buffer completes and the execution reaches the watchdog
> >>cancellation instruction before the watchdog counter reaches its
> >>threshold value the watchdog is cancelled and nothing more comes of it.
> >>No hang is detected.
> >>
> >>Note about future interaction with preemption: Preemption could happen
> >>in a command sequence prior to watchdog counter getting disabled,
> >>resulting in watchdog being triggered following preemption. The driver will
> >>need to explicitly disable the watchdog counter as part of the
> >>preemption sequence.
> >>
> >>*** This patch introduces: ***
> >>
> >>1. IRQ handler code for watchdog timeout allowing direct hang recovery
> >>based on hardware-driven hang detection, which then integrates directly
> >>with the hang recovery path. This is independent of having per-engine reset
> >>or just full gpu reset.
> >>
> >>2. Watchdog specific register information.
> >>
> >>Currently the render engine and all available media engines support
> >>watchdog timeout (VECS is only supported in GEN9). The specifications elude
> >>to the BCS engine being supported but that is currently not supported by
> >>this commit.
> >>
> >>Note that the value to stop the counter is different between render and
> >>non-render engines.
> >>
> >>Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> >>Signed-off-by: Ian Lister <ian.lister@intel.com>
> >>Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> >>Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> >>---
> >> drivers/gpu/drm/i915/i915_drv.h        |  4 ++++
> >> drivers/gpu/drm/i915/i915_irq.c        | 31 ++++++++++++++++++++++++++++++-
> >> drivers/gpu/drm/i915/i915_reg.h        |  6 ++++++
> >> drivers/gpu/drm/i915/intel_hangcheck.c | 13 +++++++++----
> >> drivers/gpu/drm/i915/intel_lrc.c       | 16 ++++++++++++++++
> >> 5 files changed, 65 insertions(+), 5 deletions(-)
> >>
> >>diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> >>index eed9ead1b592..0e4f4cc3c6de 100644
> >>--- a/drivers/gpu/drm/i915/i915_drv.h
> >>+++ b/drivers/gpu/drm/i915/i915_drv.h
> >>@@ -1568,6 +1568,9 @@ struct i915_gpu_error {
> >> 	 * recovery. All waiters on the reset_queue will be woken when
> >> 	 * that happens.
> >> 	 *
> >>+	 * When hw detects a hang before us, we can use I915_RESET_WATCHDOG to
> >>+	 * report the hang detection cause accurately.
> >>+	 *
> >> 	 * This counter is used by the wait_seqno code to notice that reset
> >> 	 * event happened and it needs to restart the entire ioctl (since most
> >> 	 * likely the seqno it waited for won't ever signal anytime soon).
> >>@@ -1580,6 +1583,7 @@ struct i915_gpu_error {
> >>
> >> 	unsigned long flags;
> >> #define I915_RESET_IN_PROGRESS	0
> >>+#define I915_RESET_WATCHDOG	2 /* looking at the future */
> >> #define I915_WEDGED		(BITS_PER_LONG - 1)
> >>
> >> 	/**
> >>diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> >>index bc70e2c451b2..4ef73363bbe9 100644
> >>--- a/drivers/gpu/drm/i915/i915_irq.c
> >>+++ b/drivers/gpu/drm/i915/i915_irq.c
> >>@@ -1352,6 +1352,28 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
> >> 		set_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
> >> 		tasklet_hi_schedule(&engine->irq_tasklet);
> >> 	}
> >>+
> >>+	if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift)) {
> >>+		struct drm_i915_private *dev_priv = engine->i915;
> >>+		u32 watchdog_disable;
> >>+
> >>+		if (engine->id == RCS)
> >>+			watchdog_disable = GEN8_RCS_WATCHDOG_DISABLE;
> >>+		else
> >>+			watchdog_disable = GEN8_XCS_WATCHDOG_DISABLE;
> >>+
> >>+		/* Stop the counter to prevent further timeout interrupts */
> >>+		I915_WRITE_FW(RING_CNTR(engine->mmio_base), watchdog_disable);
> >
> >There's no guarrantee you hold forcewake, you need to use I915_WRITE.
> >Better yet would be to avoid having to wait for forcewake within the
> >hardirq handler.
> >
> >>+
> >>+		/* Make sure the active request will be marked as guilty */
> >>+		engine->hangcheck.stalled = true;
> >>+		engine->hangcheck.seqno = intel_engine_get_seqno(engine);
> >
> >Just set a flag saying the engine->hangcheck.watchdog = true. Don't
> >confuse us. engine->hangcheck.seqno does not give the guilty seqno!
> >
> >Also there is no guarrantee here that seqno is the guilty party. That's
> >a nasty bug. Servicing the interrupt will be running in parallel with
> >the GPU that may complete the request before we read the HWS.
> >
> >Please tell me we can use a PID along with the watchdog timer...
> 
> A 'watchdog' PID and 'running' PID in the HWSP would sound ok?

No, Another STORE_DWORD_IMM has the same asynchronicity issue as just
reading seqno. I take it there is no WATCHDOG_PID that is set when the
watchdog expires? Or we can't program the CS to stop when the watchdog
goes off?

The issue is that we may blame the following context (a completely
unrelated process) for the hang - dos ahoy.

Or we can do something like current hangcheck, program the watchdog to
fire twice before we declare a hang. And only reset if we see the same
seqno on both occasions.
 
> There's also the question if we want different thresholds per engine.

I suspect we do. But that can be extended through the same
context_set_param just by passing an array (size > 0) instead of a
single value.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2017-02-23 21:49     ` Chris Wilson
@ 2017-02-23 22:12       ` Michel Thierry
  0 siblings, 0 replies; 16+ messages in thread
From: Michel Thierry @ 2017-02-23 22:12 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx, mika.kuoppala



On 23/02/17 13:49, Chris Wilson wrote:
> On Thu, Feb 23, 2017 at 01:21:03PM -0800, Michel Thierry wrote:
>>
>>
>> On 23/02/17 12:57, Chris Wilson wrote:
>>> On Thu, Feb 23, 2017 at 11:44:17AM -0800, Michel Thierry wrote:
>>>> *** General ***
>>>>
>>>> Watchdog timeout (or "media engine reset") is a feature that allows
>>>> userland applications to enable hang detection on individual batch buffers.
>>>> The detection mechanism itself is mostly bound to the hardware and the only
>>>> thing that the driver needs to do to support this form of hang detection
>>>> is to implement the interrupt handling support as well as watchdog command
>>>> emission before and after the emitted batch buffer start instruction in the
>>>> ring buffer.
>>>>
>>>> The principle of the hang detection mechanism is as follows:
>>>>
>>>> 1. Once the decision has been made to enable watchdog timeout for a
>>>> particular batch buffer and the driver is in the process of emitting the
>>>> batch buffer start instruction into the ring buffer it also emits a
>>>> watchdog timer start instruction before and a watchdog timer cancellation
>>>> instruction after the batch buffer start instruction in the ring buffer.
>>>>
>>>> 2. Once the GPU execution reaches the watchdog timer start instruction
>>>> the hardware watchdog counter is started by the hardware. The counter
>>>> keeps counting until either reaching a previously configured threshold
>>>> value or the timer cancellation instruction is executed.
>>>>
>>>> 2a. If the counter reaches the threshold value the hardware fires a
>>>> watchdog interrupt that is picked up by the watchdog interrupt handler.
>>>> This means that a hang has been detected and the driver needs to deal with
>>>> it the same way it would deal with a engine hang detected by the periodic
>>>> hang checker. The only difference between the two is that we already blamed
>>>> the active request (to ensure an engine reset).
>>>>
>>>> 2b. If the batch buffer completes and the execution reaches the watchdog
>>>> cancellation instruction before the watchdog counter reaches its
>>>> threshold value the watchdog is cancelled and nothing more comes of it.
>>>> No hang is detected.
>>>>
>>>> Note about future interaction with preemption: Preemption could happen
>>>> in a command sequence prior to watchdog counter getting disabled,
>>>> resulting in watchdog being triggered following preemption. The driver will
>>>> need to explicitly disable the watchdog counter as part of the
>>>> preemption sequence.
>>>>
>>>> *** This patch introduces: ***
>>>>
>>>> 1. IRQ handler code for watchdog timeout allowing direct hang recovery
>>>> based on hardware-driven hang detection, which then integrates directly
>>>> with the hang recovery path. This is independent of having per-engine reset
>>>> or just full gpu reset.
>>>>
>>>> 2. Watchdog specific register information.
>>>>
>>>> Currently the render engine and all available media engines support
>>>> watchdog timeout (VECS is only supported in GEN9). The specifications elude
>>>> to the BCS engine being supported but that is currently not supported by
>>>> this commit.
>>>>
>>>> Note that the value to stop the counter is different between render and
>>>> non-render engines.
>>>>
>>>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>>>> Signed-off-by: Ian Lister <ian.lister@intel.com>
>>>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>>>> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
>>>> ---
>>>> drivers/gpu/drm/i915/i915_drv.h        |  4 ++++
>>>> drivers/gpu/drm/i915/i915_irq.c        | 31 ++++++++++++++++++++++++++++++-
>>>> drivers/gpu/drm/i915/i915_reg.h        |  6 ++++++
>>>> drivers/gpu/drm/i915/intel_hangcheck.c | 13 +++++++++----
>>>> drivers/gpu/drm/i915/intel_lrc.c       | 16 ++++++++++++++++
>>>> 5 files changed, 65 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
>>>> index eed9ead1b592..0e4f4cc3c6de 100644
>>>> --- a/drivers/gpu/drm/i915/i915_drv.h
>>>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>>>> @@ -1568,6 +1568,9 @@ struct i915_gpu_error {
>>>> 	 * recovery. All waiters on the reset_queue will be woken when
>>>> 	 * that happens.
>>>> 	 *
>>>> +	 * When hw detects a hang before us, we can use I915_RESET_WATCHDOG to
>>>> +	 * report the hang detection cause accurately.
>>>> +	 *
>>>> 	 * This counter is used by the wait_seqno code to notice that reset
>>>> 	 * event happened and it needs to restart the entire ioctl (since most
>>>> 	 * likely the seqno it waited for won't ever signal anytime soon).
>>>> @@ -1580,6 +1583,7 @@ struct i915_gpu_error {
>>>>
>>>> 	unsigned long flags;
>>>> #define I915_RESET_IN_PROGRESS	0
>>>> +#define I915_RESET_WATCHDOG	2 /* looking at the future */
>>>> #define I915_WEDGED		(BITS_PER_LONG - 1)
>>>>
>>>> 	/**
>>>> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
>>>> index bc70e2c451b2..4ef73363bbe9 100644
>>>> --- a/drivers/gpu/drm/i915/i915_irq.c
>>>> +++ b/drivers/gpu/drm/i915/i915_irq.c
>>>> @@ -1352,6 +1352,28 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir, int test_shift)
>>>> 		set_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted);
>>>> 		tasklet_hi_schedule(&engine->irq_tasklet);
>>>> 	}
>>>> +
>>>> +	if (iir & (GT_GEN8_WATCHDOG_INTERRUPT << test_shift)) {
>>>> +		struct drm_i915_private *dev_priv = engine->i915;
>>>> +		u32 watchdog_disable;
>>>> +
>>>> +		if (engine->id == RCS)
>>>> +			watchdog_disable = GEN8_RCS_WATCHDOG_DISABLE;
>>>> +		else
>>>> +			watchdog_disable = GEN8_XCS_WATCHDOG_DISABLE;
>>>> +
>>>> +		/* Stop the counter to prevent further timeout interrupts */
>>>> +		I915_WRITE_FW(RING_CNTR(engine->mmio_base), watchdog_disable);
>>>
>>> There's no guarrantee you hold forcewake, you need to use I915_WRITE.
>>> Better yet would be to avoid having to wait for forcewake within the
>>> hardirq handler.
>>>
>>>> +
>>>> +		/* Make sure the active request will be marked as guilty */
>>>> +		engine->hangcheck.stalled = true;
>>>> +		engine->hangcheck.seqno = intel_engine_get_seqno(engine);
>>>
>>> Just set a flag saying the engine->hangcheck.watchdog = true. Don't
>>> confuse us. engine->hangcheck.seqno does not give the guilty seqno!
>>>
>>> Also there is no guarrantee here that seqno is the guilty party. That's
>>> a nasty bug. Servicing the interrupt will be running in parallel with
>>> the GPU that may complete the request before we read the HWS.
>>>
>>> Please tell me we can use a PID along with the watchdog timer...
>>
>> A 'watchdog' PID and 'running' PID in the HWSP would sound ok?
>
> No, Another STORE_DWORD_IMM has the same asynchronicity issue as just
> reading seqno. I take it there is no WATCHDOG_PID that is set when the
> watchdog expires? Or we can't program the CS to stop when the watchdog
> goes off?
>

If you were asking about the HW, no, there's no such thing, just a 
register for the counter and one for on/off control.

It is also not aware of what else is happening. For example, if it is 
pre-empted, we need to write to the reg to disable the timer, or it will 
expire while the new bb is running.

> The issue is that we may blame the following context (a completely
> unrelated process) for the hang - dos ahoy.
>
> Or we can do something like current hangcheck, program the watchdog to
> fire twice before we declare a hang. And only reset if we see the same
> seqno on both occasions.
>

I'll check how it works with the fire twice before doing something.

>> There's also the question if we want different thresholds per engine.
>
> I suspect we do. But that can be extended through the same
> context_set_param just by passing an array (size > 0) instead of a
> single value.
> -Chris
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH i-g-t 2/2] tests/drv_hangman: watchdog tests
  2017-02-23 19:44 ` [PATCH i-g-t 2/2] tests/drv_hangman: watchdog tests Michel Thierry
@ 2017-02-23 22:26   ` Chris Wilson
  2017-02-23 23:06     ` Chris Wilson
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Wilson @ 2017-02-23 22:26 UTC (permalink / raw)
  To: Michel Thierry; +Cc: intel-gfx

On Thu, Feb 23, 2017 at 11:44:21AM -0800, Michel Thierry wrote:
> Test watchdog triggered resets.

How about setting a watchdog N with a dummy load set to expire in 2N
(where 2N is less 1s to avoid hangcheck). Then check fence status to
ensure watchdog kicked the batch.  See gem_exec_fence.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH i-g-t 2/2] tests/drv_hangman: watchdog tests
  2017-02-23 22:26   ` Chris Wilson
@ 2017-02-23 23:06     ` Chris Wilson
  0 siblings, 0 replies; 16+ messages in thread
From: Chris Wilson @ 2017-02-23 23:06 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx, mika.kuoppala

On Thu, Feb 23, 2017 at 10:26:29PM +0000, Chris Wilson wrote:
> On Thu, Feb 23, 2017 at 11:44:21AM -0800, Michel Thierry wrote:
> > Test watchdog triggered resets.
> 
> How about setting a watchdog N with a dummy load set to expire in 2N
> (where 2N is less 1s to avoid hangcheck). Then check fence status to
> ensure watchdog kicked the batch.  See gem_exec_fence.

Or better yet, in gem_exec_watchdog set i915.enable_hangcheck=0

Hmm, which raises a good question is the user watchdog independent of
the hangcheck module_option. My vote is for yes, but that will require
a little rejigging.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+
  2017-02-23 20:57 ` [RFC 1/3] " Chris Wilson
  2017-02-23 21:21   ` Michel Thierry
@ 2017-02-23 23:38   ` Chris Wilson
  1 sibling, 0 replies; 16+ messages in thread
From: Chris Wilson @ 2017-02-23 23:38 UTC (permalink / raw)
  To: Michel Thierry, intel-gfx, mika.kuoppala

On Thu, Feb 23, 2017 at 08:57:54PM +0000, Chris Wilson wrote:
> On Thu, Feb 23, 2017 at 11:44:17AM -0800, Michel Thierry wrote:
> > +
> > +		/* Make sure the active request will be marked as guilty */
> > +		engine->hangcheck.stalled = true;
> > +		engine->hangcheck.seqno = intel_engine_get_seqno(engine);
> 
> Just set a flag saying the engine->hangcheck.watchdog = true. Don't
> confuse us. engine->hangcheck.seqno does not give the guilty seqno!

Hmm. So I was expecting a little more work on hangcheck. Once we kick
hangcheck from watchdog, we need to confirm that the active seqno hasn't
advance and ideally we would stop the ring before it does. As it stands
hangcheck.seqno will at least detect when we fail to reset in time
(still has the smell of dos as both the watchdog timeout and the request
duration are under user control), but I'm still uncomfortable with it
being set outside of the timer based hangcheck - at least not unless we
split the different seqno:

engine->hangcheck.watchdog_seqno
engine->hangcheck.timer_seqno
engine->hangcheck.stalled // now a seqno from either path

We also need to teach the standard hangcheck/i915_reset to only reset
selected engines.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC 2/3] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2017-02-23 21:03   ` Chris Wilson
@ 2017-02-24  9:15     ` Tvrtko Ursulin
  0 siblings, 0 replies; 16+ messages in thread
From: Tvrtko Ursulin @ 2017-02-24  9:15 UTC (permalink / raw)
  To: Chris Wilson, Michel Thierry, intel-gfx, mika.kuoppala



On 23/02/2017 21:03, Chris Wilson wrote:
> On Thu, Feb 23, 2017 at 11:44:18AM -0800, Michel Thierry wrote:

[snip]

>> +static int gen8_emit_stop_watchdog(struct drm_i915_gem_request *req)
>> +{
>> +	struct intel_engine_cs *engine = req->engine;
>> +	u32 *cs;
>> +
>> +	cs = intel_ring_begin(req, 4);
>> +	if (IS_ERR(cs))
>> +		return PTR_ERR(cs);
>> +
>> +	*cs++ = MI_LOAD_REGISTER_IMM(1);
>> +	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
>> +
>> +	switch (engine->id) {
>> +	default:
>> +		WARN(1, "%s does not support watchdog timeout!\n",
>> +		     engine->name);
>> +	/* default to render engine */
>
> Just
> GEM_BUG_ON(engine->id == BCS);
> if (engine->id == RCS)
> 	*cs++ = GEN8_RCS_WATCHDOG_DISABLE;
> else
> 	*cs++ = GEN8_XCS_WATCHDOG_DISABLE;

I'd go further and store it in the engine eg. "*cs++ = 
engine->watchdog_disable_cmd" or something. I think that's better than 
engines with identity crisis. :)

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-02-24  9:15 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-23 19:44 [RFC 1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+ Michel Thierry
2017-02-23 19:44 ` [RFC 2/3] drm/i915: Watchdog timeout: Ringbuffer command emission " Michel Thierry
2017-02-23 21:03   ` Chris Wilson
2017-02-24  9:15     ` Tvrtko Ursulin
2017-02-23 19:44 ` [RFC 3/3] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Michel Thierry
2017-02-23 21:14   ` Chris Wilson
2017-02-23 19:44 ` [PATCH i-g-t 1/2] lib/igt_gt: Add watchdog gem_ctx_set_param ioctl interface Michel Thierry
2017-02-23 19:44 ` [PATCH i-g-t 2/2] tests/drv_hangman: watchdog tests Michel Thierry
2017-02-23 22:26   ` Chris Wilson
2017-02-23 23:06     ` Chris Wilson
2017-02-23 20:52 ` ✓ Fi.CI.BAT: success for series starting with [RFC,1/3] drm/i915: Watchdog timeout: IRQ handler for gen8+ Patchwork
2017-02-23 20:57 ` [RFC 1/3] " Chris Wilson
2017-02-23 21:21   ` Michel Thierry
2017-02-23 21:49     ` Chris Wilson
2017-02-23 22:12       ` Michel Thierry
2017-02-23 23:38   ` Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.