All of lore.kernel.org
 help / color / mirror / Atom feed
* Gen8+ engine-reset
@ 2019-01-05  2:39 Carlos Santa
  2019-01-05  2:39 ` drm/i915: Add engine reset count in get-reset-stats ioctl Carlos Santa
                   ` (11 more replies)
  0 siblings, 12 replies; 48+ messages in thread
From: Carlos Santa @ 2019-01-05  2:39 UTC (permalink / raw)
  To: intel-gfx

This is a rebased on the original patch series from Michel Thierry
that can be found here:

https://patchwork.freedesktop.org/series/21868

Note that this series is only limited to the GPU Watchdog timeout
for execlists as it leaves out support
for GuC based submission for a later time.

The series was also successfully tested from userspace through the
the Intel i965 media driver that is readily found on some
Linux based OS including Ubuntu OS and as well as Chromium OS. The
changes on the i965 media userspace driver are currently under review at

https://github.com/intel/intel-vaapi-driver/pull/429/files

The testbed used on this series included a SKL-based NUC with 
2 BSD rings as well as a KBL-based Chromebook with a 1 BSD ring. 

Carlos Santa (1):
  drm/i915: Only process VCS2 only when supported

Michel Thierry (7):
  drm/i915: Add engine reset count in get-reset-stats ioctl
  drm/i915: Watchdog timeout: IRQ handler for gen8+
  drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  drm/i915: Watchdog timeout: Include threshold value in error state
  drm/i915/watchdog: move emit_stop_watchdog until the very end of the
    ring commands
  drm/i915: Watchdog timeout: Blindly trust watchdog timeout for reset?

 drivers/gpu/drm/i915/i915_drv.h         |  56 +++++++
 drivers/gpu/drm/i915/i915_gem_context.c | 103 +++++++++++-
 drivers/gpu/drm/i915/i915_gem_context.h |   4 +
 drivers/gpu/drm/i915/i915_gpu_error.c   |  12 +-
 drivers/gpu/drm/i915/i915_gpu_error.h   |   5 +
 drivers/gpu/drm/i915/i915_irq.c         |  17 +-
 drivers/gpu/drm/i915/i915_reg.h         |   6 +
 drivers/gpu/drm/i915/intel_hangcheck.c  |  20 ++-
 drivers/gpu/drm/i915/intel_lrc.c        | 208 +++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  10 ++
 include/uapi/drm/i915_drm.h             |   7 +-
 11 files changed, 428 insertions(+), 20 deletions(-)

-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* drm/i915: Add engine reset count in get-reset-stats ioctl
  2019-01-05  2:39 Gen8+ engine-reset Carlos Santa
@ 2019-01-05  2:39 ` Carlos Santa
  2019-01-05  2:39 ` drm/i915: Watchdog timeout: IRQ handler for gen8+ Carlos Santa
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 48+ messages in thread
From: Carlos Santa @ 2019-01-05  2:39 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

From: Michel Thierry <michel.thierry@intel.com>

Users/tests relying on the total reset count will start seeing a smaller
number since most of the hangs can be handled by engine reset.
Note that if reset engine x, context a running on engine y will be unaware
and unaffected.

To start the discussion, include just a total engine reset count. If it
is deemed useful, it can be extended to report each engine separately.

Our igt's gem_reset_stats test will need changes to ignore the pad field,
since it can now return reset_engine_count.

v2: s/engine_reset/reset_engine/, use union in uapi to not break compatibility.
v3: Keep rejecting attempts to use pad as input (Antonio)
v4: Rebased.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.c | 12 ++++++++++--
 include/uapi/drm/i915_drm.h             |  6 +++++-
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 5905b6d8f291..40fefed8c92f 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -964,6 +964,8 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 	struct drm_i915_private *dev_priv = to_i915(dev);
 	struct drm_i915_reset_stats *args = data;
 	struct i915_gem_context *ctx;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
 	int ret;
 
 	if (args->flags || args->pad)
@@ -982,10 +984,16 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev,
 	 * we should wrap the hangstats with a seqlock.
 	 */
 
-	if (capable(CAP_SYS_ADMIN))
+	if (capable(CAP_SYS_ADMIN)) {
 		args->reset_count = i915_reset_count(&dev_priv->gpu_error);
-	else
+		for_each_engine(engine, dev_priv, id)
+			args->reset_engine_count +=
+				i915_reset_engine_count(&dev_priv->gpu_error,
+							engine);
+	} else {
 		args->reset_count = 0;
+		args->reset_engine_count = 0;
+	}
 
 	args->batch_active = atomic_read(&ctx->guilty_count);
 	args->batch_pending = atomic_read(&ctx->active_count);
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 298b2e197744..0bc9e00e66ce 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1456,7 +1456,11 @@ struct drm_i915_reset_stats {
 	/* Number of batches lost pending for execution, for this context */
 	__u32 batch_pending;
 
-	__u32 pad;
+	union {
+		__u32 pad;
+		/* Engine resets since boot/module reload, for all contexts */
+		__u32 reset_engine_count;
+	};
 };
 
 struct drm_i915_gem_userptr {
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-05  2:39 Gen8+ engine-reset Carlos Santa
  2019-01-05  2:39 ` drm/i915: Add engine reset count in get-reset-stats ioctl Carlos Santa
@ 2019-01-05  2:39 ` Carlos Santa
  2019-01-07 11:58   ` Tvrtko Ursulin
  2019-01-05  2:39 ` drm/i915: Watchdog timeout: Ringbuffer command emission " Carlos Santa
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 48+ messages in thread
From: Carlos Santa @ 2019-01-05  2:39 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

From: Michel Thierry <michel.thierry@intel.com>

*** General ***

Watchdog timeout (or "media engine reset") is a feature that allows
userland applications to enable hang detection on individual batch buffers.
The detection mechanism itself is mostly bound to the hardware and the only
thing that the driver needs to do to support this form of hang detection
is to implement the interrupt handling support as well as watchdog command
emission before and after the emitted batch buffer start instruction in the
ring buffer.

The principle of the hang detection mechanism is as follows:

1. Once the decision has been made to enable watchdog timeout for a
particular batch buffer and the driver is in the process of emitting the
batch buffer start instruction into the ring buffer it also emits a
watchdog timer start instruction before and a watchdog timer cancellation
instruction after the batch buffer start instruction in the ring buffer.

2. Once the GPU execution reaches the watchdog timer start instruction
the hardware watchdog counter is started by the hardware. The counter
keeps counting until either reaching a previously configured threshold
value or the timer cancellation instruction is executed.

2a. If the counter reaches the threshold value the hardware fires a
watchdog interrupt that is picked up by the watchdog interrupt handler.
This means that a hang has been detected and the driver needs to deal with
it the same way it would deal with a engine hang detected by the periodic
hang checker. The only difference between the two is that we already blamed
the active request (to ensure an engine reset).

2b. If the batch buffer completes and the execution reaches the watchdog
cancellation instruction before the watchdog counter reaches its
threshold value the watchdog is cancelled and nothing more comes of it.
No hang is detected.

Note about future interaction with preemption: Preemption could happen
in a command sequence prior to watchdog counter getting disabled,
resulting in watchdog being triggered following preemption (e.g. when
watchdog had been enabled in the low priority batch). The driver will
need to explicitly disable the watchdog counter as part of the
preemption sequence.

*** This patch introduces: ***

1. IRQ handler code for watchdog timeout allowing direct hang recovery
based on hardware-driven hang detection, which then integrates directly
with the hang recovery path. This is independent of having per-engine reset
or just full gpu reset.

2. Watchdog specific register information.

Currently the render engine and all available media engines support
watchdog timeout (VECS is only supported in GEN9). The specifications elude
to the BCS engine being supported but that is currently not supported by
this commit.

Note that the value to stop the counter is different between render and
non-render engines in GEN8; GEN9 onwards it's the same.

v2: Move irq handler to tasklet, arm watchdog for a 2nd time to check
against false-positives.

v3: Don't use high priority tasklet, use engine_last_submit while
checking for false-positives. From GEN9 onwards, the stop counter bit is
the same for all engines.

v4: Remove unnecessary brackets, use current_seqno to mark the request
as guilty in the hangcheck/capture code.

v5: Rebased after RESET_ENGINEs flag.

v6: Don't capture error state in case of watchdog timeout. The capture
process is time consuming and this will align to what happens when we
use GuC to handle the watchdog timeout. (Chris)

v7: Rebase.

v8: Rebase, use HZ to reschedule.

v9: Rebase, get forcewake domains in function (no longer in execlists
struct).

v10: Rebase.

Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.h   |  4 ++
 drivers/gpu/drm/i915/i915_irq.c         | 14 +++-
 drivers/gpu/drm/i915/i915_reg.h         |  6 ++
 drivers/gpu/drm/i915/intel_hangcheck.c  | 17 +++--
 drivers/gpu/drm/i915/intel_lrc.c        | 86 +++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h |  6 ++
 6 files changed, 126 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
index 6d9f45468ac1..7130786aa5b4 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.h
+++ b/drivers/gpu/drm/i915/i915_gpu_error.h
@@ -256,6 +256,9 @@ struct i915_gpu_error {
 	 * inspect the bit and do the reset directly, otherwise the worker
 	 * waits for the struct_mutex.
 	 *
+	 * #I915_RESET_WATCHDOG - When hw detects a hang before us, we can use
+	 * I915_RESET_WATCHDOG to report the hang detection cause accurately.
+	 *
 	 * #I915_RESET_ENGINE[num_engines] - Since the driver doesn't need to
 	 * acquire the struct_mutex to reset an engine, we need an explicit
 	 * flag to prevent two concurrent reset attempts in the same engine.
@@ -271,6 +274,7 @@ struct i915_gpu_error {
 #define I915_RESET_BACKOFF	0
 #define I915_RESET_HANDOFF	1
 #define I915_RESET_MODESET	2
+#define I915_RESET_WATCHDOG	3
 #define I915_WEDGED		(BITS_PER_LONG - 1)
 #define I915_RESET_ENGINE	(I915_WEDGED - I915_NUM_ENGINES)
 
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index fbb094ecf6c9..859bbadb752f 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1498,6 +1498,9 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
 
 	if (tasklet)
 		tasklet_hi_schedule(&engine->execlists.tasklet);
+
+	if (iir & (GT_GEN8_WATCHDOG_INTERRUPT))
+		tasklet_schedule(&engine->execlists.watchdog_tasklet);
 }
 
 static void gen8_gt_irq_ack(struct drm_i915_private *i915,
@@ -3329,7 +3332,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
 	if (intel_has_reset_engine(dev_priv) &&
 	    !i915_terminally_wedged(&dev_priv->gpu_error)) {
 		for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
-			BUILD_BUG_ON(I915_RESET_MODESET >= I915_RESET_ENGINE);
+			BUILD_BUG_ON(I915_RESET_WATCHDOG >= I915_RESET_ENGINE);
 			if (test_and_set_bit(I915_RESET_ENGINE + engine->id,
 					     &dev_priv->gpu_error.flags))
 				continue;
@@ -4162,12 +4165,15 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
 	uint32_t gt_interrupts[] = {
 		GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
 			GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT,
 		GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
 			GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
-			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
+			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
+			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
 		0,
 		GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
 			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT
@@ -4176,6 +4182,10 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
 	if (HAS_L3_DPF(dev_priv))
 		gt_interrupts[0] |= GT_RENDER_L3_PARITY_ERROR_INTERRUPT;
 
+	/* VECS watchdog is only available in skl+ */
+	if (INTEL_GEN(dev_priv) >= 9)
+		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
+
 	dev_priv->pm_ier = 0x0;
 	dev_priv->pm_imr = ~dev_priv->pm_ier;
 	GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index 44958d994bfa..fff330643090 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -2335,6 +2335,11 @@ enum i915_power_well_id {
 #define RING_START(base)	_MMIO((base) + 0x38)
 #define RING_CTL(base)		_MMIO((base) + 0x3c)
 #define   RING_CTL_SIZE(size)	((size) - PAGE_SIZE) /* in bytes -> pages */
+#define RING_CNTR(base)		_MMIO((base) + 0x178)
+#define   GEN8_WATCHDOG_ENABLE		0
+#define   GEN8_WATCHDOG_DISABLE		1
+#define   GEN8_XCS_WATCHDOG_DISABLE	0xFFFFFFFF /* GEN8 & non-render only */
+#define RING_THRESH(base)	_MMIO((base) + 0x17C)
 #define RING_SYNC_0(base)	_MMIO((base) + 0x40)
 #define RING_SYNC_1(base)	_MMIO((base) + 0x44)
 #define RING_SYNC_2(base)	_MMIO((base) + 0x48)
@@ -2894,6 +2899,7 @@ enum i915_power_well_id {
 #define GT_BSD_USER_INTERRUPT			(1 << 12)
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /* hsw+; rsvd on snb, ivb, vlv */
 #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
+#define GT_GEN8_WATCHDOG_INTERRUPT		(1 <<  6) /* gen8+ */
 #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /* !snb */
 #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
 #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index 51e9efec5116..2906f0ef3d77 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -213,7 +213,8 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
 
 static void hangcheck_declare_hang(struct drm_i915_private *i915,
 				   unsigned int hung,
-				   unsigned int stuck)
+				   unsigned int stuck,
+				   unsigned int watchdog)
 {
 	struct intel_engine_cs *engine;
 	char msg[80];
@@ -226,13 +227,16 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
 	if (stuck != hung)
 		hung &= ~stuck;
 	len = scnprintf(msg, sizeof(msg),
-			"%s on ", stuck == hung ? "no progress" : "hang");
+			"%s on ", watchdog ? "watchdog timeout" :
+				  stuck == hung ? "no progress" : "hang");
 	for_each_engine_masked(engine, i915, hung, tmp)
 		len += scnprintf(msg + len, sizeof(msg) - len,
 				 "%s, ", engine->name);
 	msg[len-2] = '\0';
 
-	return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s", msg);
+	return i915_handle_error(i915, hung,
+				 watchdog ? 0 : I915_ERROR_CAPTURE,
+				 "%s", msg);
 }
 
 /*
@@ -250,7 +254,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 			     gpu_error.hangcheck_work.work);
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
-	unsigned int hung = 0, stuck = 0, wedged = 0;
+	unsigned int hung = 0, stuck = 0, wedged = 0, watchdog = 0;
 
 	if (!i915_modparams.enable_hangcheck)
 		return;
@@ -261,6 +265,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	if (i915_terminally_wedged(&dev_priv->gpu_error))
 		return;
 
+	if (test_and_clear_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags))
+		watchdog = 1;
+
 	/* As enabling the GPU requires fairly extensive mmio access,
 	 * periodically arm the mmio checker to see if we are triggering
 	 * any invalid access.
@@ -293,7 +300,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 	}
 
 	if (hung)
-		hangcheck_declare_hang(dev_priv, hung, stuck);
+		hangcheck_declare_hang(dev_priv, hung, stuck, watchdog);
 
 	/* Reset timer in case GPU hangs without another request being added */
 	i915_queue_hangcheck(dev_priv);
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 6c98fb7cebf2..e1dcdf545bee 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -2027,6 +2027,70 @@ static int gen8_emit_flush_render(struct i915_request *request,
 	return 0;
 }
 
+/* From GEN9 onwards, all engines use the same RING_CNTR format */
+static inline u32 get_watchdog_disable(struct intel_engine_cs *engine)
+{
+	if (engine->id == RCS || INTEL_GEN(engine->i915) >= 9)
+		return GEN8_WATCHDOG_DISABLE;
+	else
+		return GEN8_XCS_WATCHDOG_DISABLE;
+}
+
+#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
+static void gen8_watchdog_irq_handler(unsigned long data)
+{
+	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
+	struct drm_i915_private *dev_priv = engine->i915;
+	enum forcewake_domains fw_domains;
+	u32 current_seqno;
+
+	switch (engine->id) {
+	default:
+		MISSING_CASE(engine->id);
+		/* fall through */
+	case RCS:
+		fw_domains = FORCEWAKE_RENDER;
+		break;
+	case VCS:
+	case VCS2:
+	case VECS:
+		fw_domains = FORCEWAKE_MEDIA;
+		break;
+	}
+
+	intel_uncore_forcewake_get(dev_priv, fw_domains);
+
+	/* Stop the counter to prevent further timeout interrupts */
+	I915_WRITE_FW(RING_CNTR(engine->mmio_base), get_watchdog_disable(engine));
+
+	current_seqno = intel_engine_get_seqno(engine);
+
+	/* did the request complete after the timer expired? */
+	if (intel_engine_last_submit(engine) == current_seqno)
+		goto fw_put;
+
+	if (engine->hangcheck.watchdog == current_seqno) {
+		/* Make sure the active request will be marked as guilty */
+		engine->hangcheck.stalled = true;
+		engine->hangcheck.acthd = intel_engine_get_active_head(engine);
+		engine->hangcheck.seqno = current_seqno;
+
+		/* And try to run the hangcheck_work as soon as possible */
+		set_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags);
+		queue_delayed_work(system_long_wq,
+				   &dev_priv->gpu_error.hangcheck_work,
+				   round_jiffies_up_relative(HZ));
+	} else {
+		engine->hangcheck.watchdog = current_seqno;
+		/* Re-start the counter, if really hung, it will expire again */
+		I915_WRITE_FW(RING_THRESH(engine->mmio_base), GEN8_WATCHDOG_1000US);
+		I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
+	}
+
+fw_put:
+	intel_uncore_forcewake_put(dev_priv, fw_domains);
+}
+
 /*
  * Reserve space for 2 NOOPs at the end of each request to be
  * used as a workaround for not being allowed to do lite
@@ -2115,6 +2179,9 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine)
 			     &engine->execlists.tasklet.state)))
 		tasklet_kill(&engine->execlists.tasklet);
 
+	if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &engine->execlists.watchdog_tasklet.state)))
+		tasklet_kill(&engine->execlists.watchdog_tasklet);
+
 	dev_priv = engine->i915;
 
 	if (engine->buffer) {
@@ -2208,6 +2275,22 @@ logical_ring_default_irqs(struct intel_engine_cs *engine)
 
 	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
 	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
+
+	switch (engine->id) {
+	default:
+		/* BCS engine does not support hw watchdog */
+		break;
+	case RCS:
+	case VCS:
+	case VCS2:
+		engine->irq_keep_mask |= (GT_GEN8_WATCHDOG_INTERRUPT << shift);
+		break;
+	case VECS:
+		if (INTEL_GEN(engine->i915) >= 9)
+			engine->irq_keep_mask |=
+				(GT_GEN8_WATCHDOG_INTERRUPT << shift);
+		break;
+	}
 }
 
 static void
@@ -2221,6 +2304,9 @@ logical_ring_setup(struct intel_engine_cs *engine)
 	tasklet_init(&engine->execlists.tasklet,
 		     execlists_submission_tasklet, (unsigned long)engine);
 
+	tasklet_init(&engine->execlists.watchdog_tasklet,
+		     gen8_watchdog_irq_handler, (unsigned long)engine);
+
 	logical_ring_default_vfuncs(engine);
 	logical_ring_default_irqs(engine);
 }
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 3c1366c58cf3..6cb8b4280035 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -120,6 +120,7 @@ struct intel_instdone {
 struct intel_engine_hangcheck {
 	u64 acthd;
 	u32 seqno;
+	u32 watchdog;
 	enum intel_engine_hangcheck_action action;
 	unsigned long action_timestamp;
 	int deadlock;
@@ -224,6 +225,11 @@ struct intel_engine_execlists {
 	 */
 	struct tasklet_struct tasklet;
 
+	/**
+	 * @watchdog_tasklet: stop counter and re-schedule hangcheck_work asap
+	 */
+	struct tasklet_struct watchdog_tasklet;
+
 	/**
 	 * @default_priolist: priority list for I915_PRIORITY_NORMAL
 	 */
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2019-01-05  2:39 Gen8+ engine-reset Carlos Santa
  2019-01-05  2:39 ` drm/i915: Add engine reset count in get-reset-stats ioctl Carlos Santa
  2019-01-05  2:39 ` drm/i915: Watchdog timeout: IRQ handler for gen8+ Carlos Santa
@ 2019-01-05  2:39 ` Carlos Santa
  2019-01-07 12:21   ` Tvrtko Ursulin
  2019-01-05  2:39 ` drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Carlos Santa
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 48+ messages in thread
From: Carlos Santa @ 2019-01-05  2:39 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

From: Michel Thierry <michel.thierry@intel.com>

Emit the required commands into the ring buffer for starting and
stopping the watchdog timer before/after batch buffer start during
batch buffer submission.

v2: Support watchdog threshold per context engine, merge lri commands,
and move watchdog commands emission to emit_bb_start. Request space of
combined start_watchdog, bb_start and stop_watchdog to avoid any error
after emitting bb_start.

v3: There were too many req->engine in emit_bb_start.
Use GEM_BUG_ON instead of returning a very late EINVAL in the remote
case of watchdog misprogramming; set correct LRI cmd size in
emit_stop_watchdog. (Chris)

v4: Rebase.
v5: use to_intel_context instead of ctx->engine.
v6: Rebase.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_context.h |  4 ++
 drivers/gpu/drm/i915/intel_lrc.c        | 80 ++++++++++++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  4 ++
 3 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
index f6d870b1f73e..62f4eb04985b 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.h
+++ b/drivers/gpu/drm/i915/i915_gem_context.h
@@ -169,6 +169,10 @@ struct i915_gem_context {
 		u32 *lrc_reg_state;
 		u64 lrc_desc;
 		int pin_count;
+		/** watchdog_threshold: hw watchdog threshold value,
+		 * in clock counts
+		 */
+		u32 watchdog_threshold;
 
 		const struct intel_context_ops *ops;
 	} __engine[I915_NUM_ENGINES];
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index e1dcdf545bee..0ea5a37c3357 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1872,12 +1872,33 @@ static int gen8_emit_bb_start(struct i915_request *rq,
 			      u64 offset, u32 len,
 			      const unsigned int flags)
 {
+	struct intel_engine_cs *engine = rq->engine;
 	u32 *cs;
+	u32 num_dwords;
+	bool watchdog_running = false;
 
-	cs = intel_ring_begin(rq, 6);
+	/* bb_start only */
+	num_dwords = 6;
+
+	/* check if watchdog will be required */
+	if (to_intel_context(rq->gem_context, engine)->watchdog_threshold != 0) {
+		GEM_BUG_ON(!engine->emit_start_watchdog ||
+			   !engine->emit_stop_watchdog);
+
+		/* + start_watchdog (6) + stop_watchdog (4) */
+		num_dwords += 10;
+		watchdog_running = true;
+        }
+
+	cs = intel_ring_begin(rq, num_dwords);
 	if (IS_ERR(cs))
 		return PTR_ERR(cs);
 
+	if (watchdog_running) {
+		/* Start watchdog timer */
+		cs = engine->emit_start_watchdog(rq, cs);
+	}
+
 	/*
 	 * WaDisableCtxRestoreArbitration:bdw,chv
 	 *
@@ -1906,8 +1927,12 @@ static int gen8_emit_bb_start(struct i915_request *rq,
 	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
 	*cs++ = MI_NOOP;
 
-	intel_ring_advance(rq, cs);
+	if (watchdog_running) {
+		/* Cancel watchdog timer */
+		cs = engine->emit_stop_watchdog(rq, cs);
+	}
 
+	intel_ring_advance(rq, cs);
 	return 0;
 }
 
@@ -2091,6 +2116,49 @@ static void gen8_watchdog_irq_handler(unsigned long data)
 	intel_uncore_forcewake_put(dev_priv, fw_domains);
 }
 
+static u32 *gen8_emit_start_watchdog(struct i915_request *rq, u32 *cs)
+{
+	struct intel_engine_cs *engine = rq->engine;
+	struct i915_gem_context *ctx = rq->gem_context;
+	struct intel_context *ce = to_intel_context(ctx, engine);
+
+	/* XXX: no watchdog support in BCS engine */
+	GEM_BUG_ON(engine->id == BCS);
+
+	/*
+	 * watchdog register must never be programmed to zero. This would
+	 * cause the watchdog counter to exceed and not allow the engine to
+	 * go into IDLE state
+	 */
+	GEM_BUG_ON(ce->watchdog_threshold == 0);
+
+	/* Set counter period */
+	*cs++ = MI_LOAD_REGISTER_IMM(2);
+	*cs++ = i915_mmio_reg_offset(RING_THRESH(engine->mmio_base));
+	*cs++ = ce->watchdog_threshold;
+	/* Start counter */
+	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
+	*cs++ = GEN8_WATCHDOG_ENABLE;
+	*cs++ = MI_NOOP;
+
+	return cs;
+}
+
+static u32 *gen8_emit_stop_watchdog(struct i915_request *rq, u32 *cs)
+{
+	struct intel_engine_cs *engine = rq->engine;
+
+	/* XXX: no watchdog support in BCS engine */
+	GEM_BUG_ON(engine->id == BCS);
+
+	*cs++ = MI_LOAD_REGISTER_IMM(1);
+	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
+	*cs++ = get_watchdog_disable(engine);
+	*cs++ = MI_NOOP;
+
+	return cs;
+}
+
 /*
  * Reserve space for 2 NOOPs at the end of each request to be
  * used as a workaround for not being allowed to do lite
@@ -2366,6 +2434,8 @@ int logical_render_ring_init(struct intel_engine_cs *engine)
 	engine->emit_flush = gen8_emit_flush_render;
 	engine->emit_breadcrumb = gen8_emit_breadcrumb_rcs;
 	engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_rcs_sz;
+	engine->emit_start_watchdog = gen8_emit_start_watchdog;
+	engine->emit_stop_watchdog = gen8_emit_stop_watchdog;
 
 	ret = logical_ring_init(engine);
 	if (ret)
@@ -2392,6 +2462,12 @@ int logical_xcs_ring_init(struct intel_engine_cs *engine)
 {
 	logical_ring_setup(engine);
 
+	/* BCS engine does not have a watchdog-expired irq */
+	if (engine->id != BCS) {
+		engine->emit_start_watchdog = gen8_emit_start_watchdog;
+		engine->emit_stop_watchdog = gen8_emit_stop_watchdog;
+	}
+
 	return logical_ring_init(engine);
 }
 
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 6cb8b4280035..c7aa842a2db1 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -469,6 +469,10 @@ struct intel_engine_cs {
 	int		(*init_context)(struct i915_request *rq);
 
 	int		(*emit_flush)(struct i915_request *request, u32 mode);
+	u32 *		(*emit_start_watchdog)(struct i915_request *rq,
+					       u32 *cs);
+	u32 *		(*emit_stop_watchdog)(struct i915_request *rq,
+					      u32 *cs);
 #define EMIT_INVALIDATE	BIT(0)
 #define EMIT_FLUSH	BIT(1)
 #define EMIT_BARRIER	(EMIT_INVALIDATE | EMIT_FLUSH)
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2019-01-05  2:39 Gen8+ engine-reset Carlos Santa
                   ` (2 preceding siblings ...)
  2019-01-05  2:39 ` drm/i915: Watchdog timeout: Ringbuffer command emission " Carlos Santa
@ 2019-01-05  2:39 ` Carlos Santa
  2019-01-07 12:38   ` Tvrtko Ursulin
  2019-01-05  2:39 ` drm/i915: Watchdog timeout: Include threshold value in error state Carlos Santa
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 48+ messages in thread
From: Carlos Santa @ 2019-01-05  2:39 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

From: Michel Thierry <michel.thierry@intel.com>

Final enablement patch for GPU hang detection using watchdog timeout.
Using the gem_context_setparam ioctl, users can specify the desired
timeout value in microseconds, and the driver will do the conversion to
'timestamps'.

The recommended default watchdog threshold for video engines is 60000 us,
since this has been _empirically determined_ to be a good compromise for
low-latency requirements and low rate of false positives. The default
register value is ~106000us and the theoretical max value (all 1s) is
353 seconds.

Note, UABI engine ids and i915 engine ids are different, and this patch
uses the i915 ones. Some kind of mapping table [1] is required if we
decide to use the UABI engine ids.

[1] http://patchwork.freedesktop.org/patch/msgid/20170329135831.30254-2-chris@chris-wilson.co.uk

v2: Fixed get api to return values in microseconds. Threshold updated to
be per context engine. Check for u32 overflow. Capture ctx threshold
value in error state.

v3: Add a way to get array size, short-cut to disable all thresholds,
return EFAULT / EINVAL as needed. Move the capture of the threshold
value in the error state into a new patch. BXT has a different
timestamp base (because why not?).

v4: Checking if watchdog is available should be the first thing to
do, instead of giving false hopes to abi users; remove unnecessary & in
set_watchdog; ignore args->size in getparam.

v5: GEN9-LP platforms have a different crystal clock frequency, use the
right timestamp base for them (magic 8-ball predicts this will change
again later on, so future-proof it). (Daniele)

v6: Rebase, no more mutex BLK in getparam_ioctl.

v7: use to_intel_context instead of ctx->engine.

v8: Rebase, remove extra mutex from i915_gem_context_set_watchdog (Tvrtko),
Update UAPI to use engine class while keeping thresholds per
engine class (Michel).

Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h         | 56 +++++++++++++++
 drivers/gpu/drm/i915/i915_gem_context.c | 91 +++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.c        |  5 +-
 include/uapi/drm/i915_drm.h             |  1 +
 4 files changed, 151 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 7fa2a405c5fe..96d59c22e2ba 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1552,6 +1552,9 @@ struct drm_i915_private {
 	struct drm_i915_fence_reg fence_regs[I915_MAX_NUM_FENCES]; /* assume 965 */
 	int num_fence_regs; /* 8 on pre-965, 16 otherwise */
 
+	/* Command stream timestamp base - helps define watchdog threshold */
+	u32 cs_timestamp_base;
+
 	unsigned int fsb_freq, mem_freq, is_ddr3;
 	unsigned int skl_preferred_vco_freq;
 	unsigned int max_cdclk_freq;
@@ -3117,6 +3120,59 @@ i915_gem_context_lookup(struct drm_i915_file_private *file_priv, u32 id)
 	return ctx;
 }
 
+/*
+ * BDW, CHV & SKL+ Timestamp timer resolution = 0.080 uSec,
+ * or 12500000 counts per second, or ~12 counts per microsecond.
+ *
+ * But BXT/GLK Timestamp timer resolution is different, 0.052 uSec,
+ * or 19200000 counts per second, or ~19 counts per microsecond.
+ *
+ * Future-proofing, some day it won't be as simple as just GEN & IS_LP.
+ */
+#define GEN8_TIMESTAMP_CNTS_PER_USEC 12
+#define GEN9_LP_TIMESTAMP_CNTS_PER_USEC 19
+static inline u32 cs_timestamp_in_us(struct drm_i915_private *dev_priv)
+{
+	u32 cs_timestamp_base = dev_priv->cs_timestamp_base;
+
+	if (cs_timestamp_base)
+		return cs_timestamp_base;
+
+	switch (INTEL_GEN(dev_priv)) {
+	default:
+		MISSING_CASE(INTEL_GEN(dev_priv));
+		/* fall through */
+	case 9:
+		cs_timestamp_base = IS_GEN9_LP(dev_priv) ?
+					GEN9_LP_TIMESTAMP_CNTS_PER_USEC :
+					GEN8_TIMESTAMP_CNTS_PER_USEC;
+		break;
+	case 8:
+		cs_timestamp_base = GEN8_TIMESTAMP_CNTS_PER_USEC;
+		break;
+	}
+
+	dev_priv->cs_timestamp_base = cs_timestamp_base;
+	return cs_timestamp_base;
+}
+
+static inline u32
+watchdog_to_us(struct drm_i915_private *dev_priv, u32 value_in_clock_counts)
+{
+	return value_in_clock_counts / cs_timestamp_in_us(dev_priv);
+}
+
+static inline u32
+watchdog_to_clock_counts(struct drm_i915_private *dev_priv, u64 value_in_us)
+{
+	u64 threshold = value_in_us * cs_timestamp_in_us(dev_priv);
+
+	if (overflows_type(threshold, u32))
+		return -EINVAL;
+
+	return threshold;
+}
+
 int i915_perf_open_ioctl(struct drm_device *dev, void *data,
 			 struct drm_file *file);
 int i915_perf_add_config_ioctl(struct drm_device *dev, void *data,
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 40fefed8c92f..4e421b79ac07 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -840,6 +840,91 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
 	return 0;
 }
 
+/* Return the timer count threshold in microseconds. */
+int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
+				  struct drm_i915_gem_context_param *args)
+{
+	struct drm_i915_private *dev_priv = ctx->i915;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	u32 threshold_in_us[OTHER_CLASS];
+
+	if (!dev_priv->engine[VCS]->emit_start_watchdog)
+		return -ENODEV;
+
+	for_each_engine(engine, dev_priv, id) {
+		struct intel_context *ce = to_intel_context(ctx, engine);
+
+		threshold_in_us[engine->class] = watchdog_to_us(dev_priv,
+								ce->watchdog_threshold);
+	}
+
+	if (__copy_to_user(u64_to_user_ptr(args->value),
+			   &threshold_in_us,
+			   sizeof(threshold_in_us))) {
+		return -EFAULT;
+	}
+
+	args->size = sizeof(threshold_in_us);
+
+	return 0;
+}
+
+/*
+ * Based on time out value in microseconds (us) calculate
+ * timer count thresholds needed based on core frequency.
+ * Watchdog can be disabled by setting it to 0.
+ */
+int i915_gem_context_set_watchdog(struct i915_gem_context *ctx,
+				  struct drm_i915_gem_context_param *args)
+{
+	struct drm_i915_private *dev_priv = ctx->i915;
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	u32 i;
+	u32 threshold[OTHER_CLASS];
+
+	if (!dev_priv->engine[VCS]->emit_start_watchdog)
+		return -ENODEV;
+
+	memset(threshold, 0, sizeof(threshold));
+
+	/* shortcut to disable in all engines */
+	if (args->size == 0)
+		goto set_watchdog;
+
+	if (args->size < sizeof(threshold))
+		return -EFAULT;
+
+	if (copy_from_user(threshold,
+			   u64_to_user_ptr(args->value),
+			   sizeof(threshold))) {
+		mutex_lock(&dev_priv->drm.struct_mutex);
+		return -EFAULT;
+	}
+
+	/* not supported in blitter engine */
+	if (threshold[COPY_ENGINE_CLASS] != 0)
+		return -EINVAL;
+
+	for (i = RENDER_CLASS; i < OTHER_CLASS; i++) {
+		threshold[i] = watchdog_to_clock_counts(dev_priv, threshold[i]);
+
+		if (threshold[i] == -EINVAL)
+			return -EINVAL;
+	}
+
+set_watchdog:
+	for_each_engine(engine, dev_priv, id) {
+		struct intel_context *ce = to_intel_context(ctx, engine);
+
+		ce->watchdog_threshold = threshold[engine->class];
+	}
+
+	return 0;
+}
+
+
 int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 				    struct drm_file *file)
 {
@@ -877,6 +962,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
 	case I915_CONTEXT_PARAM_PRIORITY:
 		args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
 		break;
+	case I915_CONTEXT_PARAM_WATCHDOG:
+		ret = i915_gem_context_get_watchdog(ctx, args);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -949,6 +1037,9 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
 		}
 		break;
 
+	case I915_CONTEXT_PARAM_WATCHDOG:
+		ret = i915_gem_context_set_watchdog(ctx, args);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 0ea5a37c3357..0afcbeb18329 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -2061,7 +2061,7 @@ static inline u32 get_watchdog_disable(struct intel_engine_cs *engine)
 		return GEN8_XCS_WATCHDOG_DISABLE;
 }
 
-#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
+#define GEN8_WATCHDOG_1000US(dev_priv) watchdog_to_clock_counts(dev_priv, 1000)
 static void gen8_watchdog_irq_handler(unsigned long data)
 {
 	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
@@ -2108,7 +2108,8 @@ static void gen8_watchdog_irq_handler(unsigned long data)
 	} else {
 		engine->hangcheck.watchdog = current_seqno;
 		/* Re-start the counter, if really hung, it will expire again */
-		I915_WRITE_FW(RING_THRESH(engine->mmio_base), GEN8_WATCHDOG_1000US);
+		I915_WRITE_FW(RING_THRESH(engine->mmio_base),
+			      GEN8_WATCHDOG_1000US(dev_priv));
 		I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
 	}
 
diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
index 0bc9e00e66ce..da7efaf66b0e 100644
--- a/include/uapi/drm/i915_drm.h
+++ b/include/uapi/drm/i915_drm.h
@@ -1490,6 +1490,7 @@ struct drm_i915_gem_context_param {
 #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
 #define   I915_CONTEXT_DEFAULT_PRIORITY		0
 #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
+#define I915_CONTEXT_PARAM_WATCHDOG	0x7
 	__u64 value;
 };
 
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* drm/i915: Watchdog timeout: Include threshold value in error state
  2019-01-05  2:39 Gen8+ engine-reset Carlos Santa
                   ` (3 preceding siblings ...)
  2019-01-05  2:39 ` drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Carlos Santa
@ 2019-01-05  2:39 ` Carlos Santa
  2019-01-05  4:19   ` kbuild test robot
  2019-01-05  4:39   ` kbuild test robot
  2019-01-05  2:39 ` drm/i915: Only process VCS2 only when supported Carlos Santa
                   ` (6 subsequent siblings)
  11 siblings, 2 replies; 48+ messages in thread
From: Carlos Santa @ 2019-01-05  2:39 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

From: Michel Thierry <michel.thierry@intel.com>

Save the watchdog threshold (in us) as part of the engine state.

v2: Only do it for gen8+ (and prevent a missing-case warn).
v3: use ctx->__engine.
v4: Rebase.

Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 12 ++++++++----
 drivers/gpu/drm/i915/i915_gpu_error.h |  1 +
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 5533a741abeb..f97379606b1b 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -458,10 +458,12 @@ static void error_print_context(struct drm_i915_error_state_buf *m,
 				const char *header,
 				const struct drm_i915_error_context *ctx)
 {
-	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, prio %d, ban score %d%s guilty %d active %d\n",
+	err_printf(m, "%s%s[%d] user_handle %d hw_id %d, prio %d, ban score %d%s guilty %d active %d, watchdog %dus\n",
 		   header, ctx->comm, ctx->pid, ctx->handle, ctx->hw_id,
 		   ctx->sched_attr.priority, ctx->ban_score, bannable(ctx),
-		   ctx->guilty, ctx->active);
+		   ctx->guilty, ctx->active,
+		   INTEL_GEN(m->i915) >= 8 ?
+			watchdog_to_us(m->i915, ctx->watchdog_threshold) : 0);
 }
 
 static void error_print_engine(struct drm_i915_error_state_buf *m,
@@ -1451,7 +1453,8 @@ static void error_record_engine_execlists(struct intel_engine_cs *engine,
 }
 
 static void record_context(struct drm_i915_error_context *e,
-			   struct i915_gem_context *ctx)
+			   struct i915_gem_context *ctx,
+			   u32 engine_id)
 {
 	if (ctx->pid) {
 		struct task_struct *task;
@@ -1472,6 +1475,7 @@ static void record_context(struct drm_i915_error_context *e,
 	e->bannable = i915_gem_context_is_bannable(ctx);
 	e->guilty = atomic_read(&ctx->guilty_count);
 	e->active = atomic_read(&ctx->active_count);
+	e->watchdog_threshold =	ctx->__engine[engine_id].watchdog_threshold;
 }
 
 static void request_record_user_bo(struct i915_request *request,
@@ -1556,7 +1560,7 @@ static void gem_record_rings(struct i915_gpu_state *error)
 
 			ee->vm = ctx->ppgtt ? &ctx->ppgtt->vm : &ggtt->vm;
 
-			record_context(&ee->context, ctx);
+			record_context(&ee->context, ctx, engine->id);
 
 			/* We need to copy these to an anonymous buffer
 			 * as the simplest method to avoid being overwritten
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
index 7130786aa5b4..affd12e17f39 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.h
+++ b/drivers/gpu/drm/i915/i915_gpu_error.h
@@ -129,6 +129,7 @@ struct i915_gpu_state {
 			int ban_score;
 			int active;
 			int guilty;
+			int watchdog_threshold;
 			bool bannable;
 			struct i915_sched_attr sched_attr;
 		} context;
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* drm/i915: Only process VCS2 only when supported
  2019-01-05  2:39 Gen8+ engine-reset Carlos Santa
                   ` (4 preceding siblings ...)
  2019-01-05  2:39 ` drm/i915: Watchdog timeout: Include threshold value in error state Carlos Santa
@ 2019-01-05  2:39 ` Carlos Santa
  2019-01-07 12:40   ` Tvrtko Ursulin
  2019-01-05  2:40 ` drm/i915/watchdog: move emit_stop_watchdog until the very end of the ring commands Carlos Santa
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 48+ messages in thread
From: Carlos Santa @ 2019-01-05  2:39 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

Not checking for BSD2 causes a segfault on GPU revs
with no h/w support for the extra media engines.

Segfault on ULX GT2 (0x591e) follows:

Patch shared by Michel Thierry on IIRC.

[  468.625970] BUG: unable to handle kernel NULL pointer dereference at 00000000000002c0
[  468.625978] IP: gen8_cs_irq_handler+0x8d/0xcf
[  468.625979] PGD 0 P4D 0
[  468.625983] Oops: 0002 [#1] PREEMPT SMP PTI
[  468.625987] Dumping ftrace buffer:
[  468.625990]    (ftrace buffer empty)
[  468.627877] gsmi: Log Shutdown Reason 0x03
[  468.627880] Modules linked in: cmac rfcomm uinput joydev snd_soc_hdac_hdmi snd_soc_dmic snd_soc_skl snd_soc_skl_ipc lzo lzo_compress snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_acpi snd_hda_ext_core snd_hda_core zram snd_soc_max98927 ipt_MASQUERADE nf_nat_masquerade_ipv4 hid_multitouch xt_mark fuse cdc_ether usbnet btusb btrtl btbcm btintel uvcvideo bluetooth videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core ecdh_generic iio_trig_sysfs cros_ec_light_prox cros_ec_sensors_ring cros_ec_sensors cros_ec_activity cros_ec_sensors_core industrialio_triggered_buffer kfifo_buf industrialio iwlmvm iwl7000_mac80211 r8152 mii iwlwifi cfg80211
[  468.627917] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.64 #38
[  468.627919] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.107.0 11/07/2017
[  468.627922] task: ffff96bf35a13c00 task.stack: ffffa79880070000
[  468.627925] RIP: 0010:gen8_cs_irq_handler+0x8d/0xcf
[  468.627928] RSP: 0018:ffff96bf3ec83df8 EFLAGS: 00010002
[  468.627931] RAX: 0000000001000000 RBX: 0000000000000000 RCX: 0000000000000010
[  468.627933] RDX: 0000000000000010 RSI: 0000000000000040 RDI: 0000000000000000
[  468.627936] RBP: ffff96bf3ec83e20 R08: 0000000000000000 R09: 0000000000000000
[  468.627938] R10: ffffa79880073dc8 R11: ffffffffa2a6453d R12: ffff96bf3ec83e70
[  468.627940] R13: 0000000000000079 R14: 0000000000000000 R15: 0000000000000040
[  468.627943] FS:  0000000000000000(0000) GS:ffff96bf3ec80000(0000) knlGS:0000000000000000
[  468.627945] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  468.627948] CR2: 00000000000002c0 CR3: 000000016fe12002 CR4: 00000000003606e0
[  468.627950] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  468.627953] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  468.627955] Call Trace:
[  468.627959]  <IRQ>
[  468.627963]  gen8_gt_irq_handler+0x5e/0xed
[  468.627968]  gen8_irq_handler+0x9f/0x5ce
[  468.627973]  __handle_irq_event_percpu+0xb8/0x1da
[  468.627977]  handle_irq_event_percpu+0x32/0x77
[  468.627980]  handle_irq_event+0x36/0x55
[  468.627984]  handle_edge_irq+0x7d/0xcd
[  468.627988]  handle_irq+0xd9/0x11e
[  468.627992]  do_IRQ+0x4b/0xd0
[  468.627996]  common_interrupt+0x7a/0x7a
[  468.627999]  </IRQ>
[  468.628003] RIP: 0010:cpuidle_enter_state+0xff/0x177
[  468.628005] RSP: 0018:ffffa79880073e78 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff5e
[  468.628008] RAX: ffff96bf3eca09c0 RBX: 000000000002b9c3 RCX: 0000006d1c48c3b5
[  468.628010] RDX: 0000000000000037 RSI: 0000000000000001 RDI: 0000000000000000
[  468.628013] RBP: ffffa79880073ec0 R08: 0000000000000002 R09: 0000000000005000
[  468.628015] R10: 071c71c71c71c71c R11: ffffffffa2e42687 R12: 0000000000000000
[  468.628017] R13: 0000000000000002 R14: 0000000000000002 R15: ffff96bf3eca7d00
[  468.628020]  ? cpu_idle_poll+0x8e/0x8e
[  468.628025]  ? cpuidle_enter_state+0xe3/0x177
[  468.628028]  do_idle+0x10c/0x19d
[  468.628033]  cpu_startup_entry+0x6d/0x6f
[  468.628036]  start_secondary+0x189/0x1a4
[  468.628041]  secondary_startup_64+0xa5/0xb0
[  468.628044] Code: c3 84 db 74 20 f0 41 0f ba ae 98 02 00 00 00 0f 92 45 df 80 7d df 00 75 0c 49 8d be 90 02 00 00 e8 3a 7a c6 ff 41 f6 c7 40 74 23 <f0> 41 0f ba ae c0 02 00 00 00 0f 92 45 de 80 7d de 00 75 0f 49
[  468.628083] RIP: gen8_cs_irq_handler+0x8d/0xcf RSP: ffff96bf3ec83df8
[  468.628085] CR2: 00000000000002c0
[  468.628088] ---[ end trace a7a497ddeb44bcf8 ]---

Tested-by: Carlos Santa <carlos.santa@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 859bbadb752f..953ebe5c85ce 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1554,7 +1554,8 @@ static void gen8_gt_irq_handler(struct drm_i915_private *i915,
 	if (master_ctl & (GEN8_GT_VCS1_IRQ | GEN8_GT_VCS2_IRQ)) {
 		gen8_cs_irq_handler(i915->engine[VCS],
 				    gt_iir[1] >> GEN8_VCS1_IRQ_SHIFT);
-		gen8_cs_irq_handler(i915->engine[VCS2],
+		if(HAS_BSD2(i915))
+			gen8_cs_irq_handler(i915->engine[VCS2],
 				    gt_iir[1] >> GEN8_VCS2_IRQ_SHIFT);
 	}
 
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* drm/i915/watchdog: move emit_stop_watchdog until the very end of the ring commands
  2019-01-05  2:39 Gen8+ engine-reset Carlos Santa
                   ` (5 preceding siblings ...)
  2019-01-05  2:39 ` drm/i915: Only process VCS2 only when supported Carlos Santa
@ 2019-01-05  2:40 ` Carlos Santa
  2019-01-07 12:50   ` Tvrtko Ursulin
  2019-01-05  2:40 ` drm/i915: Watchdog timeout: Blindly trust watchdog timeout for reset? Carlos Santa
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 48+ messages in thread
From: Carlos Santa @ 2019-01-05  2:40 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

From: Michel Thierry <michel.thierry@intel.com>

On command streams that could potentially hang the GPU after a last
flush command, it's best not to cancel the watchdog
until after all commands have executed.

Patch shared by Michel Thierry through IIRC after reproduction on
my local setup.

Tested-by: Carlos Santa <carlos.santa@intel.com>
CC: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 53 +++++++++++++++++++++++++++-----
 1 file changed, 45 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 0afcbeb18329..25ba5fcc9466 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1885,8 +1885,8 @@ static int gen8_emit_bb_start(struct i915_request *rq,
 		GEM_BUG_ON(!engine->emit_start_watchdog ||
 			   !engine->emit_stop_watchdog);
 
-		/* + start_watchdog (6) + stop_watchdog (4) */
-		num_dwords += 10;
+		/* + start_watchdog (6) */
+		num_dwords += 6;
 		watchdog_running = true;
         }
 
@@ -1927,10 +1927,7 @@ static int gen8_emit_bb_start(struct i915_request *rq,
 	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
 	*cs++ = MI_NOOP;
 
-	if (watchdog_running) {
-		/* Cancel watchdog timer */
-		cs = engine->emit_stop_watchdog(rq, cs);
-	}
+	// XXX: emit_stop_watchdog happens in gen8_emit_breadcrumb_vcs
 
 	intel_ring_advance(rq, cs);
 	return 0;
@@ -2189,6 +2186,37 @@ static void gen8_emit_breadcrumb(struct i915_request *request, u32 *cs)
 }
 static const int gen8_emit_breadcrumb_sz = 6 + WA_TAIL_DWORDS;
 
+static void gen8_emit_breadcrumb_vcs(struct i915_request *request, u32 *cs)
+{
+	/* w/a: bit 5 needs to be zero for MI_FLUSH_DW address. */
+	BUILD_BUG_ON(I915_GEM_HWS_INDEX_ADDR & (1 << 5));
+
+	cs = gen8_emit_ggtt_write(cs, request->global_seqno,
+				  intel_hws_seqno_address(request->engine));
+	*cs++ = MI_USER_INTERRUPT;
+	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
+
+	// stop_watchdog at the very end of the ring commands
+	if (request->gem_context->__engine[VCS].watchdog_threshold != 0)
+	{
+		/* Cancel watchdog timer */
+		GEM_BUG_ON(!request->engine->emit_stop_watchdog);
+		cs = request->engine->emit_stop_watchdog(request, cs);
+	}
+	else
+	{
+		*cs++ = MI_NOOP;
+		*cs++ = MI_NOOP;
+		*cs++ = MI_NOOP;
+		*cs++ = MI_NOOP;
+	}
+
+	request->tail = intel_ring_offset(request, cs);
+	assert_ring_tail_valid(request->ring, request->tail);
+	gen8_emit_wa_tail(request, cs);
+}
+static const int gen8_emit_breadcrumb_vcs_sz = 6 + WA_TAIL_DWORDS + 4; //+4 for optional stop_watchdog
+
 static void gen8_emit_breadcrumb_rcs(struct i915_request *request, u32 *cs)
 {
 	/* We're using qword write, seqno should be aligned to 8 bytes. */
@@ -2306,8 +2334,17 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
 	engine->request_alloc = execlists_request_alloc;
 
 	engine->emit_flush = gen8_emit_flush;
-	engine->emit_breadcrumb = gen8_emit_breadcrumb;
-	engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_sz;
+
+	if (engine->id == VCS || engine->id == VCS2)
+	{
+		engine->emit_breadcrumb = gen8_emit_breadcrumb_vcs;
+		engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_vcs_sz;
+	}
+	else
+	{
+		engine->emit_breadcrumb = gen8_emit_breadcrumb;
+		engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_sz;
+	}
 
 	engine->set_default_submission = intel_execlists_set_default_submission;
 
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* drm/i915: Watchdog timeout: Blindly trust watchdog timeout for reset?
  2019-01-05  2:39 Gen8+ engine-reset Carlos Santa
                   ` (6 preceding siblings ...)
  2019-01-05  2:40 ` drm/i915/watchdog: move emit_stop_watchdog until the very end of the ring commands Carlos Santa
@ 2019-01-05  2:40 ` Carlos Santa
  2019-01-05  4:15   ` kbuild test robot
  2019-01-05 13:32   ` kbuild test robot
  2019-01-05  2:57 ` ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 48+ messages in thread
From: Carlos Santa @ 2019-01-05  2:40 UTC (permalink / raw)
  To: intel-gfx; +Cc: Michel Thierry

From: Michel Thierry <michel.thierry@intel.com>

XXX: What to do when the watchdog irq fired twice but our hangcheck
logic thinks the engine is not hung? For example, what if the
active-head moved since the irq handler?

One option is to just ignore the watchdog, if the engine is really hung,
then the driver will detect the hang by itself later on (I'm inclined to
this).

But the other option is to blindly trust the HW, which is what this patch
does...

CC: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/intel_hangcheck.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index 2906f0ef3d77..1947baa20022 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -281,7 +281,8 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 		hangcheck_accumulate_sample(engine, &hc);
 		hangcheck_store_sample(engine, &hc);
 
-		if (engine->hangcheck.stalled) {
+		if (engine->hangcheck.stalled ||
+		    engine->hangcheck.watchdog == intel_engine_get_seqno(engine)) {
 			hung |= intel_engine_flag(engine);
 			if (hc.action != ENGINE_DEAD)
 				stuck |= intel_engine_flag(engine);
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Watchdog timeout: Blindly trust watchdog timeout for reset?
  2019-01-05  2:39 Gen8+ engine-reset Carlos Santa
                   ` (7 preceding siblings ...)
  2019-01-05  2:40 ` drm/i915: Watchdog timeout: Blindly trust watchdog timeout for reset? Carlos Santa
@ 2019-01-05  2:57 ` Patchwork
  2019-01-05  3:21 ` ✓ Fi.CI.BAT: success " Patchwork
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 48+ messages in thread
From: Patchwork @ 2019-01-05  2:57 UTC (permalink / raw)
  To: Carlos Santa; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Watchdog timeout: Blindly trust watchdog timeout for reset?
URL   : https://patchwork.freedesktop.org/series/54768/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
285c27ee5aed drm/i915: Add engine reset count in get-reset-stats ioctl
-:17: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#17: 
v2: s/engine_reset/reset_engine/, use union in uapi to not break compatibility.

total: 0 errors, 1 warnings, 0 checks, 38 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Watchdog timeout: Blindly trust watchdog timeout for reset?
  2019-01-05  2:39 Gen8+ engine-reset Carlos Santa
                   ` (8 preceding siblings ...)
  2019-01-05  2:57 ` ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
@ 2019-01-05  3:21 ` Patchwork
  2019-01-05  4:41 ` ✓ Fi.CI.IGT: " Patchwork
  2019-01-07 10:11 ` Gen8+ engine-reset Tvrtko Ursulin
  11 siblings, 0 replies; 48+ messages in thread
From: Patchwork @ 2019-01-05  3:21 UTC (permalink / raw)
  To: Carlos Santa; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Watchdog timeout: Blindly trust watchdog timeout for reset?
URL   : https://patchwork.freedesktop.org/series/54768/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_5363 -> Patchwork_11193
====================================================

Summary
-------

  **WARNING**

  Minor unknown changes coming with Patchwork_11193 need to be verified
  manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_11193, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/54768/revisions/1/mbox/

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_11193:

### IGT changes ###

#### Warnings ####

  * igt@pm_rpm@basic-pci-d3-state:
    - fi-bsw-kefka:       PASS -> SKIP

  
Known issues
------------

  Here are the changes found in Patchwork_11193 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_exec_suspend@basic-s3:
    - fi-blb-e6850:       PASS -> INCOMPLETE [fdo#107718]

  * igt@pm_rpm@basic-rte:
    - fi-bsw-kefka:       PASS -> FAIL [fdo#108800]

  * igt@prime_vgem@basic-fence-flip:
    - fi-gdg-551:         PASS -> FAIL [fdo#103182]

  
#### Possible fixes ####

  * igt@kms_chamelium@hdmi-hpd-fast:
    - fi-kbl-7500u:       FAIL [fdo#108767] -> PASS

  * igt@kms_frontbuffer_tracking@basic:
    - fi-hsw-peppy:       DMESG-WARN [fdo#102614] -> PASS

  * igt@kms_pipe_crc_basic@nonblocking-crc-pipe-a-frame-sequence:
    - fi-byt-clapper:     FAIL [fdo#103191] / [fdo#107362] -> PASS +1

  
  [fdo#102614]: https://bugs.freedesktop.org/show_bug.cgi?id=102614
  [fdo#103182]: https://bugs.freedesktop.org/show_bug.cgi?id=103182
  [fdo#103191]: https://bugs.freedesktop.org/show_bug.cgi?id=103191
  [fdo#107362]: https://bugs.freedesktop.org/show_bug.cgi?id=107362
  [fdo#107718]: https://bugs.freedesktop.org/show_bug.cgi?id=107718
  [fdo#108767]: https://bugs.freedesktop.org/show_bug.cgi?id=108767
  [fdo#108800]: https://bugs.freedesktop.org/show_bug.cgi?id=108800


Participating hosts (47 -> 41)
------------------------------

  Additional (2): fi-byt-n2820 fi-apl-guc 
  Missing    (8): fi-kbl-soraka fi-ilk-m540 fi-hsw-4200u fi-byt-squawks fi-bsw-cyan fi-ctg-p8600 fi-kbl-7560u fi-skl-6600u 


Build changes
-------------

    * Linux: CI_DRM_5363 -> Patchwork_11193

  CI_DRM_5363: f141806c68e4cbd56e3a5a582eb1a6f5b7edfc84 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4756: 75081c6bfb9998bd7cbf35a7ac0578c683fe55a8 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_11193: 285c27ee5aed2b08ced1f906f3363c12f7231a61 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

285c27ee5aed drm/i915: Add engine reset count in get-reset-stats ioctl

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_11193/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: Blindly trust watchdog timeout for reset?
  2019-01-05  2:40 ` drm/i915: Watchdog timeout: Blindly trust watchdog timeout for reset? Carlos Santa
@ 2019-01-05  4:15   ` kbuild test robot
  2019-01-05 13:32   ` kbuild test robot
  1 sibling, 0 replies; 48+ messages in thread
From: kbuild test robot @ 2019-01-05  4:15 UTC (permalink / raw)
  To: Carlos Santa; +Cc: Michel Thierry, intel-gfx, kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3672 bytes --]

Hi Michel,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on drm-intel/for-linux-next]
[also build test ERROR on v4.20 next-20190103]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Carlos-Santa/drm-i915-Watchdog-timeout-Blindly-trust-watchdog-timeout-for-reset/20190105-111445
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-randconfig-x012-201900 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/gpu/drm/i915/intel_hangcheck.c: In function 'i915_hangcheck_elapsed':
>> drivers/gpu/drm/i915/intel_hangcheck.c:443:24: error: 'struct intel_engine_hangcheck' has no member named 'watchdog'
          engine->hangcheck.watchdog == intel_engine_get_seqno(engine)) {
                           ^

vim +443 drivers/gpu/drm/i915/intel_hangcheck.c

   400	
   401	/*
   402	 * This is called when the chip hasn't reported back with completed
   403	 * batchbuffers in a long time. We keep track per ring seqno progress and
   404	 * if there are no progress, hangcheck score for that ring is increased.
   405	 * Further, acthd is inspected to see if the ring is stuck. On stuck case
   406	 * we kick the ring. If we see no progress on three subsequent calls
   407	 * we assume chip is wedged and try to fix it by resetting the chip.
   408	 */
   409	static void i915_hangcheck_elapsed(struct work_struct *work)
   410	{
   411		struct drm_i915_private *dev_priv =
   412			container_of(work, typeof(*dev_priv),
   413				     gpu_error.hangcheck_work.work);
   414		struct intel_engine_cs *engine;
   415		enum intel_engine_id id;
   416		unsigned int hung = 0, stuck = 0, wedged = 0;
   417	
   418		if (!i915_modparams.enable_hangcheck)
   419			return;
   420	
   421		if (!READ_ONCE(dev_priv->gt.awake))
   422			return;
   423	
   424		if (i915_terminally_wedged(&dev_priv->gpu_error))
   425			return;
   426	
   427		/* As enabling the GPU requires fairly extensive mmio access,
   428		 * periodically arm the mmio checker to see if we are triggering
   429		 * any invalid access.
   430		 */
   431		intel_uncore_arm_unclaimed_mmio_detection(dev_priv);
   432	
   433		for_each_engine(engine, dev_priv, id) {
   434			struct intel_engine_hangcheck hc;
   435	
   436			semaphore_clear_deadlocks(dev_priv);
   437	
   438			hangcheck_load_sample(engine, &hc);
   439			hangcheck_accumulate_sample(engine, &hc);
   440			hangcheck_store_sample(engine, &hc);
   441	
   442			if (engine->hangcheck.stalled ||
 > 443			    engine->hangcheck.watchdog == intel_engine_get_seqno(engine)) {
   444				hung |= intel_engine_flag(engine);
   445				if (hc.action != ENGINE_DEAD)
   446					stuck |= intel_engine_flag(engine);
   447			}
   448	
   449			if (engine->hangcheck.wedged)
   450				wedged |= intel_engine_flag(engine);
   451		}
   452	
   453		if (wedged) {
   454			dev_err(dev_priv->drm.dev,
   455				"GPU recovery timed out,"
   456				" cancelling all in-flight rendering.\n");
   457			GEM_TRACE_DUMP();
   458			i915_gem_set_wedged(dev_priv);
   459		}
   460	
   461		if (hung)
   462			hangcheck_declare_hang(dev_priv, hung, stuck);
   463	
   464		/* Reset timer in case GPU hangs without another request being added */
   465		i915_queue_hangcheck(dev_priv);
   466	}
   467	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 36670 bytes --]

[-- Attachment #3: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: Include threshold value in error state
  2019-01-05  2:39 ` drm/i915: Watchdog timeout: Include threshold value in error state Carlos Santa
@ 2019-01-05  4:19   ` kbuild test robot
  2019-01-05  4:39   ` kbuild test robot
  1 sibling, 0 replies; 48+ messages in thread
From: kbuild test robot @ 2019-01-05  4:19 UTC (permalink / raw)
  To: Carlos Santa; +Cc: Michel Thierry, intel-gfx, kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2415 bytes --]

Hi Michel,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on drm-intel/for-linux-next]
[also build test ERROR on v4.20 next-20190103]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Carlos-Santa/drm-i915-Watchdog-timeout-Include-threshold-value-in-error-state/20190105-112649
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-randconfig-x019-201900 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/gpu/drm/i915/i915_gpu_error.c: In function 'error_print_context':
>> drivers/gpu/drm/i915/i915_gpu_error.c:466:4: error: implicit declaration of function 'watchdog_to_us'; did you mean 'wq_watchdog_touch'? [-Werror=implicit-function-declaration]
       watchdog_to_us(m->i915, ctx->watchdog_threshold) : 0);
       ^
   drivers/gpu/drm/i915/i915_gpu_error.c:189:49: note: in definition of macro 'err_printf'
    #define err_printf(e, ...) i915_error_printf(e, __VA_ARGS__)
                                                    ^~~~~~~~~~~
   drivers/gpu/drm/i915/i915_gpu_error.c: In function 'record_context':
>> drivers/gpu/drm/i915/i915_gpu_error.c:1481:50: error: 'struct intel_context' has no member named 'watchdog_threshold'
     e->watchdog_threshold = ctx->__engine[engine_id].watchdog_threshold;
                                                     ^
   cc1: all warnings being treated as errors

vim +466 drivers/gpu/drm/i915/i915_gpu_error.c

   456	
   457	static void error_print_context(struct drm_i915_error_state_buf *m,
   458					const char *header,
   459					const struct drm_i915_error_context *ctx)
   460	{
   461		err_printf(m, "%s%s[%d] user_handle %d hw_id %d, prio %d, ban score %d%s guilty %d active %d, watchdog %dus\n",
   462			   header, ctx->comm, ctx->pid, ctx->handle, ctx->hw_id,
   463			   ctx->sched_attr.priority, ctx->ban_score, bannable(ctx),
   464			   ctx->guilty, ctx->active,
   465			   INTEL_GEN(m->i915) >= 8 ?
 > 466				watchdog_to_us(m->i915, ctx->watchdog_threshold) : 0);
   467	}
   468	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 34013 bytes --]

[-- Attachment #3: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: Include threshold value in error state
  2019-01-05  2:39 ` drm/i915: Watchdog timeout: Include threshold value in error state Carlos Santa
  2019-01-05  4:19   ` kbuild test robot
@ 2019-01-05  4:39   ` kbuild test robot
  1 sibling, 0 replies; 48+ messages in thread
From: kbuild test robot @ 2019-01-05  4:39 UTC (permalink / raw)
  To: Carlos Santa; +Cc: Michel Thierry, intel-gfx, kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2407 bytes --]

Hi Michel,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on drm-intel/for-linux-next]
[also build test ERROR on v4.20 next-20190103]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Carlos-Santa/drm-i915-Watchdog-timeout-Include-threshold-value-in-error-state/20190105-112649
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: i386-randconfig-s1-201900 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/gpu//drm/i915/i915_gpu_error.c: In function 'error_print_context':
>> drivers/gpu//drm/i915/i915_gpu_error.c:466:4: error: implicit declaration of function 'watchdog_to_us' [-Werror=implicit-function-declaration]
       watchdog_to_us(m->i915, ctx->watchdog_threshold) : 0);
       ^
   drivers/gpu//drm/i915/i915_gpu_error.c:189:49: note: in definition of macro 'err_printf'
    #define err_printf(e, ...) i915_error_printf(e, __VA_ARGS__)
                                                    ^~~~~~~~~~~
   drivers/gpu//drm/i915/i915_gpu_error.c: In function 'record_context':
   drivers/gpu//drm/i915/i915_gpu_error.c:1481:50: error: 'struct intel_context' has no member named 'watchdog_threshold'
     e->watchdog_threshold = ctx->__engine[engine_id].watchdog_threshold;
                                                     ^
   cc1: some warnings being treated as errors

vim +/watchdog_to_us +466 drivers/gpu//drm/i915/i915_gpu_error.c

   456	
   457	static void error_print_context(struct drm_i915_error_state_buf *m,
   458					const char *header,
   459					const struct drm_i915_error_context *ctx)
   460	{
   461		err_printf(m, "%s%s[%d] user_handle %d hw_id %d, prio %d, ban score %d%s guilty %d active %d, watchdog %dus\n",
   462			   header, ctx->comm, ctx->pid, ctx->handle, ctx->hw_id,
   463			   ctx->sched_attr.priority, ctx->ban_score, bannable(ctx),
   464			   ctx->guilty, ctx->active,
   465			   INTEL_GEN(m->i915) >= 8 ?
 > 466				watchdog_to_us(m->i915, ctx->watchdog_threshold) : 0);
   467	}
   468	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 33577 bytes --]

[-- Attachment #3: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* ✓ Fi.CI.IGT: success for drm/i915: Watchdog timeout: Blindly trust watchdog timeout for reset?
  2019-01-05  2:39 Gen8+ engine-reset Carlos Santa
                   ` (9 preceding siblings ...)
  2019-01-05  3:21 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2019-01-05  4:41 ` Patchwork
  2019-01-07 10:11 ` Gen8+ engine-reset Tvrtko Ursulin
  11 siblings, 0 replies; 48+ messages in thread
From: Patchwork @ 2019-01-05  4:41 UTC (permalink / raw)
  To: Carlos Santa; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Watchdog timeout: Blindly trust watchdog timeout for reset?
URL   : https://patchwork.freedesktop.org/series/54768/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_5363_full -> Patchwork_11193_full
====================================================

Summary
-------

  **WARNING**

  Minor unknown changes coming with Patchwork_11193_full need to be verified
  manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_11193_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_11193_full:

### IGT changes ###

#### Warnings ####

  * igt@kms_rotation_crc@sprite-rotation-90-pos-100-0:
    - shard-apl:          PASS -> SKIP +32

  
Known issues
------------

  Here are the changes found in Patchwork_11193_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_ppgtt@blt-vs-render-ctx0:
    - shard-skl:          NOTRUN -> TIMEOUT [fdo#108039]

  * igt@gem_softpin@noreloc-s3:
    - shard-iclb:         PASS -> INCOMPLETE [fdo#107713]

  * igt@i915_suspend@fence-restore-tiled2untiled:
    - shard-apl:          PASS -> DMESG-WARN [fdo#103558] +2

  * igt@i915_suspend@fence-restore-untiled:
    - shard-skl:          NOTRUN -> INCOMPLETE [fdo#104108] / [fdo#107773]

  * igt@kms_atomic_transition@plane-all-modeset-transition-internal-panels:
    - shard-iclb:         PASS -> DMESG-WARN [fdo#107724] +3

  * igt@kms_busy@extended-modeset-hang-newfb-render-a:
    - shard-snb:          NOTRUN -> DMESG-WARN [fdo#107956]

  * igt@kms_busy@extended-modeset-hang-newfb-render-c:
    - shard-kbl:          PASS -> DMESG-WARN [fdo#107956]

  * igt@kms_busy@extended-modeset-hang-newfb-with-reset-render-b:
    - shard-skl:          NOTRUN -> DMESG-WARN [fdo#107956] +1

  * igt@kms_busy@extended-pageflip-modeset-hang-oldfb-render-b:
    - shard-iclb:         NOTRUN -> DMESG-WARN [fdo#107956]

  * igt@kms_ccs@pipe-b-crc-sprite-planes-basic:
    - shard-iclb:         NOTRUN -> FAIL [fdo#107725]

  * igt@kms_cursor_crc@cursor-256x256-onscreen:
    - shard-glk:          PASS -> FAIL [fdo#103232]

  * igt@kms_cursor_crc@cursor-256x256-random:
    - shard-apl:          PASS -> FAIL [fdo#103232] +2

  * igt@kms_cursor_crc@cursor-256x256-sliding:
    - shard-skl:          PASS -> FAIL [fdo#103232]
    - shard-iclb:         NOTRUN -> FAIL [fdo#103232] +3

  * igt@kms_fbcon_fbt@psr:
    - shard-skl:          NOTRUN -> FAIL [fdo#107882]

  * igt@kms_flip@flip-vs-expired-vblank:
    - shard-glk:          PASS -> FAIL [fdo#102887] / [fdo#105363]

  * igt@kms_flip@plain-flip-ts-check-interruptible:
    - shard-skl:          PASS -> FAIL [fdo#100368]

  * igt@kms_frontbuffer_tracking@fbc-1p-primscrn-spr-indfb-move:
    - shard-iclb:         PASS -> FAIL [fdo#103167] +7

  * igt@kms_frontbuffer_tracking@fbc-1p-primscrn-spr-indfb-onoff:
    - shard-apl:          PASS -> FAIL [fdo#103167]

  * igt@kms_frontbuffer_tracking@fbc-2p-primscrn-spr-indfb-draw-pwrite:
    - shard-glk:          PASS -> FAIL [fdo#103167]

  * igt@kms_frontbuffer_tracking@fbcpsr-1p-primscrn-spr-indfb-draw-blt:
    - shard-iclb:         NOTRUN -> FAIL [fdo#103167]

  * igt@kms_pipe_crc_basic@nonblocking-crc-pipe-c:
    - shard-skl:          PASS -> FAIL [fdo#107362]

  * igt@kms_plane@pixel-format-pipe-a-planes-source-clamping:
    - shard-skl:          NOTRUN -> DMESG-WARN [fdo#106885] +1

  * igt@kms_plane@plane-position-covered-pipe-b-planes:
    - shard-glk:          PASS -> FAIL [fdo#103166]

  * igt@kms_plane_alpha_blend@pipe-b-alpha-7efc:
    - shard-iclb:         PASS -> DMESG-WARN [fdo#107724] / [fdo#108336] +1

  * igt@kms_plane_alpha_blend@pipe-b-alpha-basic:
    - shard-skl:          NOTRUN -> FAIL [fdo#107815] / [fdo#108145]

  * igt@kms_plane_alpha_blend@pipe-c-alpha-transparant-fb:
    - shard-iclb:         PASS -> DMESG-FAIL [fdo#107724]

  * igt@kms_plane_alpha_blend@pipe-c-coverage-7efc:
    - shard-skl:          PASS -> FAIL [fdo#107815]

  * igt@kms_plane_multiple@atomic-pipe-c-tiling-x:
    - shard-iclb:         PASS -> FAIL [fdo#103166]

  * igt@kms_plane_multiple@atomic-pipe-c-tiling-y:
    - shard-apl:          PASS -> FAIL [fdo#103166] +1

  * igt@kms_plane_multiple@atomic-pipe-c-tiling-yf:
    - shard-iclb:         NOTRUN -> FAIL [fdo#103166]

  * igt@kms_rotation_crc@multiplane-rotation-cropping-top:
    - shard-glk:          PASS -> DMESG-WARN [fdo#105763] / [fdo#106538]

  * igt@kms_vblank@pipe-a-ts-continuation-suspend:
    - shard-skl:          PASS -> INCOMPLETE [fdo#104108] / [fdo#107773]

  * igt@kms_vblank@pipe-b-ts-continuation-suspend:
    - shard-kbl:          PASS -> DMESG-WARN [fdo#108566]

  * igt@pm_rpm@gem-idle:
    - shard-iclb:         PASS -> INCOMPLETE [fdo#107713] / [fdo#108840]

  * igt@pm_rpm@gem-mmap-gtt:
    - shard-iclb:         NOTRUN -> DMESG-WARN [fdo#108654]

  * igt@pm_rpm@modeset-stress-extra-wait:
    - shard-iclb:         PASS -> DMESG-WARN [fdo#108654]

  * igt@pm_rps@min-max-config-loaded:
    - shard-apl:          PASS -> FAIL [fdo#102250]

  * igt@pm_rps@waitboost:
    - shard-iclb:         NOTRUN -> FAIL [fdo#102250] / [fdo#108059]

  
#### Possible fixes ####

  * igt@kms_available_modes_crc@available_mode_test_crc:
    - shard-apl:          FAIL [fdo#106641] -> PASS

  * igt@kms_busy@extended-modeset-hang-newfb-with-reset-render-a:
    - shard-apl:          DMESG-WARN [fdo#107956] -> SKIP

  * igt@kms_color@pipe-c-ctm-max:
    - shard-apl:          FAIL [fdo#108147] -> PASS

  * igt@kms_cursor_crc@cursor-256x256-suspend:
    - shard-skl:          INCOMPLETE [fdo#104108] -> PASS
    - shard-apl:          FAIL [fdo#103191] / [fdo#103232] -> PASS

  * igt@kms_cursor_crc@cursor-256x85-random:
    - shard-apl:          FAIL [fdo#103232] -> PASS +4

  * igt@kms_cursor_legacy@flip-vs-cursor-legacy:
    - shard-apl:          DMESG-WARN [fdo#103558] / [fdo#105602] -> PASS +16

  * igt@kms_draw_crc@draw-method-xrgb2101010-mmap-wc-untiled:
    - shard-iclb:         WARN [fdo#108336] -> PASS +1

  * igt@kms_flip@flip-vs-expired-vblank:
    - shard-skl:          FAIL [fdo#105363] -> PASS

  * igt@kms_frontbuffer_tracking@fbc-1p-pri-indfb-multidraw:
    - shard-iclb:         DMESG-FAIL [fdo#107720] / [fdo#107724] -> PASS

  * igt@kms_frontbuffer_tracking@fbc-1p-primscrn-pri-shrfb-draw-mmap-gtt:
    - shard-skl:          FAIL [fdo#105682] -> PASS

  * igt@kms_frontbuffer_tracking@fbc-1p-primscrn-spr-indfb-move:
    - shard-apl:          FAIL [fdo#103167] -> PASS

  * igt@kms_frontbuffer_tracking@fbc-rgb565-draw-mmap-gtt:
    - shard-kbl:          DMESG-WARN [fdo#103313] / [fdo#103558] / [fdo#105602] -> PASS +1

  * igt@kms_frontbuffer_tracking@fbc-stridechange:
    - shard-iclb:         DMESG-FAIL [fdo#107724] -> PASS +7

  * igt@kms_frontbuffer_tracking@psr-1p-primscrn-pri-indfb-draw-mmap-cpu:
    - shard-iclb:         DMESG-WARN [fdo#107724] / [fdo#108336] -> PASS +13

  * igt@kms_frontbuffer_tracking@psr-1p-primscrn-spr-indfb-draw-pwrite:
    - shard-iclb:         FAIL [fdo#103167] -> PASS +1

  * igt@kms_plane@pixel-format-pipe-a-planes-source-clamping:
    - shard-apl:          FAIL [fdo#108948] -> PASS

  * igt@kms_plane@plane-panning-bottom-right-suspend-pipe-b-planes:
    - shard-iclb:         INCOMPLETE [fdo#107713] -> PASS

  * igt@kms_plane_alpha_blend@pipe-c-alpha-opaque-fb:
    - shard-apl:          FAIL [fdo#108145] -> SKIP

  * igt@kms_plane_multiple@atomic-pipe-a-tiling-y:
    - shard-glk:          FAIL [fdo#103166] -> PASS
    - shard-iclb:         FAIL [fdo#103166] -> PASS

  * igt@kms_plane_multiple@atomic-pipe-b-tiling-yf:
    - shard-apl:          FAIL [fdo#103166] -> SKIP

  * igt@kms_sequence@get-busy:
    - shard-iclb:         DMESG-WARN [fdo#107724] -> PASS +26

  * igt@kms_setmode@basic:
    - shard-apl:          FAIL [fdo#99912] -> PASS

  * igt@kms_vblank@pipe-a-ts-continuation-modeset-hang:
    - shard-kbl:          DMESG-WARN [fdo#103558] / [fdo#105602] -> PASS +14

  * igt@pm_rpm@cursor-dpms:
    - shard-iclb:         DMESG-WARN [fdo#108654] -> PASS

  * igt@pm_rpm@modeset-stress-extra-wait:
    - shard-skl:          INCOMPLETE [fdo#107807] -> PASS

  * igt@pm_rpm@system-suspend-modeset:
    - shard-skl:          INCOMPLETE [fdo#104108] / [fdo#107807] -> PASS

  
#### Warnings ####

  * igt@kms_cursor_crc@cursor-64x64-onscreen:
    - shard-iclb:         DMESG-WARN [fdo#107724] / [fdo#108336] -> FAIL [fdo#103232]

  * igt@kms_plane_alpha_blend@pipe-a-alpha-opaque-fb:
    - shard-apl:          DMESG-FAIL [fdo#103558] / [fdo#105602] / [fdo#108145] -> FAIL [fdo#108145]
    - shard-kbl:          DMESG-FAIL [fdo#103558] / [fdo#105602] / [fdo#108145] -> FAIL [fdo#108145]

  * igt@kms_plane_multiple@atomic-pipe-c-tiling-yf:
    - shard-apl:          DMESG-FAIL [fdo#103166] / [fdo#103558] / [fdo#105602] -> FAIL [fdo#103166]
    - shard-kbl:          DMESG-FAIL [fdo#103166] / [fdo#103558] / [fdo#105602] -> FAIL [fdo#103166]

  
  [fdo#100368]: https://bugs.freedesktop.org/show_bug.cgi?id=100368
  [fdo#102250]: https://bugs.freedesktop.org/show_bug.cgi?id=102250
  [fdo#102887]: https://bugs.freedesktop.org/show_bug.cgi?id=102887
  [fdo#103166]: https://bugs.freedesktop.org/show_bug.cgi?id=103166
  [fdo#103167]: https://bugs.freedesktop.org/show_bug.cgi?id=103167
  [fdo#103191]: https://bugs.freedesktop.org/show_bug.cgi?id=103191
  [fdo#103232]: https://bugs.freedesktop.org/show_bug.cgi?id=103232
  [fdo#103313]: https://bugs.freedesktop.org/show_bug.cgi?id=103313
  [fdo#103558]: https://bugs.freedesktop.org/show_bug.cgi?id=103558
  [fdo#104108]: https://bugs.freedesktop.org/show_bug.cgi?id=104108
  [fdo#105363]: https://bugs.freedesktop.org/show_bug.cgi?id=105363
  [fdo#105602]: https://bugs.freedesktop.org/show_bug.cgi?id=105602
  [fdo#105682]: https://bugs.freedesktop.org/show_bug.cgi?id=105682
  [fdo#105763]: https://bugs.freedesktop.org/show_bug.cgi?id=105763
  [fdo#106538]: https://bugs.freedesktop.org/show_bug.cgi?id=106538
  [fdo#106641]: https://bugs.freedesktop.org/show_bug.cgi?id=106641
  [fdo#106885]: https://bugs.freedesktop.org/show_bug.cgi?id=106885
  [fdo#107362]: https://bugs.freedesktop.org/show_bug.cgi?id=107362
  [fdo#107713]: https://bugs.freedesktop.org/show_bug.cgi?id=107713
  [fdo#107720]: https://bugs.freedesktop.org/show_bug.cgi?id=107720
  [fdo#107724]: https://bugs.freedesktop.org/show_bug.cgi?id=107724
  [fdo#107725]: https://bugs.freedesktop.org/show_bug.cgi?id=107725
  [fdo#107773]: https://bugs.freedesktop.org/show_bug.cgi?id=107773
  [fdo#107807]: https://bugs.freedesktop.org/show_bug.cgi?id=107807
  [fdo#107815]: https://bugs.freedesktop.org/show_bug.cgi?id=107815
  [fdo#107882]: https://bugs.freedesktop.org/show_bug.cgi?id=107882
  [fdo#107956]: https://bugs.freedesktop.org/show_bug.cgi?id=107956
  [fdo#108039]: https://bugs.freedesktop.org/show_bug.cgi?id=108039
  [fdo#108059]: https://bugs.freedesktop.org/show_bug.cgi?id=108059
  [fdo#108145]: https://bugs.freedesktop.org/show_bug.cgi?id=108145
  [fdo#108147]: https://bugs.freedesktop.org/show_bug.cgi?id=108147
  [fdo#108336]: https://bugs.freedesktop.org/show_bug.cgi?id=108336
  [fdo#108566]: https://bugs.freedesktop.org/show_bug.cgi?id=108566
  [fdo#108654]: https://bugs.freedesktop.org/show_bug.cgi?id=108654
  [fdo#108840]: https://bugs.freedesktop.org/show_bug.cgi?id=108840
  [fdo#108948]: https://bugs.freedesktop.org/show_bug.cgi?id=108948
  [fdo#99912]: https://bugs.freedesktop.org/show_bug.cgi?id=99912


Participating hosts (7 -> 7)
------------------------------

  No changes in participating hosts


Build changes
-------------

    * Linux: CI_DRM_5363 -> Patchwork_11193

  CI_DRM_5363: f141806c68e4cbd56e3a5a582eb1a6f5b7edfc84 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4756: 75081c6bfb9998bd7cbf35a7ac0578c683fe55a8 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_11193: 285c27ee5aed2b08ced1f906f3363c12f7231a61 @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_11193/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: Blindly trust watchdog timeout for reset?
  2019-01-05  2:40 ` drm/i915: Watchdog timeout: Blindly trust watchdog timeout for reset? Carlos Santa
  2019-01-05  4:15   ` kbuild test robot
@ 2019-01-05 13:32   ` kbuild test robot
  1 sibling, 0 replies; 48+ messages in thread
From: kbuild test robot @ 2019-01-05 13:32 UTC (permalink / raw)
  To: Carlos Santa; +Cc: Michel Thierry, intel-gfx, kbuild-all

[-- Attachment #1: Type: text/plain, Size: 5507 bytes --]

Hi Michel,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on v4.20 next-20190103]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Carlos-Santa/drm-i915-Watchdog-timeout-Blindly-trust-watchdog-timeout-for-reset/20190105-111445
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-randconfig-s4-01052002 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/asm-generic/bug.h:5:0,
                    from arch/x86/include/asm/bug.h:47,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/gfp.h:5,
                    from include/linux/slab.h:15,
                    from include/linux/io-mapping.h:22,
                    from drivers/gpu/drm/i915/i915_drv.h:36,
                    from drivers/gpu/drm/i915/intel_hangcheck.c:25:
   drivers/gpu/drm/i915/intel_hangcheck.c: In function 'i915_hangcheck_elapsed':
   drivers/gpu/drm/i915/intel_hangcheck.c:443:24: error: 'struct intel_engine_hangcheck' has no member named 'watchdog'
          engine->hangcheck.watchdog == intel_engine_get_seqno(engine)) {
                           ^
   include/linux/compiler.h:58:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> drivers/gpu/drm/i915/intel_hangcheck.c:442:3: note: in expansion of macro 'if'
      if (engine->hangcheck.stalled ||
      ^~
   drivers/gpu/drm/i915/intel_hangcheck.c:443:24: error: 'struct intel_engine_hangcheck' has no member named 'watchdog'
          engine->hangcheck.watchdog == intel_engine_get_seqno(engine)) {
                           ^
   include/linux/compiler.h:58:42: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                             ^~~~
>> drivers/gpu/drm/i915/intel_hangcheck.c:442:3: note: in expansion of macro 'if'
      if (engine->hangcheck.stalled ||
      ^~
   drivers/gpu/drm/i915/intel_hangcheck.c:443:24: error: 'struct intel_engine_hangcheck' has no member named 'watchdog'
          engine->hangcheck.watchdog == intel_engine_get_seqno(engine)) {
                           ^
   include/linux/compiler.h:69:16: note: in definition of macro '__trace_if'
      ______r = !!(cond);     \
                   ^~~~
>> drivers/gpu/drm/i915/intel_hangcheck.c:442:3: note: in expansion of macro 'if'
      if (engine->hangcheck.stalled ||
      ^~

vim +/if +442 drivers/gpu/drm/i915/intel_hangcheck.c

   400	
   401	/*
   402	 * This is called when the chip hasn't reported back with completed
   403	 * batchbuffers in a long time. We keep track per ring seqno progress and
   404	 * if there are no progress, hangcheck score for that ring is increased.
   405	 * Further, acthd is inspected to see if the ring is stuck. On stuck case
   406	 * we kick the ring. If we see no progress on three subsequent calls
   407	 * we assume chip is wedged and try to fix it by resetting the chip.
   408	 */
   409	static void i915_hangcheck_elapsed(struct work_struct *work)
   410	{
   411		struct drm_i915_private *dev_priv =
   412			container_of(work, typeof(*dev_priv),
   413				     gpu_error.hangcheck_work.work);
   414		struct intel_engine_cs *engine;
   415		enum intel_engine_id id;
   416		unsigned int hung = 0, stuck = 0, wedged = 0;
   417	
   418		if (!i915_modparams.enable_hangcheck)
   419			return;
   420	
   421		if (!READ_ONCE(dev_priv->gt.awake))
   422			return;
   423	
   424		if (i915_terminally_wedged(&dev_priv->gpu_error))
   425			return;
   426	
   427		/* As enabling the GPU requires fairly extensive mmio access,
   428		 * periodically arm the mmio checker to see if we are triggering
   429		 * any invalid access.
   430		 */
   431		intel_uncore_arm_unclaimed_mmio_detection(dev_priv);
   432	
   433		for_each_engine(engine, dev_priv, id) {
   434			struct intel_engine_hangcheck hc;
   435	
   436			semaphore_clear_deadlocks(dev_priv);
   437	
   438			hangcheck_load_sample(engine, &hc);
   439			hangcheck_accumulate_sample(engine, &hc);
   440			hangcheck_store_sample(engine, &hc);
   441	
 > 442			if (engine->hangcheck.stalled ||
   443			    engine->hangcheck.watchdog == intel_engine_get_seqno(engine)) {
   444				hung |= intel_engine_flag(engine);
   445				if (hc.action != ENGINE_DEAD)
   446					stuck |= intel_engine_flag(engine);
   447			}
   448	
   449			if (engine->hangcheck.wedged)
   450				wedged |= intel_engine_flag(engine);
   451		}
   452	
   453		if (wedged) {
   454			dev_err(dev_priv->drm.dev,
   455				"GPU recovery timed out,"
   456				" cancelling all in-flight rendering.\n");
   457			GEM_TRACE_DUMP();
   458			i915_gem_set_wedged(dev_priv);
   459		}
   460	
   461		if (hung)
   462			hangcheck_declare_hang(dev_priv, hung, stuck);
   463	
   464		/* Reset timer in case GPU hangs without another request being added */
   465		i915_queue_hangcheck(dev_priv);
   466	}
   467	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 29663 bytes --]

[-- Attachment #3: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: Gen8+ engine-reset
  2019-01-05  2:39 Gen8+ engine-reset Carlos Santa
                   ` (10 preceding siblings ...)
  2019-01-05  4:41 ` ✓ Fi.CI.IGT: " Patchwork
@ 2019-01-07 10:11 ` Tvrtko Ursulin
  11 siblings, 0 replies; 48+ messages in thread
From: Tvrtko Ursulin @ 2019-01-07 10:11 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx


Hi,

We have had engine reset support in i915 for some time now. So lets call 
this work engine watchdog or something, just not engine reset.

By calling it engine reset we have confused sites like Phoronix and some 
our users.

Regards,

Tvrtko

On 05/01/2019 02:39, Carlos Santa wrote:
> This is a rebased on the original patch series from Michel Thierry
> that can be found here:
> 
> https://patchwork.freedesktop.org/series/21868
> 
> Note that this series is only limited to the GPU Watchdog timeout
> for execlists as it leaves out support
> for GuC based submission for a later time.
> 
> The series was also successfully tested from userspace through the
> the Intel i965 media driver that is readily found on some
> Linux based OS including Ubuntu OS and as well as Chromium OS. The
> changes on the i965 media userspace driver are currently under review at
> 
> https://github.com/intel/intel-vaapi-driver/pull/429/files
> 
> The testbed used on this series included a SKL-based NUC with
> 2 BSD rings as well as a KBL-based Chromebook with a 1 BSD ring.
> 
> Carlos Santa (1):
>    drm/i915: Only process VCS2 only when supported
> 
> Michel Thierry (7):
>    drm/i915: Add engine reset count in get-reset-stats ioctl
>    drm/i915: Watchdog timeout: IRQ handler for gen8+
>    drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
>    drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
>    drm/i915: Watchdog timeout: Include threshold value in error state
>    drm/i915/watchdog: move emit_stop_watchdog until the very end of the
>      ring commands
>    drm/i915: Watchdog timeout: Blindly trust watchdog timeout for reset?
> 
>   drivers/gpu/drm/i915/i915_drv.h         |  56 +++++++
>   drivers/gpu/drm/i915/i915_gem_context.c | 103 +++++++++++-
>   drivers/gpu/drm/i915/i915_gem_context.h |   4 +
>   drivers/gpu/drm/i915/i915_gpu_error.c   |  12 +-
>   drivers/gpu/drm/i915/i915_gpu_error.h   |   5 +
>   drivers/gpu/drm/i915/i915_irq.c         |  17 +-
>   drivers/gpu/drm/i915/i915_reg.h         |   6 +
>   drivers/gpu/drm/i915/intel_hangcheck.c  |  20 ++-
>   drivers/gpu/drm/i915/intel_lrc.c        | 208 +++++++++++++++++++++++-
>   drivers/gpu/drm/i915/intel_ringbuffer.h |  10 ++
>   include/uapi/drm/i915_drm.h             |   7 +-
>   11 files changed, 428 insertions(+), 20 deletions(-)
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-05  2:39 ` drm/i915: Watchdog timeout: IRQ handler for gen8+ Carlos Santa
@ 2019-01-07 11:58   ` Tvrtko Ursulin
  2019-01-07 12:16     ` Chris Wilson
                       ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Tvrtko Ursulin @ 2019-01-07 11:58 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx; +Cc: Michel Thierry


Hi,

This series has not been recognized by Patchwork as such, nor are the 
patches numbered. Have you used git format-patch -<N> --cover-letter and 
git send-email to send it out?

Rest inline.

On 05/01/2019 02:39, Carlos Santa wrote:
> From: Michel Thierry <michel.thierry@intel.com>
> 
> *** General ***
> 
> Watchdog timeout (or "media engine reset") is a feature that allows
> userland applications to enable hang detection on individual batch buffers.
> The detection mechanism itself is mostly bound to the hardware and the only
> thing that the driver needs to do to support this form of hang detection
> is to implement the interrupt handling support as well as watchdog command
> emission before and after the emitted batch buffer start instruction in the
> ring buffer.
> 
> The principle of the hang detection mechanism is as follows:
> 
> 1. Once the decision has been made to enable watchdog timeout for a
> particular batch buffer and the driver is in the process of emitting the
> batch buffer start instruction into the ring buffer it also emits a
> watchdog timer start instruction before and a watchdog timer cancellation
> instruction after the batch buffer start instruction in the ring buffer.
> 
> 2. Once the GPU execution reaches the watchdog timer start instruction
> the hardware watchdog counter is started by the hardware. The counter
> keeps counting until either reaching a previously configured threshold
> value or the timer cancellation instruction is executed.
> 
> 2a. If the counter reaches the threshold value the hardware fires a
> watchdog interrupt that is picked up by the watchdog interrupt handler.
> This means that a hang has been detected and the driver needs to deal with
> it the same way it would deal with a engine hang detected by the periodic
> hang checker. The only difference between the two is that we already blamed
> the active request (to ensure an engine reset).

What happens if the watchdog fires but the "guilty" request completes 
before the interrupt has been delivered, or acted upon? Would that mean 
an innocent request could be blamed for the timeout? Maybe the answer 
comes later in the patch/series.

> 
> 2b. If the batch buffer completes and the execution reaches the watchdog
> cancellation instruction before the watchdog counter reaches its
> threshold value the watchdog is cancelled and nothing more comes of it.
> No hang is detected.
> 
> Note about future interaction with preemption: Preemption could happen
> in a command sequence prior to watchdog counter getting disabled,
> resulting in watchdog being triggered following preemption (e.g. when
> watchdog had been enabled in the low priority batch). The driver will
> need to explicitly disable the watchdog counter as part of the
> preemption sequence.

Does the series take care of preemption?

> 
> *** This patch introduces: ***
> 
> 1. IRQ handler code for watchdog timeout allowing direct hang recovery
> based on hardware-driven hang detection, which then integrates directly
> with the hang recovery path. This is independent of having per-engine reset
> or just full gpu reset.
> 
> 2. Watchdog specific register information.
> 
> Currently the render engine and all available media engines support
> watchdog timeout (VECS is only supported in GEN9). The specifications elude
> to the BCS engine being supported but that is currently not supported by
> this commit.
> 
> Note that the value to stop the counter is different between render and
> non-render engines in GEN8; GEN9 onwards it's the same.
> 
> v2: Move irq handler to tasklet, arm watchdog for a 2nd time to check
> against false-positives.
> 
> v3: Don't use high priority tasklet, use engine_last_submit while
> checking for false-positives. From GEN9 onwards, the stop counter bit is
> the same for all engines.
> 
> v4: Remove unnecessary brackets, use current_seqno to mark the request
> as guilty in the hangcheck/capture code.
> 
> v5: Rebased after RESET_ENGINEs flag.
> 
> v6: Don't capture error state in case of watchdog timeout. The capture
> process is time consuming and this will align to what happens when we
> use GuC to handle the watchdog timeout. (Chris)
> 
> v7: Rebase.
> 
> v8: Rebase, use HZ to reschedule.
> 
> v9: Rebase, get forcewake domains in function (no longer in execlists
> struct).
> 
> v10: Rebase.
> 
> Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gpu_error.h   |  4 ++
>   drivers/gpu/drm/i915/i915_irq.c         | 14 +++-
>   drivers/gpu/drm/i915/i915_reg.h         |  6 ++
>   drivers/gpu/drm/i915/intel_hangcheck.c  | 17 +++--
>   drivers/gpu/drm/i915/intel_lrc.c        | 86 +++++++++++++++++++++++++
>   drivers/gpu/drm/i915/intel_ringbuffer.h |  6 ++
>   6 files changed, 126 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
> index 6d9f45468ac1..7130786aa5b4 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> @@ -256,6 +256,9 @@ struct i915_gpu_error {
>   	 * inspect the bit and do the reset directly, otherwise the worker
>   	 * waits for the struct_mutex.
>   	 *
> +	 * #I915_RESET_WATCHDOG - When hw detects a hang before us, we can use
> +	 * I915_RESET_WATCHDOG to report the hang detection cause accurately.
> +	 *
>   	 * #I915_RESET_ENGINE[num_engines] - Since the driver doesn't need to
>   	 * acquire the struct_mutex to reset an engine, we need an explicit
>   	 * flag to prevent two concurrent reset attempts in the same engine.
> @@ -271,6 +274,7 @@ struct i915_gpu_error {
>   #define I915_RESET_BACKOFF	0
>   #define I915_RESET_HANDOFF	1
>   #define I915_RESET_MODESET	2
> +#define I915_RESET_WATCHDOG	3
>   #define I915_WEDGED		(BITS_PER_LONG - 1)
>   #define I915_RESET_ENGINE	(I915_WEDGED - I915_NUM_ENGINES)
>   
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index fbb094ecf6c9..859bbadb752f 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1498,6 +1498,9 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
>   
>   	if (tasklet)
>   		tasklet_hi_schedule(&engine->execlists.tasklet);
> +
> +	if (iir & (GT_GEN8_WATCHDOG_INTERRUPT))

Braces are not needed.

> +		tasklet_schedule(&engine->execlists.watchdog_tasklet);
>   }
>   
>   static void gen8_gt_irq_ack(struct drm_i915_private *i915,
> @@ -3329,7 +3332,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv,
>   	if (intel_has_reset_engine(dev_priv) &&
>   	    !i915_terminally_wedged(&dev_priv->gpu_error)) {
>   		for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
> -			BUILD_BUG_ON(I915_RESET_MODESET >= I915_RESET_ENGINE);
> +			BUILD_BUG_ON(I915_RESET_WATCHDOG >= I915_RESET_ENGINE);
>   			if (test_and_set_bit(I915_RESET_ENGINE + engine->id,
>   					     &dev_priv->gpu_error.flags))
>   				continue;
> @@ -4162,12 +4165,15 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
>   	uint32_t gt_interrupts[] = {
>   		GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
>   			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> +			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
>   			GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
>   			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT,
>   		GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
>   			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> +			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
>   			GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
> -			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
> +			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT |
> +			GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT,
>   		0,
>   		GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
>   			GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT
> @@ -4176,6 +4182,10 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
>   	if (HAS_L3_DPF(dev_priv))
>   		gt_interrupts[0] |= GT_RENDER_L3_PARITY_ERROR_INTERRUPT;
>   
> +	/* VECS watchdog is only available in skl+ */
> +	if (INTEL_GEN(dev_priv) >= 9)
> +		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;

Is the shift missing here?

> +
>   	dev_priv->pm_ier = 0x0;
>   	dev_priv->pm_imr = ~dev_priv->pm_ier;
>   	GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 44958d994bfa..fff330643090 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -2335,6 +2335,11 @@ enum i915_power_well_id {
>   #define RING_START(base)	_MMIO((base) + 0x38)
>   #define RING_CTL(base)		_MMIO((base) + 0x3c)
>   #define   RING_CTL_SIZE(size)	((size) - PAGE_SIZE) /* in bytes -> pages */
> +#define RING_CNTR(base)		_MMIO((base) + 0x178)
> +#define   GEN8_WATCHDOG_ENABLE		0
> +#define   GEN8_WATCHDOG_DISABLE		1
> +#define   GEN8_XCS_WATCHDOG_DISABLE	0xFFFFFFFF /* GEN8 & non-render only */
> +#define RING_THRESH(base)	_MMIO((base) + 0x17C)
>   #define RING_SYNC_0(base)	_MMIO((base) + 0x40)
>   #define RING_SYNC_1(base)	_MMIO((base) + 0x44)
>   #define RING_SYNC_2(base)	_MMIO((base) + 0x48)
> @@ -2894,6 +2899,7 @@ enum i915_power_well_id {
>   #define GT_BSD_USER_INTERRUPT			(1 << 12)
>   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /* hsw+; rsvd on snb, ivb, vlv */
>   #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
> +#define GT_GEN8_WATCHDOG_INTERRUPT		(1 <<  6) /* gen8+ */
>   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /* !snb */
>   #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
>   #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index 51e9efec5116..2906f0ef3d77 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -213,7 +213,8 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
>   
>   static void hangcheck_declare_hang(struct drm_i915_private *i915,
>   				   unsigned int hung,
> -				   unsigned int stuck)
> +				   unsigned int stuck,
> +				   unsigned int watchdog)
>   {
>   	struct intel_engine_cs *engine;
>   	char msg[80];
> @@ -226,13 +227,16 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
>   	if (stuck != hung)
>   		hung &= ~stuck;
>   	len = scnprintf(msg, sizeof(msg),
> -			"%s on ", stuck == hung ? "no progress" : "hang");
> +			"%s on ", watchdog ? "watchdog timeout" :
> +				  stuck == hung ? "no progress" : "hang");
>   	for_each_engine_masked(engine, i915, hung, tmp)
>   		len += scnprintf(msg + len, sizeof(msg) - len,
>   				 "%s, ", engine->name);
>   	msg[len-2] = '\0';
>   
> -	return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s", msg);
> +	return i915_handle_error(i915, hung,
> +				 watchdog ? 0 : I915_ERROR_CAPTURE,
> +				 "%s", msg);
>   }
>   
>   /*
> @@ -250,7 +254,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>   			     gpu_error.hangcheck_work.work);
>   	struct intel_engine_cs *engine;
>   	enum intel_engine_id id;
> -	unsigned int hung = 0, stuck = 0, wedged = 0;
> +	unsigned int hung = 0, stuck = 0, wedged = 0, watchdog = 0;
>   
>   	if (!i915_modparams.enable_hangcheck)
>   		return;
> @@ -261,6 +265,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>   	if (i915_terminally_wedged(&dev_priv->gpu_error))
>   		return;
>   
> +	if (test_and_clear_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags))
> +		watchdog = 1;
> +
>   	/* As enabling the GPU requires fairly extensive mmio access,
>   	 * periodically arm the mmio checker to see if we are triggering
>   	 * any invalid access.
> @@ -293,7 +300,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>   	}
>   
>   	if (hung)
> -		hangcheck_declare_hang(dev_priv, hung, stuck);
> +		hangcheck_declare_hang(dev_priv, hung, stuck, watchdog);
>   
>   	/* Reset timer in case GPU hangs without another request being added */
>   	i915_queue_hangcheck(dev_priv);
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 6c98fb7cebf2..e1dcdf545bee 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -2027,6 +2027,70 @@ static int gen8_emit_flush_render(struct i915_request *request,
>   	return 0;
>   }
>   
> +/* From GEN9 onwards, all engines use the same RING_CNTR format */
> +static inline u32 get_watchdog_disable(struct intel_engine_cs *engine)
> +{
> +	if (engine->id == RCS || INTEL_GEN(engine->i915) >= 9)
> +		return GEN8_WATCHDOG_DISABLE;
> +	else
> +		return GEN8_XCS_WATCHDOG_DISABLE;
> +}
> +
> +#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function

Please do then. :)

> +static void gen8_watchdog_irq_handler(unsigned long data)
> +{
> +	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> +	struct drm_i915_private *dev_priv = engine->i915;
> +	enum forcewake_domains fw_domains;
> +	u32 current_seqno;
> +
> +	switch (engine->id) {
> +	default:
> +		MISSING_CASE(engine->id);
> +		/* fall through */
> +	case RCS:
> +		fw_domains = FORCEWAKE_RENDER;
> +		break;
> +	case VCS:
> +	case VCS2:
> +	case VECS:
> +		fw_domains = FORCEWAKE_MEDIA;
> +		break;
> +	}
> +
> +	intel_uncore_forcewake_get(dev_priv, fw_domains);

I'd be tempted to drop this and just use I915_WRITE. It doesn't feel 
like there is any performance to be gained with it and it embeds too 
much knowledge here.

Alternatively, if you want to keep it, consider using 
intel_uncore_forcewake_for_reg to leave the fw domain knowledge out of 
here. See for instance how it is used in intel_engine_cs.c.

> +
> +	/* Stop the counter to prevent further timeout interrupts */
> +	I915_WRITE_FW(RING_CNTR(engine->mmio_base), get_watchdog_disable(engine));

What if we disable the watchdog for a batch following the falsely 
accused one? I mean this:

1. Batch 1 runs
2. IRQ fires -> tasklet_schedule
3. Batch 2 runs (can be different context)
4. Tasklet runs
5. Watchdog gets disabled
6. Batch 2 hangs - but watchdog has been disabled

?

> +
> +	current_seqno = intel_engine_get_seqno(engine);
> +
> +	/* did the request complete after the timer expired? */
> +	if (intel_engine_last_submit(engine) == current_seqno)
> +		goto fw_put;
> +
> +	if (engine->hangcheck.watchdog == current_seqno) {
> +		/* Make sure the active request will be marked as guilty */
> +		engine->hangcheck.stalled = true;
> +		engine->hangcheck.acthd = intel_engine_get_active_head(engine);
> +		engine->hangcheck.seqno = current_seqno;
> +
> +		/* And try to run the hangcheck_work as soon as possible */
> +		set_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags);
> +		queue_delayed_work(system_long_wq,
> +				   &dev_priv->gpu_error.hangcheck_work,
> +				   round_jiffies_up_relative(HZ));
> +	} else {
> +		engine->hangcheck.watchdog = current_seqno;

The logic above potentially handles my previous question? Could be if 
batch 2 hangs. But..

> +		/* Re-start the counter, if really hung, it will expire again */
> +		I915_WRITE_FW(RING_THRESH(engine->mmio_base), GEN8_WATCHDOG_1000US);
> +		I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);

.. the timeout will be wrong ie. not respected from what the userspace 
set. So I don't think it will work. This code either needs to handle 
running with watchdog enabled, or here it somehow needs to fish out the 
correct timeout to set.

> +	}
> +
> +fw_put:
> +	intel_uncore_forcewake_put(dev_priv, fw_domains);
> +}
> +
>   /*
>    * Reserve space for 2 NOOPs at the end of each request to be
>    * used as a workaround for not being allowed to do lite
> @@ -2115,6 +2179,9 @@ void intel_logical_ring_cleanup(struct intel_engine_cs *engine)
>   			     &engine->execlists.tasklet.state)))
>   		tasklet_kill(&engine->execlists.tasklet);
>   
> +	if (WARN_ON(test_bit(TASKLET_STATE_SCHED, &engine->execlists.watchdog_tasklet.state)))
> +		tasklet_kill(&engine->execlists.watchdog_tasklet);
> +

I don't see any code ensuring this WARN can't fire if the tasklet gets 
delayed? A tasklet_kill in intel_engines_park might be enough.

>   	dev_priv = engine->i915;
>   
>   	if (engine->buffer) {
> @@ -2208,6 +2275,22 @@ logical_ring_default_irqs(struct intel_engine_cs *engine)
>   
>   	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
>   	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
> +
> +	switch (engine->id) {
> +	default:
> +		/* BCS engine does not support hw watchdog */
> +		break;
> +	case RCS:
> +	case VCS:
> +	case VCS2:

Change all to class based checks please or maintenance gets hard. Hm 
even more so, like this ICL is broken.

> +		engine->irq_keep_mask |= (GT_GEN8_WATCHDOG_INTERRUPT << shift);

Braces not needed here and below.

> +		break;
> +	case VECS:
> +		if (INTEL_GEN(engine->i915) >= 9)
> +			engine->irq_keep_mask |=
> +				(GT_GEN8_WATCHDOG_INTERRUPT << shift);
> +		break;
> +	}
>   }
>   
>   static void
> @@ -2221,6 +2304,9 @@ logical_ring_setup(struct intel_engine_cs *engine)
>   	tasklet_init(&engine->execlists.tasklet,
>   		     execlists_submission_tasklet, (unsigned long)engine);
>   
> +	tasklet_init(&engine->execlists.watchdog_tasklet,
> +		     gen8_watchdog_irq_handler, (unsigned long)engine);
> +
>   	logical_ring_default_vfuncs(engine);
>   	logical_ring_default_irqs(engine);
>   }
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 3c1366c58cf3..6cb8b4280035 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -120,6 +120,7 @@ struct intel_instdone {
>   struct intel_engine_hangcheck {
>   	u64 acthd;
>   	u32 seqno;
> +	u32 watchdog;
>   	enum intel_engine_hangcheck_action action;
>   	unsigned long action_timestamp;
>   	int deadlock;
> @@ -224,6 +225,11 @@ struct intel_engine_execlists {
>   	 */
>   	struct tasklet_struct tasklet;
>   
> +	/**
> +	 * @watchdog_tasklet: stop counter and re-schedule hangcheck_work asap
> +	 */
> +	struct tasklet_struct watchdog_tasklet;
> +
>   	/**
>   	 * @default_priolist: priority list for I915_PRIORITY_NORMAL
>   	 */
> 

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-07 11:58   ` Tvrtko Ursulin
@ 2019-01-07 12:16     ` Chris Wilson
  2019-01-07 12:58       ` Tvrtko Ursulin
  2019-01-07 13:43     ` Tvrtko Ursulin
  2019-01-24  0:13     ` Carlos Santa
  2 siblings, 1 reply; 48+ messages in thread
From: Chris Wilson @ 2019-01-07 12:16 UTC (permalink / raw)
  To: Carlos Santa, Tvrtko Ursulin, intel-gfx; +Cc: Michel Thierry

Quoting Tvrtko Ursulin (2019-01-07 11:58:13)
> 
> Hi,
> 
> This series has not been recognized by Patchwork as such, nor are the 
> patches numbered. Have you used git format-patch -<N> --cover-letter and 
> git send-email to send it out?
> 
> Rest inline.
> 
> On 05/01/2019 02:39, Carlos Santa wrote:
> > +static void gen8_watchdog_irq_handler(unsigned long data)
> > +{
> > +     struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> > +     struct drm_i915_private *dev_priv = engine->i915;
> > +     enum forcewake_domains fw_domains;
> > +     u32 current_seqno;
> > +
> > +     switch (engine->id) {
> > +     default:
> > +             MISSING_CASE(engine->id);
> > +             /* fall through */
> > +     case RCS:
> > +             fw_domains = FORCEWAKE_RENDER;
> > +             break;
> > +     case VCS:
> > +     case VCS2:
> > +     case VECS:
> > +             fw_domains = FORCEWAKE_MEDIA;
> > +             break;
> > +     }
> > +
> > +     intel_uncore_forcewake_get(dev_priv, fw_domains);
> 
> I'd be tempted to drop this and just use I915_WRITE. It doesn't feel 
> like there is any performance to be gained with it and it embeds too 
> much knowledge here.

No, no, no. Let's not reintroduce a fw inside irq context on a frequent
timer again.

Rule of thumb for fw_get:
gen6+: 10us to 50ms.
gen8+: 10us to 500us.

And then we don't release fw for 1ms after the fw_put. So we basically
prevent GT powersaving while the watchdog is active. That strikes me as
hopefully an unintended consequence.

The fw_get will be required if we actually hang, but for the timer
check, we should be able to do without.

And while on the topic of the timer irq, it should be forcibly cleared
along intel_engine_park, so that we ensure it is not raised while the
device/driver is supposed to be asleep. Or something to that effect.

> > +     current_seqno = intel_engine_get_seqno(engine);
> > +
> > +     /* did the request complete after the timer expired? */
> > +     if (intel_engine_last_submit(engine) == current_seqno)
> > +             goto fw_put;
> > +
> > +     if (engine->hangcheck.watchdog == current_seqno) {
> > +             /* Make sure the active request will be marked as guilty */
> > +             engine->hangcheck.stalled = true;
> > +             engine->hangcheck.acthd = intel_engine_get_active_head(engine);
> > +             engine->hangcheck.seqno = current_seqno;
> > +
> > +             /* And try to run the hangcheck_work as soon as possible */
> > +             set_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags);
> > +             queue_delayed_work(system_long_wq,
> > +                                &dev_priv->gpu_error.hangcheck_work,
> > +                                round_jiffies_up_relative(HZ));
> > +     } else {
> > +             engine->hangcheck.watchdog = current_seqno;
> 
> The logic above potentially handles my previous question? Could be if 
> batch 2 hangs. But..

Also, DO NOT USE HANGCHECK for this. The whole design was to be able to
do the engine reset right away. (Now guc can't but that's known broken.)

Aside, we have to rewrite this entire logic anyway as the engine seqno
and global_seqno are obsolete.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+
  2019-01-05  2:39 ` drm/i915: Watchdog timeout: Ringbuffer command emission " Carlos Santa
@ 2019-01-07 12:21   ` Tvrtko Ursulin
  0 siblings, 0 replies; 48+ messages in thread
From: Tvrtko Ursulin @ 2019-01-07 12:21 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx; +Cc: Michel Thierry


On 05/01/2019 02:39, Carlos Santa wrote:
> From: Michel Thierry <michel.thierry@intel.com>
> 
> Emit the required commands into the ring buffer for starting and
> stopping the watchdog timer before/after batch buffer start during
> batch buffer submission.
> 
> v2: Support watchdog threshold per context engine, merge lri commands,
> and move watchdog commands emission to emit_bb_start. Request space of
> combined start_watchdog, bb_start and stop_watchdog to avoid any error
> after emitting bb_start.
> 
> v3: There were too many req->engine in emit_bb_start.
> Use GEM_BUG_ON instead of returning a very late EINVAL in the remote
> case of watchdog misprogramming; set correct LRI cmd size in
> emit_stop_watchdog. (Chris)
> 
> v4: Rebase.
> v5: use to_intel_context instead of ctx->engine.
> v6: Rebase.
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gem_context.h |  4 ++
>   drivers/gpu/drm/i915/intel_lrc.c        | 80 ++++++++++++++++++++++++-
>   drivers/gpu/drm/i915/intel_ringbuffer.h |  4 ++
>   3 files changed, 86 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h
> index f6d870b1f73e..62f4eb04985b 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.h
> +++ b/drivers/gpu/drm/i915/i915_gem_context.h
> @@ -169,6 +169,10 @@ struct i915_gem_context {
>   		u32 *lrc_reg_state;
>   		u64 lrc_desc;
>   		int pin_count;
> +		/** watchdog_threshold: hw watchdog threshold value,
> +		 * in clock counts
> +		 */
> +		u32 watchdog_threshold;
>   
>   		const struct intel_context_ops *ops;
>   	} __engine[I915_NUM_ENGINES];
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index e1dcdf545bee..0ea5a37c3357 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1872,12 +1872,33 @@ static int gen8_emit_bb_start(struct i915_request *rq,
>   			      u64 offset, u32 len,
>   			      const unsigned int flags)
>   {
> +	struct intel_engine_cs *engine = rq->engine;
>   	u32 *cs;
> +	u32 num_dwords;
> +	bool watchdog_running = false;

Hm it's not really running but should be running, but ok. I'd just find 
something lie 'run_watchdog' or 'enable_watchdog' clearer.

>   
> -	cs = intel_ring_begin(rq, 6);
> +	/* bb_start only */
> +	num_dwords = 6;
> +
> +	/* check if watchdog will be required */
> +	if (to_intel_context(rq->gem_context, engine)->watchdog_threshold != 0) {
> +		GEM_BUG_ON(!engine->emit_start_watchdog ||
> +			   !engine->emit_stop_watchdog);
> +
> +		/* + start_watchdog (6) + stop_watchdog (4) */
> +		num_dwords += 10;
> +		watchdog_running = true;
> +        }
> +
> +	cs = intel_ring_begin(rq, num_dwords);
>   	if (IS_ERR(cs))
>   		return PTR_ERR(cs);
>   
> +	if (watchdog_running) {
> +		/* Start watchdog timer */
> +		cs = engine->emit_start_watchdog(rq, cs);
> +	}
> +
>   	/*
>   	 * WaDisableCtxRestoreArbitration:bdw,chv
>   	 *
> @@ -1906,8 +1927,12 @@ static int gen8_emit_bb_start(struct i915_request *rq,
>   	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
>   	*cs++ = MI_NOOP;

It could be neater to avoid to many MI_NOOPs. I think if you'd make emit 
start/stop not emit it, and moved this line above to be last, a single 
MI_MOOP could be emitted conditionally if the total number of emitted 
(based on cs pointer) is odd.

>   
> -	intel_ring_advance(rq, cs);
> +	if (watchdog_running) {
> +		/* Cancel watchdog timer */
> +		cs = engine->emit_stop_watchdog(rq, cs);
> +	}
>   
> +	intel_ring_advance(rq, cs);
>   	return 0;
>   }
>   
> @@ -2091,6 +2116,49 @@ static void gen8_watchdog_irq_handler(unsigned long data)
>   	intel_uncore_forcewake_put(dev_priv, fw_domains);
>   }
>   
> +static u32 *gen8_emit_start_watchdog(struct i915_request *rq, u32 *cs)
> +{
> +	struct intel_engine_cs *engine = rq->engine;
> +	struct i915_gem_context *ctx = rq->gem_context;
> +	struct intel_context *ce = to_intel_context(ctx, engine);
> +
> +	/* XXX: no watchdog support in BCS engine */
> +	GEM_BUG_ON(engine->id == BCS);
> +
> +	/*
> +	 * watchdog register must never be programmed to zero. This would
> +	 * cause the watchdog counter to exceed and not allow the engine to

Waht does exceed mean here?

> +	 * go into IDLE state
> +	 */
> +	GEM_BUG_ON(ce->watchdog_threshold == 0);
> +
> +	/* Set counter period */
> +	*cs++ = MI_LOAD_REGISTER_IMM(2);
> +	*cs++ = i915_mmio_reg_offset(RING_THRESH(engine->mmio_base));
> +	*cs++ = ce->watchdog_threshold;
> +	/* Start counter */
> +	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
> +	*cs++ = GEN8_WATCHDOG_ENABLE;
> +	*cs++ = MI_NOOP;
> +
> +	return cs;
> +}
> +
> +static u32 *gen8_emit_stop_watchdog(struct i915_request *rq, u32 *cs)
> +{
> +	struct intel_engine_cs *engine = rq->engine;
> +
> +	/* XXX: no watchdog support in BCS engine */
> +	GEM_BUG_ON(engine->id == BCS);

I think there will be enough of these checks (including the previous 
patch) to warrant engine->flags |= I915_ENGINE_SUPPORTS_WATCHDOG.

> +
> +	*cs++ = MI_LOAD_REGISTER_IMM(1);
> +	*cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base));
> +	*cs++ = get_watchdog_disable(engine);

Since this helper now potentially runs on every bb end perhaps it is 
worth storing the magic value in the engine instead of running the 
conditionals every time.

> +	*cs++ = MI_NOOP;
> +
> +	return cs;
> +}
> +
>   /*
>    * Reserve space for 2 NOOPs at the end of each request to be
>    * used as a workaround for not being allowed to do lite
> @@ -2366,6 +2434,8 @@ int logical_render_ring_init(struct intel_engine_cs *engine)
>   	engine->emit_flush = gen8_emit_flush_render;
>   	engine->emit_breadcrumb = gen8_emit_breadcrumb_rcs;
>   	engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_rcs_sz;
> +	engine->emit_start_watchdog = gen8_emit_start_watchdog;
> +	engine->emit_stop_watchdog = gen8_emit_stop_watchdog;

I don't see that these two need to be vfuncs - am I missing something?

>   
>   	ret = logical_ring_init(engine);
>   	if (ret)
> @@ -2392,6 +2462,12 @@ int logical_xcs_ring_init(struct intel_engine_cs *engine)
>   {
>   	logical_ring_setup(engine);
>   
> +	/* BCS engine does not have a watchdog-expired irq */
> +	if (engine->id != BCS) {
> +		engine->emit_start_watchdog = gen8_emit_start_watchdog;
> +		engine->emit_stop_watchdog = gen8_emit_stop_watchdog;
> +	}
> +
>   	return logical_ring_init(engine);
>   }
>   
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 6cb8b4280035..c7aa842a2db1 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -469,6 +469,10 @@ struct intel_engine_cs {
>   	int		(*init_context)(struct i915_request *rq);
>   
>   	int		(*emit_flush)(struct i915_request *request, u32 mode);
> +	u32 *		(*emit_start_watchdog)(struct i915_request *rq,
> +					       u32 *cs);
> +	u32 *		(*emit_stop_watchdog)(struct i915_request *rq,
> +					      u32 *cs);
>   #define EMIT_INVALIDATE	BIT(0)
>   #define EMIT_FLUSH	BIT(1)
>   #define EMIT_BARRIER	(EMIT_INVALIDATE | EMIT_FLUSH)
> 

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2019-01-05  2:39 ` drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Carlos Santa
@ 2019-01-07 12:38   ` Tvrtko Ursulin
  2019-01-07 12:50     ` Chris Wilson
  2019-01-07 17:00     ` Tvrtko Ursulin
  0 siblings, 2 replies; 48+ messages in thread
From: Tvrtko Ursulin @ 2019-01-07 12:38 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx; +Cc: Michel Thierry


On 05/01/2019 02:39, Carlos Santa wrote:
> From: Michel Thierry <michel.thierry@intel.com>
> 
> Final enablement patch for GPU hang detection using watchdog timeout.
> Using the gem_context_setparam ioctl, users can specify the desired
> timeout value in microseconds, and the driver will do the conversion to
> 'timestamps'.
> 
> The recommended default watchdog threshold for video engines is 60000 us,
> since this has been _empirically determined_ to be a good compromise for
> low-latency requirements and low rate of false positives. The default
> register value is ~106000us and the theoretical max value (all 1s) is
> 353 seconds.
> 
> Note, UABI engine ids and i915 engine ids are different, and this patch
> uses the i915 ones. Some kind of mapping table [1] is required if we
> decide to use the UABI engine ids.

This paragraphs is not true any more.

> 
> [1] http://patchwork.freedesktop.org/patch/msgid/20170329135831.30254-2-chris@chris-wilson.co.uk
 >> v2: Fixed get api to return values in microseconds. Threshold updated to
> be per context engine. Check for u32 overflow. Capture ctx threshold
> value in error state.
> 
> v3: Add a way to get array size, short-cut to disable all thresholds,
> return EFAULT / EINVAL as needed. Move the capture of the threshold
> value in the error state into a new patch. BXT has a different
> timestamp base (because why not?).
> 
> v4: Checking if watchdog is available should be the first thing to
> do, instead of giving false hopes to abi users; remove unnecessary & in
> set_watchdog; ignore args->size in getparam.
> 
> v5: GEN9-LP platforms have a different crystal clock frequency, use the
> right timestamp base for them (magic 8-ball predicts this will change
> again later on, so future-proof it). (Daniele)
> 
> v6: Rebase, no more mutex BLK in getparam_ioctl.
> 
> v7: use to_intel_context instead of ctx->engine.
> 
> v8: Rebase, remove extra mutex from i915_gem_context_set_watchdog (Tvrtko),
> Update UAPI to use engine class while keeping thresholds per
> engine class (Michel).
> 
> Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.h         | 56 +++++++++++++++
>   drivers/gpu/drm/i915/i915_gem_context.c | 91 +++++++++++++++++++++++++
>   drivers/gpu/drm/i915/intel_lrc.c        |  5 +-
>   include/uapi/drm/i915_drm.h             |  1 +
>   4 files changed, 151 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 7fa2a405c5fe..96d59c22e2ba 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1552,6 +1552,9 @@ struct drm_i915_private {
>   	struct drm_i915_fence_reg fence_regs[I915_MAX_NUM_FENCES]; /* assume 965 */
>   	int num_fence_regs; /* 8 on pre-965, 16 otherwise */
>   
> +	/* Command stream timestamp base - helps define watchdog threshold */
> +	u32 cs_timestamp_base;
> +
>   	unsigned int fsb_freq, mem_freq, is_ddr3;
>   	unsigned int skl_preferred_vco_freq;
>   	unsigned int max_cdclk_freq;
> @@ -3117,6 +3120,59 @@ i915_gem_context_lookup(struct drm_i915_file_private *file_priv, u32 id)
>   	return ctx;
>   }
>   
> +/*
> + * BDW, CHV & SKL+ Timestamp timer resolution = 0.080 uSec,
> + * or 12500000 counts per second, or ~12 counts per microsecond.
> + *
> + * But BXT/GLK Timestamp timer resolution is different, 0.052 uSec,
> + * or 19200000 counts per second, or ~19 counts per microsecond.
> + *
> + * Future-proofing, some day it won't be as simple as just GEN & IS_LP.
> + */
> +#define GEN8_TIMESTAMP_CNTS_PER_USEC 12
> +#define GEN9_LP_TIMESTAMP_CNTS_PER_USEC 19
> +static inline u32 cs_timestamp_in_us(struct drm_i915_private *dev_priv)
> +{
> +	u32 cs_timestamp_base = dev_priv->cs_timestamp_base;
> +
> +	if (cs_timestamp_base)
> +		return cs_timestamp_base;
> +
> +	switch (INTEL_GEN(dev_priv)) {
> +	default:
> +		MISSING_CASE(INTEL_GEN(dev_priv));
> +		/* fall through */
> +	case 9:
> +		cs_timestamp_base = IS_GEN9_LP(dev_priv) ?
> +					GEN9_LP_TIMESTAMP_CNTS_PER_USEC :
> +					GEN8_TIMESTAMP_CNTS_PER_USEC;
> +		break;
> +	case 8:
> +		cs_timestamp_base = GEN8_TIMESTAMP_CNTS_PER_USEC;
> +		break;
> +	}
> +
> +	dev_priv->cs_timestamp_base = cs_timestamp_base;
> +	return cs_timestamp_base;
> +}

We already have RUNTIME_INFO(i915)->cs_timestamp_frequency_khz and 
read_timestamp_frequency which sets it.

> +
> +static inline u32
> +watchdog_to_us(struct drm_i915_private *dev_priv, u32 value_in_clock_counts)
> +{
> +	return value_in_clock_counts / cs_timestamp_in_us(dev_priv);
> +}
> +
> +static inline u32
> +watchdog_to_clock_counts(struct drm_i915_private *dev_priv, u64 value_in_us)
> +{
> +	u64 threshold = value_in_us * cs_timestamp_in_us(dev_priv);
> +
> +	if (overflows_type(threshold, u32))
> +		return -EINVAL;
> +
> +	return threshold;
> +}
> +
>   int i915_perf_open_ioctl(struct drm_device *dev, void *data,
>   			 struct drm_file *file);
>   int i915_perf_add_config_ioctl(struct drm_device *dev, void *data,
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index 40fefed8c92f..4e421b79ac07 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -840,6 +840,91 @@ int i915_gem_context_destroy_ioctl(struct drm_device *dev, void *data,
>   	return 0;
>   }
>   
> +/* Return the timer count threshold in microseconds. */
> +int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
> +				  struct drm_i915_gem_context_param *args)
> +{
> +	struct drm_i915_private *dev_priv = ctx->i915;
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	u32 threshold_in_us[OTHER_CLASS];
> +
> +	if (!dev_priv->engine[VCS]->emit_start_watchdog)
> +		return -ENODEV;

This could use the engine->flags, but the question is which engine. The 
answer will depend on the uapi so we will come to it later.

> +
> +	for_each_engine(engine, dev_priv, id) {
> +		struct intel_context *ce = to_intel_context(ctx, engine);
> +
> +		threshold_in_us[engine->class] = watchdog_to_us(dev_priv,
> +								ce->watchdog_threshold);

Would there be some lossyness in read-back due us -> internal -> us 
conversion?

> +	}
> +
> +	if (__copy_to_user(u64_to_user_ptr(args->value),
> +			   &threshold_in_us,
> +			   sizeof(threshold_in_us))) {

I think user pointer hasn't been verified for write access yet so 
standard copy_to_user should be used.

> +		return -EFAULT;
> +	}
> +
> +	args->size = sizeof(threshold_in_us);
> +
> +	return 0;
> +}
> +
> +/*
> + * Based on time out value in microseconds (us) calculate
> + * timer count thresholds needed based on core frequency.
> + * Watchdog can be disabled by setting it to 0.
> + */
> +int i915_gem_context_set_watchdog(struct i915_gem_context *ctx,
> +				  struct drm_i915_gem_context_param *args)
> +{
> +	struct drm_i915_private *dev_priv = ctx->i915;
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	u32 i;

Incorrect type for class iterator.

> +	u32 threshold[OTHER_CLASS];
> +
> +	if (!dev_priv->engine[VCS]->emit_start_watchdog)
> +		return -ENODEV;
> +
> +	memset(threshold, 0, sizeof(threshold));
> +
> +	/* shortcut to disable in all engines */
> +	if (args->size == 0)
> +		goto set_watchdog;
> +
> +	if (args->size < sizeof(threshold))
> +		return -EFAULT;

This would make the uapi not backward compatible. (Old userspace on new 
kernels). But uapi will probably need other changes as mentioned already.

> +
> +	if (copy_from_user(threshold,
> +			   u64_to_user_ptr(args->value),
> +			   sizeof(threshold))) {
> +		mutex_lock(&dev_priv->drm.struct_mutex);

Ooops!

> +		return -EFAULT;
> +	}
> +
> +	/* not supported in blitter engine */
> +	if (threshold[COPY_ENGINE_CLASS] != 0)
> +		return -EINVAL;
> +
> +	for (i = RENDER_CLASS; i < OTHER_CLASS; i++) {
> +		threshold[i] = watchdog_to_clock_counts(dev_priv, threshold[i]);
> +
> +		if (threshold[i] == -EINVAL)
> +			return -EINVAL;
> +	}
> +
> +set_watchdog:
> +	for_each_engine(engine, dev_priv, id) {
> +		struct intel_context *ce = to_intel_context(ctx, engine);
> +
> +		ce->watchdog_threshold = threshold[engine->class];
> +	}
> +
> +	return 0;
> +}
> +
> +
>   int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   				    struct drm_file *file)
>   {
> @@ -877,6 +962,9 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data,
>   	case I915_CONTEXT_PARAM_PRIORITY:
>   		args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT;
>   		break;
> +	case I915_CONTEXT_PARAM_WATCHDOG:
> +		ret = i915_gem_context_get_watchdog(ctx, args);
> +		break;
>   	default:
>   		ret = -EINVAL;
>   		break;
> @@ -949,6 +1037,9 @@ int i915_gem_context_setparam_ioctl(struct drm_device *dev, void *data,
>   		}
>   		break;
>   
> +	case I915_CONTEXT_PARAM_WATCHDOG:
> +		ret = i915_gem_context_set_watchdog(ctx, args);
> +		break;
>   	default:
>   		ret = -EINVAL;
>   		break;
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 0ea5a37c3357..0afcbeb18329 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -2061,7 +2061,7 @@ static inline u32 get_watchdog_disable(struct intel_engine_cs *engine)
>   		return GEN8_XCS_WATCHDOG_DISABLE;
>   }
>   
> -#define GEN8_WATCHDOG_1000US 0x2ee0 //XXX: Temp, replace with helper function
> +#define GEN8_WATCHDOG_1000US(dev_priv) watchdog_to_clock_counts(dev_priv, 1000)

Right, hm, I think you should introduce a helper in that earlier patch 
to avoid the ugly comment.

I doesn't have to do anything yet. However as said in that review, I am 
not sure this approach of restarting the timer is safe to begin with.

>   static void gen8_watchdog_irq_handler(unsigned long data)
>   {
>   	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> @@ -2108,7 +2108,8 @@ static void gen8_watchdog_irq_handler(unsigned long data)
>   	} else {
>   		engine->hangcheck.watchdog = current_seqno;
>   		/* Re-start the counter, if really hung, it will expire again */
> -		I915_WRITE_FW(RING_THRESH(engine->mmio_base), GEN8_WATCHDOG_1000US);
> +		I915_WRITE_FW(RING_THRESH(engine->mmio_base),
> +			      GEN8_WATCHDOG_1000US(dev_priv));
>   		I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE);
>   	}
>   
> diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h
> index 0bc9e00e66ce..da7efaf66b0e 100644
> --- a/include/uapi/drm/i915_drm.h
> +++ b/include/uapi/drm/i915_drm.h
> @@ -1490,6 +1490,7 @@ struct drm_i915_gem_context_param {
>   #define   I915_CONTEXT_MAX_USER_PRIORITY	1023 /* inclusive */
>   #define   I915_CONTEXT_DEFAULT_PRIORITY		0
>   #define   I915_CONTEXT_MIN_USER_PRIORITY	-1023 /* inclusive */
> +#define I915_CONTEXT_PARAM_WATCHDOG	0x7
>   	__u64 value;
>   };
>   
> 

So UAPI is not really defined at all as far as userspace (looking at 
i915_drm.h can see).

We will have to come up with something better and also considering the 
Virtual Engine work. And also work out the details semantics etc.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Only process VCS2 only when supported
  2019-01-05  2:39 ` drm/i915: Only process VCS2 only when supported Carlos Santa
@ 2019-01-07 12:40   ` Tvrtko Ursulin
  2019-01-24  0:20     ` Carlos Santa
  0 siblings, 1 reply; 48+ messages in thread
From: Tvrtko Ursulin @ 2019-01-07 12:40 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx; +Cc: Michel Thierry


On 05/01/2019 02:39, Carlos Santa wrote:
> Not checking for BSD2 causes a segfault on GPU revs
> with no h/w support for the extra media engines.
> 
> Segfault on ULX GT2 (0x591e) follows:
> 
> Patch shared by Michel Thierry on IIRC.
> 
> [  468.625970] BUG: unable to handle kernel NULL pointer dereference at 00000000000002c0
> [  468.625978] IP: gen8_cs_irq_handler+0x8d/0xcf
> [  468.625979] PGD 0 P4D 0
> [  468.625983] Oops: 0002 [#1] PREEMPT SMP PTI
> [  468.625987] Dumping ftrace buffer:
> [  468.625990]    (ftrace buffer empty)
> [  468.627877] gsmi: Log Shutdown Reason 0x03
> [  468.627880] Modules linked in: cmac rfcomm uinput joydev snd_soc_hdac_hdmi snd_soc_dmic snd_soc_skl snd_soc_skl_ipc lzo lzo_compress snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_acpi snd_hda_ext_core snd_hda_core zram snd_soc_max98927 ipt_MASQUERADE nf_nat_masquerade_ipv4 hid_multitouch xt_mark fuse cdc_ether usbnet btusb btrtl btbcm btintel uvcvideo bluetooth videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core ecdh_generic iio_trig_sysfs cros_ec_light_prox cros_ec_sensors_ring cros_ec_sensors cros_ec_activity cros_ec_sensors_core industrialio_triggered_buffer kfifo_buf industrialio iwlmvm iwl7000_mac80211 r8152 mii iwlwifi cfg80211
> [  468.627917] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.64 #38
> [  468.627919] Hardware name: Google Eve/Eve, BIOS Google_Eve.9584.107.0 11/07/2017
> [  468.627922] task: ffff96bf35a13c00 task.stack: ffffa79880070000
> [  468.627925] RIP: 0010:gen8_cs_irq_handler+0x8d/0xcf
> [  468.627928] RSP: 0018:ffff96bf3ec83df8 EFLAGS: 00010002
> [  468.627931] RAX: 0000000001000000 RBX: 0000000000000000 RCX: 0000000000000010
> [  468.627933] RDX: 0000000000000010 RSI: 0000000000000040 RDI: 0000000000000000
> [  468.627936] RBP: ffff96bf3ec83e20 R08: 0000000000000000 R09: 0000000000000000
> [  468.627938] R10: ffffa79880073dc8 R11: ffffffffa2a6453d R12: ffff96bf3ec83e70
> [  468.627940] R13: 0000000000000079 R14: 0000000000000000 R15: 0000000000000040
> [  468.627943] FS:  0000000000000000(0000) GS:ffff96bf3ec80000(0000) knlGS:0000000000000000
> [  468.627945] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  468.627948] CR2: 00000000000002c0 CR3: 000000016fe12002 CR4: 00000000003606e0
> [  468.627950] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  468.627953] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  468.627955] Call Trace:
> [  468.627959]  <IRQ>
> [  468.627963]  gen8_gt_irq_handler+0x5e/0xed
> [  468.627968]  gen8_irq_handler+0x9f/0x5ce
> [  468.627973]  __handle_irq_event_percpu+0xb8/0x1da
> [  468.627977]  handle_irq_event_percpu+0x32/0x77
> [  468.627980]  handle_irq_event+0x36/0x55
> [  468.627984]  handle_edge_irq+0x7d/0xcd
> [  468.627988]  handle_irq+0xd9/0x11e
> [  468.627992]  do_IRQ+0x4b/0xd0
> [  468.627996]  common_interrupt+0x7a/0x7a
> [  468.627999]  </IRQ>
> [  468.628003] RIP: 0010:cpuidle_enter_state+0xff/0x177
> [  468.628005] RSP: 0018:ffffa79880073e78 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff5e
> [  468.628008] RAX: ffff96bf3eca09c0 RBX: 000000000002b9c3 RCX: 0000006d1c48c3b5
> [  468.628010] RDX: 0000000000000037 RSI: 0000000000000001 RDI: 0000000000000000
> [  468.628013] RBP: ffffa79880073ec0 R08: 0000000000000002 R09: 0000000000005000
> [  468.628015] R10: 071c71c71c71c71c R11: ffffffffa2e42687 R12: 0000000000000000
> [  468.628017] R13: 0000000000000002 R14: 0000000000000002 R15: ffff96bf3eca7d00
> [  468.628020]  ? cpu_idle_poll+0x8e/0x8e
> [  468.628025]  ? cpuidle_enter_state+0xe3/0x177
> [  468.628028]  do_idle+0x10c/0x19d
> [  468.628033]  cpu_startup_entry+0x6d/0x6f
> [  468.628036]  start_secondary+0x189/0x1a4
> [  468.628041]  secondary_startup_64+0xa5/0xb0
> [  468.628044] Code: c3 84 db 74 20 f0 41 0f ba ae 98 02 00 00 00 0f 92 45 df 80 7d df 00 75 0c 49 8d be 90 02 00 00 e8 3a 7a c6 ff 41 f6 c7 40 74 23 <f0> 41 0f ba ae c0 02 00 00 00 0f 92 45 de 80 7d de 00 75 0f 49
> [  468.628083] RIP: gen8_cs_irq_handler+0x8d/0xcf RSP: ffff96bf3ec83df8
> [  468.628085] CR2: 00000000000002c0
> [  468.628088] ---[ end trace a7a497ddeb44bcf8 ]---
> 
> Tested-by: Carlos Santa <carlos.santa@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>
> Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_irq.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 859bbadb752f..953ebe5c85ce 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1554,7 +1554,8 @@ static void gen8_gt_irq_handler(struct drm_i915_private *i915,
>   	if (master_ctl & (GEN8_GT_VCS1_IRQ | GEN8_GT_VCS2_IRQ)) {
>   		gen8_cs_irq_handler(i915->engine[VCS],
>   				    gt_iir[1] >> GEN8_VCS1_IRQ_SHIFT);
> -		gen8_cs_irq_handler(i915->engine[VCS2],
> +		if(HAS_BSD2(i915))
> +			gen8_cs_irq_handler(i915->engine[VCS2],
>   				    gt_iir[1] >> GEN8_VCS2_IRQ_SHIFT);
>   	}
>   
> 

Why would we be getting interrupts from engines which are not there?

Is this just a consequence of the missing shift when programming 
interrupts in an earlier patch?

Regards,

Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2019-01-07 12:38   ` Tvrtko Ursulin
@ 2019-01-07 12:50     ` Chris Wilson
  2019-01-07 13:39       ` Tvrtko Ursulin
  2019-01-07 17:00     ` Tvrtko Ursulin
  1 sibling, 1 reply; 48+ messages in thread
From: Chris Wilson @ 2019-01-07 12:50 UTC (permalink / raw)
  To: Carlos Santa, Tvrtko Ursulin, intel-gfx; +Cc: Michel Thierry

Quoting Tvrtko Ursulin (2019-01-07 12:38:47)
> On 05/01/2019 02:39, Carlos Santa wrote:
> > +/* Return the timer count threshold in microseconds. */
> > +int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
> > +                               struct drm_i915_gem_context_param *args)
> > +     if (__copy_to_user(u64_to_user_ptr(args->value),
> > +                        &threshold_in_us,
> > +                        sizeof(threshold_in_us))) {
> 
> I think user pointer hasn't been verified for write access yet so 
> standard copy_to_user should be used.

The starting point here is that this bakes in OTHER_CLASS as uABI when
clearly it is not uABI. The array must be variable size and so the first
pass is to report the size to userspace (if they pass args->size = 0)
and then to copy up to the user requested size. Since it is a variable
array with uncertain indexing, the output should be along the lines of
[class, instance, watchdog]. But what about virtual engine? Good
question. (Remember set must follow the same pattern as get, so you can
always do a rmw cleanly.)

I guess we just have to accept class == -1 as meaning use instance as an
index into ctx->engines[]. Tvrtko? Virtual engine would be reported as
(-1, 0, watchdog), and settable similarly with the other engines being
addressable either by index or by class-instance.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915/watchdog: move emit_stop_watchdog until the very end of the ring commands
  2019-01-05  2:40 ` drm/i915/watchdog: move emit_stop_watchdog until the very end of the ring commands Carlos Santa
@ 2019-01-07 12:50   ` Tvrtko Ursulin
  2019-01-07 12:54     ` Chris Wilson
  2019-01-11  2:25     ` Carlos Santa
  0 siblings, 2 replies; 48+ messages in thread
From: Tvrtko Ursulin @ 2019-01-07 12:50 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx; +Cc: Michel Thierry


On 05/01/2019 02:40, Carlos Santa wrote:
> From: Michel Thierry <michel.thierry@intel.com>
> 
> On command streams that could potentially hang the GPU after a last
> flush command, it's best not to cancel the watchdog
> until after all commands have executed.
> 
> Patch shared by Michel Thierry through IIRC after reproduction on

Joonas pointed out on IRC that IRC is called IRC. :)

> my local setup.
> 
> Tested-by: Carlos Santa <carlos.santa@intel.com>
> CC: Antonio Argenziano <antonio.argenziano@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> ---
>   drivers/gpu/drm/i915/intel_lrc.c | 53 +++++++++++++++++++++++++++-----
>   1 file changed, 45 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 0afcbeb18329..25ba5fcc9466 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1885,8 +1885,8 @@ static int gen8_emit_bb_start(struct i915_request *rq,
>   		GEM_BUG_ON(!engine->emit_start_watchdog ||
>   			   !engine->emit_stop_watchdog);
>   
> -		/* + start_watchdog (6) + stop_watchdog (4) */
> -		num_dwords += 10;
> +		/* + start_watchdog (6) */
> +		num_dwords += 6;
>   		watchdog_running = true;
>           }
>   
> @@ -1927,10 +1927,7 @@ static int gen8_emit_bb_start(struct i915_request *rq,
>   	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
>   	*cs++ = MI_NOOP;
>   
> -	if (watchdog_running) {
> -		/* Cancel watchdog timer */
> -		cs = engine->emit_stop_watchdog(rq, cs);
> -	}
> +	// XXX: emit_stop_watchdog happens in gen8_emit_breadcrumb_vcs

No C++ comments please. And what does XXX mean? Doesn't feel like it 
belongs.

>   
>   	intel_ring_advance(rq, cs);
>   	return 0;
> @@ -2189,6 +2186,37 @@ static void gen8_emit_breadcrumb(struct i915_request *request, u32 *cs)
>   }
>   static const int gen8_emit_breadcrumb_sz = 6 + WA_TAIL_DWORDS;
>   
> +static void gen8_emit_breadcrumb_vcs(struct i915_request *request, u32 *cs)
> +{
> +	/* w/a: bit 5 needs to be zero for MI_FLUSH_DW address. */
> +	BUILD_BUG_ON(I915_GEM_HWS_INDEX_ADDR & (1 << 5));
> +
> +	cs = gen8_emit_ggtt_write(cs, request->global_seqno,
> +				  intel_hws_seqno_address(request->engine));
> +	*cs++ = MI_USER_INTERRUPT;
> +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> +
> +	// stop_watchdog at the very end of the ring commands
> +	if (request->gem_context->__engine[VCS].watchdog_threshold != 0)

VCS is wrong. Whole check needs to be to_intel_context(ctx, 
engine)->watchdog_threshold I think.

> +	{
> +		/* Cancel watchdog timer */
> +		GEM_BUG_ON(!request->engine->emit_stop_watchdog);
> +		cs = request->engine->emit_stop_watchdog(request, cs);
> +	}
> +	else
> +	{

Coding style is wrong (curly braces for if else).

> +		*cs++ = MI_NOOP;
> +		*cs++ = MI_NOOP;
> +		*cs++ = MI_NOOP;
> +		*cs++ = MI_NOOP;
> +	}
> +
> +	request->tail = intel_ring_offset(request, cs);
> +	assert_ring_tail_valid(request->ring, request->tail);
> +	gen8_emit_wa_tail(request, cs);
> +}
> +static const int gen8_emit_breadcrumb_vcs_sz = 6 + WA_TAIL_DWORDS + 4; //+4 for optional stop_watchdog
> +
>   static void gen8_emit_breadcrumb_rcs(struct i915_request *request, u32 *cs)
>   {
>   	/* We're using qword write, seqno should be aligned to 8 bytes. */
> @@ -2306,8 +2334,17 @@ logical_ring_default_vfuncs(struct intel_engine_cs *engine)
>   	engine->request_alloc = execlists_request_alloc;
>   
>   	engine->emit_flush = gen8_emit_flush;
> -	engine->emit_breadcrumb = gen8_emit_breadcrumb;
> -	engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_sz;
> +
> +	if (engine->id == VCS || engine->id == VCS2)

What about VCS3 or 4? Hint use engine class.

And what about RCS and VECS?

> +	{
> +		engine->emit_breadcrumb = gen8_emit_breadcrumb_vcs;
> +		engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_vcs_sz;
> +	}
> +	else
> +	{
> +		engine->emit_breadcrumb = gen8_emit_breadcrumb;
> +		engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_sz;
> +	}
>   
>   	engine->set_default_submission = intel_execlists_set_default_submission;
>   
> 

Looks like the patch should be squashed with the one which implements 
watchdog emit start/end? I mean if the whole setup has broken edge cases 
without this..

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915/watchdog: move emit_stop_watchdog until the very end of the ring commands
  2019-01-07 12:50   ` Tvrtko Ursulin
@ 2019-01-07 12:54     ` Chris Wilson
  2019-01-07 13:01       ` Tvrtko Ursulin
  2019-01-11  2:25     ` Carlos Santa
  1 sibling, 1 reply; 48+ messages in thread
From: Chris Wilson @ 2019-01-07 12:54 UTC (permalink / raw)
  To: Carlos Santa, Tvrtko Ursulin, intel-gfx; +Cc: Michel Thierry

Quoting Tvrtko Ursulin (2019-01-07 12:50:24)
> 
> On 05/01/2019 02:40, Carlos Santa wrote:
> > +static void gen8_emit_breadcrumb_vcs(struct i915_request *request, u32 *cs)
> > +{
> > +     /* w/a: bit 5 needs to be zero for MI_FLUSH_DW address. */
> > +     BUILD_BUG_ON(I915_GEM_HWS_INDEX_ADDR & (1 << 5));
> > +
> > +     cs = gen8_emit_ggtt_write(cs, request->global_seqno,
> > +                               intel_hws_seqno_address(request->engine));
> > +     *cs++ = MI_USER_INTERRUPT;
> > +     *cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> > +
> > +     // stop_watchdog at the very end of the ring commands
> > +     if (request->gem_context->__engine[VCS].watchdog_threshold != 0)
> 
> VCS is wrong. Whole check needs to be to_intel_context(ctx, 
> engine)->watchdog_threshold I think.

You too! rq->hw_context->watchdog_threshold. It's as if hw_context may
not even be part of gem_context...
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-07 12:16     ` Chris Wilson
@ 2019-01-07 12:58       ` Tvrtko Ursulin
  2019-01-07 13:02         ` Chris Wilson
  0 siblings, 1 reply; 48+ messages in thread
From: Tvrtko Ursulin @ 2019-01-07 12:58 UTC (permalink / raw)
  To: Chris Wilson, Carlos Santa, intel-gfx; +Cc: Michel Thierry


On 07/01/2019 12:16, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2019-01-07 11:58:13)
>>
>> Hi,
>>
>> This series has not been recognized by Patchwork as such, nor are the
>> patches numbered. Have you used git format-patch -<N> --cover-letter and
>> git send-email to send it out?
>>
>> Rest inline.
>>
>> On 05/01/2019 02:39, Carlos Santa wrote:
>>> +static void gen8_watchdog_irq_handler(unsigned long data)
>>> +{
>>> +     struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
>>> +     struct drm_i915_private *dev_priv = engine->i915;
>>> +     enum forcewake_domains fw_domains;
>>> +     u32 current_seqno;
>>> +
>>> +     switch (engine->id) {
>>> +     default:
>>> +             MISSING_CASE(engine->id);
>>> +             /* fall through */
>>> +     case RCS:
>>> +             fw_domains = FORCEWAKE_RENDER;
>>> +             break;
>>> +     case VCS:
>>> +     case VCS2:
>>> +     case VECS:
>>> +             fw_domains = FORCEWAKE_MEDIA;
>>> +             break;
>>> +     }
>>> +
>>> +     intel_uncore_forcewake_get(dev_priv, fw_domains);
>>
>> I'd be tempted to drop this and just use I915_WRITE. It doesn't feel
>> like there is any performance to be gained with it and it embeds too
>> much knowledge here.
> 
> No, no, no. Let's not reintroduce a fw inside irq context on a frequent
> timer again.

Tasklet and hopefully watchdog timeouts are not frequent. :)

> Rule of thumb for fw_get:
> gen6+: 10us to 50ms.
> gen8+: 10us to 500us.
> 
> And then we don't release fw for 1ms after the fw_put. So we basically
> prevent GT powersaving while the watchdog is active. That strikes me as
> hopefully an unintended consequence.
> 
> The fw_get will be required if we actually hang, but for the timer
> check, we should be able to do without.

That would be nice, but it is needed by the watchdog disable/re-enable 
logic. Which I commented looks suspect to me so maybe something can be 
done about that.

But in general, I didn't quite get if you are opposed to my suggestion 
not to open code knowledge of fw domains here in favour of simple 
I915_WRITE, or just the whole concept of taking a fw by any means here.

> And while on the topic of the timer irq, it should be forcibly cleared
> along intel_engine_park, so that we ensure it is not raised while the
> device/driver is supposed to be asleep. Or something to that effect.

I have raised the issue of syncing the potentially delayed tasklet, but 
yeah, could be that more is needed.

>>> +     current_seqno = intel_engine_get_seqno(engine);
>>> +
>>> +     /* did the request complete after the timer expired? */
>>> +     if (intel_engine_last_submit(engine) == current_seqno)
>>> +             goto fw_put;
>>> +
>>> +     if (engine->hangcheck.watchdog == current_seqno) {
>>> +             /* Make sure the active request will be marked as guilty */
>>> +             engine->hangcheck.stalled = true;
>>> +             engine->hangcheck.acthd = intel_engine_get_active_head(engine);
>>> +             engine->hangcheck.seqno = current_seqno;
>>> +
>>> +             /* And try to run the hangcheck_work as soon as possible */
>>> +             set_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags);
>>> +             queue_delayed_work(system_long_wq,
>>> +                                &dev_priv->gpu_error.hangcheck_work,
>>> +                                round_jiffies_up_relative(HZ));
>>> +     } else {
>>> +             engine->hangcheck.watchdog = current_seqno;
>>
>> The logic above potentially handles my previous question? Could be if
>> batch 2 hangs. But..
> 
> Also, DO NOT USE HANGCHECK for this. The whole design was to be able to
> do the engine reset right away. (Now guc can't but that's known broken.)
> 
> Aside, we have to rewrite this entire logic anyway as the engine seqno
> and global_seqno are obsolete.

Btw one thing I forgot to say - I did not focus on the hangcheck 
interactions - I'll leave that for people more in the know of that.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915/watchdog: move emit_stop_watchdog until the very end of the ring commands
  2019-01-07 12:54     ` Chris Wilson
@ 2019-01-07 13:01       ` Tvrtko Ursulin
  0 siblings, 0 replies; 48+ messages in thread
From: Tvrtko Ursulin @ 2019-01-07 13:01 UTC (permalink / raw)
  To: Chris Wilson, Carlos Santa, intel-gfx; +Cc: Michel Thierry


On 07/01/2019 12:54, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2019-01-07 12:50:24)
>>
>> On 05/01/2019 02:40, Carlos Santa wrote:
>>> +static void gen8_emit_breadcrumb_vcs(struct i915_request *request, u32 *cs)
>>> +{
>>> +     /* w/a: bit 5 needs to be zero for MI_FLUSH_DW address. */
>>> +     BUILD_BUG_ON(I915_GEM_HWS_INDEX_ADDR & (1 << 5));
>>> +
>>> +     cs = gen8_emit_ggtt_write(cs, request->global_seqno,
>>> +                               intel_hws_seqno_address(request->engine));
>>> +     *cs++ = MI_USER_INTERRUPT;
>>> +     *cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
>>> +
>>> +     // stop_watchdog at the very end of the ring commands
>>> +     if (request->gem_context->__engine[VCS].watchdog_threshold != 0)
>>
>> VCS is wrong. Whole check needs to be to_intel_context(ctx,
>> engine)->watchdog_threshold I think.
> 
> You too! rq->hw_context->watchdog_threshold. It's as if hw_context may
> not even be part of gem_context...

Oops.. this is what happens when you just review and review - new stuff 
does not get ingrained in memory unless typing it enough. :)

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-07 12:58       ` Tvrtko Ursulin
@ 2019-01-07 13:02         ` Chris Wilson
  2019-01-07 13:12           ` Tvrtko Ursulin
  0 siblings, 1 reply; 48+ messages in thread
From: Chris Wilson @ 2019-01-07 13:02 UTC (permalink / raw)
  To: Carlos Santa, Tvrtko Ursulin, intel-gfx; +Cc: Michel Thierry

Quoting Tvrtko Ursulin (2019-01-07 12:58:39)
> 
> On 07/01/2019 12:16, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2019-01-07 11:58:13)
> >> On 05/01/2019 02:39, Carlos Santa wrote:
> >>> +static void gen8_watchdog_irq_handler(unsigned long data)
> >>> +{
> >>> +     struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> >>> +     struct drm_i915_private *dev_priv = engine->i915;
> >>> +     enum forcewake_domains fw_domains;
> >>> +     u32 current_seqno;
> >>> +
> >>> +     switch (engine->id) {
> >>> +     default:
> >>> +             MISSING_CASE(engine->id);
> >>> +             /* fall through */
> >>> +     case RCS:
> >>> +             fw_domains = FORCEWAKE_RENDER;
> >>> +             break;
> >>> +     case VCS:
> >>> +     case VCS2:
> >>> +     case VECS:
> >>> +             fw_domains = FORCEWAKE_MEDIA;
> >>> +             break;
> >>> +     }
> >>> +
> >>> +     intel_uncore_forcewake_get(dev_priv, fw_domains);
> >>
> >> I'd be tempted to drop this and just use I915_WRITE. It doesn't feel
> >> like there is any performance to be gained with it and it embeds too
> >> much knowledge here.
> > 
> > No, no, no. Let's not reintroduce a fw inside irq context on a frequent
> > timer again.
> 
> Tasklet and hopefully watchdog timeouts are not frequent. :)

I thought the typical value mentioned elsewhere was a 1ms watchdog. Some
might say why even use a watchdog for longer than that as a hrtimer will
be more efficient (coupling in with other timer activity) ;)
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-07 13:02         ` Chris Wilson
@ 2019-01-07 13:12           ` Tvrtko Ursulin
  0 siblings, 0 replies; 48+ messages in thread
From: Tvrtko Ursulin @ 2019-01-07 13:12 UTC (permalink / raw)
  To: Chris Wilson, Carlos Santa, intel-gfx; +Cc: Michel Thierry


On 07/01/2019 13:02, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2019-01-07 12:58:39)
>>
>> On 07/01/2019 12:16, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2019-01-07 11:58:13)
>>>> On 05/01/2019 02:39, Carlos Santa wrote:
>>>>> +static void gen8_watchdog_irq_handler(unsigned long data)
>>>>> +{
>>>>> +     struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
>>>>> +     struct drm_i915_private *dev_priv = engine->i915;
>>>>> +     enum forcewake_domains fw_domains;
>>>>> +     u32 current_seqno;
>>>>> +
>>>>> +     switch (engine->id) {
>>>>> +     default:
>>>>> +             MISSING_CASE(engine->id);
>>>>> +             /* fall through */
>>>>> +     case RCS:
>>>>> +             fw_domains = FORCEWAKE_RENDER;
>>>>> +             break;
>>>>> +     case VCS:
>>>>> +     case VCS2:
>>>>> +     case VECS:
>>>>> +             fw_domains = FORCEWAKE_MEDIA;
>>>>> +             break;
>>>>> +     }
>>>>> +
>>>>> +     intel_uncore_forcewake_get(dev_priv, fw_domains);
>>>>
>>>> I'd be tempted to drop this and just use I915_WRITE. It doesn't feel
>>>> like there is any performance to be gained with it and it embeds too
>>>> much knowledge here.
>>>
>>> No, no, no. Let's not reintroduce a fw inside irq context on a frequent
>>> timer again.
>>
>> Tasklet and hopefully watchdog timeouts are not frequent. :)
> 
> I thought the typical value mentioned elsewhere was a 1ms watchdog. Some

Commit message to a patch from this series says 60ms is recommended.

> might say why even use a watchdog for longer than that as a hrtimer will
> be more efficient (coupling in with other timer activity) ;)

For the normal case (longish batches, very few timeouts) it feels more 
efficient to have ten or so extra dwords with each request than to 
fiddle with hrtimers, no?

But interactions with preemption and future time-slicing yeah don't 
know. Maybe a single solution will be simpler at that point.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2019-01-07 12:50     ` Chris Wilson
@ 2019-01-07 13:39       ` Tvrtko Ursulin
  2019-01-07 13:51         ` Chris Wilson
  0 siblings, 1 reply; 48+ messages in thread
From: Tvrtko Ursulin @ 2019-01-07 13:39 UTC (permalink / raw)
  To: Chris Wilson, Carlos Santa, intel-gfx; +Cc: Michel Thierry


On 07/01/2019 12:50, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2019-01-07 12:38:47)
>> On 05/01/2019 02:39, Carlos Santa wrote:
>>> +/* Return the timer count threshold in microseconds. */
>>> +int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
>>> +                               struct drm_i915_gem_context_param *args)
>>> +     if (__copy_to_user(u64_to_user_ptr(args->value),
>>> +                        &threshold_in_us,
>>> +                        sizeof(threshold_in_us))) {
>>
>> I think user pointer hasn't been verified for write access yet so
>> standard copy_to_user should be used.
> 
> The starting point here is that this bakes in OTHER_CLASS as uABI when
> clearly it is not uABI. The array must be variable size and so the first
> pass is to report the size to userspace (if they pass args->size = 0)
> and then to copy up to the user requested size. Since it is a variable
> array with uncertain indexing, the output should be along the lines of
> [class, instance, watchdog]. But what about virtual engine? Good
> question. (Remember set must follow the same pattern as get, so you can
> always do a rmw cleanly.)
> 
> I guess we just have to accept class == -1 as meaning use instance as an
> index into ctx->engines[]. Tvrtko? Virtual engine would be reported as
> (-1, 0, watchdog), and settable similarly with the other engines being
> addressable either by index or by class-instance.

Or use index based addressing mode if engine map is set? Index 0 only 
being valid if load balancing is enabled on top.

So an array of structs like:

struct drm_i915_watchdog_timeout {
	union {
		struct {
			u16 class;
			u16 instance;
		};
		u32 index;
	};
	u32 timeout_us;
};

?

We can allow standalone set param and via extension as well.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-07 11:58   ` Tvrtko Ursulin
  2019-01-07 12:16     ` Chris Wilson
@ 2019-01-07 13:43     ` Tvrtko Ursulin
  2019-01-07 13:57       ` Chris Wilson
  2019-01-24  0:13     ` Carlos Santa
  2 siblings, 1 reply; 48+ messages in thread
From: Tvrtko Ursulin @ 2019-01-07 13:43 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx; +Cc: Michel Thierry


On 07/01/2019 11:58, Tvrtko Ursulin wrote:

[snip]

>> Note about future interaction with preemption: Preemption could happen
>> in a command sequence prior to watchdog counter getting disabled,
>> resulting in watchdog being triggered following preemption (e.g. when
>> watchdog had been enabled in the low priority batch). The driver will
>> need to explicitly disable the watchdog counter as part of the
>> preemption sequence.
> 
> Does the series take care of preemption?

I did not find that it does.

So this is something which definitely needs to be handled.

And it is not only disabling the watchdog on preemption, but there is 
also a question of restoring it when/before a preempted context 
continues execution.

(Can we ignore the thresholds changing in between? Probably yes. 
Otherwise we'll have to record it in the request structure.)

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2019-01-07 13:39       ` Tvrtko Ursulin
@ 2019-01-07 13:51         ` Chris Wilson
  0 siblings, 0 replies; 48+ messages in thread
From: Chris Wilson @ 2019-01-07 13:51 UTC (permalink / raw)
  To: Carlos Santa, Tvrtko Ursulin, intel-gfx; +Cc: Michel Thierry

Quoting Tvrtko Ursulin (2019-01-07 13:39:24)
> 
> On 07/01/2019 12:50, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2019-01-07 12:38:47)
> >> On 05/01/2019 02:39, Carlos Santa wrote:
> >>> +/* Return the timer count threshold in microseconds. */
> >>> +int i915_gem_context_get_watchdog(struct i915_gem_context *ctx,
> >>> +                               struct drm_i915_gem_context_param *args)
> >>> +     if (__copy_to_user(u64_to_user_ptr(args->value),
> >>> +                        &threshold_in_us,
> >>> +                        sizeof(threshold_in_us))) {
> >>
> >> I think user pointer hasn't been verified for write access yet so
> >> standard copy_to_user should be used.
> > 
> > The starting point here is that this bakes in OTHER_CLASS as uABI when
> > clearly it is not uABI. The array must be variable size and so the first
> > pass is to report the size to userspace (if they pass args->size = 0)
> > and then to copy up to the user requested size. Since it is a variable
> > array with uncertain indexing, the output should be along the lines of
> > [class, instance, watchdog]. But what about virtual engine? Good
> > question. (Remember set must follow the same pattern as get, so you can
> > always do a rmw cleanly.)
> > 
> > I guess we just have to accept class == -1 as meaning use instance as an
> > index into ctx->engines[]. Tvrtko? Virtual engine would be reported as
> > (-1, 0, watchdog), and settable similarly with the other engines being
> > addressable either by index or by class-instance.
> 
> Or use index based addressing mode if engine map is set? Index 0 only 
> being valid if load balancing is enabled on top.

I was trying to avoid tying ctx->engines[] into such an extensive set of
API changes, but given that it affects execbuffer_ioctl, the die is
cast. With that approach, every time we use class-instance (with respect
to a context) we should also swap it over to using an index when
ctx->engines[] is set.

The advantage to using a flag (in this case instance=-1) is that we can
then write igt (or library routines) oblivious to the state of
ctx->engines[].

From an application pov, they can ignore it, as they will be doing the
SET_ENGINES as part of their initial context construction (one presumes)
and so will always have ctx->engines[] known to be active.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-07 13:43     ` Tvrtko Ursulin
@ 2019-01-07 13:57       ` Chris Wilson
  2019-01-07 16:58         ` Tvrtko Ursulin
  0 siblings, 1 reply; 48+ messages in thread
From: Chris Wilson @ 2019-01-07 13:57 UTC (permalink / raw)
  To: Carlos Santa, Tvrtko Ursulin, intel-gfx; +Cc: Michel Thierry

Quoting Tvrtko Ursulin (2019-01-07 13:43:29)
> 
> On 07/01/2019 11:58, Tvrtko Ursulin wrote:
> 
> [snip]
> 
> >> Note about future interaction with preemption: Preemption could happen
> >> in a command sequence prior to watchdog counter getting disabled,
> >> resulting in watchdog being triggered following preemption (e.g. when
> >> watchdog had been enabled in the low priority batch). The driver will
> >> need to explicitly disable the watchdog counter as part of the
> >> preemption sequence.
> > 
> > Does the series take care of preemption?
> 
> I did not find that it does.

Oh. I hoped that the watchdog was saved as part of the context... Then
despite preemption, the timeout would resume from where we left off as
soon as it was back on the gpu.

If the timeout remaining was context saved it would be much simpler (at
least on first glance), please say it is.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-07 13:57       ` Chris Wilson
@ 2019-01-07 16:58         ` Tvrtko Ursulin
  2019-01-07 18:31           ` Chris Wilson
                             ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Tvrtko Ursulin @ 2019-01-07 16:58 UTC (permalink / raw)
  To: Chris Wilson, Carlos Santa, intel-gfx; +Cc: Michel Thierry


On 07/01/2019 13:57, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2019-01-07 13:43:29)
>>
>> On 07/01/2019 11:58, Tvrtko Ursulin wrote:
>>
>> [snip]
>>
>>>> Note about future interaction with preemption: Preemption could happen
>>>> in a command sequence prior to watchdog counter getting disabled,
>>>> resulting in watchdog being triggered following preemption (e.g. when
>>>> watchdog had been enabled in the low priority batch). The driver will
>>>> need to explicitly disable the watchdog counter as part of the
>>>> preemption sequence.
>>>
>>> Does the series take care of preemption?
>>
>> I did not find that it does.
> 
> Oh. I hoped that the watchdog was saved as part of the context... Then
> despite preemption, the timeout would resume from where we left off as
> soon as it was back on the gpu.
> 
> If the timeout remaining was context saved it would be much simpler (at
> least on first glance), please say it is.

I made my comments going only by the text from the commit message and 
the absence of any preemption special handling.

Having read the spec, the situation seems like this:

  * Watchdog control and threshold register are context saved and restored.

  * On a context switch watchdog counter is reset to zero and 
automatically disabled until enabled by a context restore or explicitly.

So it sounds the commit message could be wrong that special handling is 
needed from this direction. But read till the end on the restriction listed.

  * Watchdog counter is reset to zero and is not accumulated across 
multiple submission of the same context (due preemption).

I read this as - after preemption contexts gets a new full timeout 
allocation. Or in other words, if a context is preempted N times, it's 
cumulative watchdog timeout will be N * set value.

This could be theoretically exploitable to bypass the timeout. If a 
client sets up two contexts with prio -1 and -2, and keeps submitting 
periodical no-op batches against prio -1 context, while prio -2 is it's 
own hog, then prio -2 context defeats the watchdog timer. I think.. 
would appreciate is someone challenged this conclusion.

And finally there is one programming restriction which says:

  * SW must not preempt the workload which has watchdog enabled. Either 
it must:

a) disable preemption for that workload completely, or
b) disable the watchdog via mmio write before any write to ELSP

This seems it contradiction with the statement that the counter gets 
disabled on context switch and stays disabled.

I did not spot anything like this in the series. So it would seem the 
commit message is correct after all.

It would be good if someone could re-read the bspec text on register 
0x2178 to double check what I wrote.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2019-01-07 12:38   ` Tvrtko Ursulin
  2019-01-07 12:50     ` Chris Wilson
@ 2019-01-07 17:00     ` Tvrtko Ursulin
  2019-01-07 17:20       ` Tvrtko Ursulin
  1 sibling, 1 reply; 48+ messages in thread
From: Tvrtko Ursulin @ 2019-01-07 17:00 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx; +Cc: Michel Thierry


On 07/01/2019 12:38, Tvrtko Ursulin wrote:

[snip]

>> +#define GEN8_TIMESTAMP_CNTS_PER_USEC 12
>> +#define GEN9_LP_TIMESTAMP_CNTS_PER_USEC 19
>> +static inline u32 cs_timestamp_in_us(struct drm_i915_private *dev_priv)
>> +{
>> +    u32 cs_timestamp_base = dev_priv->cs_timestamp_base;
>> +
>> +    if (cs_timestamp_base)
>> +        return cs_timestamp_base;
>> +
>> +    switch (INTEL_GEN(dev_priv)) {
>> +    default:
>> +        MISSING_CASE(INTEL_GEN(dev_priv));
>> +        /* fall through */
>> +    case 9:
>> +        cs_timestamp_base = IS_GEN9_LP(dev_priv) ?
>> +                    GEN9_LP_TIMESTAMP_CNTS_PER_USEC :
>> +                    GEN8_TIMESTAMP_CNTS_PER_USEC;
>> +        break;
>> +    case 8:
>> +        cs_timestamp_base = GEN8_TIMESTAMP_CNTS_PER_USEC;
>> +        break;
>> +    }
>> +
>> +    dev_priv->cs_timestamp_base = cs_timestamp_base;
>> +    return cs_timestamp_base;
>> +}
> 
> We already have RUNTIME_INFO(i915)->cs_timestamp_frequency_khz and 
> read_timestamp_frequency which sets it.

Here I missed the mark, please ignore.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: DRM kernel interface to set the timeout
  2019-01-07 17:00     ` Tvrtko Ursulin
@ 2019-01-07 17:20       ` Tvrtko Ursulin
  0 siblings, 0 replies; 48+ messages in thread
From: Tvrtko Ursulin @ 2019-01-07 17:20 UTC (permalink / raw)
  To: Carlos Santa, intel-gfx; +Cc: Michel Thierry


On 07/01/2019 17:00, Tvrtko Ursulin wrote:
> 
> On 07/01/2019 12:38, Tvrtko Ursulin wrote:
> 
> [snip]
> 
>>> +#define GEN8_TIMESTAMP_CNTS_PER_USEC 12
>>> +#define GEN9_LP_TIMESTAMP_CNTS_PER_USEC 19
>>> +static inline u32 cs_timestamp_in_us(struct drm_i915_private *dev_priv)
>>> +{
>>> +    u32 cs_timestamp_base = dev_priv->cs_timestamp_base;
>>> +
>>> +    if (cs_timestamp_base)
>>> +        return cs_timestamp_base;
>>> +
>>> +    switch (INTEL_GEN(dev_priv)) {
>>> +    default:
>>> +        MISSING_CASE(INTEL_GEN(dev_priv));
>>> +        /* fall through */
>>> +    case 9:
>>> +        cs_timestamp_base = IS_GEN9_LP(dev_priv) ?
>>> +                    GEN9_LP_TIMESTAMP_CNTS_PER_USEC :
>>> +                    GEN8_TIMESTAMP_CNTS_PER_USEC;
>>> +        break;
>>> +    case 8:
>>> +        cs_timestamp_base = GEN8_TIMESTAMP_CNTS_PER_USEC;
>>> +        break;
>>> +    }
>>> +
>>> +    dev_priv->cs_timestamp_base = cs_timestamp_base;
>>> +    return cs_timestamp_base;
>>> +}
>>
>> We already have RUNTIME_INFO(i915)->cs_timestamp_frequency_khz and 
>> read_timestamp_frequency which sets it.
> 
> Here I missed the mark, please ignore.

One thing I am not sure though is whether CTC_MODE register affects this 
clock. In other words should this code read it to get the correct 
effective tick duration.

Bspec chain of Watchdog Timers (14970) -> Reported Timestamp Count 
(12610) ->
Timestamp Bases (13569) does not answer that for me.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-07 16:58         ` Tvrtko Ursulin
@ 2019-01-07 18:31           ` Chris Wilson
  2019-01-11  0:47           ` Antonio Argenziano
  2019-01-11  2:58           ` Carlos Santa
  2 siblings, 0 replies; 48+ messages in thread
From: Chris Wilson @ 2019-01-07 18:31 UTC (permalink / raw)
  To: Carlos Santa, Tvrtko Ursulin, intel-gfx; +Cc: Michel Thierry

Quoting Tvrtko Ursulin (2019-01-07 16:58:24)
> And finally there is one programming restriction which says:
> 
>   * SW must not preempt the workload which has watchdog enabled. Either 
> it must:
> 
> a) disable preemption for that workload completely, or
> b) disable the watchdog via mmio write before any write to ELSP

Oh dear. I'm at a loss for words. Any ELSP write, *shudders*.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-07 16:58         ` Tvrtko Ursulin
  2019-01-07 18:31           ` Chris Wilson
@ 2019-01-11  0:47           ` Antonio Argenziano
  2019-01-11  8:22             ` Tvrtko Ursulin
  2019-01-11  2:58           ` Carlos Santa
  2 siblings, 1 reply; 48+ messages in thread
From: Antonio Argenziano @ 2019-01-11  0:47 UTC (permalink / raw)
  To: Tvrtko Ursulin, Chris Wilson, Carlos Santa, intel-gfx; +Cc: Michel Thierry



On 07/01/19 08:58, Tvrtko Ursulin wrote:
> 
> On 07/01/2019 13:57, Chris Wilson wrote:
>> Quoting Tvrtko Ursulin (2019-01-07 13:43:29)
>>>
>>> On 07/01/2019 11:58, Tvrtko Ursulin wrote:
>>>
>>> [snip]
>>>
>>>>> Note about future interaction with preemption: Preemption could happen
>>>>> in a command sequence prior to watchdog counter getting disabled,
>>>>> resulting in watchdog being triggered following preemption (e.g. when
>>>>> watchdog had been enabled in the low priority batch). The driver will
>>>>> need to explicitly disable the watchdog counter as part of the
>>>>> preemption sequence.
>>>>
>>>> Does the series take care of preemption?
>>>
>>> I did not find that it does.
>>
>> Oh. I hoped that the watchdog was saved as part of the context... Then
>> despite preemption, the timeout would resume from where we left off as
>> soon as it was back on the gpu.
>>
>> If the timeout remaining was context saved it would be much simpler (at
>> least on first glance), please say it is.
> 
> I made my comments going only by the text from the commit message and 
> the absence of any preemption special handling.
> 
> Having read the spec, the situation seems like this:
> 
>   * Watchdog control and threshold register are context saved and restored.
> 
>   * On a context switch watchdog counter is reset to zero and 
> automatically disabled until enabled by a context restore or explicitly.
> 
> So it sounds the commit message could be wrong that special handling is 
> needed from this direction. But read till the end on the restriction 
> listed.
> 
>   * Watchdog counter is reset to zero and is not accumulated across 
> multiple submission of the same context (due preemption).
> 
> I read this as - after preemption contexts gets a new full timeout 
> allocation. Or in other words, if a context is preempted N times, it's 
> cumulative watchdog timeout will be N * set value.
> 
> This could be theoretically exploitable to bypass the timeout. If a 
> client sets up two contexts with prio -1 and -2, and keeps submitting 
> periodical no-op batches against prio -1 context, while prio -2 is it's 
> own hog, then prio -2 context defeats the watchdog timer. I think.. 
> would appreciate is someone challenged this conclusion.

I think you are right that is a possibility but, is that a problem? The 
client can just not set the threshold to bypass the timeout. Also 
because you need the hanging batch to be simply preemptible, you cannot 
disrupt any work from another client that is higher priority. This is 
pretty much the same behavior of hangcheck IIRC so something we already 
accept.

> 
> And finally there is one programming restriction which says:
> 
>   * SW must not preempt the workload which has watchdog enabled. Either 
> it must:
> 
> a) disable preemption for that workload completely, or
> b) disable the watchdog via mmio write before any write to ELSP
> 
> This seems it contradiction with the statement that the counter gets 
> disabled on context switch and stays disabled.
> 
> I did not spot anything like this in the series. So it would seem the 
> commit message is correct after all.
> 
> It would be good if someone could re-read the bspec text on register 
> 0x2178 to double check what I wrote.

The way I read it is that the restriction applies only to some platforms 
where the 'normal' description doesn't apply.

Antonio

> 
> Regards,
> 
> Tvrtko
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915/watchdog: move emit_stop_watchdog until the very end of the ring commands
  2019-01-07 12:50   ` Tvrtko Ursulin
  2019-01-07 12:54     ` Chris Wilson
@ 2019-01-11  2:25     ` Carlos Santa
  1 sibling, 0 replies; 48+ messages in thread
From: Carlos Santa @ 2019-01-11  2:25 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: Michel Thierry

On Mon, 2019-01-07 at 12:50 +0000, Tvrtko Ursulin wrote:
> On 05/01/2019 02:40, Carlos Santa wrote:
> > From: Michel Thierry <michel.thierry@intel.com>
> > 
> > On command streams that could potentially hang the GPU after a last
> > flush command, it's best not to cancel the watchdog
> > until after all commands have executed.
> > 
> > Patch shared by Michel Thierry through IIRC after reproduction on
> 
> Joonas pointed out on IRC that IRC is called IRC. :)
> 
> > my local setup.
> > 
> > Tested-by: Carlos Santa <carlos.santa@intel.com>
> > CC: Antonio Argenziano <antonio.argenziano@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> > Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> > Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> > ---
> >   drivers/gpu/drm/i915/intel_lrc.c | 53
> > +++++++++++++++++++++++++++-----
> >   1 file changed, 45 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c
> > b/drivers/gpu/drm/i915/intel_lrc.c
> > index 0afcbeb18329..25ba5fcc9466 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -1885,8 +1885,8 @@ static int gen8_emit_bb_start(struct
> > i915_request *rq,
> >   		GEM_BUG_ON(!engine->emit_start_watchdog ||
> >   			   !engine->emit_stop_watchdog);
> >   
> > -		/* + start_watchdog (6) + stop_watchdog (4) */
> > -		num_dwords += 10;
> > +		/* + start_watchdog (6) */
> > +		num_dwords += 6;
> >   		watchdog_running = true;
> >           }
> >   
> > @@ -1927,10 +1927,7 @@ static int gen8_emit_bb_start(struct
> > i915_request *rq,
> >   	*cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE;
> >   	*cs++ = MI_NOOP;
> >   
> > -	if (watchdog_running) {
> > -		/* Cancel watchdog timer */
> > -		cs = engine->emit_stop_watchdog(rq, cs);
> > -	}
> > +	// XXX: emit_stop_watchdog happens in gen8_emit_breadcrumb_vcs
> 
> No C++ comments please. And what does XXX mean? Doesn't feel like it 
> belongs.
> 
> >   
> >   	intel_ring_advance(rq, cs);
> >   	return 0;
> > @@ -2189,6 +2186,37 @@ static void gen8_emit_breadcrumb(struct
> > i915_request *request, u32 *cs)
> >   }
> >   static const int gen8_emit_breadcrumb_sz = 6 + WA_TAIL_DWORDS;
> >   
> > +static void gen8_emit_breadcrumb_vcs(struct i915_request *request,
> > u32 *cs)
> > +{
> > +	/* w/a: bit 5 needs to be zero for MI_FLUSH_DW address. */
> > +	BUILD_BUG_ON(I915_GEM_HWS_INDEX_ADDR & (1 << 5));
> > +
> > +	cs = gen8_emit_ggtt_write(cs, request->global_seqno,
> > +				  intel_hws_seqno_address(request-
> > >engine));
> > +	*cs++ = MI_USER_INTERRUPT;
> > +	*cs++ = MI_ARB_ON_OFF | MI_ARB_ENABLE;
> > +
> > +	// stop_watchdog at the very end of the ring commands
> > +	if (request->gem_context->__engine[VCS].watchdog_threshold !=
> > 0)
> 
> VCS is wrong. Whole check needs to be to_intel_context(ctx, 
> engine)->watchdog_threshold I think.
> 
> > +	{
> > +		/* Cancel watchdog timer */
> > +		GEM_BUG_ON(!request->engine->emit_stop_watchdog);
> > +		cs = request->engine->emit_stop_watchdog(request, cs);
> > +	}
> > +	else
> > +	{
> 
> Coding style is wrong (curly braces for if else).
> 
> > +		*cs++ = MI_NOOP;
> > +		*cs++ = MI_NOOP;
> > +		*cs++ = MI_NOOP;
> > +		*cs++ = MI_NOOP;
> > +	}
> > +
> > +	request->tail = intel_ring_offset(request, cs);
> > +	assert_ring_tail_valid(request->ring, request->tail);
> > +	gen8_emit_wa_tail(request, cs);
> > +}
> > +static const int gen8_emit_breadcrumb_vcs_sz = 6 + WA_TAIL_DWORDS
> > + 4; //+4 for optional stop_watchdog
> > +
> >   static void gen8_emit_breadcrumb_rcs(struct i915_request
> > *request, u32 *cs)
> >   {
> >   	/* We're using qword write, seqno should be aligned to 8 bytes.
> > */
> > @@ -2306,8 +2334,17 @@ logical_ring_default_vfuncs(struct
> > intel_engine_cs *engine)
> >   	engine->request_alloc = execlists_request_alloc;
> >   
> >   	engine->emit_flush = gen8_emit_flush;
> > -	engine->emit_breadcrumb = gen8_emit_breadcrumb;
> > -	engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_sz;
> > +
> > +	if (engine->id == VCS || engine->id == VCS2)
> 
> What about VCS3 or 4? Hint use engine class.
> 
> And what about RCS and VECS?
> 
> > +	{
> > +		engine->emit_breadcrumb = gen8_emit_breadcrumb_vcs;
> > +		engine->emit_breadcrumb_sz =
> > gen8_emit_breadcrumb_vcs_sz;
> > +	}
> > +	else
> > +	{
> > +		engine->emit_breadcrumb = gen8_emit_breadcrumb;
> > +		engine->emit_breadcrumb_sz = gen8_emit_breadcrumb_sz;
> > +	}
> >   
> >   	engine->set_default_submission =
> > intel_execlists_set_default_submission;
> >   
> > 
> 
> Looks like the patch should be squashed with the one which
> implements 
> watchdog emit start/end? I mean if the whole setup has broken edge
> cases 
> without this..

Ok, I'll rework the above and squash it with the watchdog emit/start
patch
thx, CS

> 
> Regards,
> 
> Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-07 16:58         ` Tvrtko Ursulin
  2019-01-07 18:31           ` Chris Wilson
  2019-01-11  0:47           ` Antonio Argenziano
@ 2019-01-11  2:58           ` Carlos Santa
  2 siblings, 0 replies; 48+ messages in thread
From: Carlos Santa @ 2019-01-11  2:58 UTC (permalink / raw)
  To: Tvrtko Ursulin, Chris Wilson, intel-gfx; +Cc: Michel Thierry

On Mon, 2019-01-07 at 16:58 +0000, Tvrtko Ursulin wrote:
> On 07/01/2019 13:57, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2019-01-07 13:43:29)
> > > 
> > > On 07/01/2019 11:58, Tvrtko Ursulin wrote:
> > > 
> > > [snip]
> > > 
> > > > > Note about future interaction with preemption: Preemption
> > > > > could happen
> > > > > in a command sequence prior to watchdog counter getting
> > > > > disabled,
> > > > > resulting in watchdog being triggered following preemption
> > > > > (e.g. when
> > > > > watchdog had been enabled in the low priority batch). The
> > > > > driver will
> > > > > need to explicitly disable the watchdog counter as part of
> > > > > the
> > > > > preemption sequence.
> > > > 
> > > > Does the series take care of preemption?
> > > 
> > > I did not find that it does.
> > 
> > Oh. I hoped that the watchdog was saved as part of the context...
> > Then
> > despite preemption, the timeout would resume from where we left off
> > as
> > soon as it was back on the gpu.
> > 
> > If the timeout remaining was context saved it would be much simpler
> > (at
> > least on first glance), please say it is.

The watchdog timeout gets saved as part of the register state context
so it will still be enabled after coming back from preemption but the
timeout value will be reset back to the original MAX value that it was
programmed. At least that's what I remember from a discussion with
Michel but I can check again...

Regards,
Carlos

> 
> I made my comments going only by the text from the commit message
> and 
> the absence of any preemption special handling.
> 
> Having read the spec, the situation seems like this:
> 
>   * Watchdog control and threshold register are context saved and
> restored.
> 
>   * On a context switch watchdog counter is reset to zero and 
> automatically disabled until enabled by a context restore or
> explicitly.
> 
> So it sounds the commit message could be wrong that special handling
> is 
> needed from this direction. But read till the end on the restriction
> listed.
> 
>   * Watchdog counter is reset to zero and is not accumulated across 
> multiple submission of the same context (due preemption).
> 
> I read this as - after preemption contexts gets a new full timeout 
> allocation. Or in other words, if a context is preempted N times,
> it's 
> cumulative watchdog timeout will be N * set value.
> 
> This could be theoretically exploitable to bypass the timeout. If a 
> client sets up two contexts with prio -1 and -2, and keeps
> submitting 
> periodical no-op batches against prio -1 context, while prio -2 is
> it's 
> own hog, then prio -2 context defeats the watchdog timer. I think.. 
> would appreciate is someone challenged this conclusion.
> 
> And finally there is one programming restriction which says:
> 
>   * SW must not preempt the workload which has watchdog enabled.
> Either 
> it must:
> 
> a) disable preemption for that workload completely, or
> b) disable the watchdog via mmio write before any write to ELSP
> 
> This seems it contradiction with the statement that the counter gets 
> disabled on context switch and stays disabled.
> 
> I did not spot anything like this in the series. So it would seem
> the 
> commit message is correct after all.
> 
> It would be good if someone could re-read the bspec text on register 
> 0x2178 to double check what I wrote.
> 
> Regards,
> 
> Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-11  0:47           ` Antonio Argenziano
@ 2019-01-11  8:22             ` Tvrtko Ursulin
  2019-01-11 17:31               ` Antonio Argenziano
  0 siblings, 1 reply; 48+ messages in thread
From: Tvrtko Ursulin @ 2019-01-11  8:22 UTC (permalink / raw)
  To: Antonio Argenziano, Chris Wilson, Carlos Santa, intel-gfx; +Cc: Michel Thierry


On 11/01/2019 00:47, Antonio Argenziano wrote:
> On 07/01/19 08:58, Tvrtko Ursulin wrote:
>> On 07/01/2019 13:57, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2019-01-07 13:43:29)
>>>>
>>>> On 07/01/2019 11:58, Tvrtko Ursulin wrote:
>>>>
>>>> [snip]
>>>>
>>>>>> Note about future interaction with preemption: Preemption could 
>>>>>> happen
>>>>>> in a command sequence prior to watchdog counter getting disabled,
>>>>>> resulting in watchdog being triggered following preemption (e.g. when
>>>>>> watchdog had been enabled in the low priority batch). The driver will
>>>>>> need to explicitly disable the watchdog counter as part of the
>>>>>> preemption sequence.
>>>>>
>>>>> Does the series take care of preemption?
>>>>
>>>> I did not find that it does.
>>>
>>> Oh. I hoped that the watchdog was saved as part of the context... Then
>>> despite preemption, the timeout would resume from where we left off as
>>> soon as it was back on the gpu.
>>>
>>> If the timeout remaining was context saved it would be much simpler (at
>>> least on first glance), please say it is.
>>
>> I made my comments going only by the text from the commit message and 
>> the absence of any preemption special handling.
>>
>> Having read the spec, the situation seems like this:
>>
>>   * Watchdog control and threshold register are context saved and 
>> restored.
>>
>>   * On a context switch watchdog counter is reset to zero and 
>> automatically disabled until enabled by a context restore or explicitly.
>>
>> So it sounds the commit message could be wrong that special handling 
>> is needed from this direction. But read till the end on the 
>> restriction listed.
>>
>>   * Watchdog counter is reset to zero and is not accumulated across 
>> multiple submission of the same context (due preemption).
>>
>> I read this as - after preemption contexts gets a new full timeout 
>> allocation. Or in other words, if a context is preempted N times, it's 
>> cumulative watchdog timeout will be N * set value.
>>
>> This could be theoretically exploitable to bypass the timeout. If a 
>> client sets up two contexts with prio -1 and -2, and keeps submitting 
>> periodical no-op batches against prio -1 context, while prio -2 is 
>> it's own hog, then prio -2 context defeats the watchdog timer. I 
>> think.. would appreciate is someone challenged this conclusion.
> 
> I think you are right that is a possibility but, is that a problem? The 
> client can just not set the threshold to bypass the timeout. Also 
> because you need the hanging batch to be simply preemptible, you cannot 
> disrupt any work from another client that is higher priority. This is 

But I think higher priority client can have the same effect on the lower 
priority purely by accident, no?

As a real world example, user kicks off an background transcoding job, 
which happens to use prio -2, and uses the watchdog timer.

At the same time user watches a video from a player of normal priority. 
This causes periodic, say 24Hz, preemption events, due frame decoding 
activity on the same engine as the transcoding client.

Does this defeat the watchdog timer for the former is the question? Then 
the questions of can we do something about it and whether it really 
isn't a problem?

Maybe it is not disrupting higher priority clients but it is causing an 
time unbound power drain.

> pretty much the same behavior of hangcheck IIRC so something we already 
> accept.

You mean today hangcheck wouldn't notice a hanging batch in the same 
scenario as above? If so it sounds like a huge gap we need to try and fix.

>>
>> And finally there is one programming restriction which says:
>>
>>   * SW must not preempt the workload which has watchdog enabled. 
>> Either it must:
>>
>> a) disable preemption for that workload completely, or
>> b) disable the watchdog via mmio write before any write to ELSP
>>
>> This seems it contradiction with the statement that the counter gets 
>> disabled on context switch and stays disabled.
>>
>> I did not spot anything like this in the series. So it would seem the 
>> commit message is correct after all.
>>
>> It would be good if someone could re-read the bspec text on register 
>> 0x2178 to double check what I wrote.
> 
> The way I read it is that the restriction applies only to some platforms 
> where the 'normal' description doesn't apply.

You are right. Are the listed parts in the field so the series would 
have to handle this or we can ignore it?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-11  8:22             ` Tvrtko Ursulin
@ 2019-01-11 17:31               ` Antonio Argenziano
  2019-01-11 21:28                 ` John Harrison
  0 siblings, 1 reply; 48+ messages in thread
From: Antonio Argenziano @ 2019-01-11 17:31 UTC (permalink / raw)
  To: Tvrtko Ursulin, Chris Wilson, Carlos Santa, intel-gfx; +Cc: Michel Thierry



On 11/01/19 00:22, Tvrtko Ursulin wrote:
> 
> On 11/01/2019 00:47, Antonio Argenziano wrote:
>> On 07/01/19 08:58, Tvrtko Ursulin wrote:
>>> On 07/01/2019 13:57, Chris Wilson wrote:
>>>> Quoting Tvrtko Ursulin (2019-01-07 13:43:29)
>>>>>
>>>>> On 07/01/2019 11:58, Tvrtko Ursulin wrote:
>>>>>
>>>>> [snip]
>>>>>
>>>>>>> Note about future interaction with preemption: Preemption could 
>>>>>>> happen
>>>>>>> in a command sequence prior to watchdog counter getting disabled,
>>>>>>> resulting in watchdog being triggered following preemption (e.g. 
>>>>>>> when
>>>>>>> watchdog had been enabled in the low priority batch). The driver 
>>>>>>> will
>>>>>>> need to explicitly disable the watchdog counter as part of the
>>>>>>> preemption sequence.
>>>>>>
>>>>>> Does the series take care of preemption?
>>>>>
>>>>> I did not find that it does.
>>>>
>>>> Oh. I hoped that the watchdog was saved as part of the context... Then
>>>> despite preemption, the timeout would resume from where we left off as
>>>> soon as it was back on the gpu.
>>>>
>>>> If the timeout remaining was context saved it would be much simpler (at
>>>> least on first glance), please say it is.
>>>
>>> I made my comments going only by the text from the commit message and 
>>> the absence of any preemption special handling.
>>>
>>> Having read the spec, the situation seems like this:
>>>
>>>   * Watchdog control and threshold register are context saved and 
>>> restored.
>>>
>>>   * On a context switch watchdog counter is reset to zero and 
>>> automatically disabled until enabled by a context restore or explicitly.
>>>
>>> So it sounds the commit message could be wrong that special handling 
>>> is needed from this direction. But read till the end on the 
>>> restriction listed.
>>>
>>>   * Watchdog counter is reset to zero and is not accumulated across 
>>> multiple submission of the same context (due preemption).
>>>
>>> I read this as - after preemption contexts gets a new full timeout 
>>> allocation. Or in other words, if a context is preempted N times, 
>>> it's cumulative watchdog timeout will be N * set value.
>>>
>>> This could be theoretically exploitable to bypass the timeout. If a 
>>> client sets up two contexts with prio -1 and -2, and keeps submitting 
>>> periodical no-op batches against prio -1 context, while prio -2 is 
>>> it's own hog, then prio -2 context defeats the watchdog timer. I 
>>> think.. would appreciate is someone challenged this conclusion.
>>
>> I think you are right that is a possibility but, is that a problem? 
>> The client can just not set the threshold to bypass the timeout. Also 
>> because you need the hanging batch to be simply preemptible, you 
>> cannot disrupt any work from another client that is higher priority. 
>> This is 
> 
> But I think higher priority client can have the same effect on the lower 
> priority purely by accident, no?
> 
> As a real world example, user kicks off an background transcoding job, 
> which happens to use prio -2, and uses the watchdog timer.
> 
> At the same time user watches a video from a player of normal priority. 
> This causes periodic, say 24Hz, preemption events, due frame decoding 
> activity on the same engine as the transcoding client.
> 
> Does this defeat the watchdog timer for the former is the question? Then 
> the questions of can we do something about it and whether it really 
> isn't a problem?

I guess it depends if you consider that timeout as the maximum lifespan 
a workload can have or max contiguous active time.

> 
> Maybe it is not disrupting higher priority clients but it is causing an 
> time unbound power drain.
> 
>> pretty much the same behavior of hangcheck IIRC so something we 
>> already accept.
> 
> You mean today hangcheck wouldn't notice a hanging batch in the same 
> scenario as above? If so it sounds like a huge gap we need to try and fix.

My understanding of it is that we only keep a record of what was running 
the last time hangcheck was run so it is possible to trick it into 
resetting when a preemption occurs but I could be missing something.

> 
>>>
>>> And finally there is one programming restriction which says:
>>>
>>>   * SW must not preempt the workload which has watchdog enabled. 
>>> Either it must:
>>>
>>> a) disable preemption for that workload completely, or
>>> b) disable the watchdog via mmio write before any write to ELSP
>>>
>>> This seems it contradiction with the statement that the counter gets 
>>> disabled on context switch and stays disabled.
>>>
>>> I did not spot anything like this in the series. So it would seem the 
>>> commit message is correct after all.
>>>
>>> It would be good if someone could re-read the bspec text on register 
>>> 0x2178 to double check what I wrote.
>>
>> The way I read it is that the restriction applies only to some 
>> platforms where the 'normal' description doesn't apply.
> 
> You are right. Are the listed parts in the field so the series would 
> have to handle this or we can ignore it?

I think there is something we need to handle e.g. BXT.

Antonio

> 
> Regards,
> 
> Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-11 17:31               ` Antonio Argenziano
@ 2019-01-11 21:28                 ` John Harrison
  2019-01-16 16:15                   ` Tvrtko Ursulin
  0 siblings, 1 reply; 48+ messages in thread
From: John Harrison @ 2019-01-11 21:28 UTC (permalink / raw)
  To: Antonio Argenziano, Tvrtko Ursulin, Chris Wilson, Carlos Santa,
	intel-gfx
  Cc: Michel Thierry


On 1/11/2019 09:31, Antonio Argenziano wrote:
>
> On 11/01/19 00:22, Tvrtko Ursulin wrote:
>>
>> On 11/01/2019 00:47, Antonio Argenziano wrote:
>>> On 07/01/19 08:58, Tvrtko Ursulin wrote:
>>>> On 07/01/2019 13:57, Chris Wilson wrote:
>>>>> Quoting Tvrtko Ursulin (2019-01-07 13:43:29)
>>>>>>
>>>>>> On 07/01/2019 11:58, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> [snip]
>>>>>>
>>>>>>>> Note about future interaction with preemption: Preemption could 
>>>>>>>> happen
>>>>>>>> in a command sequence prior to watchdog counter getting disabled,
>>>>>>>> resulting in watchdog being triggered following preemption 
>>>>>>>> (e.g. when
>>>>>>>> watchdog had been enabled in the low priority batch). The 
>>>>>>>> driver will
>>>>>>>> need to explicitly disable the watchdog counter as part of the
>>>>>>>> preemption sequence.
>>>>>>>
>>>>>>> Does the series take care of preemption?
>>>>>>
>>>>>> I did not find that it does.
>>>>>
>>>>> Oh. I hoped that the watchdog was saved as part of the context... 
>>>>> Then
>>>>> despite preemption, the timeout would resume from where we left 
>>>>> off as
>>>>> soon as it was back on the gpu.
>>>>>
>>>>> If the timeout remaining was context saved it would be much 
>>>>> simpler (at
>>>>> least on first glance), please say it is.
>>>>
>>>> I made my comments going only by the text from the commit message 
>>>> and the absence of any preemption special handling.
>>>>
>>>> Having read the spec, the situation seems like this:
>>>>
>>>>   * Watchdog control and threshold register are context saved and 
>>>> restored.
>>>>
>>>>   * On a context switch watchdog counter is reset to zero and 
>>>> automatically disabled until enabled by a context restore or 
>>>> explicitly.
>>>>
>>>> So it sounds the commit message could be wrong that special 
>>>> handling is needed from this direction. But read till the end on 
>>>> the restriction listed.
>>>>
>>>>   * Watchdog counter is reset to zero and is not accumulated across 
>>>> multiple submission of the same context (due preemption).
>>>>
>>>> I read this as - after preemption contexts gets a new full timeout 
>>>> allocation. Or in other words, if a context is preempted N times, 
>>>> it's cumulative watchdog timeout will be N * set value.
>>>>
>>>> This could be theoretically exploitable to bypass the timeout. If a 
>>>> client sets up two contexts with prio -1 and -2, and keeps 
>>>> submitting periodical no-op batches against prio -1 context, while 
>>>> prio -2 is it's own hog, then prio -2 context defeats the watchdog 
>>>> timer. I think.. would appreciate is someone challenged this 
>>>> conclusion.
>>>
>>> I think you are right that is a possibility but, is that a problem? 
>>> The client can just not set the threshold to bypass the timeout. 
>>> Also because you need the hanging batch to be simply preemptible, 
>>> you cannot disrupt any work from another client that is higher 
>>> priority. This is 
>>
>> But I think higher priority client can have the same effect on the 
>> lower priority purely by accident, no?
>>
>> As a real world example, user kicks off an background transcoding 
>> job, which happens to use prio -2, and uses the watchdog timer.
>>
>> At the same time user watches a video from a player of normal 
>> priority. This causes periodic, say 24Hz, preemption events, due 
>> frame decoding activity on the same engine as the transcoding client.
>>
>> Does this defeat the watchdog timer for the former is the question? 
>> Then the questions of can we do something about it and whether it 
>> really isn't a problem?
>
> I guess it depends if you consider that timeout as the maximum 
> lifespan a workload can have or max contiguous active time.

I believe the intended purpose of the watchdog is to prevent broken 
bitstreams hanging the transcoder/player. That is, it is a form of error 
detection used by the media driver to handle bad user input. So if there 
is a way for the watchdog to be extended indefinitely under normal 
situations, that would be a problem. It means the transcoder will not 
detect the broken input data in a timely manner and effectively hang 
rather than skip over to the next packet. And note that broken input 
data can be caused by something as innocent as a dropped packet due to 
high network contention. No need for any malicious activity at all.

John.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-11 21:28                 ` John Harrison
@ 2019-01-16 16:15                   ` Tvrtko Ursulin
  2019-01-16 17:42                     ` Antonio Argenziano
  0 siblings, 1 reply; 48+ messages in thread
From: Tvrtko Ursulin @ 2019-01-16 16:15 UTC (permalink / raw)
  To: John Harrison, Antonio Argenziano, Chris Wilson, Carlos Santa, intel-gfx
  Cc: Michel Thierry


On 11/01/2019 21:28, John Harrison wrote:
> 
> On 1/11/2019 09:31, Antonio Argenziano wrote:
>>
>> On 11/01/19 00:22, Tvrtko Ursulin wrote:
>>>
>>> On 11/01/2019 00:47, Antonio Argenziano wrote:
>>>> On 07/01/19 08:58, Tvrtko Ursulin wrote:
>>>>> On 07/01/2019 13:57, Chris Wilson wrote:
>>>>>> Quoting Tvrtko Ursulin (2019-01-07 13:43:29)
>>>>>>>
>>>>>>> On 07/01/2019 11:58, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> [snip]
>>>>>>>
>>>>>>>>> Note about future interaction with preemption: Preemption could 
>>>>>>>>> happen
>>>>>>>>> in a command sequence prior to watchdog counter getting disabled,
>>>>>>>>> resulting in watchdog being triggered following preemption 
>>>>>>>>> (e.g. when
>>>>>>>>> watchdog had been enabled in the low priority batch). The 
>>>>>>>>> driver will
>>>>>>>>> need to explicitly disable the watchdog counter as part of the
>>>>>>>>> preemption sequence.
>>>>>>>>
>>>>>>>> Does the series take care of preemption?
>>>>>>>
>>>>>>> I did not find that it does.
>>>>>>
>>>>>> Oh. I hoped that the watchdog was saved as part of the context... 
>>>>>> Then
>>>>>> despite preemption, the timeout would resume from where we left 
>>>>>> off as
>>>>>> soon as it was back on the gpu.
>>>>>>
>>>>>> If the timeout remaining was context saved it would be much 
>>>>>> simpler (at
>>>>>> least on first glance), please say it is.
>>>>>
>>>>> I made my comments going only by the text from the commit message 
>>>>> and the absence of any preemption special handling.
>>>>>
>>>>> Having read the spec, the situation seems like this:
>>>>>
>>>>>   * Watchdog control and threshold register are context saved and 
>>>>> restored.
>>>>>
>>>>>   * On a context switch watchdog counter is reset to zero and 
>>>>> automatically disabled until enabled by a context restore or 
>>>>> explicitly.
>>>>>
>>>>> So it sounds the commit message could be wrong that special 
>>>>> handling is needed from this direction. But read till the end on 
>>>>> the restriction listed.
>>>>>
>>>>>   * Watchdog counter is reset to zero and is not accumulated across 
>>>>> multiple submission of the same context (due preemption).
>>>>>
>>>>> I read this as - after preemption contexts gets a new full timeout 
>>>>> allocation. Or in other words, if a context is preempted N times, 
>>>>> it's cumulative watchdog timeout will be N * set value.
>>>>>
>>>>> This could be theoretically exploitable to bypass the timeout. If a 
>>>>> client sets up two contexts with prio -1 and -2, and keeps 
>>>>> submitting periodical no-op batches against prio -1 context, while 
>>>>> prio -2 is it's own hog, then prio -2 context defeats the watchdog 
>>>>> timer. I think.. would appreciate is someone challenged this 
>>>>> conclusion.
>>>>
>>>> I think you are right that is a possibility but, is that a problem? 
>>>> The client can just not set the threshold to bypass the timeout. 
>>>> Also because you need the hanging batch to be simply preemptible, 
>>>> you cannot disrupt any work from another client that is higher 
>>>> priority. This is 
>>>
>>> But I think higher priority client can have the same effect on the 
>>> lower priority purely by accident, no?
>>>
>>> As a real world example, user kicks off an background transcoding 
>>> job, which happens to use prio -2, and uses the watchdog timer.
>>>
>>> At the same time user watches a video from a player of normal 
>>> priority. This causes periodic, say 24Hz, preemption events, due 
>>> frame decoding activity on the same engine as the transcoding client.
>>>
>>> Does this defeat the watchdog timer for the former is the question? 
>>> Then the questions of can we do something about it and whether it 
>>> really isn't a problem?
>>
>> I guess it depends if you consider that timeout as the maximum 
>> lifespan a workload can have or max contiguous active time.
> 
> I believe the intended purpose of the watchdog is to prevent broken 
> bitstreams hanging the transcoder/player. That is, it is a form of error 
> detection used by the media driver to handle bad user input. So if there 
> is a way for the watchdog to be extended indefinitely under normal 
> situations, that would be a problem. It means the transcoder will not 
> detect the broken input data in a timely manner and effectively hang 
> rather than skip over to the next packet. And note that broken input 
> data can be caused by something as innocent as a dropped packet due to 
> high network contention. No need for any malicious activity at all.

My understanding of the intended purpose is the same. And it would be a 
very useful feature.

Chris mentioned the other day that until hardware is fixed to context 
save/restore the watchdog counter this could simply be implemented using 
timers. And I have to say I agree. Shouldn't be too hard to prototype it 
using  hrtimers - start on context in, stop on context out and kick 
forward on user interrupts. More or less.

Then if the cost of these hrtimer manipulations wouldn't show in 
profiles significantly we would have a solution. At least in execlists 
mode. :) But in parallel we could file a feature request to fix the 
hardware implementation and then could just switch the timer "backend" 
from hrtimers to GPU.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-16 16:15                   ` Tvrtko Ursulin
@ 2019-01-16 17:42                     ` Antonio Argenziano
  2019-01-16 17:59                       ` Antonio Argenziano
  0 siblings, 1 reply; 48+ messages in thread
From: Antonio Argenziano @ 2019-01-16 17:42 UTC (permalink / raw)
  To: Tvrtko Ursulin, John Harrison, Chris Wilson, Carlos Santa, intel-gfx
  Cc: Michel Thierry



On 16/01/19 08:15, Tvrtko Ursulin wrote:
> 
> On 11/01/2019 21:28, John Harrison wrote:
>>
>> On 1/11/2019 09:31, Antonio Argenziano wrote:
>>>
>>> On 11/01/19 00:22, Tvrtko Ursulin wrote:
>>>>
>>>> On 11/01/2019 00:47, Antonio Argenziano wrote:
>>>>> On 07/01/19 08:58, Tvrtko Ursulin wrote:
>>>>>> On 07/01/2019 13:57, Chris Wilson wrote:
>>>>>>> Quoting Tvrtko Ursulin (2019-01-07 13:43:29)
>>>>>>>>
>>>>>>>> On 07/01/2019 11:58, Tvrtko Ursulin wrote:
>>>>>>>>
>>>>>>>> [snip]
>>>>>>>>
>>>>>>>>>> Note about future interaction with preemption: Preemption 
>>>>>>>>>> could happen
>>>>>>>>>> in a command sequence prior to watchdog counter getting disabled,
>>>>>>>>>> resulting in watchdog being triggered following preemption 
>>>>>>>>>> (e.g. when
>>>>>>>>>> watchdog had been enabled in the low priority batch). The 
>>>>>>>>>> driver will
>>>>>>>>>> need to explicitly disable the watchdog counter as part of the
>>>>>>>>>> preemption sequence.
>>>>>>>>>
>>>>>>>>> Does the series take care of preemption?
>>>>>>>>
>>>>>>>> I did not find that it does.
>>>>>>>
>>>>>>> Oh. I hoped that the watchdog was saved as part of the context... 
>>>>>>> Then
>>>>>>> despite preemption, the timeout would resume from where we left 
>>>>>>> off as
>>>>>>> soon as it was back on the gpu.
>>>>>>>
>>>>>>> If the timeout remaining was context saved it would be much 
>>>>>>> simpler (at
>>>>>>> least on first glance), please say it is.
>>>>>>
>>>>>> I made my comments going only by the text from the commit message 
>>>>>> and the absence of any preemption special handling.
>>>>>>
>>>>>> Having read the spec, the situation seems like this:
>>>>>>
>>>>>>   * Watchdog control and threshold register are context saved and 
>>>>>> restored.
>>>>>>
>>>>>>   * On a context switch watchdog counter is reset to zero and 
>>>>>> automatically disabled until enabled by a context restore or 
>>>>>> explicitly.
>>>>>>
>>>>>> So it sounds the commit message could be wrong that special 
>>>>>> handling is needed from this direction. But read till the end on 
>>>>>> the restriction listed.
>>>>>>
>>>>>>   * Watchdog counter is reset to zero and is not accumulated 
>>>>>> across multiple submission of the same context (due preemption).
>>>>>>
>>>>>> I read this as - after preemption contexts gets a new full timeout 
>>>>>> allocation. Or in other words, if a context is preempted N times, 
>>>>>> it's cumulative watchdog timeout will be N * set value.
>>>>>>
>>>>>> This could be theoretically exploitable to bypass the timeout. If 
>>>>>> a client sets up two contexts with prio -1 and -2, and keeps 
>>>>>> submitting periodical no-op batches against prio -1 context, while 
>>>>>> prio -2 is it's own hog, then prio -2 context defeats the watchdog 
>>>>>> timer. I think.. would appreciate is someone challenged this 
>>>>>> conclusion.
>>>>>
>>>>> I think you are right that is a possibility but, is that a problem? 
>>>>> The client can just not set the threshold to bypass the timeout. 
>>>>> Also because you need the hanging batch to be simply preemptible, 
>>>>> you cannot disrupt any work from another client that is higher 
>>>>> priority. This is 
>>>>
>>>> But I think higher priority client can have the same effect on the 
>>>> lower priority purely by accident, no?
>>>>
>>>> As a real world example, user kicks off an background transcoding 
>>>> job, which happens to use prio -2, and uses the watchdog timer.
>>>>
>>>> At the same time user watches a video from a player of normal 
>>>> priority. This causes periodic, say 24Hz, preemption events, due 
>>>> frame decoding activity on the same engine as the transcoding client.
>>>>
>>>> Does this defeat the watchdog timer for the former is the question? 
>>>> Then the questions of can we do something about it and whether it 
>>>> really isn't a problem?
>>>
>>> I guess it depends if you consider that timeout as the maximum 
>>> lifespan a workload can have or max contiguous active time.
>>
>> I believe the intended purpose of the watchdog is to prevent broken 
>> bitstreams hanging the transcoder/player. That is, it is a form of 
>> error detection used by the media driver to handle bad user input. So 
>> if there is a way for the watchdog to be extended indefinitely under 
>> normal situations, that would be a problem. It means the transcoder 
>> will not detect the broken input data in a timely manner and 
>> effectively hang rather than skip over to the next packet. And note 
>> that broken input data can be caused by something as innocent as a 
>> dropped packet due to high network contention. No need for any 
>> malicious activity at all.
> 
> My understanding of the intended purpose is the same. And it would be a 
> very useful feature.

I'm not familiar enough with the application but, in the scenario above, 
what if the batch that is being preempted is not stuck but just nice 
enough to be preempted enough times so that it wouldn't complete in the 
given wall clock time but would be fast enough by itself.

> 
> Chris mentioned the other day that until hardware is fixed to context 
> save/restore the watchdog counter this could simply be implemented using 
> timers. And I have to say I agree. Shouldn't be too hard to prototype it 
> using  hrtimers - start on context in, stop on context out and kick 
> forward on user interrupts. More or less.

Would this implement the feature on the driver side just like it would 
for the HW? I mean have the same IOCTL and silently discard workload 
that hit the timeout. Also, would it discard batches while they are in 
the queue (not active)?

Antonio

> 
> Then if the cost of these hrtimer manipulations wouldn't show in 
> profiles significantly we would have a solution. At least in execlists 
> mode. :) But in parallel we could file a feature request to fix the 
> hardware implementation and then could just switch the timer "backend" 
> from hrtimers to GPU.
> 
> Regards,
> 
> Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-16 17:42                     ` Antonio Argenziano
@ 2019-01-16 17:59                       ` Antonio Argenziano
  0 siblings, 0 replies; 48+ messages in thread
From: Antonio Argenziano @ 2019-01-16 17:59 UTC (permalink / raw)
  To: Tvrtko Ursulin, John Harrison, Chris Wilson, Carlos Santa, intel-gfx
  Cc: Michel Thierry



On 16/01/19 09:42, Antonio Argenziano wrote:
> 
> 
> On 16/01/19 08:15, Tvrtko Ursulin wrote:
>>
>> On 11/01/2019 21:28, John Harrison wrote:
>>>
>>> On 1/11/2019 09:31, Antonio Argenziano wrote:
>>>>
>>>> On 11/01/19 00:22, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 11/01/2019 00:47, Antonio Argenziano wrote:
>>>>>> On 07/01/19 08:58, Tvrtko Ursulin wrote:
>>>>>>> On 07/01/2019 13:57, Chris Wilson wrote:
>>>>>>>> Quoting Tvrtko Ursulin (2019-01-07 13:43:29)
>>>>>>>>>
>>>>>>>>> On 07/01/2019 11:58, Tvrtko Ursulin wrote:
>>>>>>>>>
>>>>>>>>> [snip]
>>>>>>>>>
>>>>>>>>>>> Note about future interaction with preemption: Preemption 
>>>>>>>>>>> could happen
>>>>>>>>>>> in a command sequence prior to watchdog counter getting 
>>>>>>>>>>> disabled,
>>>>>>>>>>> resulting in watchdog being triggered following preemption 
>>>>>>>>>>> (e.g. when
>>>>>>>>>>> watchdog had been enabled in the low priority batch). The 
>>>>>>>>>>> driver will
>>>>>>>>>>> need to explicitly disable the watchdog counter as part of the
>>>>>>>>>>> preemption sequence.
>>>>>>>>>>
>>>>>>>>>> Does the series take care of preemption?
>>>>>>>>>
>>>>>>>>> I did not find that it does.
>>>>>>>>
>>>>>>>> Oh. I hoped that the watchdog was saved as part of the 
>>>>>>>> context... Then
>>>>>>>> despite preemption, the timeout would resume from where we left 
>>>>>>>> off as
>>>>>>>> soon as it was back on the gpu.
>>>>>>>>
>>>>>>>> If the timeout remaining was context saved it would be much 
>>>>>>>> simpler (at
>>>>>>>> least on first glance), please say it is.
>>>>>>>
>>>>>>> I made my comments going only by the text from the commit message 
>>>>>>> and the absence of any preemption special handling.
>>>>>>>
>>>>>>> Having read the spec, the situation seems like this:
>>>>>>>
>>>>>>>   * Watchdog control and threshold register are context saved and 
>>>>>>> restored.
>>>>>>>
>>>>>>>   * On a context switch watchdog counter is reset to zero and 
>>>>>>> automatically disabled until enabled by a context restore or 
>>>>>>> explicitly.
>>>>>>>
>>>>>>> So it sounds the commit message could be wrong that special 
>>>>>>> handling is needed from this direction. But read till the end on 
>>>>>>> the restriction listed.
>>>>>>>
>>>>>>>   * Watchdog counter is reset to zero and is not accumulated 
>>>>>>> across multiple submission of the same context (due preemption).
>>>>>>>
>>>>>>> I read this as - after preemption contexts gets a new full 
>>>>>>> timeout allocation. Or in other words, if a context is preempted 
>>>>>>> N times, it's cumulative watchdog timeout will be N * set value.
>>>>>>>
>>>>>>> This could be theoretically exploitable to bypass the timeout. If 
>>>>>>> a client sets up two contexts with prio -1 and -2, and keeps 
>>>>>>> submitting periodical no-op batches against prio -1 context, 
>>>>>>> while prio -2 is it's own hog, then prio -2 context defeats the 
>>>>>>> watchdog timer. I think.. would appreciate is someone challenged 
>>>>>>> this conclusion.
>>>>>>
>>>>>> I think you are right that is a possibility but, is that a 
>>>>>> problem? The client can just not set the threshold to bypass the 
>>>>>> timeout. Also because you need the hanging batch to be simply 
>>>>>> preemptible, you cannot disrupt any work from another client that 
>>>>>> is higher priority. This is 
>>>>>
>>>>> But I think higher priority client can have the same effect on the 
>>>>> lower priority purely by accident, no?
>>>>>
>>>>> As a real world example, user kicks off an background transcoding 
>>>>> job, which happens to use prio -2, and uses the watchdog timer.
>>>>>
>>>>> At the same time user watches a video from a player of normal 
>>>>> priority. This causes periodic, say 24Hz, preemption events, due 
>>>>> frame decoding activity on the same engine as the transcoding client.
>>>>>
>>>>> Does this defeat the watchdog timer for the former is the question? 
>>>>> Then the questions of can we do something about it and whether it 
>>>>> really isn't a problem?
>>>>
>>>> I guess it depends if you consider that timeout as the maximum 
>>>> lifespan a workload can have or max contiguous active time.
>>>
>>> I believe the intended purpose of the watchdog is to prevent broken 
>>> bitstreams hanging the transcoder/player. That is, it is a form of 
>>> error detection used by the media driver to handle bad user input. So 
>>> if there is a way for the watchdog to be extended indefinitely under 
>>> normal situations, that would be a problem. It means the transcoder 
>>> will not detect the broken input data in a timely manner and 
>>> effectively hang rather than skip over to the next packet. And note 
>>> that broken input data can be caused by something as innocent as a 
>>> dropped packet due to high network contention. No need for any 
>>> malicious activity at all.
>>
>> My understanding of the intended purpose is the same. And it would be 
>> a very useful feature.
> 
> I'm not familiar enough with the application but, in the scenario above, 
> what if the batch that is being preempted is not stuck but just nice 
> enough to be preempted enough times so that it wouldn't complete in the 
> given wall clock time but would be fast enough by itself.

Ignore me, re-reading this I now get you are trying to advocate for an 
active-time timeout not pure wall clock time.

> 
>>
>> Chris mentioned the other day that until hardware is fixed to context 
>> save/restore the watchdog counter this could simply be implemented 
>> using timers. And I have to say I agree. Shouldn't be too hard to 
>> prototype it using  hrtimers - start on context in, stop on context 
>> out and kick forward on user interrupts. More or less.
> 
> Would this implement the feature on the driver side just like it would 
> for the HW? I mean have the same IOCTL and silently discard workload 
> that hit the timeout. Also, would it discard batches while they are in 
> the queue (not active)?
> 
> Antonio
> 
>>
>> Then if the cost of these hrtimer manipulations wouldn't show in 
>> profiles significantly we would have a solution. At least in execlists 
>> mode. :) But in parallel we could file a feature request to fix the 
>> hardware implementation and then could just switch the timer "backend" 
>> from hrtimers to GPU.
>>
>> Regards,
>>
>> Tvrtko
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Watchdog timeout: IRQ handler for gen8+
  2019-01-07 11:58   ` Tvrtko Ursulin
  2019-01-07 12:16     ` Chris Wilson
  2019-01-07 13:43     ` Tvrtko Ursulin
@ 2019-01-24  0:13     ` Carlos Santa
  2 siblings, 0 replies; 48+ messages in thread
From: Carlos Santa @ 2019-01-24  0:13 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: Michel Thierry

On Mon, 2019-01-07 at 11:58 +0000, Tvrtko Ursulin wrote:

[snip]

> > 
> >   
> >   static void gen8_gt_irq_ack(struct drm_i915_private *i915,
> > @@ -3329,7 +3332,7 @@ void i915_handle_error(struct
> > drm_i915_private *dev_priv,
> >   	if (intel_has_reset_engine(dev_priv) &&
> >   	    !i915_terminally_wedged(&dev_priv->gpu_error)) {
> >   		for_each_engine_masked(engine, dev_priv, engine_mask,
> > tmp) {
> > -			BUILD_BUG_ON(I915_RESET_MODESET >=
> > I915_RESET_ENGINE);
> > +			BUILD_BUG_ON(I915_RESET_WATCHDOG >=
> > I915_RESET_ENGINE);
> >   			if (test_and_set_bit(I915_RESET_ENGINE +
> > engine->id,
> >   					     &dev_priv-
> > >gpu_error.flags))
> >   				continue;
> > @@ -4162,12 +4165,15 @@ static void gen8_gt_irq_postinstall(struct
> > drm_i915_private *dev_priv)
> >   	uint32_t gt_interrupts[] = {
> >   		GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> >   			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_RCS_IRQ_SHIFT |
> > +			GT_GEN8_WATCHDOG_INTERRUPT <<
> > GEN8_RCS_IRQ_SHIFT |
> >   			GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT
> > |
> >   			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_BCS_IRQ_SHIFT,
> >   		GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> >   			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_VCS1_IRQ_SHIFT |
> > +			GT_GEN8_WATCHDOG_INTERRUPT <<
> > GEN8_VCS1_IRQ_SHIFT |
> >   			GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT
> > |
> > -			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_VCS2_IRQ_SHIFT,
> > +			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_VCS2_IRQ_SHIFT |
> > +			GT_GEN8_WATCHDOG_INTERRUPT <<
> > GEN8_VCS2_IRQ_SHIFT,
> >   		0,
> >   		GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
> >   			GT_CONTEXT_SWITCH_INTERRUPT <<
> > GEN8_VECS_IRQ_SHIFT
> > @@ -4176,6 +4182,10 @@ static void gen8_gt_irq_postinstall(struct
> > drm_i915_private *dev_priv)
> >   	if (HAS_L3_DPF(dev_priv))
> >   		gt_interrupts[0] |=
> > GT_RENDER_L3_PARITY_ERROR_INTERRUPT;
> >   
> > +	/* VECS watchdog is only available in skl+ */
> > +	if (INTEL_GEN(dev_priv) >= 9)
> > +		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
> 
> Is the shift missing here?
> 

No, the above addresses the interrupts for the VECS watchdog only and
the correct shift is applied in element 3 of the array.

Regards,
Carlos



_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: drm/i915: Only process VCS2 only when supported
  2019-01-07 12:40   ` Tvrtko Ursulin
@ 2019-01-24  0:20     ` Carlos Santa
  0 siblings, 0 replies; 48+ messages in thread
From: Carlos Santa @ 2019-01-24  0:20 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx; +Cc: Michel Thierry

On Mon, 2019-01-07 at 12:40 +0000, Tvrtko Ursulin wrote:
> On 05/01/2019 02:39, Carlos Santa wrote:
> > Not checking for BSD2 causes a segfault on GPU revs
> > with no h/w support for the extra media engines.
> > 
> > Segfault on ULX GT2 (0x591e) follows:
> > 
> > Patch shared by Michel Thierry on IIRC.
> > 
> > [  468.625970] BUG: unable to handle kernel NULL pointer
> > dereference at 00000000000002c0
> > [  468.625978] IP: gen8_cs_irq_handler+0x8d/0xcf
> > [  468.625979] PGD 0 P4D 0
> > [  468.625983] Oops: 0002 [#1] PREEMPT SMP PTI
> > [  468.625987] Dumping ftrace buffer:
> > [  468.625990]    (ftrace buffer empty)
> > [  468.627877] gsmi: Log Shutdown Reason 0x03
> > [  468.627880] Modules linked in: cmac rfcomm uinput joydev
> > snd_soc_hdac_hdmi snd_soc_dmic snd_soc_skl snd_soc_skl_ipc lzo
> > lzo_compress snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_acpi
> > snd_hda_ext_core snd_hda_core zram snd_soc_max98927 ipt_MASQUERADE
> > nf_nat_masquerade_ipv4 hid_multitouch xt_mark fuse cdc_ether usbnet
> > btusb btrtl btbcm btintel uvcvideo bluetooth videobuf2_vmalloc
> > videobuf2_memops videobuf2_v4l2 videobuf2_core ecdh_generic
> > iio_trig_sysfs cros_ec_light_prox cros_ec_sensors_ring
> > cros_ec_sensors cros_ec_activity cros_ec_sensors_core
> > industrialio_triggered_buffer kfifo_buf industrialio iwlmvm
> > iwl7000_mac80211 r8152 mii iwlwifi cfg80211
> > [  468.627917] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.64
> > #38
> > [  468.627919] Hardware name: Google Eve/Eve, BIOS
> > Google_Eve.9584.107.0 11/07/2017
> > [  468.627922] task: ffff96bf35a13c00 task.stack: ffffa79880070000
> > [  468.627925] RIP: 0010:gen8_cs_irq_handler+0x8d/0xcf
> > [  468.627928] RSP: 0018:ffff96bf3ec83df8 EFLAGS: 00010002
> > [  468.627931] RAX: 0000000001000000 RBX: 0000000000000000 RCX:
> > 0000000000000010
> > [  468.627933] RDX: 0000000000000010 RSI: 0000000000000040 RDI:
> > 0000000000000000
> > [  468.627936] RBP: ffff96bf3ec83e20 R08: 0000000000000000 R09:
> > 0000000000000000
> > [  468.627938] R10: ffffa79880073dc8 R11: ffffffffa2a6453d R12:
> > ffff96bf3ec83e70
> > [  468.627940] R13: 0000000000000079 R14: 0000000000000000 R15:
> > 0000000000000040
> > [  468.627943] FS:  0000000000000000(0000)
> > GS:ffff96bf3ec80000(0000) knlGS:0000000000000000
> > [  468.627945] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  468.627948] CR2: 00000000000002c0 CR3: 000000016fe12002 CR4:
> > 00000000003606e0
> > [  468.627950] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > [  468.627953] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > 0000000000000400
> > [  468.627955] Call Trace:
> > [  468.627959]  <IRQ>
> > [  468.627963]  gen8_gt_irq_handler+0x5e/0xed
> > [  468.627968]  gen8_irq_handler+0x9f/0x5ce
> > [  468.627973]  __handle_irq_event_percpu+0xb8/0x1da
> > [  468.627977]  handle_irq_event_percpu+0x32/0x77
> > [  468.627980]  handle_irq_event+0x36/0x55
> > [  468.627984]  handle_edge_irq+0x7d/0xcd
> > [  468.627988]  handle_irq+0xd9/0x11e
> > [  468.627992]  do_IRQ+0x4b/0xd0
> > [  468.627996]  common_interrupt+0x7a/0x7a
> > [  468.627999]  </IRQ>
> > [  468.628003] RIP: 0010:cpuidle_enter_state+0xff/0x177
> > [  468.628005] RSP: 0018:ffffa79880073e78 EFLAGS: 00000206
> > ORIG_RAX: ffffffffffffff5e
> > [  468.628008] RAX: ffff96bf3eca09c0 RBX: 000000000002b9c3 RCX:
> > 0000006d1c48c3b5
> > [  468.628010] RDX: 0000000000000037 RSI: 0000000000000001 RDI:
> > 0000000000000000
> > [  468.628013] RBP: ffffa79880073ec0 R08: 0000000000000002 R09:
> > 0000000000005000
> > [  468.628015] R10: 071c71c71c71c71c R11: ffffffffa2e42687 R12:
> > 0000000000000000
> > [  468.628017] R13: 0000000000000002 R14: 0000000000000002 R15:
> > ffff96bf3eca7d00
> > [  468.628020]  ? cpu_idle_poll+0x8e/0x8e
> > [  468.628025]  ? cpuidle_enter_state+0xe3/0x177
> > [  468.628028]  do_idle+0x10c/0x19d
> > [  468.628033]  cpu_startup_entry+0x6d/0x6f
> > [  468.628036]  start_secondary+0x189/0x1a4
> > [  468.628041]  secondary_startup_64+0xa5/0xb0
> > [  468.628044] Code: c3 84 db 74 20 f0 41 0f ba ae 98 02 00 00 00
> > 0f 92 45 df 80 7d df 00 75 0c 49 8d be 90 02 00 00 e8 3a 7a c6 ff
> > 41 f6 c7 40 74 23 <f0> 41 0f ba ae c0 02 00 00 00 0f 92 45 de 80 7d
> > de 00 75 0f 49
> > [  468.628083] RIP: gen8_cs_irq_handler+0x8d/0xcf RSP:
> > ffff96bf3ec83df8
> > [  468.628085] CR2: 00000000000002c0
> > [  468.628088] ---[ end trace a7a497ddeb44bcf8 ]---
> > 
> > Tested-by: Carlos Santa <carlos.santa@intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> > Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> > Cc: Michel Thierry <michel.thierry@intel.com>
> > Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_irq.c | 3 ++-
> >   1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_irq.c
> > b/drivers/gpu/drm/i915/i915_irq.c
> > index 859bbadb752f..953ebe5c85ce 100644
> > --- a/drivers/gpu/drm/i915/i915_irq.c
> > +++ b/drivers/gpu/drm/i915/i915_irq.c
> > @@ -1554,7 +1554,8 @@ static void gen8_gt_irq_handler(struct
> > drm_i915_private *i915,
> >   	if (master_ctl & (GEN8_GT_VCS1_IRQ | GEN8_GT_VCS2_IRQ)) {
> >   		gen8_cs_irq_handler(i915->engine[VCS],
> >   				    gt_iir[1] >> GEN8_VCS1_IRQ_SHIFT);
> > -		gen8_cs_irq_handler(i915->engine[VCS2],
> > +		if(HAS_BSD2(i915))
> > +			gen8_cs_irq_handler(i915->engine[VCS2],
> >   				    gt_iir[1] >> GEN8_VCS2_IRQ_SHIFT);
> >   	}
> >   
> > 
> 
> Why would we be getting interrupts from engines which are not there?
> 
> Is this just a consequence of the missing shift when programming 
> interrupts in an earlier patch?
> 
> Regards,
> 
> Tvrtko
> 

You were right, the correct shift was missing and that was causing this
bug, however, that was only affecting Chromium OS as the i915 driver
code base in that kernel is behind drm-tip. Adding "<< test_shift" in
gen8_cs_irq_handler() fixes that crash in the older kernel:

if (iir & (GT_GEN8_WATCHDOG_INTERRUPT) << test_shift)

Regards,
Carlos


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2019-01-24  0:20 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-05  2:39 Gen8+ engine-reset Carlos Santa
2019-01-05  2:39 ` drm/i915: Add engine reset count in get-reset-stats ioctl Carlos Santa
2019-01-05  2:39 ` drm/i915: Watchdog timeout: IRQ handler for gen8+ Carlos Santa
2019-01-07 11:58   ` Tvrtko Ursulin
2019-01-07 12:16     ` Chris Wilson
2019-01-07 12:58       ` Tvrtko Ursulin
2019-01-07 13:02         ` Chris Wilson
2019-01-07 13:12           ` Tvrtko Ursulin
2019-01-07 13:43     ` Tvrtko Ursulin
2019-01-07 13:57       ` Chris Wilson
2019-01-07 16:58         ` Tvrtko Ursulin
2019-01-07 18:31           ` Chris Wilson
2019-01-11  0:47           ` Antonio Argenziano
2019-01-11  8:22             ` Tvrtko Ursulin
2019-01-11 17:31               ` Antonio Argenziano
2019-01-11 21:28                 ` John Harrison
2019-01-16 16:15                   ` Tvrtko Ursulin
2019-01-16 17:42                     ` Antonio Argenziano
2019-01-16 17:59                       ` Antonio Argenziano
2019-01-11  2:58           ` Carlos Santa
2019-01-24  0:13     ` Carlos Santa
2019-01-05  2:39 ` drm/i915: Watchdog timeout: Ringbuffer command emission " Carlos Santa
2019-01-07 12:21   ` Tvrtko Ursulin
2019-01-05  2:39 ` drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Carlos Santa
2019-01-07 12:38   ` Tvrtko Ursulin
2019-01-07 12:50     ` Chris Wilson
2019-01-07 13:39       ` Tvrtko Ursulin
2019-01-07 13:51         ` Chris Wilson
2019-01-07 17:00     ` Tvrtko Ursulin
2019-01-07 17:20       ` Tvrtko Ursulin
2019-01-05  2:39 ` drm/i915: Watchdog timeout: Include threshold value in error state Carlos Santa
2019-01-05  4:19   ` kbuild test robot
2019-01-05  4:39   ` kbuild test robot
2019-01-05  2:39 ` drm/i915: Only process VCS2 only when supported Carlos Santa
2019-01-07 12:40   ` Tvrtko Ursulin
2019-01-24  0:20     ` Carlos Santa
2019-01-05  2:40 ` drm/i915/watchdog: move emit_stop_watchdog until the very end of the ring commands Carlos Santa
2019-01-07 12:50   ` Tvrtko Ursulin
2019-01-07 12:54     ` Chris Wilson
2019-01-07 13:01       ` Tvrtko Ursulin
2019-01-11  2:25     ` Carlos Santa
2019-01-05  2:40 ` drm/i915: Watchdog timeout: Blindly trust watchdog timeout for reset? Carlos Santa
2019-01-05  4:15   ` kbuild test robot
2019-01-05 13:32   ` kbuild test robot
2019-01-05  2:57 ` ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
2019-01-05  3:21 ` ✓ Fi.CI.BAT: success " Patchwork
2019-01-05  4:41 ` ✓ Fi.CI.IGT: " Patchwork
2019-01-07 10:11 ` Gen8+ engine-reset Tvrtko Ursulin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.