All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Carlos Santa <carlos.santa@intel.com>, intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH v5 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+
Date: Mon, 25 Mar 2019 10:00:55 +0000	[thread overview]
Message-ID: <2543cdeb-beef-237b-34b5-73296ca41c46@linux.intel.com> (raw)
In-Reply-To: <20190322234118.65980-3-carlos.santa@intel.com>


On 22/03/2019 23:41, Carlos Santa wrote:
> From: Michel Thierry <michel.thierry@intel.com>
> 
> *** General ***
> 
> Watchdog timeout (or "media engine reset") is a feature that allows
> userland applications to enable hang detection on individual batch buffers.
> The detection mechanism itself is mostly bound to the hardware and the only
> thing that the driver needs to do to support this form of hang detection
> is to implement the interrupt handling support as well as watchdog command
> emission before and after the emitted batch buffer start instruction in the
> ring buffer.
> 
> The principle of the hang detection mechanism is as follows:
> 
> 1. Once the decision has been made to enable watchdog timeout for a
> particular batch buffer and the driver is in the process of emitting the
> batch buffer start instruction into the ring buffer it also emits a
> watchdog timer start instruction before and a watchdog timer cancellation
> instruction after the batch buffer start instruction in the ring buffer.
> 
> 2. Once the GPU execution reaches the watchdog timer start instruction
> the hardware watchdog counter is started by the hardware. The counter
> keeps counting until either reaching a previously configured threshold
> value or the timer cancellation instruction is executed.
> 
> 2a. If the counter reaches the threshold value the hardware fires a
> watchdog interrupt that is picked up by the watchdog interrupt handler.
> This means that a hang has been detected and the driver needs to deal with
> it the same way it would deal with a engine hang detected by the periodic
> hang checker. The only difference between the two is that we already blamed
> the active request (to ensure an engine reset).
> 
> 2b. If the batch buffer completes and the execution reaches the watchdog
> cancellation instruction before the watchdog counter reaches its
> threshold value the watchdog is cancelled and nothing more comes of it.
> No hang is detected.
> 
> Note about future interaction with preemption: Preemption could happen
> in a command sequence prior to watchdog counter getting disabled,
> resulting in watchdog being triggered following preemption (e.g. when
> watchdog had been enabled in the low priority batch). The driver will
> need to explicitly disable the watchdog counter as part of the
> preemption sequence.
> 
> *** This patch introduces: ***
> 
> 1. IRQ handler code for watchdog timeout allowing direct hang recovery
> based on hardware-driven hang detection, which then integrates directly
> with the hang recovery path. This is independent of having per-engine reset
> or just full gpu reset.
> 
> 2. Watchdog specific register information.
> 
> Currently the render engine and all available media engines support
> watchdog timeout (VECS is only supported in GEN9). The specifications elude
> to the BCS engine being supported but that is currently not supported by
> this commit.
> 
> Note that the value to stop the counter is different between render and
> non-render engines in GEN8; GEN9 onwards it's the same.
> 
> v2: Move irq handler to tasklet, arm watchdog for a 2nd time to check
> against false-positives.
> 
> v3: Don't use high priority tasklet, use engine_last_submit while
> checking for false-positives. From GEN9 onwards, the stop counter bit is
> the same for all engines.
> 
> v4: Remove unnecessary brackets, use current_seqno to mark the request
> as guilty in the hangcheck/capture code.
> 
> v5: Rebased after RESET_ENGINEs flag.
> 
> v6: Don't capture error state in case of watchdog timeout. The capture
> process is time consuming and this will align to what happens when we
> use GuC to handle the watchdog timeout. (Chris)
> 
> v7: Rebase.
> 
> v8: Rebase, use HZ to reschedule.
> 
> v9: Rebase, get forcewake domains in function (no longer in execlists
> struct).
> 
> v10: Rebase.
> 
> v11: Rebase,
>       remove extra braces (Tvrtko),
>       implement watchdog_to_clock_counts helper (Tvrtko),
>       Move tasklet_kill(watchdog_tasklet) inside intel_engines (Tvrtko),
>       Use a global heartbeat seqno instead of engine seqno (Chris)
>       Make all engines checks all class based checks (Tvrtko)
> 
> v12: Rebase,
>       Reset immediately upon entering the IRQ (Chris)
>       Make reset_engine_to_str a helper (Tvrtko)
>       Rename watchdog_irq_handler as watchdog_tasklet (Tvrtko)
>       Let the compiler itself do the inline (Tvrtko)
> 
> v13: Rebase
> v14: Rebase, skip checking for the guilty seqno in the tasklet (Tvrtko)

IIRC I only asked about it so I guess you tried it and it works well?

Can you also post the IGTs so we can see the test coverage?

Regards,

Tvrtko

> 
> Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>
> Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gpu_error.h     |  4 ++
>   drivers/gpu/drm/i915/i915_irq.c           | 14 ++++--
>   drivers/gpu/drm/i915/i915_reg.h           |  6 +++
>   drivers/gpu/drm/i915/i915_reset.c         | 20 +++++++++
>   drivers/gpu/drm/i915/i915_reset.h         |  6 +++
>   drivers/gpu/drm/i915/intel_engine_cs.c    |  1 +
>   drivers/gpu/drm/i915/intel_engine_types.h |  5 +++
>   drivers/gpu/drm/i915/intel_hangcheck.c    | 11 +----
>   drivers/gpu/drm/i915/intel_lrc.c          | 52 +++++++++++++++++++++++
>   9 files changed, 107 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
> index 99d6b7b270c2..6cf6a8679b26 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> @@ -203,6 +203,9 @@ struct i915_gpu_error {
>   	 * any global resources that may be clobber by the reset (such as
>   	 * FENCE registers).
>   	 *
> +	 * #I915_RESET_WATCHDOG - When hw detects a hang before us, we can use
> +	 * I915_RESET_WATCHDOG to report the hang detection cause accurately.
> +	 *
>   	 * #I915_RESET_ENGINE[num_engines] - Since the driver doesn't need to
>   	 * acquire the struct_mutex to reset an engine, we need an explicit
>   	 * flag to prevent two concurrent reset attempts in the same engine.
> @@ -218,6 +221,7 @@ struct i915_gpu_error {
>   #define I915_RESET_BACKOFF	0
>   #define I915_RESET_MODESET	1
>   #define I915_RESET_ENGINE	2
> +#define I915_RESET_WATCHDOG	3
>   #define I915_WEDGED		(BITS_PER_LONG - 1)
>   
>   	/** Number of times the device has been reset (global) */
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index 82d487189a34..e64994be25c3 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1466,6 +1466,9 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir)
>   
>   	if (tasklet)
>   		tasklet_hi_schedule(&engine->execlists.tasklet);
> +
> +	if (iir & GT_GEN8_WATCHDOG_INTERRUPT)
> +		tasklet_schedule(&engine->execlists.watchdog_tasklet);
>   }
>   
>   static void gen8_gt_irq_ack(struct drm_i915_private *i915,
> @@ -3892,20 +3895,25 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv)
>   	u32 gt_interrupts[] = {
>   		(GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
>   		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
> +		 GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT |
>   		 GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT |
>   		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT),
> -
>   		(GT_RENDER_USER_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
>   		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
> +		 GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS0_IRQ_SHIFT |
>   		 GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> -		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT),
> -
> +		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT |
> +		 GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT),
>   		0,
>   
>   		(GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT |
>   		 GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT)
>   	};
>   
> +	/* VECS watchdog is only available in skl+ */
> +	if (INTEL_GEN(dev_priv) >= 9)
> +		gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT;
> +
>   	dev_priv->pm_ier = 0x0;
>   	dev_priv->pm_imr = ~dev_priv->pm_ier;
>   	GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]);
> diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
> index 9b69cec21f7b..ac8d984e16ae 100644
> --- a/drivers/gpu/drm/i915/i915_reg.h
> +++ b/drivers/gpu/drm/i915/i915_reg.h
> @@ -2363,6 +2363,11 @@ enum i915_power_well_id {
>   #define RING_START(base)	_MMIO((base) + 0x38)
>   #define RING_CTL(base)		_MMIO((base) + 0x3c)
>   #define   RING_CTL_SIZE(size)	((size) - PAGE_SIZE) /* in bytes -> pages */
> +#define RING_CNTR(base)		_MMIO((base) + 0x178)
> +#define GEN8_WATCHDOG_ENABLE		0
> +#define GEN8_WATCHDOG_DISABLE		1
> +#define GEN8_XCS_WATCHDOG_DISABLE	0xFFFFFFFF /* GEN8 & non-render only */
> +#define RING_THRESH(base)	_MMIO((base) + 0x17C)
>   #define RING_SYNC_0(base)	_MMIO((base) + 0x40)
>   #define RING_SYNC_1(base)	_MMIO((base) + 0x44)
>   #define RING_SYNC_2(base)	_MMIO((base) + 0x48)
> @@ -2925,6 +2930,7 @@ enum i915_power_well_id {
>   #define GT_BSD_USER_INTERRUPT			(1 << 12)
>   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1	(1 << 11) /* hsw+; rsvd on snb, ivb, vlv */
>   #define GT_CONTEXT_SWITCH_INTERRUPT		(1 <<  8)
> +#define GT_GEN8_WATCHDOG_INTERRUPT		(1 <<  6) /* gen8+ */
>   #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT	(1 <<  5) /* !snb */
>   #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT	(1 <<  4)
>   #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT	(1 <<  3)
> diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
> index 861fe083e383..739fa5ad1a8d 100644
> --- a/drivers/gpu/drm/i915/i915_reset.c
> +++ b/drivers/gpu/drm/i915/i915_reset.c
> @@ -1208,6 +1208,26 @@ void i915_clear_error_registers(struct drm_i915_private *dev_priv)
>   	}
>   }
>   
> +void engine_reset_error_to_str(struct drm_i915_private *i915,
> +	           char *msg,
> +	           size_t sz,
> +	           unsigned int hung,
> +	           unsigned int stuck,
> +	           unsigned int watchdog)
> +{
> +	int len;
> +	unsigned int tmp;
> +	struct intel_engine_cs *engine;
> +
> +	len = scnprintf(msg, sz,
> +			"%s on ", watchdog ? "watchdog timeout" :
> +				stuck == hung ? "no_progress" : "hang");
> +	for_each_engine_masked(engine, i915, hung, tmp)
> +		len += scnprintf(msg + len, sz - len,
> +				"%s, ", engine->name);
> +	msg[len-2] = '\0';
> +}
> +
>   /**
>    * i915_handle_error - handle a gpu error
>    * @i915: i915 device private
> diff --git a/drivers/gpu/drm/i915/i915_reset.h b/drivers/gpu/drm/i915/i915_reset.h
> index 16f2389f656f..8582d1242248 100644
> --- a/drivers/gpu/drm/i915/i915_reset.h
> +++ b/drivers/gpu/drm/i915/i915_reset.h
> @@ -20,6 +20,12 @@ void i915_handle_error(struct drm_i915_private *i915,
>   		       u32 engine_mask,
>   		       unsigned long flags,
>   		       const char *fmt, ...);
> +void engine_reset_error_to_str(struct drm_i915_private *i915,
> +               char *str,
> +               size_t sz,
> +               unsigned int hung,
> +               unsigned int stuck,
> +               unsigned int watchdog);
>   #define I915_ERROR_CAPTURE BIT(0)
>   
>   void i915_clear_error_registers(struct drm_i915_private *i915);
> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> index 652c1b3ba190..88cf0fc07623 100644
> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> @@ -1149,6 +1149,7 @@ void intel_engines_park(struct drm_i915_private *i915)
>   		/* Flush the residual irq tasklets first. */
>   		intel_engine_disarm_breadcrumbs(engine);
>   		tasklet_kill(&engine->execlists.tasklet);
> +		tasklet_kill(&engine->execlists.watchdog_tasklet);
>   
>   		/*
>   		 * We are committed now to parking the engines, make sure there
> diff --git a/drivers/gpu/drm/i915/intel_engine_types.h b/drivers/gpu/drm/i915/intel_engine_types.h
> index b0aa1f0d4e47..c4f66b774e7c 100644
> --- a/drivers/gpu/drm/i915/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/intel_engine_types.h
> @@ -124,6 +124,11 @@ struct intel_engine_execlists {
>   	 */
>   	struct tasklet_struct tasklet;
>   
> +	/*
> +	 * @watchdog_tasklet: stop counter and reschedule hangcheck_work asap
> +	 */
> +	struct tasklet_struct watchdog_tasklet;
> +
>   	/**
>   	 * @default_priolist: priority list for I915_PRIORITY_NORMAL
>   	 */
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index 57ed49dc19c4..4bf26863678c 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -220,22 +220,15 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915,
>   				   unsigned int hung,
>   				   unsigned int stuck)
>   {
> -	struct intel_engine_cs *engine;
>   	char msg[80];
> -	unsigned int tmp;
> -	int len;
> +	size_t len = sizeof(msg);
>   
>   	/* If some rings hung but others were still busy, only
>   	 * blame the hanging rings in the synopsis.
>   	 */
>   	if (stuck != hung)
>   		hung &= ~stuck;
> -	len = scnprintf(msg, sizeof(msg),
> -			"%s on ", stuck == hung ? "no progress" : "hang");
> -	for_each_engine_masked(engine, i915, hung, tmp)
> -		len += scnprintf(msg + len, sizeof(msg) - len,
> -				 "%s, ", engine->name);
> -	msg[len-2] = '\0';
> +	engine_reset_error_to_str(i915, msg, len, hung, stuck, 0);
>   
>   	return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s", msg);
>   }
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index e54e0064b2d6..85785a94f6ae 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -2195,6 +2195,40 @@ static int gen8_emit_flush_render(struct i915_request *request,
>   	return 0;
>   }
>   
> +static void gen8_watchdog_tasklet(unsigned long data)
> +{
> +	struct intel_engine_cs *engine = (struct intel_engine_cs *)data;
> +	struct drm_i915_private *dev_priv = engine->i915;
> +	enum forcewake_domains fw_domains;
> +	char msg[80];
> +	size_t len = sizeof(msg);
> +	unsigned long *lock = &engine->i915->gpu_error.flags;
> +	unsigned int bit = I915_RESET_ENGINE + engine->id;
> +
> +	switch (engine->class) {
> +	default:
> +		MISSING_CASE(engine->id);
> +		/* fall through */
> +	case RENDER_CLASS:
> +		fw_domains = FORCEWAKE_RENDER;
> +		break;
> +	case VIDEO_DECODE_CLASS:
> +	case VIDEO_ENHANCEMENT_CLASS:
> +		fw_domains = FORCEWAKE_MEDIA;
> +		break;
> +	}
> +
> +	intel_uncore_forcewake_get(dev_priv, fw_domains);
> +
> +	if (!test_and_set_bit(bit, lock)) {
> +		unsigned int hung = engine->mask;
> +		engine_reset_error_to_str(dev_priv, msg, len, hung, 0, 1);
> +		i915_reset_engine(engine, msg);
> +		clear_bit(bit, lock);
> +		wake_up_bit(lock, bit);
> +	}
> +}
> +
>   /*
>    * Reserve space for 2 NOOPs at the end of each request to be
>    * used as a workaround for not being allowed to do lite
> @@ -2377,6 +2411,21 @@ logical_ring_default_irqs(struct intel_engine_cs *engine)
>   
>   	engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift;
>   	engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift;
> +
> +	switch (engine->class) {
> +	default:
> +		/* BCS engine does not support hw watchdog */
> +		break;
> +	case RENDER_CLASS:
> +	case VIDEO_DECODE_CLASS:
> +		engine->irq_keep_mask |= GT_GEN8_WATCHDOG_INTERRUPT << shift;
> +		break;
> +	case VIDEO_ENHANCEMENT_CLASS:
> +		if (INTEL_GEN(engine->i915) >= 9)
> +			engine->irq_keep_mask |=
> +				GT_GEN8_WATCHDOG_INTERRUPT << shift;
> +		break;
> +	}
>   }
>   
>   static int
> @@ -2394,6 +2443,9 @@ logical_ring_setup(struct intel_engine_cs *engine)
>   	tasklet_init(&engine->execlists.tasklet,
>   		     execlists_submission_tasklet, (unsigned long)engine);
>   
> +	tasklet_init(&engine->execlists.watchdog_tasklet,
> +		     gen8_watchdog_tasklet, (unsigned long)engine);
> +
>   	logical_ring_default_vfuncs(engine);
>   	logical_ring_default_irqs(engine);
>   
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2019-03-25 10:00 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-22 23:41 [PATCH v5 0/5] GEN8+ GPU Watchdog Reset Support Carlos Santa
2019-03-22 23:41 ` [PATCH v5 1/5] drm/i915: Add engine reset count in get-reset-stats ioctl Carlos Santa
2019-03-30  8:45   ` Chris Wilson
2019-03-22 23:41 ` [PATCH v5 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+ Carlos Santa
2019-03-25 10:00   ` Tvrtko Ursulin [this message]
2019-03-27  1:58     ` Carlos Santa
2019-03-27 10:40       ` Tvrtko Ursulin
2019-03-22 23:41 ` [PATCH v5 3/5] drm/i915: Watchdog timeout: Ringbuffer command emission " Carlos Santa
2019-03-30  8:49   ` Chris Wilson
2019-03-30  9:01   ` Chris Wilson
2019-04-02  0:57     ` Carlos Santa
2019-03-22 23:41 ` [PATCH v5 4/5] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout Carlos Santa
2019-03-22 23:41 ` [PATCH v5 5/5] drm/i915: Watchdog timeout: Include threshold value in error state Carlos Santa
2019-03-22 23:59 ` ✗ Fi.CI.BAT: failure for GEN8+ GPU Watchdog Reset Support Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2543cdeb-beef-237b-34b5-73296ca41c46@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=carlos.santa@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.