All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915: Declare the driver wedged if hangcheck makes no progress
@ 2018-06-02 10:48 Chris Wilson
  2018-06-02 11:03 ` ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
                   ` (4 more replies)
  0 siblings, 5 replies; 7+ messages in thread
From: Chris Wilson @ 2018-06-02 10:48 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika Kuoppala

Hangcheck is our back up in case the GPU or the driver gets stuck. It
detects when the GPU is not making any progress and issues a GPU reset.
However, if the driver is failing to make any progress, we can get
ourselves into a situation where we continually try resetting the GPU to
no avail. Employ a second timeout such that if we continue to see the
same seqno (the stalled engine has made no progress at all) over the
course of several hangchecks, declare the driver wedged and attempt to
start afresh.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c     |  5 +++--
 drivers/gpu/drm/i915/i915_drv.h         |  2 ++
 drivers/gpu/drm/i915/intel_hangcheck.c  | 17 ++++++++++++++++-
 drivers/gpu/drm/i915/intel_ringbuffer.h |  3 ++-
 4 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 82d06a03b22f..7e68684be1e4 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -1362,11 +1362,12 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
 		seq_printf(m, "\tseqno = %x [current %x, last %x]\n",
 			   engine->hangcheck.seqno, seqno[id],
 			   intel_engine_last_submit(engine));
-		seq_printf(m, "\twaiters? %s, fake irq active? %s, stalled? %s\n",
+		seq_printf(m, "\twaiters? %s, fake irq active? %s, stalled? %s, wedged? %s\n",
 			   yesno(intel_engine_has_waiter(engine)),
 			   yesno(test_bit(engine->id,
 					  &dev_priv->gpu_error.missed_irq_rings)),
-			   yesno(engine->hangcheck.stalled));
+			   yesno(engine->hangcheck.stalled),
+			   yesno(engine->hangcheck.wedged));
 
 		spin_lock_irq(&b->rb_lock);
 		for (rb = rb_first(&b->waiters); rb; rb = rb_next(rb)) {
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 6649962e991a..d254a29a59a0 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -909,6 +909,8 @@ struct i915_gem_mm {
 #define I915_ENGINE_DEAD_TIMEOUT  (4 * HZ)  /* Seqno, head and subunits dead */
 #define I915_SEQNO_DEAD_TIMEOUT   (12 * HZ) /* Seqno dead with active head */
 
+#define I915_ENGINE_WEDGED_TIMEOUT  (60 * HZ)  /* Reset but no recovery? */
+
 enum modeset_restore {
 	MODESET_ON_LID_OPEN,
 	MODESET_DONE,
diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
index d47e346bd49e..2fc7a0dd0df9 100644
--- a/drivers/gpu/drm/i915/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/intel_hangcheck.c
@@ -294,6 +294,7 @@ static void hangcheck_store_sample(struct intel_engine_cs *engine,
 	engine->hangcheck.seqno = hc->seqno;
 	engine->hangcheck.action = hc->action;
 	engine->hangcheck.stalled = hc->stalled;
+	engine->hangcheck.wedged = hc->wedged;
 }
 
 static enum intel_engine_hangcheck_action
@@ -368,6 +369,9 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
 
 	hc->stalled = time_after(jiffies,
 				 engine->hangcheck.action_timestamp + timeout);
+	hc->wedged = time_after(jiffies,
+				 engine->hangcheck.action_timestamp +
+				 I915_ENGINE_WEDGED_TIMEOUT);
 }
 
 static void hangcheck_declare_hang(struct drm_i915_private *i915,
@@ -409,7 +413,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 			     gpu_error.hangcheck_work.work);
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
-	unsigned int hung = 0, stuck = 0;
+	unsigned int hung = 0, stuck = 0, wedged = 0;
 
 	if (!i915_modparams.enable_hangcheck)
 		return;
@@ -440,6 +444,17 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
 			if (hc.action != ENGINE_DEAD)
 				stuck |= intel_engine_flag(engine);
 		}
+
+		if (engine->hangcheck.wedged)
+			wedged |= intel_engine_flag(engine);
+	}
+
+	if (wedged) {
+		dev_err(dev_priv->drm.dev,
+			"GPU recovery timed out,"
+			" cancelling all in-flight rendering.\n");
+		GEM_TRACE_DUMP();
+		i915_gem_set_wedged(dev_priv);
 	}
 
 	if (hung)
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index bed66500ca80..2c1b28a33df8 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -120,7 +120,8 @@ struct intel_engine_hangcheck {
 	int deadlock;
 	struct intel_instdone instdone;
 	struct i915_request *active_request;
-	bool stalled;
+	bool stalled:1;
+	bool wedged:1;
 };
 
 struct intel_ring {
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Declare the driver wedged if hangcheck makes no progress
  2018-06-02 10:48 [PATCH] drm/i915: Declare the driver wedged if hangcheck makes no progress Chris Wilson
@ 2018-06-02 11:03 ` Patchwork
  2018-06-02 11:04 ` ✗ Fi.CI.SPARSE: " Patchwork
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Patchwork @ 2018-06-02 11:03 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Declare the driver wedged if hangcheck makes no progress
URL   : https://patchwork.freedesktop.org/series/44138/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
04c23f53fbe3 drm/i915: Declare the driver wedged if hangcheck makes no progress
-:68: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#68: FILE: drivers/gpu/drm/i915/intel_hangcheck.c:373:
+	hc->wedged = time_after(jiffies,
+				 engine->hangcheck.action_timestamp +

-:109: WARNING:BOOL_BITFIELD: Avoid using bool as bitfield.  Prefer bool bitfields as unsigned int or u<8|16|32>
#109: FILE: drivers/gpu/drm/i915/intel_ringbuffer.h:125:
+	bool stalled:1;

-:110: WARNING:BOOL_BITFIELD: Avoid using bool as bitfield.  Prefer bool bitfields as unsigned int or u<8|16|32>
#110: FILE: drivers/gpu/drm/i915/intel_ringbuffer.h:126:
+	bool wedged:1;

total: 0 errors, 2 warnings, 1 checks, 72 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* ✗ Fi.CI.SPARSE: warning for drm/i915: Declare the driver wedged if hangcheck makes no progress
  2018-06-02 10:48 [PATCH] drm/i915: Declare the driver wedged if hangcheck makes no progress Chris Wilson
  2018-06-02 11:03 ` ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
@ 2018-06-02 11:04 ` Patchwork
  2018-06-02 11:24 ` ✓ Fi.CI.BAT: success " Patchwork
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 7+ messages in thread
From: Patchwork @ 2018-06-02 11:04 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Declare the driver wedged if hangcheck makes no progress
URL   : https://patchwork.freedesktop.org/series/44138/
State : warning

== Summary ==

$ dim sparse origin/drm-tip
Commit: drm/i915: Declare the driver wedged if hangcheck makes no progress
-drivers/gpu/drm/i915/selftests/../i915_drv.h:3665:16: warning: expression using sizeof(void)
+drivers/gpu/drm/i915/selftests/../i915_drv.h:3667:16: warning: expression using sizeof(void)

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Declare the driver wedged if hangcheck makes no progress
  2018-06-02 10:48 [PATCH] drm/i915: Declare the driver wedged if hangcheck makes no progress Chris Wilson
  2018-06-02 11:03 ` ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
  2018-06-02 11:04 ` ✗ Fi.CI.SPARSE: " Patchwork
@ 2018-06-02 11:24 ` Patchwork
  2018-06-02 13:51 ` ✓ Fi.CI.IGT: " Patchwork
  2018-06-14 15:06 ` [PATCH] " Mika Kuoppala
  4 siblings, 0 replies; 7+ messages in thread
From: Patchwork @ 2018-06-02 11:24 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Declare the driver wedged if hangcheck makes no progress
URL   : https://patchwork.freedesktop.org/series/44138/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4275 -> Patchwork_9179 =

== Summary - SUCCESS ==

  No regressions found.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/44138/revisions/1/mbox/

== Known issues ==

  Here are the changes found in Patchwork_9179 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_pipe_crc_basic@hang-read-crc-pipe-c:
      fi-skl-6700k2:      PASS -> FAIL (fdo#104724, fdo#103191)

    igt@prime_vgem@basic-fence-flip:
      fi-ilk-650:         PASS -> FAIL (fdo#104008)

    
    ==== Possible fixes ====

    igt@kms_flip@basic-flip-vs-modeset:
      fi-glk-j4005:       DMESG-WARN (fdo#106000) -> PASS

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
      fi-snb-2520m:       INCOMPLETE (fdo#103713) -> PASS
      fi-cnl-psr:         DMESG-WARN (fdo#104951) -> PASS

    
  fdo#103191 https://bugs.freedesktop.org/show_bug.cgi?id=103191
  fdo#103713 https://bugs.freedesktop.org/show_bug.cgi?id=103713
  fdo#104008 https://bugs.freedesktop.org/show_bug.cgi?id=104008
  fdo#104724 https://bugs.freedesktop.org/show_bug.cgi?id=104724
  fdo#104951 https://bugs.freedesktop.org/show_bug.cgi?id=104951
  fdo#106000 https://bugs.freedesktop.org/show_bug.cgi?id=106000


== Participating hosts (40 -> 36) ==

  Missing    (4): fi-ctg-p8600 fi-ilk-m540 fi-byt-squawks fi-skl-6700hq 


== Build changes ==

    * Linux: CI_DRM_4275 -> Patchwork_9179

  CI_DRM_4275: 8fdb62e0511e81fa935059c274a2457361fdb679 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4505: 8a8f0271a71e2e0d2a2caa4d41f4ad1d9c89670e @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9179: 04c23f53fbe36224d6de9f70d5bc00bd3ddd545c @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

04c23f53fbe3 drm/i915: Declare the driver wedged if hangcheck makes no progress

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9179/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* ✓ Fi.CI.IGT: success for drm/i915: Declare the driver wedged if hangcheck makes no progress
  2018-06-02 10:48 [PATCH] drm/i915: Declare the driver wedged if hangcheck makes no progress Chris Wilson
                   ` (2 preceding siblings ...)
  2018-06-02 11:24 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2018-06-02 13:51 ` Patchwork
  2018-06-14 15:06 ` [PATCH] " Mika Kuoppala
  4 siblings, 0 replies; 7+ messages in thread
From: Patchwork @ 2018-06-02 13:51 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Declare the driver wedged if hangcheck makes no progress
URL   : https://patchwork.freedesktop.org/series/44138/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4275_full -> Patchwork_9179_full =

== Summary - WARNING ==

  Minor unknown changes coming with Patchwork_9179_full need to be verified
  manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_9179_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/44138/revisions/1/mbox/

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_9179_full:

  === IGT changes ===

    ==== Warnings ====

    igt@gem_mocs_settings@mocs-rc6-vebox:
      shard-kbl:          PASS -> SKIP +1

    igt@pm_rc6_residency@rc6-accuracy:
      shard-kbl:          SKIP -> PASS

    
== Known issues ==

  Here are the changes found in Patchwork_9179_full that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_atomic_transition@1x-modeset-transitions-nonblocking:
      shard-glk:          PASS -> FAIL (fdo#105703)

    igt@kms_cursor_legacy@2x-nonblocking-modeset-vs-cursor-atomic:
      shard-glk:          PASS -> FAIL (fdo#105454, fdo#106509)

    igt@kms_flip@2x-flip-vs-expired-vblank:
      shard-glk:          PASS -> FAIL (fdo#105363)

    igt@kms_flip@2x-flip-vs-wf_vblank:
      shard-glk:          PASS -> FAIL (fdo#100368) +1

    igt@kms_flip_tiling@flip-x-tiled:
      shard-glk:          PASS -> FAIL (fdo#104724, fdo#103822) +1

    
    ==== Possible fixes ====

    igt@drv_selftest@live_gtt:
      shard-glk:          INCOMPLETE (k.org#198133, fdo#103359) -> PASS

    igt@gem_eio@hibernate:
      shard-snb:          INCOMPLETE (fdo#105411) -> PASS

    igt@kms_flip@2x-dpms-vs-vblank-race:
      shard-glk:          FAIL (fdo#103060) -> PASS

    igt@kms_flip@basic-flip-vs-wf_vblank:
      shard-hsw:          FAIL (fdo#103928) -> PASS

    igt@kms_flip@plain-flip-ts-check:
      shard-hsw:          FAIL (fdo#100368) -> PASS
      shard-glk:          FAIL (fdo#100368) -> PASS

    igt@kms_flip_tiling@flip-to-y-tiled:
      shard-glk:          FAIL (fdo#104724) -> PASS

    
    ==== Warnings ====

    igt@gem_eio@suspend:
      shard-snb:          DMESG-FAIL -> INCOMPLETE (fdo#105411)

    
  fdo#100368 https://bugs.freedesktop.org/show_bug.cgi?id=100368
  fdo#103060 https://bugs.freedesktop.org/show_bug.cgi?id=103060
  fdo#103359 https://bugs.freedesktop.org/show_bug.cgi?id=103359
  fdo#103822 https://bugs.freedesktop.org/show_bug.cgi?id=103822
  fdo#103928 https://bugs.freedesktop.org/show_bug.cgi?id=103928
  fdo#104724 https://bugs.freedesktop.org/show_bug.cgi?id=104724
  fdo#105363 https://bugs.freedesktop.org/show_bug.cgi?id=105363
  fdo#105411 https://bugs.freedesktop.org/show_bug.cgi?id=105411
  fdo#105454 https://bugs.freedesktop.org/show_bug.cgi?id=105454
  fdo#105703 https://bugs.freedesktop.org/show_bug.cgi?id=105703
  fdo#106509 https://bugs.freedesktop.org/show_bug.cgi?id=106509
  k.org#198133 https://bugzilla.kernel.org/show_bug.cgi?id=198133


== Participating hosts (5 -> 5) ==

  No changes in participating hosts


== Build changes ==

    * Linux: CI_DRM_4275 -> Patchwork_9179

  CI_DRM_4275: 8fdb62e0511e81fa935059c274a2457361fdb679 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4505: 8a8f0271a71e2e0d2a2caa4d41f4ad1d9c89670e @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9179: 04c23f53fbe36224d6de9f70d5bc00bd3ddd545c @ git://anongit.freedesktop.org/gfx-ci/linux

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9179/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/i915: Declare the driver wedged if hangcheck makes no progress
  2018-06-02 10:48 [PATCH] drm/i915: Declare the driver wedged if hangcheck makes no progress Chris Wilson
                   ` (3 preceding siblings ...)
  2018-06-02 13:51 ` ✓ Fi.CI.IGT: " Patchwork
@ 2018-06-14 15:06 ` Mika Kuoppala
  2018-06-14 18:39   ` Chris Wilson
  4 siblings, 1 reply; 7+ messages in thread
From: Mika Kuoppala @ 2018-06-14 15:06 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Hangcheck is our back up in case the GPU or the driver gets stuck. It
> detects when the GPU is not making any progress and issues a GPU reset.
> However, if the driver is failing to make any progress, we can get
> ourselves into a situation where we continually try resetting the GPU to
> no avail. Employ a second timeout such that if we continue to see the
> same seqno (the stalled engine has made no progress at all) over the
> course of several hangchecks, declare the driver wedged and attempt to
> start afresh.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_debugfs.c     |  5 +++--
>  drivers/gpu/drm/i915/i915_drv.h         |  2 ++
>  drivers/gpu/drm/i915/intel_hangcheck.c  | 17 ++++++++++++++++-
>  drivers/gpu/drm/i915/intel_ringbuffer.h |  3 ++-
>  4 files changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index 82d06a03b22f..7e68684be1e4 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -1362,11 +1362,12 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
>  		seq_printf(m, "\tseqno = %x [current %x, last %x]\n",
>  			   engine->hangcheck.seqno, seqno[id],
>  			   intel_engine_last_submit(engine));
> -		seq_printf(m, "\twaiters? %s, fake irq active? %s, stalled? %s\n",
> +		seq_printf(m, "\twaiters? %s, fake irq active? %s, stalled? %s, wedged? %s\n",
>  			   yesno(intel_engine_has_waiter(engine)),
>  			   yesno(test_bit(engine->id,
>  					  &dev_priv->gpu_error.missed_irq_rings)),
> -			   yesno(engine->hangcheck.stalled));
> +			   yesno(engine->hangcheck.stalled),
> +			   yesno(engine->hangcheck.wedged));
>  
>  		spin_lock_irq(&b->rb_lock);
>  		for (rb = rb_first(&b->waiters); rb; rb = rb_next(rb)) {
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 6649962e991a..d254a29a59a0 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -909,6 +909,8 @@ struct i915_gem_mm {
>  #define I915_ENGINE_DEAD_TIMEOUT  (4 * HZ)  /* Seqno, head and subunits dead */
>  #define I915_SEQNO_DEAD_TIMEOUT   (12 * HZ) /* Seqno dead with active head */
>  
> +#define I915_ENGINE_WEDGED_TIMEOUT  (60 * HZ)  /* Reset but no recovery? */
> +
>  enum modeset_restore {
>  	MODESET_ON_LID_OPEN,
>  	MODESET_DONE,
> diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> index d47e346bd49e..2fc7a0dd0df9 100644
> --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> @@ -294,6 +294,7 @@ static void hangcheck_store_sample(struct intel_engine_cs *engine,
>  	engine->hangcheck.seqno = hc->seqno;
>  	engine->hangcheck.action = hc->action;
>  	engine->hangcheck.stalled = hc->stalled;
> +	engine->hangcheck.wedged = hc->wedged;
>  }
>  
>  static enum intel_engine_hangcheck_action
> @@ -368,6 +369,9 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
>  
>  	hc->stalled = time_after(jiffies,
>  				 engine->hangcheck.action_timestamp + timeout);
> +	hc->wedged = time_after(jiffies,
> +				 engine->hangcheck.action_timestamp +
> +				 I915_ENGINE_WEDGED_TIMEOUT);

I was concerned that some callpath does end up zeroing
hangcheck->seqno through intel_engine_init_hangcheck.

It seems that seqno wrap and unparking are the
only ones, with exception of setup paths that
does this.

But as both wrap and unpark are done with idle
precondition, this should only at worst bring
one extra tick to reload the seqnos.

So I can't poke holes in this and it should
prevent our driver mistakes ending up in
endless resets loops, like the tin says.

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/i915: Declare the driver wedged if hangcheck makes no progress
  2018-06-14 15:06 ` [PATCH] " Mika Kuoppala
@ 2018-06-14 18:39   ` Chris Wilson
  0 siblings, 0 replies; 7+ messages in thread
From: Chris Wilson @ 2018-06-14 18:39 UTC (permalink / raw)
  To: Mika Kuoppala, intel-gfx

Quoting Mika Kuoppala (2018-06-14 16:06:39)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Hangcheck is our back up in case the GPU or the driver gets stuck. It
> > detects when the GPU is not making any progress and issues a GPU reset.
> > However, if the driver is failing to make any progress, we can get
> > ourselves into a situation where we continually try resetting the GPU to
> > no avail. Employ a second timeout such that if we continue to see the
> > same seqno (the stalled engine has made no progress at all) over the
> > course of several hangchecks, declare the driver wedged and attempt to
> > start afresh.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_debugfs.c     |  5 +++--
> >  drivers/gpu/drm/i915/i915_drv.h         |  2 ++
> >  drivers/gpu/drm/i915/intel_hangcheck.c  | 17 ++++++++++++++++-
> >  drivers/gpu/drm/i915/intel_ringbuffer.h |  3 ++-
> >  4 files changed, 23 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> > index 82d06a03b22f..7e68684be1e4 100644
> > --- a/drivers/gpu/drm/i915/i915_debugfs.c
> > +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> > @@ -1362,11 +1362,12 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused)
> >               seq_printf(m, "    seqno = %x [current %x, last %x]\n",
> >                          engine->hangcheck.seqno, seqno[id],
> >                          intel_engine_last_submit(engine));
> > -             seq_printf(m, "    waiters? %s, fake irq active? %s, stalled? %s\n",
> > +             seq_printf(m, "    waiters? %s, fake irq active? %s, stalled? %s, wedged? %s\n",
> >                          yesno(intel_engine_has_waiter(engine)),
> >                          yesno(test_bit(engine->id,
> >                                         &dev_priv->gpu_error.missed_irq_rings)),
> > -                        yesno(engine->hangcheck.stalled));
> > +                        yesno(engine->hangcheck.stalled),
> > +                        yesno(engine->hangcheck.wedged));
> >  
> >               spin_lock_irq(&b->rb_lock);
> >               for (rb = rb_first(&b->waiters); rb; rb = rb_next(rb)) {
> > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> > index 6649962e991a..d254a29a59a0 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.h
> > +++ b/drivers/gpu/drm/i915/i915_drv.h
> > @@ -909,6 +909,8 @@ struct i915_gem_mm {
> >  #define I915_ENGINE_DEAD_TIMEOUT  (4 * HZ)  /* Seqno, head and subunits dead */
> >  #define I915_SEQNO_DEAD_TIMEOUT   (12 * HZ) /* Seqno dead with active head */
> >  
> > +#define I915_ENGINE_WEDGED_TIMEOUT  (60 * HZ)  /* Reset but no recovery? */
> > +
> >  enum modeset_restore {
> >       MODESET_ON_LID_OPEN,
> >       MODESET_DONE,
> > diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c
> > index d47e346bd49e..2fc7a0dd0df9 100644
> > --- a/drivers/gpu/drm/i915/intel_hangcheck.c
> > +++ b/drivers/gpu/drm/i915/intel_hangcheck.c
> > @@ -294,6 +294,7 @@ static void hangcheck_store_sample(struct intel_engine_cs *engine,
> >       engine->hangcheck.seqno = hc->seqno;
> >       engine->hangcheck.action = hc->action;
> >       engine->hangcheck.stalled = hc->stalled;
> > +     engine->hangcheck.wedged = hc->wedged;
> >  }
> >  
> >  static enum intel_engine_hangcheck_action
> > @@ -368,6 +369,9 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine,
> >  
> >       hc->stalled = time_after(jiffies,
> >                                engine->hangcheck.action_timestamp + timeout);
> > +     hc->wedged = time_after(jiffies,
> > +                              engine->hangcheck.action_timestamp +
> > +                              I915_ENGINE_WEDGED_TIMEOUT);
> 
> I was concerned that some callpath does end up zeroing
> hangcheck->seqno through intel_engine_init_hangcheck.
> 
> It seems that seqno wrap and unparking are the
> only ones, with exception of setup paths that
> does this.
> 
> But as both wrap and unpark are done with idle
> precondition, this should only at worst bring
> one extra tick to reload the seqnos.
> 
> So I can't poke holes in this and it should
> prevent our driver mistakes ending up in
> endless resets loops, like the tin says.
> 
> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Thanks, it's just meant to be a safety net against our (driver) bugs, so
I don't mind it being coarse. Just loud and abrasive.

Pushed,
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-06-14 18:39 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-02 10:48 [PATCH] drm/i915: Declare the driver wedged if hangcheck makes no progress Chris Wilson
2018-06-02 11:03 ` ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
2018-06-02 11:04 ` ✗ Fi.CI.SPARSE: " Patchwork
2018-06-02 11:24 ` ✓ Fi.CI.BAT: success " Patchwork
2018-06-02 13:51 ` ✓ Fi.CI.IGT: " Patchwork
2018-06-14 15:06 ` [PATCH] " Mika Kuoppala
2018-06-14 18:39   ` Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.