All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
@ 2018-05-18 10:09 Chris Wilson
  2018-05-18 10:37 ` ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Chris Wilson @ 2018-05-18 10:09 UTC (permalink / raw)
  To: intel-gfx

Inside the live_hangcheck (reset) selftests, we occasionally see
failures like

<7>[  239.094840] i915_gem_set_wedged rcs0
<7>[  239.094843] i915_gem_set_wedged 	current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
<7>[  239.094846] i915_gem_set_wedged 	Reset count: 6239 (global 1)
<7>[  239.094848] i915_gem_set_wedged 	Requests:
<7>[  239.095052] i915_gem_set_wedged 		first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
<7>[  239.095056] i915_gem_set_wedged 		last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
<7>[  239.095059] i915_gem_set_wedged 		active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
<7>[  239.095062] i915_gem_set_wedged 		[head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
<7>[  239.100050] i915_gem_set_wedged 		ring->start:  0x00283000
<7>[  239.100053] i915_gem_set_wedged 		ring->head:   0x000001f8
<7>[  239.100055] i915_gem_set_wedged 		ring->tail:   0x000002a8
<7>[  239.100057] i915_gem_set_wedged 		ring->emit:   0x000002a8
<7>[  239.100059] i915_gem_set_wedged 		ring->space:  0x00000f10
<7>[  239.100085] i915_gem_set_wedged 	RING_START: 0x00283000
<7>[  239.100088] i915_gem_set_wedged 	RING_HEAD:  0x00000260
<7>[  239.100091] i915_gem_set_wedged 	RING_TAIL:  0x000002a8
<7>[  239.100094] i915_gem_set_wedged 	RING_CTL:   0x00000001
<7>[  239.100097] i915_gem_set_wedged 	RING_MODE:  0x00000300 [idle]
<7>[  239.100100] i915_gem_set_wedged 	RING_IMR: fffffefe
<7>[  239.100104] i915_gem_set_wedged 	ACTHD:  0x00000000_0000609c
<7>[  239.100108] i915_gem_set_wedged 	BBADDR: 0x00000000_0000609d
<7>[  239.100111] i915_gem_set_wedged 	DMA_FADDR: 0x00000000_00283260
<7>[  239.100114] i915_gem_set_wedged 	IPEIR: 0x00000000
<7>[  239.100117] i915_gem_set_wedged 	IPEHR: 0x02800000
<7>[  239.100120] i915_gem_set_wedged 	Execlist status: 0x00044052 00000002
<7>[  239.100124] i915_gem_set_wedged 	Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
<7>[  239.100128] i915_gem_set_wedged 		ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
<7>[  239.100132] i915_gem_set_wedged 		ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
<7>[  239.100135] i915_gem_set_wedged 		HW active? 0x5
<7>[  239.100250] i915_gem_set_wedged 		E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
<7>[  239.100338] i915_gem_set_wedged 		E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
<7>[  239.100340] i915_gem_set_wedged 		Queue priority: 139
<7>[  239.100343] i915_gem_set_wedged 		Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
<7>[  239.100346] i915_gem_set_wedged 		Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
<7>[  239.100349] i915_gem_set_wedged 		Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
<7>[  239.100352] i915_gem_set_wedged 		Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
<7>[  239.100356] i915_gem_set_wedged 		Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
<7>[  239.100362] i915_gem_set_wedged 	drv_selftest [5894] waiting for 19a99

where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
The RING_MODE indicates that is idle and has the STOP_RING bit set, so
try clearing it.

v2: Only clear the bit on restarting the ring, as we want to be sure the
STOP_RING bit is kept if reset fails on wedging.
v3: Spot when the ring state doesn't make sense when re-initialising the
engine and dump it to the logs so that we don't have to wait for an
error later and try to guess what happened earlier.
v4: Prepare to print all the unexpected state, not just the first.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 3744f5750624..ba8411ba4abf 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1781,6 +1781,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
 		I915_WRITE(RING_MODE_GEN7(engine),
 			   _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
 
+	I915_WRITE(RING_MI_MODE(engine->mmio_base),
+		   _MASKED_BIT_DISABLE(STOP_RING));
+
 	I915_WRITE(RING_HWS_PGA(engine->mmio_base),
 		   engine->status_page.ggtt_offset);
 	POSTING_READ(RING_HWS_PGA(engine->mmio_base));
@@ -1789,6 +1792,19 @@ static void enable_execlists(struct intel_engine_cs *engine)
 	engine->execlists.csb_head = -1;
 }
 
+static bool unexpected_starting_state(struct intel_engine_cs *engine)
+{
+	struct drm_i915_private *dev_priv = engine->i915;
+	bool unexpected = false;
+
+	if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) {
+		DRM_DEBUG_DRIVER("STOP_RING still set in RING_MI_MODE\n");
+		unexpected = true;
+	}
+
+	return unexpected;
+}
+
 static int gen8_init_common_ring(struct intel_engine_cs *engine)
 {
 	struct intel_engine_execlists * const execlists = &engine->execlists;
@@ -1801,6 +1817,12 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
 	intel_engine_reset_breadcrumbs(engine);
 	intel_engine_init_hangcheck(engine);
 
+	if (GEM_SHOW_DEBUG() && unexpected_starting_state(engine)) {
+		struct drm_printer p = drm_debug_printer(__func__);
+
+		intel_engine_dump(engine, &p, NULL);
+	}
+
 	enable_execlists(engine);
 
 	/* After a GPU reset, we may have requests to replay */
-- 
2.17.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-18 10:09 [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset Chris Wilson
@ 2018-05-18 10:37 ` Patchwork
  2018-05-18 10:54 ` ✓ Fi.CI.BAT: success " Patchwork
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Patchwork @ 2018-05-18 10:37 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
URL   : https://patchwork.freedesktop.org/series/43404/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
e01c159ce05b drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
-:11: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#11: 
<7>[  239.094843] i915_gem_set_wedged 	current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]

total: 0 errors, 1 warnings, 0 checks, 40 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-18 10:09 [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset Chris Wilson
  2018-05-18 10:37 ` ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
@ 2018-05-18 10:54 ` Patchwork
  2018-05-18 11:05 ` [PATCH v4] " Tvrtko Ursulin
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Patchwork @ 2018-05-18 10:54 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
URL   : https://patchwork.freedesktop.org/series/43404/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4205 -> Patchwork_9047 =

== Summary - WARNING ==

  Minor unknown changes coming with Patchwork_9047 need to be verified
  manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_9047, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/43404/revisions/1/mbox/

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_9047:

  === IGT changes ===

    ==== Warnings ====

    igt@gem_exec_gttfill@basic:
      fi-pnv-d510:        SKIP -> PASS

    
== Known issues ==

  Here are the changes found in Patchwork_9047 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_frontbuffer_tracking@basic:
      fi-hsw-4200u:       PASS -> DMESG-FAIL (fdo#102614, fdo#106103)
      fi-hsw-peppy:       PASS -> DMESG-FAIL (fdo#102614, fdo#106103)

    
    ==== Possible fixes ====

    igt@kms_flip@basic-flip-vs-wf_vblank:
      {fi-cfl-guc}:       FAIL (fdo#103928) -> PASS

    
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  fdo#102614 https://bugs.freedesktop.org/show_bug.cgi?id=102614
  fdo#103928 https://bugs.freedesktop.org/show_bug.cgi?id=103928
  fdo#106103 https://bugs.freedesktop.org/show_bug.cgi?id=106103


== Participating hosts (43 -> 37) ==

  Missing    (6): fi-ilk-m540 fi-bxt-dsi fi-byt-j1900 fi-byt-squawks fi-bsw-cyan fi-skl-6700hq 


== Build changes ==

    * Linux: CI_DRM_4205 -> Patchwork_9047

  CI_DRM_4205: 921a7738e856643993ea0889dd5a936c22151649 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4487: eccae1360d6d01e73c6af2bd97122cef708207ef @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9047: e01c159ce05bb1471d058ca945d5d386a2fe9512 @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4487: 6ab75f7eb5e1dccbb773e1739beeb2d7cbd6ad0d @ git://anongit.freedesktop.org/piglit


== Linux commits ==

e01c159ce05b drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9047/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-18 10:09 [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset Chris Wilson
  2018-05-18 10:37 ` ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
  2018-05-18 10:54 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2018-05-18 11:05 ` Tvrtko Ursulin
  2018-05-18 11:10   ` Chris Wilson
  2018-05-18 14:43 ` ✗ Fi.CI.IGT: failure for " Patchwork
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Tvrtko Ursulin @ 2018-05-18 11:05 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 18/05/2018 11:09, Chris Wilson wrote:
> Inside the live_hangcheck (reset) selftests, we occasionally see
> failures like
> 
> <7>[  239.094840] i915_gem_set_wedged rcs0
> <7>[  239.094843] i915_gem_set_wedged 	current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
> <7>[  239.094846] i915_gem_set_wedged 	Reset count: 6239 (global 1)
> <7>[  239.094848] i915_gem_set_wedged 	Requests:
> <7>[  239.095052] i915_gem_set_wedged 		first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> <7>[  239.095056] i915_gem_set_wedged 		last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
> <7>[  239.095059] i915_gem_set_wedged 		active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> <7>[  239.095062] i915_gem_set_wedged 		[head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
> <7>[  239.100050] i915_gem_set_wedged 		ring->start:  0x00283000
> <7>[  239.100053] i915_gem_set_wedged 		ring->head:   0x000001f8
> <7>[  239.100055] i915_gem_set_wedged 		ring->tail:   0x000002a8
> <7>[  239.100057] i915_gem_set_wedged 		ring->emit:   0x000002a8
> <7>[  239.100059] i915_gem_set_wedged 		ring->space:  0x00000f10
> <7>[  239.100085] i915_gem_set_wedged 	RING_START: 0x00283000
> <7>[  239.100088] i915_gem_set_wedged 	RING_HEAD:  0x00000260
> <7>[  239.100091] i915_gem_set_wedged 	RING_TAIL:  0x000002a8
> <7>[  239.100094] i915_gem_set_wedged 	RING_CTL:   0x00000001
> <7>[  239.100097] i915_gem_set_wedged 	RING_MODE:  0x00000300 [idle]
> <7>[  239.100100] i915_gem_set_wedged 	RING_IMR: fffffefe
> <7>[  239.100104] i915_gem_set_wedged 	ACTHD:  0x00000000_0000609c
> <7>[  239.100108] i915_gem_set_wedged 	BBADDR: 0x00000000_0000609d
> <7>[  239.100111] i915_gem_set_wedged 	DMA_FADDR: 0x00000000_00283260
> <7>[  239.100114] i915_gem_set_wedged 	IPEIR: 0x00000000
> <7>[  239.100117] i915_gem_set_wedged 	IPEHR: 0x02800000
> <7>[  239.100120] i915_gem_set_wedged 	Execlist status: 0x00044052 00000002
> <7>[  239.100124] i915_gem_set_wedged 	Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
> <7>[  239.100128] i915_gem_set_wedged 		ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> <7>[  239.100132] i915_gem_set_wedged 		ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> <7>[  239.100135] i915_gem_set_wedged 		HW active? 0x5
> <7>[  239.100250] i915_gem_set_wedged 		E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> <7>[  239.100338] i915_gem_set_wedged 		E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> <7>[  239.100340] i915_gem_set_wedged 		Queue priority: 139
> <7>[  239.100343] i915_gem_set_wedged 		Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
> <7>[  239.100346] i915_gem_set_wedged 		Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
> <7>[  239.100349] i915_gem_set_wedged 		Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
> <7>[  239.100352] i915_gem_set_wedged 		Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
> <7>[  239.100356] i915_gem_set_wedged 		Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
> <7>[  239.100362] i915_gem_set_wedged 	drv_selftest [5894] waiting for 19a99
> 
> where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
> The RING_MODE indicates that is idle and has the STOP_RING bit set, so
> try clearing it.
> 
> v2: Only clear the bit on restarting the ring, as we want to be sure the
> STOP_RING bit is kept if reset fails on wedging.
> v3: Spot when the ring state doesn't make sense when re-initialising the
> engine and dump it to the logs so that we don't have to wait for an
> error later and try to guess what happened earlier.
> v4: Prepare to print all the unexpected state, not just the first.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++++++++++++
>   1 file changed, 22 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 3744f5750624..ba8411ba4abf 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1781,6 +1781,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
>   		I915_WRITE(RING_MODE_GEN7(engine),
>   			   _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
>   
> +	I915_WRITE(RING_MI_MODE(engine->mmio_base),
> +		   _MASKED_BIT_DISABLE(STOP_RING));

Worries me a bit to clear it unconditionally since documentation says 
nothing (that I can find) about this scenario.

> +
>   	I915_WRITE(RING_HWS_PGA(engine->mmio_base),
>   		   engine->status_page.ggtt_offset);
>   	POSTING_READ(RING_HWS_PGA(engine->mmio_base));
> @@ -1789,6 +1792,19 @@ static void enable_execlists(struct intel_engine_cs *engine)
>   	engine->execlists.csb_head = -1;
>   }
>   
> +static bool unexpected_starting_state(struct intel_engine_cs *engine)
> +{
> +	struct drm_i915_private *dev_priv = engine->i915;
> +	bool unexpected = false;
> +
> +	if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) {
> +		DRM_DEBUG_DRIVER("STOP_RING still set in RING_MI_MODE\n");

Also worries me to bury this as a debug message unless we can find in 
documentation that this is known to happen occasionally.

If we had the check and logging immediately after reset, with a more 
visible log, we could build a larger data set of reported failures and 
then go back to hw people and ask them about it.

> +		unexpected = true;
> +	}
> +
> +	return unexpected;
> +}
> +
>   static int gen8_init_common_ring(struct intel_engine_cs *engine)
>   {
>   	struct intel_engine_execlists * const execlists = &engine->execlists;
> @@ -1801,6 +1817,12 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
>   	intel_engine_reset_breadcrumbs(engine);
>   	intel_engine_init_hangcheck(engine);
>   
> +	if (GEM_SHOW_DEBUG() && unexpected_starting_state(engine)) {
> +		struct drm_printer p = drm_debug_printer(__func__);
> +
> +		intel_engine_dump(engine, &p, NULL);
> +	}
> +
>   	enable_execlists(engine);
>   
>   	/* After a GPU reset, we may have requests to replay */
> 

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-18 11:05 ` [PATCH v4] " Tvrtko Ursulin
@ 2018-05-18 11:10   ` Chris Wilson
  2018-05-18 11:50     ` Tvrtko Ursulin
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Wilson @ 2018-05-18 11:10 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2018-05-18 12:05:17)
> 
> On 18/05/2018 11:09, Chris Wilson wrote:
> > Inside the live_hangcheck (reset) selftests, we occasionally see
> > failures like
> > 
> > <7>[  239.094840] i915_gem_set_wedged rcs0
> > <7>[  239.094843] i915_gem_set_wedged         current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
> > <7>[  239.094846] i915_gem_set_wedged         Reset count: 6239 (global 1)
> > <7>[  239.094848] i915_gem_set_wedged         Requests:
> > <7>[  239.095052] i915_gem_set_wedged                 first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> > <7>[  239.095056] i915_gem_set_wedged                 last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
> > <7>[  239.095059] i915_gem_set_wedged                 active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> > <7>[  239.095062] i915_gem_set_wedged                 [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
> > <7>[  239.100050] i915_gem_set_wedged                 ring->start:  0x00283000
> > <7>[  239.100053] i915_gem_set_wedged                 ring->head:   0x000001f8
> > <7>[  239.100055] i915_gem_set_wedged                 ring->tail:   0x000002a8
> > <7>[  239.100057] i915_gem_set_wedged                 ring->emit:   0x000002a8
> > <7>[  239.100059] i915_gem_set_wedged                 ring->space:  0x00000f10
> > <7>[  239.100085] i915_gem_set_wedged         RING_START: 0x00283000
> > <7>[  239.100088] i915_gem_set_wedged         RING_HEAD:  0x00000260
> > <7>[  239.100091] i915_gem_set_wedged         RING_TAIL:  0x000002a8
> > <7>[  239.100094] i915_gem_set_wedged         RING_CTL:   0x00000001
> > <7>[  239.100097] i915_gem_set_wedged         RING_MODE:  0x00000300 [idle]
> > <7>[  239.100100] i915_gem_set_wedged         RING_IMR: fffffefe
> > <7>[  239.100104] i915_gem_set_wedged         ACTHD:  0x00000000_0000609c
> > <7>[  239.100108] i915_gem_set_wedged         BBADDR: 0x00000000_0000609d
> > <7>[  239.100111] i915_gem_set_wedged         DMA_FADDR: 0x00000000_00283260
> > <7>[  239.100114] i915_gem_set_wedged         IPEIR: 0x00000000
> > <7>[  239.100117] i915_gem_set_wedged         IPEHR: 0x02800000
> > <7>[  239.100120] i915_gem_set_wedged         Execlist status: 0x00044052 00000002
> > <7>[  239.100124] i915_gem_set_wedged         Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
> > <7>[  239.100128] i915_gem_set_wedged                 ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> > <7>[  239.100132] i915_gem_set_wedged                 ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> > <7>[  239.100135] i915_gem_set_wedged                 HW active? 0x5
> > <7>[  239.100250] i915_gem_set_wedged                 E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> > <7>[  239.100338] i915_gem_set_wedged                 E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> > <7>[  239.100340] i915_gem_set_wedged                 Queue priority: 139
> > <7>[  239.100343] i915_gem_set_wedged                 Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
> > <7>[  239.100346] i915_gem_set_wedged                 Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
> > <7>[  239.100349] i915_gem_set_wedged                 Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
> > <7>[  239.100352] i915_gem_set_wedged                 Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
> > <7>[  239.100356] i915_gem_set_wedged                 Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
> > <7>[  239.100362] i915_gem_set_wedged         drv_selftest [5894] waiting for 19a99
> > 
> > where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
> > The RING_MODE indicates that is idle and has the STOP_RING bit set, so
> > try clearing it.
> > 
> > v2: Only clear the bit on restarting the ring, as we want to be sure the
> > STOP_RING bit is kept if reset fails on wedging.
> > v3: Spot when the ring state doesn't make sense when re-initialising the
> > engine and dump it to the logs so that we don't have to wait for an
> > error later and try to guess what happened earlier.
> > v4: Prepare to print all the unexpected state, not just the first.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> > ---
> >   drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++++++++++++
> >   1 file changed, 22 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > index 3744f5750624..ba8411ba4abf 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -1781,6 +1781,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
> >               I915_WRITE(RING_MODE_GEN7(engine),
> >                          _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
> >   
> > +     I915_WRITE(RING_MI_MODE(engine->mmio_base),
> > +                _MASKED_BIT_DISABLE(STOP_RING));
> 
> Worries me a bit to clear it unconditionally since documentation says 
> nothing (that I can find) about this scenario.
> 
> > +
> >       I915_WRITE(RING_HWS_PGA(engine->mmio_base),
> >                  engine->status_page.ggtt_offset);
> >       POSTING_READ(RING_HWS_PGA(engine->mmio_base));
> > @@ -1789,6 +1792,19 @@ static void enable_execlists(struct intel_engine_cs *engine)
> >       engine->execlists.csb_head = -1;
> >   }
> >   
> > +static bool unexpected_starting_state(struct intel_engine_cs *engine)
> > +{
> > +     struct drm_i915_private *dev_priv = engine->i915;
> > +     bool unexpected = false;
> > +
> > +     if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) {
> > +             DRM_DEBUG_DRIVER("STOP_RING still set in RING_MI_MODE\n");
> 
> Also worries me to bury this as a debug message unless we can find in 
> documentation that this is known to happen occasionally.
> 
> If we had the check and logging immediately after reset, with a more 
> visible log, we could build a larger data set of reported failures and 
> then go back to hw people and ask them about it.

The problem with the reset side is trying to avoid mixing in the case
where we are faking it and expect it to still be set. On the ringbuffer
path, we handle STOP_RING as being part of the init sequence. The log
already says we did a reset and STOP_RING is set (since we have tracing
through the reset), I just need to find the right spot to handle it.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-18 11:10   ` Chris Wilson
@ 2018-05-18 11:50     ` Tvrtko Ursulin
  2018-05-18 12:28       ` Chris Wilson
  0 siblings, 1 reply; 20+ messages in thread
From: Tvrtko Ursulin @ 2018-05-18 11:50 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 18/05/2018 12:10, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-05-18 12:05:17)
>>
>> On 18/05/2018 11:09, Chris Wilson wrote:
>>> Inside the live_hangcheck (reset) selftests, we occasionally see
>>> failures like
>>>
>>> <7>[  239.094840] i915_gem_set_wedged rcs0
>>> <7>[  239.094843] i915_gem_set_wedged         current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
>>> <7>[  239.094846] i915_gem_set_wedged         Reset count: 6239 (global 1)
>>> <7>[  239.094848] i915_gem_set_wedged         Requests:
>>> <7>[  239.095052] i915_gem_set_wedged                 first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
>>> <7>[  239.095056] i915_gem_set_wedged                 last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
>>> <7>[  239.095059] i915_gem_set_wedged                 active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
>>> <7>[  239.095062] i915_gem_set_wedged                 [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
>>> <7>[  239.100050] i915_gem_set_wedged                 ring->start:  0x00283000
>>> <7>[  239.100053] i915_gem_set_wedged                 ring->head:   0x000001f8
>>> <7>[  239.100055] i915_gem_set_wedged                 ring->tail:   0x000002a8
>>> <7>[  239.100057] i915_gem_set_wedged                 ring->emit:   0x000002a8
>>> <7>[  239.100059] i915_gem_set_wedged                 ring->space:  0x00000f10
>>> <7>[  239.100085] i915_gem_set_wedged         RING_START: 0x00283000
>>> <7>[  239.100088] i915_gem_set_wedged         RING_HEAD:  0x00000260
>>> <7>[  239.100091] i915_gem_set_wedged         RING_TAIL:  0x000002a8
>>> <7>[  239.100094] i915_gem_set_wedged         RING_CTL:   0x00000001
>>> <7>[  239.100097] i915_gem_set_wedged         RING_MODE:  0x00000300 [idle]
>>> <7>[  239.100100] i915_gem_set_wedged         RING_IMR: fffffefe
>>> <7>[  239.100104] i915_gem_set_wedged         ACTHD:  0x00000000_0000609c
>>> <7>[  239.100108] i915_gem_set_wedged         BBADDR: 0x00000000_0000609d
>>> <7>[  239.100111] i915_gem_set_wedged         DMA_FADDR: 0x00000000_00283260
>>> <7>[  239.100114] i915_gem_set_wedged         IPEIR: 0x00000000
>>> <7>[  239.100117] i915_gem_set_wedged         IPEHR: 0x02800000
>>> <7>[  239.100120] i915_gem_set_wedged         Execlist status: 0x00044052 00000002
>>> <7>[  239.100124] i915_gem_set_wedged         Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
>>> <7>[  239.100128] i915_gem_set_wedged                 ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
>>> <7>[  239.100132] i915_gem_set_wedged                 ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
>>> <7>[  239.100135] i915_gem_set_wedged                 HW active? 0x5
>>> <7>[  239.100250] i915_gem_set_wedged                 E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
>>> <7>[  239.100338] i915_gem_set_wedged                 E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
>>> <7>[  239.100340] i915_gem_set_wedged                 Queue priority: 139
>>> <7>[  239.100343] i915_gem_set_wedged                 Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
>>> <7>[  239.100346] i915_gem_set_wedged                 Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
>>> <7>[  239.100349] i915_gem_set_wedged                 Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
>>> <7>[  239.100352] i915_gem_set_wedged                 Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
>>> <7>[  239.100356] i915_gem_set_wedged                 Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
>>> <7>[  239.100362] i915_gem_set_wedged         drv_selftest [5894] waiting for 19a99
>>>
>>> where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
>>> The RING_MODE indicates that is idle and has the STOP_RING bit set, so
>>> try clearing it.
>>>
>>> v2: Only clear the bit on restarting the ring, as we want to be sure the
>>> STOP_RING bit is kept if reset fails on wedging.
>>> v3: Spot when the ring state doesn't make sense when re-initialising the
>>> engine and dump it to the logs so that we don't have to wait for an
>>> error later and try to guess what happened earlier.
>>> v4: Prepare to print all the unexpected state, not just the first.
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++++++++++++
>>>    1 file changed, 22 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>>> index 3744f5750624..ba8411ba4abf 100644
>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>> @@ -1781,6 +1781,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
>>>                I915_WRITE(RING_MODE_GEN7(engine),
>>>                           _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
>>>    
>>> +     I915_WRITE(RING_MI_MODE(engine->mmio_base),
>>> +                _MASKED_BIT_DISABLE(STOP_RING));
>>
>> Worries me a bit to clear it unconditionally since documentation says
>> nothing (that I can find) about this scenario.
>>
>>> +
>>>        I915_WRITE(RING_HWS_PGA(engine->mmio_base),
>>>                   engine->status_page.ggtt_offset);
>>>        POSTING_READ(RING_HWS_PGA(engine->mmio_base));
>>> @@ -1789,6 +1792,19 @@ static void enable_execlists(struct intel_engine_cs *engine)
>>>        engine->execlists.csb_head = -1;
>>>    }
>>>    
>>> +static bool unexpected_starting_state(struct intel_engine_cs *engine)
>>> +{
>>> +     struct drm_i915_private *dev_priv = engine->i915;
>>> +     bool unexpected = false;
>>> +
>>> +     if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) {
>>> +             DRM_DEBUG_DRIVER("STOP_RING still set in RING_MI_MODE\n");
>>
>> Also worries me to bury this as a debug message unless we can find in
>> documentation that this is known to happen occasionally.
>>
>> If we had the check and logging immediately after reset, with a more
>> visible log, we could build a larger data set of reported failures and
>> then go back to hw people and ask them about it.
> 
> The problem with the reset side is trying to avoid mixing in the case
> where we are faking it and expect it to still be set. On the ringbuffer
> path, we handle STOP_RING as being part of the init sequence. The log
> already says we did a reset and STOP_RING is set (since we have tracing
> through the reset), I just need to find the right spot to handle it.

The "distance" in code paths between reset and init_hw is probably the 
smallest of my concerns. If we could make this message not debug only, 
and clear STOP_RING only conditionally I think I would be happy with 
giving this a go and see if it collects any data.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-18 11:50     ` Tvrtko Ursulin
@ 2018-05-18 12:28       ` Chris Wilson
  2018-05-18 12:36         ` Tvrtko Ursulin
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Wilson @ 2018-05-18 12:28 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2018-05-18 12:50:41)
> 
> On 18/05/2018 12:10, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-05-18 12:05:17)
> >>
> >> On 18/05/2018 11:09, Chris Wilson wrote:
> >>> Inside the live_hangcheck (reset) selftests, we occasionally see
> >>> failures like
> >>>
> >>> <7>[  239.094840] i915_gem_set_wedged rcs0
> >>> <7>[  239.094843] i915_gem_set_wedged         current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
> >>> <7>[  239.094846] i915_gem_set_wedged         Reset count: 6239 (global 1)
> >>> <7>[  239.094848] i915_gem_set_wedged         Requests:
> >>> <7>[  239.095052] i915_gem_set_wedged                 first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> >>> <7>[  239.095056] i915_gem_set_wedged                 last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
> >>> <7>[  239.095059] i915_gem_set_wedged                 active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> >>> <7>[  239.095062] i915_gem_set_wedged                 [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
> >>> <7>[  239.100050] i915_gem_set_wedged                 ring->start:  0x00283000
> >>> <7>[  239.100053] i915_gem_set_wedged                 ring->head:   0x000001f8
> >>> <7>[  239.100055] i915_gem_set_wedged                 ring->tail:   0x000002a8
> >>> <7>[  239.100057] i915_gem_set_wedged                 ring->emit:   0x000002a8
> >>> <7>[  239.100059] i915_gem_set_wedged                 ring->space:  0x00000f10
> >>> <7>[  239.100085] i915_gem_set_wedged         RING_START: 0x00283000
> >>> <7>[  239.100088] i915_gem_set_wedged         RING_HEAD:  0x00000260
> >>> <7>[  239.100091] i915_gem_set_wedged         RING_TAIL:  0x000002a8
> >>> <7>[  239.100094] i915_gem_set_wedged         RING_CTL:   0x00000001
> >>> <7>[  239.100097] i915_gem_set_wedged         RING_MODE:  0x00000300 [idle]
> >>> <7>[  239.100100] i915_gem_set_wedged         RING_IMR: fffffefe
> >>> <7>[  239.100104] i915_gem_set_wedged         ACTHD:  0x00000000_0000609c
> >>> <7>[  239.100108] i915_gem_set_wedged         BBADDR: 0x00000000_0000609d
> >>> <7>[  239.100111] i915_gem_set_wedged         DMA_FADDR: 0x00000000_00283260
> >>> <7>[  239.100114] i915_gem_set_wedged         IPEIR: 0x00000000
> >>> <7>[  239.100117] i915_gem_set_wedged         IPEHR: 0x02800000
> >>> <7>[  239.100120] i915_gem_set_wedged         Execlist status: 0x00044052 00000002
> >>> <7>[  239.100124] i915_gem_set_wedged         Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
> >>> <7>[  239.100128] i915_gem_set_wedged                 ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> >>> <7>[  239.100132] i915_gem_set_wedged                 ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> >>> <7>[  239.100135] i915_gem_set_wedged                 HW active? 0x5
> >>> <7>[  239.100250] i915_gem_set_wedged                 E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> >>> <7>[  239.100338] i915_gem_set_wedged                 E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> >>> <7>[  239.100340] i915_gem_set_wedged                 Queue priority: 139
> >>> <7>[  239.100343] i915_gem_set_wedged                 Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
> >>> <7>[  239.100346] i915_gem_set_wedged                 Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
> >>> <7>[  239.100349] i915_gem_set_wedged                 Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
> >>> <7>[  239.100352] i915_gem_set_wedged                 Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
> >>> <7>[  239.100356] i915_gem_set_wedged                 Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
> >>> <7>[  239.100362] i915_gem_set_wedged         drv_selftest [5894] waiting for 19a99
> >>>
> >>> where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
> >>> The RING_MODE indicates that is idle and has the STOP_RING bit set, so
> >>> try clearing it.
> >>>
> >>> v2: Only clear the bit on restarting the ring, as we want to be sure the
> >>> STOP_RING bit is kept if reset fails on wedging.
> >>> v3: Spot when the ring state doesn't make sense when re-initialising the
> >>> engine and dump it to the logs so that we don't have to wait for an
> >>> error later and try to guess what happened earlier.
> >>> v4: Prepare to print all the unexpected state, not just the first.
> >>>
> >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >>> ---
> >>>    drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++++++++++++
> >>>    1 file changed, 22 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> >>> index 3744f5750624..ba8411ba4abf 100644
> >>> --- a/drivers/gpu/drm/i915/intel_lrc.c
> >>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> >>> @@ -1781,6 +1781,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
> >>>                I915_WRITE(RING_MODE_GEN7(engine),
> >>>                           _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
> >>>    
> >>> +     I915_WRITE(RING_MI_MODE(engine->mmio_base),
> >>> +                _MASKED_BIT_DISABLE(STOP_RING));
> >>
> >> Worries me a bit to clear it unconditionally since documentation says
> >> nothing (that I can find) about this scenario.
> >>
> >>> +
> >>>        I915_WRITE(RING_HWS_PGA(engine->mmio_base),
> >>>                   engine->status_page.ggtt_offset);
> >>>        POSTING_READ(RING_HWS_PGA(engine->mmio_base));
> >>> @@ -1789,6 +1792,19 @@ static void enable_execlists(struct intel_engine_cs *engine)
> >>>        engine->execlists.csb_head = -1;
> >>>    }
> >>>    
> >>> +static bool unexpected_starting_state(struct intel_engine_cs *engine)
> >>> +{
> >>> +     struct drm_i915_private *dev_priv = engine->i915;
> >>> +     bool unexpected = false;
> >>> +
> >>> +     if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) {
> >>> +             DRM_DEBUG_DRIVER("STOP_RING still set in RING_MI_MODE\n");
> >>
> >> Also worries me to bury this as a debug message unless we can find in
> >> documentation that this is known to happen occasionally.
> >>
> >> If we had the check and logging immediately after reset, with a more
> >> visible log, we could build a larger data set of reported failures and
> >> then go back to hw people and ask them about it.
> > 
> > The problem with the reset side is trying to avoid mixing in the case
> > where we are faking it and expect it to still be set. On the ringbuffer
> > path, we handle STOP_RING as being part of the init sequence. The log
> > already says we did a reset and STOP_RING is set (since we have tracing
> > through the reset), I just need to find the right spot to handle it.
> 
> The "distance" in code paths between reset and init_hw is probably the 
> smallest of my concerns. If we could make this message not debug only, 
> and clear STOP_RING only conditionally I think I would be happy with 
> giving this a go and see if it collects any data.

Why? Who else but us are going to read it, and why bother if it is not
actionable. It is logged (was logged already, now highlighted) and for
the user none of it is relevant, the only thing that matters to them is
that they don't even notice the system suffered a GPU hang.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-18 12:28       ` Chris Wilson
@ 2018-05-18 12:36         ` Tvrtko Ursulin
  2018-05-18 14:13           ` Chris Wilson
  0 siblings, 1 reply; 20+ messages in thread
From: Tvrtko Ursulin @ 2018-05-18 12:36 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 18/05/2018 13:28, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-05-18 12:50:41)
>>
>> On 18/05/2018 12:10, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2018-05-18 12:05:17)
>>>>
>>>> On 18/05/2018 11:09, Chris Wilson wrote:
>>>>> Inside the live_hangcheck (reset) selftests, we occasionally see
>>>>> failures like
>>>>>
>>>>> <7>[  239.094840] i915_gem_set_wedged rcs0
>>>>> <7>[  239.094843] i915_gem_set_wedged         current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
>>>>> <7>[  239.094846] i915_gem_set_wedged         Reset count: 6239 (global 1)
>>>>> <7>[  239.094848] i915_gem_set_wedged         Requests:
>>>>> <7>[  239.095052] i915_gem_set_wedged                 first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
>>>>> <7>[  239.095056] i915_gem_set_wedged                 last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
>>>>> <7>[  239.095059] i915_gem_set_wedged                 active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
>>>>> <7>[  239.095062] i915_gem_set_wedged                 [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
>>>>> <7>[  239.100050] i915_gem_set_wedged                 ring->start:  0x00283000
>>>>> <7>[  239.100053] i915_gem_set_wedged                 ring->head:   0x000001f8
>>>>> <7>[  239.100055] i915_gem_set_wedged                 ring->tail:   0x000002a8
>>>>> <7>[  239.100057] i915_gem_set_wedged                 ring->emit:   0x000002a8
>>>>> <7>[  239.100059] i915_gem_set_wedged                 ring->space:  0x00000f10
>>>>> <7>[  239.100085] i915_gem_set_wedged         RING_START: 0x00283000
>>>>> <7>[  239.100088] i915_gem_set_wedged         RING_HEAD:  0x00000260
>>>>> <7>[  239.100091] i915_gem_set_wedged         RING_TAIL:  0x000002a8
>>>>> <7>[  239.100094] i915_gem_set_wedged         RING_CTL:   0x00000001
>>>>> <7>[  239.100097] i915_gem_set_wedged         RING_MODE:  0x00000300 [idle]
>>>>> <7>[  239.100100] i915_gem_set_wedged         RING_IMR: fffffefe
>>>>> <7>[  239.100104] i915_gem_set_wedged         ACTHD:  0x00000000_0000609c
>>>>> <7>[  239.100108] i915_gem_set_wedged         BBADDR: 0x00000000_0000609d
>>>>> <7>[  239.100111] i915_gem_set_wedged         DMA_FADDR: 0x00000000_00283260
>>>>> <7>[  239.100114] i915_gem_set_wedged         IPEIR: 0x00000000
>>>>> <7>[  239.100117] i915_gem_set_wedged         IPEHR: 0x02800000
>>>>> <7>[  239.100120] i915_gem_set_wedged         Execlist status: 0x00044052 00000002
>>>>> <7>[  239.100124] i915_gem_set_wedged         Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
>>>>> <7>[  239.100128] i915_gem_set_wedged                 ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
>>>>> <7>[  239.100132] i915_gem_set_wedged                 ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
>>>>> <7>[  239.100135] i915_gem_set_wedged                 HW active? 0x5
>>>>> <7>[  239.100250] i915_gem_set_wedged                 E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
>>>>> <7>[  239.100338] i915_gem_set_wedged                 E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
>>>>> <7>[  239.100340] i915_gem_set_wedged                 Queue priority: 139
>>>>> <7>[  239.100343] i915_gem_set_wedged                 Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
>>>>> <7>[  239.100346] i915_gem_set_wedged                 Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
>>>>> <7>[  239.100349] i915_gem_set_wedged                 Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
>>>>> <7>[  239.100352] i915_gem_set_wedged                 Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
>>>>> <7>[  239.100356] i915_gem_set_wedged                 Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
>>>>> <7>[  239.100362] i915_gem_set_wedged         drv_selftest [5894] waiting for 19a99
>>>>>
>>>>> where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
>>>>> The RING_MODE indicates that is idle and has the STOP_RING bit set, so
>>>>> try clearing it.
>>>>>
>>>>> v2: Only clear the bit on restarting the ring, as we want to be sure the
>>>>> STOP_RING bit is kept if reset fails on wedging.
>>>>> v3: Spot when the ring state doesn't make sense when re-initialising the
>>>>> engine and dump it to the logs so that we don't have to wait for an
>>>>> error later and try to guess what happened earlier.
>>>>> v4: Prepare to print all the unexpected state, not just the first.
>>>>>
>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>>>>> ---
>>>>>     drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++++++++++++
>>>>>     1 file changed, 22 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>>>>> index 3744f5750624..ba8411ba4abf 100644
>>>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>>>> @@ -1781,6 +1781,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
>>>>>                 I915_WRITE(RING_MODE_GEN7(engine),
>>>>>                            _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
>>>>>     
>>>>> +     I915_WRITE(RING_MI_MODE(engine->mmio_base),
>>>>> +                _MASKED_BIT_DISABLE(STOP_RING));
>>>>
>>>> Worries me a bit to clear it unconditionally since documentation says
>>>> nothing (that I can find) about this scenario.
>>>>
>>>>> +
>>>>>         I915_WRITE(RING_HWS_PGA(engine->mmio_base),
>>>>>                    engine->status_page.ggtt_offset);
>>>>>         POSTING_READ(RING_HWS_PGA(engine->mmio_base));
>>>>> @@ -1789,6 +1792,19 @@ static void enable_execlists(struct intel_engine_cs *engine)
>>>>>         engine->execlists.csb_head = -1;
>>>>>     }
>>>>>     
>>>>> +static bool unexpected_starting_state(struct intel_engine_cs *engine)
>>>>> +{
>>>>> +     struct drm_i915_private *dev_priv = engine->i915;
>>>>> +     bool unexpected = false;
>>>>> +
>>>>> +     if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) {
>>>>> +             DRM_DEBUG_DRIVER("STOP_RING still set in RING_MI_MODE\n");
>>>>
>>>> Also worries me to bury this as a debug message unless we can find in
>>>> documentation that this is known to happen occasionally.
>>>>
>>>> If we had the check and logging immediately after reset, with a more
>>>> visible log, we could build a larger data set of reported failures and
>>>> then go back to hw people and ask them about it.
>>>
>>> The problem with the reset side is trying to avoid mixing in the case
>>> where we are faking it and expect it to still be set. On the ringbuffer
>>> path, we handle STOP_RING as being part of the init sequence. The log
>>> already says we did a reset and STOP_RING is set (since we have tracing
>>> through the reset), I just need to find the right spot to handle it.
>>
>> The "distance" in code paths between reset and init_hw is probably the
>> smallest of my concerns. If we could make this message not debug only,
>> and clear STOP_RING only conditionally I think I would be happy with
>> giving this a go and see if it collects any data.
> 
> Why? Who else but us are going to read it, and why bother if it is not
> actionable. It is logged (was logged already, now highlighted) and for
> the user none of it is relevant, the only thing that matters to them is
> that they don't even notice the system suffered a GPU hang.

My reasoning is this - something has happened which is completely 
unexpected and may have other negative effect which are currently 
unknown to us. By logging it at a higher level we collect this now very 
specific signature on a wider collection of platforms and can then act 
on the collected data set. Also very specific log message makes bug 
triage easier and a lower chance it gets bunched up with other hang reports.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-18 12:36         ` Tvrtko Ursulin
@ 2018-05-18 14:13           ` Chris Wilson
  2018-05-18 14:42             ` Tvrtko Ursulin
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Wilson @ 2018-05-18 14:13 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2018-05-18 13:36:52)
> 
> On 18/05/2018 13:28, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-05-18 12:50:41)
> >>
> >> On 18/05/2018 12:10, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2018-05-18 12:05:17)
> >>>>
> >>>> On 18/05/2018 11:09, Chris Wilson wrote:
> >>>>> Inside the live_hangcheck (reset) selftests, we occasionally see
> >>>>> failures like
> >>>>>
> >>>>> <7>[  239.094840] i915_gem_set_wedged rcs0
> >>>>> <7>[  239.094843] i915_gem_set_wedged         current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
> >>>>> <7>[  239.094846] i915_gem_set_wedged         Reset count: 6239 (global 1)
> >>>>> <7>[  239.094848] i915_gem_set_wedged         Requests:
> >>>>> <7>[  239.095052] i915_gem_set_wedged                 first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> >>>>> <7>[  239.095056] i915_gem_set_wedged                 last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
> >>>>> <7>[  239.095059] i915_gem_set_wedged                 active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> >>>>> <7>[  239.095062] i915_gem_set_wedged                 [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
> >>>>> <7>[  239.100050] i915_gem_set_wedged                 ring->start:  0x00283000
> >>>>> <7>[  239.100053] i915_gem_set_wedged                 ring->head:   0x000001f8
> >>>>> <7>[  239.100055] i915_gem_set_wedged                 ring->tail:   0x000002a8
> >>>>> <7>[  239.100057] i915_gem_set_wedged                 ring->emit:   0x000002a8
> >>>>> <7>[  239.100059] i915_gem_set_wedged                 ring->space:  0x00000f10
> >>>>> <7>[  239.100085] i915_gem_set_wedged         RING_START: 0x00283000
> >>>>> <7>[  239.100088] i915_gem_set_wedged         RING_HEAD:  0x00000260
> >>>>> <7>[  239.100091] i915_gem_set_wedged         RING_TAIL:  0x000002a8
> >>>>> <7>[  239.100094] i915_gem_set_wedged         RING_CTL:   0x00000001
> >>>>> <7>[  239.100097] i915_gem_set_wedged         RING_MODE:  0x00000300 [idle]
> >>>>> <7>[  239.100100] i915_gem_set_wedged         RING_IMR: fffffefe
> >>>>> <7>[  239.100104] i915_gem_set_wedged         ACTHD:  0x00000000_0000609c
> >>>>> <7>[  239.100108] i915_gem_set_wedged         BBADDR: 0x00000000_0000609d
> >>>>> <7>[  239.100111] i915_gem_set_wedged         DMA_FADDR: 0x00000000_00283260
> >>>>> <7>[  239.100114] i915_gem_set_wedged         IPEIR: 0x00000000
> >>>>> <7>[  239.100117] i915_gem_set_wedged         IPEHR: 0x02800000
> >>>>> <7>[  239.100120] i915_gem_set_wedged         Execlist status: 0x00044052 00000002
> >>>>> <7>[  239.100124] i915_gem_set_wedged         Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
> >>>>> <7>[  239.100128] i915_gem_set_wedged                 ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> >>>>> <7>[  239.100132] i915_gem_set_wedged                 ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> >>>>> <7>[  239.100135] i915_gem_set_wedged                 HW active? 0x5
> >>>>> <7>[  239.100250] i915_gem_set_wedged                 E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> >>>>> <7>[  239.100338] i915_gem_set_wedged                 E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> >>>>> <7>[  239.100340] i915_gem_set_wedged                 Queue priority: 139
> >>>>> <7>[  239.100343] i915_gem_set_wedged                 Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
> >>>>> <7>[  239.100346] i915_gem_set_wedged                 Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
> >>>>> <7>[  239.100349] i915_gem_set_wedged                 Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
> >>>>> <7>[  239.100352] i915_gem_set_wedged                 Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
> >>>>> <7>[  239.100356] i915_gem_set_wedged                 Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
> >>>>> <7>[  239.100362] i915_gem_set_wedged         drv_selftest [5894] waiting for 19a99
> >>>>>
> >>>>> where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
> >>>>> The RING_MODE indicates that is idle and has the STOP_RING bit set, so
> >>>>> try clearing it.
> >>>>>
> >>>>> v2: Only clear the bit on restarting the ring, as we want to be sure the
> >>>>> STOP_RING bit is kept if reset fails on wedging.
> >>>>> v3: Spot when the ring state doesn't make sense when re-initialising the
> >>>>> engine and dump it to the logs so that we don't have to wait for an
> >>>>> error later and try to guess what happened earlier.
> >>>>> v4: Prepare to print all the unexpected state, not just the first.
> >>>>>
> >>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >>>>> ---
> >>>>>     drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++++++++++++
> >>>>>     1 file changed, 22 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> >>>>> index 3744f5750624..ba8411ba4abf 100644
> >>>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
> >>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> >>>>> @@ -1781,6 +1781,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
> >>>>>                 I915_WRITE(RING_MODE_GEN7(engine),
> >>>>>                            _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
> >>>>>     
> >>>>> +     I915_WRITE(RING_MI_MODE(engine->mmio_base),
> >>>>> +                _MASKED_BIT_DISABLE(STOP_RING));
> >>>>
> >>>> Worries me a bit to clear it unconditionally since documentation says
> >>>> nothing (that I can find) about this scenario.
> >>>>
> >>>>> +
> >>>>>         I915_WRITE(RING_HWS_PGA(engine->mmio_base),
> >>>>>                    engine->status_page.ggtt_offset);
> >>>>>         POSTING_READ(RING_HWS_PGA(engine->mmio_base));
> >>>>> @@ -1789,6 +1792,19 @@ static void enable_execlists(struct intel_engine_cs *engine)
> >>>>>         engine->execlists.csb_head = -1;
> >>>>>     }
> >>>>>     
> >>>>> +static bool unexpected_starting_state(struct intel_engine_cs *engine)
> >>>>> +{
> >>>>> +     struct drm_i915_private *dev_priv = engine->i915;
> >>>>> +     bool unexpected = false;
> >>>>> +
> >>>>> +     if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) {
> >>>>> +             DRM_DEBUG_DRIVER("STOP_RING still set in RING_MI_MODE\n");
> >>>>
> >>>> Also worries me to bury this as a debug message unless we can find in
> >>>> documentation that this is known to happen occasionally.
> >>>>
> >>>> If we had the check and logging immediately after reset, with a more
> >>>> visible log, we could build a larger data set of reported failures and
> >>>> then go back to hw people and ask them about it.
> >>>
> >>> The problem with the reset side is trying to avoid mixing in the case
> >>> where we are faking it and expect it to still be set. On the ringbuffer
> >>> path, we handle STOP_RING as being part of the init sequence. The log
> >>> already says we did a reset and STOP_RING is set (since we have tracing
> >>> through the reset), I just need to find the right spot to handle it.
> >>
> >> The "distance" in code paths between reset and init_hw is probably the
> >> smallest of my concerns. If we could make this message not debug only,
> >> and clear STOP_RING only conditionally I think I would be happy with
> >> giving this a go and see if it collects any data.
> > 
> > Why? Who else but us are going to read it, and why bother if it is not
> > actionable. It is logged (was logged already, now highlighted) and for
> > the user none of it is relevant, the only thing that matters to them is
> > that they don't even notice the system suffered a GPU hang.
> 
> My reasoning is this - something has happened which is completely 
> unexpected and may have other negative effect which are currently 
> unknown to us. By logging it at a higher level we collect this now very 
> specific signature on a wider collection of platforms and can then act 
> on the collected data set. Also very specific log message makes bug 
> triage easier and a lower chance it gets bunched up with other hang reports.

You do have a very specific debug message, but having a flip-flopping
test that has no impact on system stability (as evidenced by the lack of
errors, if there was a subsequent error it is imperative that the post
mortem capture does its best, everything else is lost) does not help us.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-18 14:13           ` Chris Wilson
@ 2018-05-18 14:42             ` Tvrtko Ursulin
  2018-05-19  9:04               ` Chris Wilson
  0 siblings, 1 reply; 20+ messages in thread
From: Tvrtko Ursulin @ 2018-05-18 14:42 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 18/05/2018 15:13, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-05-18 13:36:52)
>>
>> On 18/05/2018 13:28, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2018-05-18 12:50:41)
>>>>
>>>> On 18/05/2018 12:10, Chris Wilson wrote:
>>>>> Quoting Tvrtko Ursulin (2018-05-18 12:05:17)
>>>>>>
>>>>>> On 18/05/2018 11:09, Chris Wilson wrote:
>>>>>>> Inside the live_hangcheck (reset) selftests, we occasionally see
>>>>>>> failures like
>>>>>>>
>>>>>>> <7>[  239.094840] i915_gem_set_wedged rcs0
>>>>>>> <7>[  239.094843] i915_gem_set_wedged         current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
>>>>>>> <7>[  239.094846] i915_gem_set_wedged         Reset count: 6239 (global 1)
>>>>>>> <7>[  239.094848] i915_gem_set_wedged         Requests:
>>>>>>> <7>[  239.095052] i915_gem_set_wedged                 first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
>>>>>>> <7>[  239.095056] i915_gem_set_wedged                 last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
>>>>>>> <7>[  239.095059] i915_gem_set_wedged                 active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
>>>>>>> <7>[  239.095062] i915_gem_set_wedged                 [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
>>>>>>> <7>[  239.100050] i915_gem_set_wedged                 ring->start:  0x00283000
>>>>>>> <7>[  239.100053] i915_gem_set_wedged                 ring->head:   0x000001f8
>>>>>>> <7>[  239.100055] i915_gem_set_wedged                 ring->tail:   0x000002a8
>>>>>>> <7>[  239.100057] i915_gem_set_wedged                 ring->emit:   0x000002a8
>>>>>>> <7>[  239.100059] i915_gem_set_wedged                 ring->space:  0x00000f10
>>>>>>> <7>[  239.100085] i915_gem_set_wedged         RING_START: 0x00283000
>>>>>>> <7>[  239.100088] i915_gem_set_wedged         RING_HEAD:  0x00000260
>>>>>>> <7>[  239.100091] i915_gem_set_wedged         RING_TAIL:  0x000002a8
>>>>>>> <7>[  239.100094] i915_gem_set_wedged         RING_CTL:   0x00000001
>>>>>>> <7>[  239.100097] i915_gem_set_wedged         RING_MODE:  0x00000300 [idle]
>>>>>>> <7>[  239.100100] i915_gem_set_wedged         RING_IMR: fffffefe
>>>>>>> <7>[  239.100104] i915_gem_set_wedged         ACTHD:  0x00000000_0000609c
>>>>>>> <7>[  239.100108] i915_gem_set_wedged         BBADDR: 0x00000000_0000609d
>>>>>>> <7>[  239.100111] i915_gem_set_wedged         DMA_FADDR: 0x00000000_00283260
>>>>>>> <7>[  239.100114] i915_gem_set_wedged         IPEIR: 0x00000000
>>>>>>> <7>[  239.100117] i915_gem_set_wedged         IPEHR: 0x02800000
>>>>>>> <7>[  239.100120] i915_gem_set_wedged         Execlist status: 0x00044052 00000002
>>>>>>> <7>[  239.100124] i915_gem_set_wedged         Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
>>>>>>> <7>[  239.100128] i915_gem_set_wedged                 ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
>>>>>>> <7>[  239.100132] i915_gem_set_wedged                 ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
>>>>>>> <7>[  239.100135] i915_gem_set_wedged                 HW active? 0x5
>>>>>>> <7>[  239.100250] i915_gem_set_wedged                 E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
>>>>>>> <7>[  239.100338] i915_gem_set_wedged                 E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
>>>>>>> <7>[  239.100340] i915_gem_set_wedged                 Queue priority: 139
>>>>>>> <7>[  239.100343] i915_gem_set_wedged                 Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
>>>>>>> <7>[  239.100346] i915_gem_set_wedged                 Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
>>>>>>> <7>[  239.100349] i915_gem_set_wedged                 Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
>>>>>>> <7>[  239.100352] i915_gem_set_wedged                 Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
>>>>>>> <7>[  239.100356] i915_gem_set_wedged                 Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
>>>>>>> <7>[  239.100362] i915_gem_set_wedged         drv_selftest [5894] waiting for 19a99
>>>>>>>
>>>>>>> where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
>>>>>>> The RING_MODE indicates that is idle and has the STOP_RING bit set, so
>>>>>>> try clearing it.
>>>>>>>
>>>>>>> v2: Only clear the bit on restarting the ring, as we want to be sure the
>>>>>>> STOP_RING bit is kept if reset fails on wedging.
>>>>>>> v3: Spot when the ring state doesn't make sense when re-initialising the
>>>>>>> engine and dump it to the logs so that we don't have to wait for an
>>>>>>> error later and try to guess what happened earlier.
>>>>>>> v4: Prepare to print all the unexpected state, not just the first.
>>>>>>>
>>>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>>>>>>> ---
>>>>>>>      drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++++++++++++
>>>>>>>      1 file changed, 22 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>>>>>>> index 3744f5750624..ba8411ba4abf 100644
>>>>>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>>>>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>>>>>> @@ -1781,6 +1781,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
>>>>>>>                  I915_WRITE(RING_MODE_GEN7(engine),
>>>>>>>                             _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
>>>>>>>      
>>>>>>> +     I915_WRITE(RING_MI_MODE(engine->mmio_base),
>>>>>>> +                _MASKED_BIT_DISABLE(STOP_RING));
>>>>>>
>>>>>> Worries me a bit to clear it unconditionally since documentation says
>>>>>> nothing (that I can find) about this scenario.
>>>>>>
>>>>>>> +
>>>>>>>          I915_WRITE(RING_HWS_PGA(engine->mmio_base),
>>>>>>>                     engine->status_page.ggtt_offset);
>>>>>>>          POSTING_READ(RING_HWS_PGA(engine->mmio_base));
>>>>>>> @@ -1789,6 +1792,19 @@ static void enable_execlists(struct intel_engine_cs *engine)
>>>>>>>          engine->execlists.csb_head = -1;
>>>>>>>      }
>>>>>>>      
>>>>>>> +static bool unexpected_starting_state(struct intel_engine_cs *engine)
>>>>>>> +{
>>>>>>> +     struct drm_i915_private *dev_priv = engine->i915;
>>>>>>> +     bool unexpected = false;
>>>>>>> +
>>>>>>> +     if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) {
>>>>>>> +             DRM_DEBUG_DRIVER("STOP_RING still set in RING_MI_MODE\n");
>>>>>>
>>>>>> Also worries me to bury this as a debug message unless we can find in
>>>>>> documentation that this is known to happen occasionally.
>>>>>>
>>>>>> If we had the check and logging immediately after reset, with a more
>>>>>> visible log, we could build a larger data set of reported failures and
>>>>>> then go back to hw people and ask them about it.
>>>>>
>>>>> The problem with the reset side is trying to avoid mixing in the case
>>>>> where we are faking it and expect it to still be set. On the ringbuffer
>>>>> path, we handle STOP_RING as being part of the init sequence. The log
>>>>> already says we did a reset and STOP_RING is set (since we have tracing
>>>>> through the reset), I just need to find the right spot to handle it.
>>>>
>>>> The "distance" in code paths between reset and init_hw is probably the
>>>> smallest of my concerns. If we could make this message not debug only,
>>>> and clear STOP_RING only conditionally I think I would be happy with
>>>> giving this a go and see if it collects any data.
>>>
>>> Why? Who else but us are going to read it, and why bother if it is not
>>> actionable. It is logged (was logged already, now highlighted) and for
>>> the user none of it is relevant, the only thing that matters to them is
>>> that they don't even notice the system suffered a GPU hang.
>>
>> My reasoning is this - something has happened which is completely
>> unexpected and may have other negative effect which are currently
>> unknown to us. By logging it at a higher level we collect this now very
>> specific signature on a wider collection of platforms and can then act
>> on the collected data set. Also very specific log message makes bug
>> triage easier and a lower chance it gets bunched up with other hang reports.
> 
> You do have a very specific debug message, but having a flip-flopping
> test that has no impact on system stability (as evidenced by the lack of
> errors, if there was a subsequent error it is imperative that the post
> mortem capture does its best, everything else is lost) does not help us.

What if we deploy this as DRM_NOTICE/INFO, get reports from the field 
and not just the selftest, and then realize the issue is actually 
interesting and could be fed back to HW design?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* ✗ Fi.CI.IGT: failure for drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-18 10:09 [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset Chris Wilson
                   ` (2 preceding siblings ...)
  2018-05-18 11:05 ` [PATCH v4] " Tvrtko Ursulin
@ 2018-05-18 14:43 ` Patchwork
  2018-05-18 16:14 ` ✗ Fi.CI.CHECKPATCH: warning " Patchwork
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Patchwork @ 2018-05-18 14:43 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
URL   : https://patchwork.freedesktop.org/series/43404/
State : failure

== Summary ==

= CI Bug Log - changes from CI_DRM_4205_full -> Patchwork_9047_full =

== Summary - FAILURE ==

  Serious unknown changes coming with Patchwork_9047_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_9047_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/43404/revisions/1/mbox/

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_9047_full:

  === IGT changes ===

    ==== Possible regressions ====

    igt@gem_exec_await@wide-contexts:
      shard-apl:          PASS -> FAIL

    
    ==== Warnings ====

    igt@gem_exec_schedule@deep-bsd2:
      shard-kbl:          SKIP -> PASS +1

    igt@gem_exec_schedule@deep-vebox:
      shard-kbl:          PASS -> SKIP +1

    igt@kms_plane_lowres@pipe-c-tiling-x:
      shard-apl:          SKIP -> PASS

    
== Known issues ==

  Here are the changes found in Patchwork_9047_full that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_flip@dpms-vs-vblank-race:
      shard-hsw:          PASS -> FAIL (fdo#103060)

    igt@kms_flip@flip-vs-blocking-wf-vblank:
      shard-hsw:          PASS -> FAIL (fdo#100368)

    igt@kms_flip@flip-vs-expired-vblank-interruptible:
      shard-glk:          PASS -> FAIL (fdo#105363)

    igt@kms_flip@plain-flip-ts-check-interruptible:
      shard-glk:          PASS -> FAIL (fdo#100368) +1

    igt@kms_flip_tiling@flip-y-tiled:
      shard-glk:          PASS -> FAIL (fdo#103822, fdo#104724)

    igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-shrfb-draw-mmap-wc:
      shard-glk:          PASS -> FAIL (fdo#103167, fdo#104724)

    igt@kms_setmode@basic:
      shard-kbl:          PASS -> FAIL (fdo#99912)

    
    ==== Possible fixes ====

    igt@kms_flip@2x-plain-flip-fb-recreate-interruptible:
      shard-hsw:          FAIL (fdo#100368) -> PASS

    igt@kms_flip_tiling@flip-to-x-tiled:
      shard-glk:          FAIL (fdo#104724) -> PASS

    igt@kms_flip_tiling@flip-to-y-tiled:
      shard-glk:          FAIL (fdo#103822, fdo#104724) -> PASS

    igt@kms_frontbuffer_tracking@fbc-1p-primscrn-pri-shrfb-draw-mmap-gtt:
      shard-apl:          DMESG-FAIL (fdo#105602, fdo#103558) -> PASS

    igt@kms_pipe_crc_basic@hang-read-crc-pipe-c:
      shard-apl:          DMESG-WARN (fdo#105602, fdo#103558) -> PASS +9

    igt@kms_setmode@basic:
      shard-apl:          FAIL (fdo#99912) -> PASS

    
  fdo#100368 https://bugs.freedesktop.org/show_bug.cgi?id=100368
  fdo#103060 https://bugs.freedesktop.org/show_bug.cgi?id=103060
  fdo#103167 https://bugs.freedesktop.org/show_bug.cgi?id=103167
  fdo#103558 https://bugs.freedesktop.org/show_bug.cgi?id=103558
  fdo#103822 https://bugs.freedesktop.org/show_bug.cgi?id=103822
  fdo#104724 https://bugs.freedesktop.org/show_bug.cgi?id=104724
  fdo#105363 https://bugs.freedesktop.org/show_bug.cgi?id=105363
  fdo#105602 https://bugs.freedesktop.org/show_bug.cgi?id=105602
  fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912


== Participating hosts (9 -> 9) ==

  No changes in participating hosts


== Build changes ==

    * Linux: CI_DRM_4205 -> Patchwork_9047

  CI_DRM_4205: 921a7738e856643993ea0889dd5a936c22151649 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4487: eccae1360d6d01e73c6af2bd97122cef708207ef @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9047: e01c159ce05bb1471d058ca945d5d386a2fe9512 @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4487: 6ab75f7eb5e1dccbb773e1739beeb2d7cbd6ad0d @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9047/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-18 10:09 [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset Chris Wilson
                   ` (3 preceding siblings ...)
  2018-05-18 14:43 ` ✗ Fi.CI.IGT: failure for " Patchwork
@ 2018-05-18 16:14 ` Patchwork
  2018-05-18 16:29 ` ✓ Fi.CI.BAT: success " Patchwork
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 20+ messages in thread
From: Patchwork @ 2018-05-18 16:14 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
URL   : https://patchwork.freedesktop.org/series/43404/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
100fe331b0d3 drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
-:11: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#11: 
<7>[  239.094843] i915_gem_set_wedged 	current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]

total: 0 errors, 1 warnings, 0 checks, 40 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-18 10:09 [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset Chris Wilson
                   ` (4 preceding siblings ...)
  2018-05-18 16:14 ` ✗ Fi.CI.CHECKPATCH: warning " Patchwork
@ 2018-05-18 16:29 ` Patchwork
  2018-05-19  0:05 ` ✓ Fi.CI.IGT: " Patchwork
  2018-05-19  9:05 ` [PATCH v4] " Chris Wilson
  7 siblings, 0 replies; 20+ messages in thread
From: Patchwork @ 2018-05-18 16:29 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
URL   : https://patchwork.freedesktop.org/series/43404/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4209 -> Patchwork_9055 =

== Summary - WARNING ==

  Minor unknown changes coming with Patchwork_9055 need to be verified
  manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_9055, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/43404/revisions/1/mbox/

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_9055:

  === IGT changes ===

    ==== Warnings ====

    igt@gem_exec_gttfill@basic:
      fi-pnv-d510:        PASS -> SKIP

    
== Known issues ==

  Here are the changes found in Patchwork_9055 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_flip@basic-flip-vs-dpms:
      fi-hsw-4770r:       PASS -> DMESG-WARN (fdo#105602)

    igt@kms_frontbuffer_tracking@basic:
      fi-hsw-peppy:       PASS -> DMESG-FAIL (fdo#102614, fdo#106103)

    
    ==== Possible fixes ====

    igt@gem_mmap_gtt@basic-small-bo-tiledx:
      fi-gdg-551:         FAIL (fdo#102575) -> PASS

    igt@kms_pipe_crc_basic@read-crc-pipe-b-frame-sequence:
      fi-hsw-4200u:       FAIL (fdo#103481) -> PASS

    
  fdo#102575 https://bugs.freedesktop.org/show_bug.cgi?id=102575
  fdo#102614 https://bugs.freedesktop.org/show_bug.cgi?id=102614
  fdo#103481 https://bugs.freedesktop.org/show_bug.cgi?id=103481
  fdo#105602 https://bugs.freedesktop.org/show_bug.cgi?id=105602
  fdo#106103 https://bugs.freedesktop.org/show_bug.cgi?id=106103


== Participating hosts (43 -> 39) ==

  Missing    (4): fi-ilk-m540 fi-byt-squawks fi-bsw-cyan fi-skl-6700hq 


== Build changes ==

    * Linux: CI_DRM_4209 -> Patchwork_9055

  CI_DRM_4209: eecb2c1e793ed98c39876c92fc64cd18a7fe6412 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4487: eccae1360d6d01e73c6af2bd97122cef708207ef @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9055: 100fe331b0d389dd375e5408acbf49221c972372 @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4487: 6ab75f7eb5e1dccbb773e1739beeb2d7cbd6ad0d @ git://anongit.freedesktop.org/piglit


== Linux commits ==

100fe331b0d3 drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9055/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* ✓ Fi.CI.IGT: success for drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-18 10:09 [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset Chris Wilson
                   ` (5 preceding siblings ...)
  2018-05-18 16:29 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2018-05-19  0:05 ` Patchwork
  2018-05-19  9:05 ` [PATCH v4] " Chris Wilson
  7 siblings, 0 replies; 20+ messages in thread
From: Patchwork @ 2018-05-19  0:05 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
URL   : https://patchwork.freedesktop.org/series/43404/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4209_full -> Patchwork_9055_full =

== Summary - WARNING ==

  Minor unknown changes coming with Patchwork_9055_full need to be verified
  manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_9055_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/43404/revisions/1/mbox/

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_9055_full:

  === IGT changes ===

    ==== Warnings ====

    igt@gem_exec_schedule@deep-blt:
      shard-kbl:          PASS -> SKIP

    igt@kms_force_connector_basic@prune-stale-modes:
      shard-hsw:          SKIP -> PASS

    igt@kms_plane_lowres@pipe-c-tiling-x:
      shard-apl:          SKIP -> PASS

    
== Known issues ==

  Here are the changes found in Patchwork_9055_full that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_flip@flip-vs-wf_vblank-interruptible:
      shard-hsw:          PASS -> FAIL (fdo#103928)

    igt@kms_flip@plain-flip-fb-recreate:
      shard-hsw:          PASS -> FAIL (fdo#100368)

    igt@kms_flip@plain-flip-fb-recreate-interruptible:
      shard-glk:          PASS -> FAIL (fdo#103928)

    igt@kms_flip_tiling@flip-x-tiled:
      shard-glk:          PASS -> FAIL (fdo#103822, fdo#104724)

    igt@kms_frontbuffer_tracking@fbc-1p-primscrn-pri-shrfb-draw-mmap-gtt:
      shard-kbl:          PASS -> DMESG-WARN (fdo#106247)

    
    ==== Possible fixes ====

    igt@drv_selftest@live_hangcheck:
      shard-kbl:          DMESG-FAIL (fdo#106560) -> PASS

    igt@kms_cursor_legacy@2x-nonblocking-modeset-vs-cursor-atomic:
      shard-glk:          FAIL (fdo#106509, fdo#105454) -> PASS

    igt@kms_flip@2x-plain-flip-fb-recreate-interruptible:
      shard-glk:          FAIL (fdo#100368) -> PASS

    igt@kms_flip_tiling@flip-to-x-tiled:
      shard-glk:          FAIL (fdo#104724) -> PASS

    igt@kms_flip_tiling@flip-y-tiled:
      shard-glk:          FAIL (fdo#103822, fdo#104724) -> PASS

    igt@kms_frontbuffer_tracking@fbc-1p-primscrn-pri-shrfb-draw-mmap-gtt:
      shard-apl:          DMESG-FAIL (fdo#105602, fdo#103558) -> PASS

    igt@kms_frontbuffer_tracking@fbc-2p-primscrn-pri-shrfb-draw-mmap-wc:
      shard-glk:          FAIL (fdo#104724, fdo#103167) -> PASS +2

    igt@kms_pipe_crc_basic@hang-read-crc-pipe-c:
      shard-apl:          DMESG-WARN (fdo#105602, fdo#103558) -> PASS +9

    igt@kms_setmode@basic:
      shard-hsw:          FAIL (fdo#99912) -> PASS

    
  fdo#100368 https://bugs.freedesktop.org/show_bug.cgi?id=100368
  fdo#103167 https://bugs.freedesktop.org/show_bug.cgi?id=103167
  fdo#103558 https://bugs.freedesktop.org/show_bug.cgi?id=103558
  fdo#103822 https://bugs.freedesktop.org/show_bug.cgi?id=103822
  fdo#103928 https://bugs.freedesktop.org/show_bug.cgi?id=103928
  fdo#104724 https://bugs.freedesktop.org/show_bug.cgi?id=104724
  fdo#105454 https://bugs.freedesktop.org/show_bug.cgi?id=105454
  fdo#105602 https://bugs.freedesktop.org/show_bug.cgi?id=105602
  fdo#106247 https://bugs.freedesktop.org/show_bug.cgi?id=106247
  fdo#106509 https://bugs.freedesktop.org/show_bug.cgi?id=106509
  fdo#106560 https://bugs.freedesktop.org/show_bug.cgi?id=106560
  fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912


== Participating hosts (9 -> 9) ==

  No changes in participating hosts


== Build changes ==

    * Linux: CI_DRM_4209 -> Patchwork_9055

  CI_DRM_4209: eecb2c1e793ed98c39876c92fc64cd18a7fe6412 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4487: eccae1360d6d01e73c6af2bd97122cef708207ef @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9055: 100fe331b0d389dd375e5408acbf49221c972372 @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4487: 6ab75f7eb5e1dccbb773e1739beeb2d7cbd6ad0d @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9055/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-18 14:42             ` Tvrtko Ursulin
@ 2018-05-19  9:04               ` Chris Wilson
  2018-05-24 13:34                 ` Tvrtko Ursulin
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Wilson @ 2018-05-19  9:04 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2018-05-18 15:42:00)
> 
> On 18/05/2018 15:13, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-05-18 13:36:52)
> >>
> >> On 18/05/2018 13:28, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2018-05-18 12:50:41)
> >>>>
> >>>> On 18/05/2018 12:10, Chris Wilson wrote:
> >>>>> Quoting Tvrtko Ursulin (2018-05-18 12:05:17)
> >>>>>>
> >>>>>> On 18/05/2018 11:09, Chris Wilson wrote:
> >>>>>>> Inside the live_hangcheck (reset) selftests, we occasionally see
> >>>>>>> failures like
> >>>>>>>
> >>>>>>> <7>[  239.094840] i915_gem_set_wedged rcs0
> >>>>>>> <7>[  239.094843] i915_gem_set_wedged         current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
> >>>>>>> <7>[  239.094846] i915_gem_set_wedged         Reset count: 6239 (global 1)
> >>>>>>> <7>[  239.094848] i915_gem_set_wedged         Requests:
> >>>>>>> <7>[  239.095052] i915_gem_set_wedged                 first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> >>>>>>> <7>[  239.095056] i915_gem_set_wedged                 last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
> >>>>>>> <7>[  239.095059] i915_gem_set_wedged                 active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> >>>>>>> <7>[  239.095062] i915_gem_set_wedged                 [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
> >>>>>>> <7>[  239.100050] i915_gem_set_wedged                 ring->start:  0x00283000
> >>>>>>> <7>[  239.100053] i915_gem_set_wedged                 ring->head:   0x000001f8
> >>>>>>> <7>[  239.100055] i915_gem_set_wedged                 ring->tail:   0x000002a8
> >>>>>>> <7>[  239.100057] i915_gem_set_wedged                 ring->emit:   0x000002a8
> >>>>>>> <7>[  239.100059] i915_gem_set_wedged                 ring->space:  0x00000f10
> >>>>>>> <7>[  239.100085] i915_gem_set_wedged         RING_START: 0x00283000
> >>>>>>> <7>[  239.100088] i915_gem_set_wedged         RING_HEAD:  0x00000260
> >>>>>>> <7>[  239.100091] i915_gem_set_wedged         RING_TAIL:  0x000002a8
> >>>>>>> <7>[  239.100094] i915_gem_set_wedged         RING_CTL:   0x00000001
> >>>>>>> <7>[  239.100097] i915_gem_set_wedged         RING_MODE:  0x00000300 [idle]
> >>>>>>> <7>[  239.100100] i915_gem_set_wedged         RING_IMR: fffffefe
> >>>>>>> <7>[  239.100104] i915_gem_set_wedged         ACTHD:  0x00000000_0000609c
> >>>>>>> <7>[  239.100108] i915_gem_set_wedged         BBADDR: 0x00000000_0000609d
> >>>>>>> <7>[  239.100111] i915_gem_set_wedged         DMA_FADDR: 0x00000000_00283260
> >>>>>>> <7>[  239.100114] i915_gem_set_wedged         IPEIR: 0x00000000
> >>>>>>> <7>[  239.100117] i915_gem_set_wedged         IPEHR: 0x02800000
> >>>>>>> <7>[  239.100120] i915_gem_set_wedged         Execlist status: 0x00044052 00000002
> >>>>>>> <7>[  239.100124] i915_gem_set_wedged         Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
> >>>>>>> <7>[  239.100128] i915_gem_set_wedged                 ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> >>>>>>> <7>[  239.100132] i915_gem_set_wedged                 ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> >>>>>>> <7>[  239.100135] i915_gem_set_wedged                 HW active? 0x5
> >>>>>>> <7>[  239.100250] i915_gem_set_wedged                 E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> >>>>>>> <7>[  239.100338] i915_gem_set_wedged                 E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> >>>>>>> <7>[  239.100340] i915_gem_set_wedged                 Queue priority: 139
> >>>>>>> <7>[  239.100343] i915_gem_set_wedged                 Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
> >>>>>>> <7>[  239.100346] i915_gem_set_wedged                 Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
> >>>>>>> <7>[  239.100349] i915_gem_set_wedged                 Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
> >>>>>>> <7>[  239.100352] i915_gem_set_wedged                 Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
> >>>>>>> <7>[  239.100356] i915_gem_set_wedged                 Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
> >>>>>>> <7>[  239.100362] i915_gem_set_wedged         drv_selftest [5894] waiting for 19a99
> >>>>>>>
> >>>>>>> where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
> >>>>>>> The RING_MODE indicates that is idle and has the STOP_RING bit set, so
> >>>>>>> try clearing it.
> >>>>>>>
> >>>>>>> v2: Only clear the bit on restarting the ring, as we want to be sure the
> >>>>>>> STOP_RING bit is kept if reset fails on wedging.
> >>>>>>> v3: Spot when the ring state doesn't make sense when re-initialising the
> >>>>>>> engine and dump it to the logs so that we don't have to wait for an
> >>>>>>> error later and try to guess what happened earlier.
> >>>>>>> v4: Prepare to print all the unexpected state, not just the first.
> >>>>>>>
> >>>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >>>>>>> ---
> >>>>>>>      drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++++++++++++
> >>>>>>>      1 file changed, 22 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> >>>>>>> index 3744f5750624..ba8411ba4abf 100644
> >>>>>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
> >>>>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> >>>>>>> @@ -1781,6 +1781,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
> >>>>>>>                  I915_WRITE(RING_MODE_GEN7(engine),
> >>>>>>>                             _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
> >>>>>>>      
> >>>>>>> +     I915_WRITE(RING_MI_MODE(engine->mmio_base),
> >>>>>>> +                _MASKED_BIT_DISABLE(STOP_RING));
> >>>>>>
> >>>>>> Worries me a bit to clear it unconditionally since documentation says
> >>>>>> nothing (that I can find) about this scenario.
> >>>>>>
> >>>>>>> +
> >>>>>>>          I915_WRITE(RING_HWS_PGA(engine->mmio_base),
> >>>>>>>                     engine->status_page.ggtt_offset);
> >>>>>>>          POSTING_READ(RING_HWS_PGA(engine->mmio_base));
> >>>>>>> @@ -1789,6 +1792,19 @@ static void enable_execlists(struct intel_engine_cs *engine)
> >>>>>>>          engine->execlists.csb_head = -1;
> >>>>>>>      }
> >>>>>>>      
> >>>>>>> +static bool unexpected_starting_state(struct intel_engine_cs *engine)
> >>>>>>> +{
> >>>>>>> +     struct drm_i915_private *dev_priv = engine->i915;
> >>>>>>> +     bool unexpected = false;
> >>>>>>> +
> >>>>>>> +     if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) {
> >>>>>>> +             DRM_DEBUG_DRIVER("STOP_RING still set in RING_MI_MODE\n");
> >>>>>>
> >>>>>> Also worries me to bury this as a debug message unless we can find in
> >>>>>> documentation that this is known to happen occasionally.
> >>>>>>
> >>>>>> If we had the check and logging immediately after reset, with a more
> >>>>>> visible log, we could build a larger data set of reported failures and
> >>>>>> then go back to hw people and ask them about it.
> >>>>>
> >>>>> The problem with the reset side is trying to avoid mixing in the case
> >>>>> where we are faking it and expect it to still be set. On the ringbuffer
> >>>>> path, we handle STOP_RING as being part of the init sequence. The log
> >>>>> already says we did a reset and STOP_RING is set (since we have tracing
> >>>>> through the reset), I just need to find the right spot to handle it.
> >>>>
> >>>> The "distance" in code paths between reset and init_hw is probably the
> >>>> smallest of my concerns. If we could make this message not debug only,
> >>>> and clear STOP_RING only conditionally I think I would be happy with
> >>>> giving this a go and see if it collects any data.
> >>>
> >>> Why? Who else but us are going to read it, and why bother if it is not
> >>> actionable. It is logged (was logged already, now highlighted) and for
> >>> the user none of it is relevant, the only thing that matters to them is
> >>> that they don't even notice the system suffered a GPU hang.
> >>
> >> My reasoning is this - something has happened which is completely
> >> unexpected and may have other negative effect which are currently
> >> unknown to us. By logging it at a higher level we collect this now very
> >> specific signature on a wider collection of platforms and can then act
> >> on the collected data set. Also very specific log message makes bug
> >> triage easier and a lower chance it gets bunched up with other hang reports.
> > 
> > You do have a very specific debug message, but having a flip-flopping
> > test that has no impact on system stability (as evidenced by the lack of
> > errors, if there was a subsequent error it is imperative that the post
> > mortem capture does its best, everything else is lost) does not help us.
> 
> What if we deploy this as DRM_NOTICE/INFO, get reports from the field 
> and not just the selftest, and then realize the issue is actually 
> interesting and could be fed back to HW design?

But we already have enough to point out that the HW is not behaving as
we^W I would expect.

For field analysis, I strongly believe that unless you can capture
everything you need without user intervention, you will never get the
debug information required. The instructions we have are "oops,
something went wrong, please file a new bug report here and attach this
file". That's already too complicated wants to be automated, e.g. abrtd.
But for whatever reason we don't get the reports from abrtd.

Elsewhere we use the method of light periodic checks and if they spot an
issue, switch on heavy debug mode. For something like this where we want
a trace of operations as they happen? I think we need to be able to turn
on GEM tracing on the fly and be able to capture the compressed buffers.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-18 10:09 [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset Chris Wilson
                   ` (6 preceding siblings ...)
  2018-05-19  0:05 ` ✓ Fi.CI.IGT: " Patchwork
@ 2018-05-19  9:05 ` Chris Wilson
  7 siblings, 0 replies; 20+ messages in thread
From: Chris Wilson @ 2018-05-19  9:05 UTC (permalink / raw)
  To: intel-gfx

Quoting Chris Wilson (2018-05-18 11:09:33)
> Inside the live_hangcheck (reset) selftests, we occasionally see
> failures like
> 
> <7>[  239.094840] i915_gem_set_wedged rcs0
> <7>[  239.094843] i915_gem_set_wedged   current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
> <7>[  239.094846] i915_gem_set_wedged   Reset count: 6239 (global 1)
> <7>[  239.094848] i915_gem_set_wedged   Requests:
> <7>[  239.095052] i915_gem_set_wedged           first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> <7>[  239.095056] i915_gem_set_wedged           last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
> <7>[  239.095059] i915_gem_set_wedged           active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> <7>[  239.095062] i915_gem_set_wedged           [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
> <7>[  239.100050] i915_gem_set_wedged           ring->start:  0x00283000
> <7>[  239.100053] i915_gem_set_wedged           ring->head:   0x000001f8
> <7>[  239.100055] i915_gem_set_wedged           ring->tail:   0x000002a8
> <7>[  239.100057] i915_gem_set_wedged           ring->emit:   0x000002a8
> <7>[  239.100059] i915_gem_set_wedged           ring->space:  0x00000f10
> <7>[  239.100085] i915_gem_set_wedged   RING_START: 0x00283000
> <7>[  239.100088] i915_gem_set_wedged   RING_HEAD:  0x00000260
> <7>[  239.100091] i915_gem_set_wedged   RING_TAIL:  0x000002a8
> <7>[  239.100094] i915_gem_set_wedged   RING_CTL:   0x00000001
> <7>[  239.100097] i915_gem_set_wedged   RING_MODE:  0x00000300 [idle]
> <7>[  239.100100] i915_gem_set_wedged   RING_IMR: fffffefe
> <7>[  239.100104] i915_gem_set_wedged   ACTHD:  0x00000000_0000609c
> <7>[  239.100108] i915_gem_set_wedged   BBADDR: 0x00000000_0000609d
> <7>[  239.100111] i915_gem_set_wedged   DMA_FADDR: 0x00000000_00283260
> <7>[  239.100114] i915_gem_set_wedged   IPEIR: 0x00000000
> <7>[  239.100117] i915_gem_set_wedged   IPEHR: 0x02800000
> <7>[  239.100120] i915_gem_set_wedged   Execlist status: 0x00044052 00000002
> <7>[  239.100124] i915_gem_set_wedged   Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
> <7>[  239.100128] i915_gem_set_wedged           ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> <7>[  239.100132] i915_gem_set_wedged           ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> <7>[  239.100135] i915_gem_set_wedged           HW active? 0x5
> <7>[  239.100250] i915_gem_set_wedged           E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> <7>[  239.100338] i915_gem_set_wedged           E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> <7>[  239.100340] i915_gem_set_wedged           Queue priority: 139
> <7>[  239.100343] i915_gem_set_wedged           Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
> <7>[  239.100346] i915_gem_set_wedged           Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
> <7>[  239.100349] i915_gem_set_wedged           Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
> <7>[  239.100352] i915_gem_set_wedged           Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
> <7>[  239.100356] i915_gem_set_wedged           Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
> <7>[  239.100362] i915_gem_set_wedged   drv_selftest [5894] waiting for 19a99
> 
> where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
> The RING_MODE indicates that is idle and has the STOP_RING bit set, so
> try clearing it.
> 
> v2: Only clear the bit on restarting the ring, as we want to be sure the
> STOP_RING bit is kept if reset fails on wedging.
> v3: Spot when the ring state doesn't make sense when re-initialising the
> engine and dump it to the logs so that we don't have to wait for an
> error later and try to guess what happened earlier.
> v4: Prepare to print all the unexpected state, not just the first.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>

This has gone through about 10 times without a live_hangcheck fail,
which is significantly better than the current state of affairs.
-Chris

> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 3744f5750624..ba8411ba4abf 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1781,6 +1781,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
>                 I915_WRITE(RING_MODE_GEN7(engine),
>                            _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
>  
> +       I915_WRITE(RING_MI_MODE(engine->mmio_base),
> +                  _MASKED_BIT_DISABLE(STOP_RING));
> +
>         I915_WRITE(RING_HWS_PGA(engine->mmio_base),
>                    engine->status_page.ggtt_offset);
>         POSTING_READ(RING_HWS_PGA(engine->mmio_base));
> @@ -1789,6 +1792,19 @@ static void enable_execlists(struct intel_engine_cs *engine)
>         engine->execlists.csb_head = -1;
>  }
>  
> +static bool unexpected_starting_state(struct intel_engine_cs *engine)
> +{
> +       struct drm_i915_private *dev_priv = engine->i915;
> +       bool unexpected = false;
> +
> +       if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) {
> +               DRM_DEBUG_DRIVER("STOP_RING still set in RING_MI_MODE\n");
> +               unexpected = true;
> +       }
> +
> +       return unexpected;
> +}
> +
>  static int gen8_init_common_ring(struct intel_engine_cs *engine)
>  {
>         struct intel_engine_execlists * const execlists = &engine->execlists;
> @@ -1801,6 +1817,12 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
>         intel_engine_reset_breadcrumbs(engine);
>         intel_engine_init_hangcheck(engine);
>  
> +       if (GEM_SHOW_DEBUG() && unexpected_starting_state(engine)) {
> +               struct drm_printer p = drm_debug_printer(__func__);
> +
> +               intel_engine_dump(engine, &p, NULL);
> +       }
> +
>         enable_execlists(engine);
>  
>         /* After a GPU reset, we may have requests to replay */
> -- 
> 2.17.0
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-19  9:04               ` Chris Wilson
@ 2018-05-24 13:34                 ` Tvrtko Ursulin
  2018-05-24 13:40                   ` Chris Wilson
  0 siblings, 1 reply; 20+ messages in thread
From: Tvrtko Ursulin @ 2018-05-24 13:34 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 19/05/2018 10:04, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-05-18 15:42:00)
>>
>> On 18/05/2018 15:13, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2018-05-18 13:36:52)
>>>>
>>>> On 18/05/2018 13:28, Chris Wilson wrote:
>>>>> Quoting Tvrtko Ursulin (2018-05-18 12:50:41)
>>>>>>
>>>>>> On 18/05/2018 12:10, Chris Wilson wrote:
>>>>>>> Quoting Tvrtko Ursulin (2018-05-18 12:05:17)
>>>>>>>>
>>>>>>>> On 18/05/2018 11:09, Chris Wilson wrote:
>>>>>>>>> Inside the live_hangcheck (reset) selftests, we occasionally see
>>>>>>>>> failures like
>>>>>>>>>
>>>>>>>>> <7>[  239.094840] i915_gem_set_wedged rcs0
>>>>>>>>> <7>[  239.094843] i915_gem_set_wedged         current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
>>>>>>>>> <7>[  239.094846] i915_gem_set_wedged         Reset count: 6239 (global 1)
>>>>>>>>> <7>[  239.094848] i915_gem_set_wedged         Requests:
>>>>>>>>> <7>[  239.095052] i915_gem_set_wedged                 first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
>>>>>>>>> <7>[  239.095056] i915_gem_set_wedged                 last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
>>>>>>>>> <7>[  239.095059] i915_gem_set_wedged                 active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
>>>>>>>>> <7>[  239.095062] i915_gem_set_wedged                 [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
>>>>>>>>> <7>[  239.100050] i915_gem_set_wedged                 ring->start:  0x00283000
>>>>>>>>> <7>[  239.100053] i915_gem_set_wedged                 ring->head:   0x000001f8
>>>>>>>>> <7>[  239.100055] i915_gem_set_wedged                 ring->tail:   0x000002a8
>>>>>>>>> <7>[  239.100057] i915_gem_set_wedged                 ring->emit:   0x000002a8
>>>>>>>>> <7>[  239.100059] i915_gem_set_wedged                 ring->space:  0x00000f10
>>>>>>>>> <7>[  239.100085] i915_gem_set_wedged         RING_START: 0x00283000
>>>>>>>>> <7>[  239.100088] i915_gem_set_wedged         RING_HEAD:  0x00000260
>>>>>>>>> <7>[  239.100091] i915_gem_set_wedged         RING_TAIL:  0x000002a8
>>>>>>>>> <7>[  239.100094] i915_gem_set_wedged         RING_CTL:   0x00000001
>>>>>>>>> <7>[  239.100097] i915_gem_set_wedged         RING_MODE:  0x00000300 [idle]
>>>>>>>>> <7>[  239.100100] i915_gem_set_wedged         RING_IMR: fffffefe
>>>>>>>>> <7>[  239.100104] i915_gem_set_wedged         ACTHD:  0x00000000_0000609c
>>>>>>>>> <7>[  239.100108] i915_gem_set_wedged         BBADDR: 0x00000000_0000609d
>>>>>>>>> <7>[  239.100111] i915_gem_set_wedged         DMA_FADDR: 0x00000000_00283260
>>>>>>>>> <7>[  239.100114] i915_gem_set_wedged         IPEIR: 0x00000000
>>>>>>>>> <7>[  239.100117] i915_gem_set_wedged         IPEHR: 0x02800000
>>>>>>>>> <7>[  239.100120] i915_gem_set_wedged         Execlist status: 0x00044052 00000002
>>>>>>>>> <7>[  239.100124] i915_gem_set_wedged         Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
>>>>>>>>> <7>[  239.100128] i915_gem_set_wedged                 ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
>>>>>>>>> <7>[  239.100132] i915_gem_set_wedged                 ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
>>>>>>>>> <7>[  239.100135] i915_gem_set_wedged                 HW active? 0x5
>>>>>>>>> <7>[  239.100250] i915_gem_set_wedged                 E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
>>>>>>>>> <7>[  239.100338] i915_gem_set_wedged                 E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
>>>>>>>>> <7>[  239.100340] i915_gem_set_wedged                 Queue priority: 139
>>>>>>>>> <7>[  239.100343] i915_gem_set_wedged                 Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
>>>>>>>>> <7>[  239.100346] i915_gem_set_wedged                 Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
>>>>>>>>> <7>[  239.100349] i915_gem_set_wedged                 Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
>>>>>>>>> <7>[  239.100352] i915_gem_set_wedged                 Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
>>>>>>>>> <7>[  239.100356] i915_gem_set_wedged                 Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
>>>>>>>>> <7>[  239.100362] i915_gem_set_wedged         drv_selftest [5894] waiting for 19a99
>>>>>>>>>
>>>>>>>>> where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
>>>>>>>>> The RING_MODE indicates that is idle and has the STOP_RING bit set, so
>>>>>>>>> try clearing it.
>>>>>>>>>
>>>>>>>>> v2: Only clear the bit on restarting the ring, as we want to be sure the
>>>>>>>>> STOP_RING bit is kept if reset fails on wedging.
>>>>>>>>> v3: Spot when the ring state doesn't make sense when re-initialising the
>>>>>>>>> engine and dump it to the logs so that we don't have to wait for an
>>>>>>>>> error later and try to guess what happened earlier.
>>>>>>>>> v4: Prepare to print all the unexpected state, not just the first.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>>>>>>>>> ---
>>>>>>>>>       drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++++++++++++
>>>>>>>>>       1 file changed, 22 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>>>>>>>>> index 3744f5750624..ba8411ba4abf 100644
>>>>>>>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>>>>>>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>>>>>>>> @@ -1781,6 +1781,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
>>>>>>>>>                   I915_WRITE(RING_MODE_GEN7(engine),
>>>>>>>>>                              _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
>>>>>>>>>       
>>>>>>>>> +     I915_WRITE(RING_MI_MODE(engine->mmio_base),
>>>>>>>>> +                _MASKED_BIT_DISABLE(STOP_RING));
>>>>>>>>
>>>>>>>> Worries me a bit to clear it unconditionally since documentation says
>>>>>>>> nothing (that I can find) about this scenario.
>>>>>>>>
>>>>>>>>> +
>>>>>>>>>           I915_WRITE(RING_HWS_PGA(engine->mmio_base),
>>>>>>>>>                      engine->status_page.ggtt_offset);
>>>>>>>>>           POSTING_READ(RING_HWS_PGA(engine->mmio_base));
>>>>>>>>> @@ -1789,6 +1792,19 @@ static void enable_execlists(struct intel_engine_cs *engine)
>>>>>>>>>           engine->execlists.csb_head = -1;
>>>>>>>>>       }
>>>>>>>>>       
>>>>>>>>> +static bool unexpected_starting_state(struct intel_engine_cs *engine)
>>>>>>>>> +{
>>>>>>>>> +     struct drm_i915_private *dev_priv = engine->i915;
>>>>>>>>> +     bool unexpected = false;
>>>>>>>>> +
>>>>>>>>> +     if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) {
>>>>>>>>> +             DRM_DEBUG_DRIVER("STOP_RING still set in RING_MI_MODE\n");
>>>>>>>>
>>>>>>>> Also worries me to bury this as a debug message unless we can find in
>>>>>>>> documentation that this is known to happen occasionally.
>>>>>>>>
>>>>>>>> If we had the check and logging immediately after reset, with a more
>>>>>>>> visible log, we could build a larger data set of reported failures and
>>>>>>>> then go back to hw people and ask them about it.
>>>>>>>
>>>>>>> The problem with the reset side is trying to avoid mixing in the case
>>>>>>> where we are faking it and expect it to still be set. On the ringbuffer
>>>>>>> path, we handle STOP_RING as being part of the init sequence. The log
>>>>>>> already says we did a reset and STOP_RING is set (since we have tracing
>>>>>>> through the reset), I just need to find the right spot to handle it.
>>>>>>
>>>>>> The "distance" in code paths between reset and init_hw is probably the
>>>>>> smallest of my concerns. If we could make this message not debug only,
>>>>>> and clear STOP_RING only conditionally I think I would be happy with
>>>>>> giving this a go and see if it collects any data.
>>>>>
>>>>> Why? Who else but us are going to read it, and why bother if it is not
>>>>> actionable. It is logged (was logged already, now highlighted) and for
>>>>> the user none of it is relevant, the only thing that matters to them is
>>>>> that they don't even notice the system suffered a GPU hang.
>>>>
>>>> My reasoning is this - something has happened which is completely
>>>> unexpected and may have other negative effect which are currently
>>>> unknown to us. By logging it at a higher level we collect this now very
>>>> specific signature on a wider collection of platforms and can then act
>>>> on the collected data set. Also very specific log message makes bug
>>>> triage easier and a lower chance it gets bunched up with other hang reports.
>>>
>>> You do have a very specific debug message, but having a flip-flopping
>>> test that has no impact on system stability (as evidenced by the lack of
>>> errors, if there was a subsequent error it is imperative that the post
>>> mortem capture does its best, everything else is lost) does not help us.
>>
>> What if we deploy this as DRM_NOTICE/INFO, get reports from the field
>> and not just the selftest, and then realize the issue is actually
>> interesting and could be fed back to HW design?
> 
> But we already have enough to point out that the HW is not behaving as
> we^W I would expect.
> 
> For field analysis, I strongly believe that unless you can capture
> everything you need without user intervention, you will never get the
> debug information required. The instructions we have are "oops,
> something went wrong, please file a new bug report here and attach this
> file". That's already too complicated wants to be automated, e.g. abrtd.
> But for whatever reason we don't get the reports from abrtd.
> 
> Elsewhere we use the method of light periodic checks and if they spot an
> issue, switch on heavy debug mode. For something like this where we want
> a trace of operations as they happen? I think we need to be able to turn
> on GEM tracing on the fly and be able to capture the compressed buffers.

Not sure that we need a trace of pre-events for this. It seems pretty 
straight-forward to me. Register is not in the state in which it should 
be immediately after reset.

It will help sort the hang reports into a new bucket.

And won't create any more bug reports since after a message always comes 
a hang.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-24 13:34                 ` Tvrtko Ursulin
@ 2018-05-24 13:40                   ` Chris Wilson
  2018-05-25  8:36                     ` Tvrtko Ursulin
  0 siblings, 1 reply; 20+ messages in thread
From: Chris Wilson @ 2018-05-24 13:40 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2018-05-24 14:34:41)
> 
> On 19/05/2018 10:04, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-05-18 15:42:00)
> >>
> >> On 18/05/2018 15:13, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2018-05-18 13:36:52)
> >>>>
> >>>> On 18/05/2018 13:28, Chris Wilson wrote:
> >>>>> Quoting Tvrtko Ursulin (2018-05-18 12:50:41)
> >>>>>>
> >>>>>> On 18/05/2018 12:10, Chris Wilson wrote:
> >>>>>>> Quoting Tvrtko Ursulin (2018-05-18 12:05:17)
> >>>>>>>>
> >>>>>>>> On 18/05/2018 11:09, Chris Wilson wrote:
> >>>>>>>>> Inside the live_hangcheck (reset) selftests, we occasionally see
> >>>>>>>>> failures like
> >>>>>>>>>
> >>>>>>>>> <7>[  239.094840] i915_gem_set_wedged rcs0
> >>>>>>>>> <7>[  239.094843] i915_gem_set_wedged         current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
> >>>>>>>>> <7>[  239.094846] i915_gem_set_wedged         Reset count: 6239 (global 1)
> >>>>>>>>> <7>[  239.094848] i915_gem_set_wedged         Requests:
> >>>>>>>>> <7>[  239.095052] i915_gem_set_wedged                 first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> >>>>>>>>> <7>[  239.095056] i915_gem_set_wedged                 last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
> >>>>>>>>> <7>[  239.095059] i915_gem_set_wedged                 active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> >>>>>>>>> <7>[  239.095062] i915_gem_set_wedged                 [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
> >>>>>>>>> <7>[  239.100050] i915_gem_set_wedged                 ring->start:  0x00283000
> >>>>>>>>> <7>[  239.100053] i915_gem_set_wedged                 ring->head:   0x000001f8
> >>>>>>>>> <7>[  239.100055] i915_gem_set_wedged                 ring->tail:   0x000002a8
> >>>>>>>>> <7>[  239.100057] i915_gem_set_wedged                 ring->emit:   0x000002a8
> >>>>>>>>> <7>[  239.100059] i915_gem_set_wedged                 ring->space:  0x00000f10
> >>>>>>>>> <7>[  239.100085] i915_gem_set_wedged         RING_START: 0x00283000
> >>>>>>>>> <7>[  239.100088] i915_gem_set_wedged         RING_HEAD:  0x00000260
> >>>>>>>>> <7>[  239.100091] i915_gem_set_wedged         RING_TAIL:  0x000002a8
> >>>>>>>>> <7>[  239.100094] i915_gem_set_wedged         RING_CTL:   0x00000001
> >>>>>>>>> <7>[  239.100097] i915_gem_set_wedged         RING_MODE:  0x00000300 [idle]
> >>>>>>>>> <7>[  239.100100] i915_gem_set_wedged         RING_IMR: fffffefe
> >>>>>>>>> <7>[  239.100104] i915_gem_set_wedged         ACTHD:  0x00000000_0000609c
> >>>>>>>>> <7>[  239.100108] i915_gem_set_wedged         BBADDR: 0x00000000_0000609d
> >>>>>>>>> <7>[  239.100111] i915_gem_set_wedged         DMA_FADDR: 0x00000000_00283260
> >>>>>>>>> <7>[  239.100114] i915_gem_set_wedged         IPEIR: 0x00000000
> >>>>>>>>> <7>[  239.100117] i915_gem_set_wedged         IPEHR: 0x02800000
> >>>>>>>>> <7>[  239.100120] i915_gem_set_wedged         Execlist status: 0x00044052 00000002
> >>>>>>>>> <7>[  239.100124] i915_gem_set_wedged         Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
> >>>>>>>>> <7>[  239.100128] i915_gem_set_wedged                 ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> >>>>>>>>> <7>[  239.100132] i915_gem_set_wedged                 ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> >>>>>>>>> <7>[  239.100135] i915_gem_set_wedged                 HW active? 0x5
> >>>>>>>>> <7>[  239.100250] i915_gem_set_wedged                 E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> >>>>>>>>> <7>[  239.100338] i915_gem_set_wedged                 E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> >>>>>>>>> <7>[  239.100340] i915_gem_set_wedged                 Queue priority: 139
> >>>>>>>>> <7>[  239.100343] i915_gem_set_wedged                 Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
> >>>>>>>>> <7>[  239.100346] i915_gem_set_wedged                 Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
> >>>>>>>>> <7>[  239.100349] i915_gem_set_wedged                 Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
> >>>>>>>>> <7>[  239.100352] i915_gem_set_wedged                 Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
> >>>>>>>>> <7>[  239.100356] i915_gem_set_wedged                 Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
> >>>>>>>>> <7>[  239.100362] i915_gem_set_wedged         drv_selftest [5894] waiting for 19a99
> >>>>>>>>>
> >>>>>>>>> where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
> >>>>>>>>> The RING_MODE indicates that is idle and has the STOP_RING bit set, so
> >>>>>>>>> try clearing it.
> >>>>>>>>>
> >>>>>>>>> v2: Only clear the bit on restarting the ring, as we want to be sure the
> >>>>>>>>> STOP_RING bit is kept if reset fails on wedging.
> >>>>>>>>> v3: Spot when the ring state doesn't make sense when re-initialising the
> >>>>>>>>> engine and dump it to the logs so that we don't have to wait for an
> >>>>>>>>> error later and try to guess what happened earlier.
> >>>>>>>>> v4: Prepare to print all the unexpected state, not just the first.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >>>>>>>>> ---
> >>>>>>>>>       drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++++++++++++
> >>>>>>>>>       1 file changed, 22 insertions(+)
> >>>>>>>>>
> >>>>>>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> >>>>>>>>> index 3744f5750624..ba8411ba4abf 100644
> >>>>>>>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
> >>>>>>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> >>>>>>>>> @@ -1781,6 +1781,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
> >>>>>>>>>                   I915_WRITE(RING_MODE_GEN7(engine),
> >>>>>>>>>                              _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
> >>>>>>>>>       
> >>>>>>>>> +     I915_WRITE(RING_MI_MODE(engine->mmio_base),
> >>>>>>>>> +                _MASKED_BIT_DISABLE(STOP_RING));
> >>>>>>>>
> >>>>>>>> Worries me a bit to clear it unconditionally since documentation says
> >>>>>>>> nothing (that I can find) about this scenario.
> >>>>>>>>
> >>>>>>>>> +
> >>>>>>>>>           I915_WRITE(RING_HWS_PGA(engine->mmio_base),
> >>>>>>>>>                      engine->status_page.ggtt_offset);
> >>>>>>>>>           POSTING_READ(RING_HWS_PGA(engine->mmio_base));
> >>>>>>>>> @@ -1789,6 +1792,19 @@ static void enable_execlists(struct intel_engine_cs *engine)
> >>>>>>>>>           engine->execlists.csb_head = -1;
> >>>>>>>>>       }
> >>>>>>>>>       
> >>>>>>>>> +static bool unexpected_starting_state(struct intel_engine_cs *engine)
> >>>>>>>>> +{
> >>>>>>>>> +     struct drm_i915_private *dev_priv = engine->i915;
> >>>>>>>>> +     bool unexpected = false;
> >>>>>>>>> +
> >>>>>>>>> +     if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) {
> >>>>>>>>> +             DRM_DEBUG_DRIVER("STOP_RING still set in RING_MI_MODE\n");
> >>>>>>>>
> >>>>>>>> Also worries me to bury this as a debug message unless we can find in
> >>>>>>>> documentation that this is known to happen occasionally.
> >>>>>>>>
> >>>>>>>> If we had the check and logging immediately after reset, with a more
> >>>>>>>> visible log, we could build a larger data set of reported failures and
> >>>>>>>> then go back to hw people and ask them about it.
> >>>>>>>
> >>>>>>> The problem with the reset side is trying to avoid mixing in the case
> >>>>>>> where we are faking it and expect it to still be set. On the ringbuffer
> >>>>>>> path, we handle STOP_RING as being part of the init sequence. The log
> >>>>>>> already says we did a reset and STOP_RING is set (since we have tracing
> >>>>>>> through the reset), I just need to find the right spot to handle it.
> >>>>>>
> >>>>>> The "distance" in code paths between reset and init_hw is probably the
> >>>>>> smallest of my concerns. If we could make this message not debug only,
> >>>>>> and clear STOP_RING only conditionally I think I would be happy with
> >>>>>> giving this a go and see if it collects any data.
> >>>>>
> >>>>> Why? Who else but us are going to read it, and why bother if it is not
> >>>>> actionable. It is logged (was logged already, now highlighted) and for
> >>>>> the user none of it is relevant, the only thing that matters to them is
> >>>>> that they don't even notice the system suffered a GPU hang.
> >>>>
> >>>> My reasoning is this - something has happened which is completely
> >>>> unexpected and may have other negative effect which are currently
> >>>> unknown to us. By logging it at a higher level we collect this now very
> >>>> specific signature on a wider collection of platforms and can then act
> >>>> on the collected data set. Also very specific log message makes bug
> >>>> triage easier and a lower chance it gets bunched up with other hang reports.
> >>>
> >>> You do have a very specific debug message, but having a flip-flopping
> >>> test that has no impact on system stability (as evidenced by the lack of
> >>> errors, if there was a subsequent error it is imperative that the post
> >>> mortem capture does its best, everything else is lost) does not help us.
> >>
> >> What if we deploy this as DRM_NOTICE/INFO, get reports from the field
> >> and not just the selftest, and then realize the issue is actually
> >> interesting and could be fed back to HW design?
> > 
> > But we already have enough to point out that the HW is not behaving as
> > we^W I would expect.
> > 
> > For field analysis, I strongly believe that unless you can capture
> > everything you need without user intervention, you will never get the
> > debug information required. The instructions we have are "oops,
> > something went wrong, please file a new bug report here and attach this
> > file". That's already too complicated wants to be automated, e.g. abrtd.
> > But for whatever reason we don't get the reports from abrtd.
> > 
> > Elsewhere we use the method of light periodic checks and if they spot an
> > issue, switch on heavy debug mode. For something like this where we want
> > a trace of operations as they happen? I think we need to be able to turn
> > on GEM tracing on the fly and be able to capture the compressed buffers.
> 
> Not sure that we need a trace of pre-events for this. It seems pretty 
> straight-forward to me. Register is not in the state in which it should 
> be immediately after reset.

But you don't always have a reset, that makes throwing a tantrum here
difficult.
 
> It will help sort the hang reports into a new bucket.

The information is already in the hang report. What isn't which would be
useful is the trace of events that precede certain hangs (ones that I
suspect are driver bugs as opposed to userspace GPU programming issues).
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-24 13:40                   ` Chris Wilson
@ 2018-05-25  8:36                     ` Tvrtko Ursulin
  2018-05-25 12:54                       ` Chris Wilson
  0 siblings, 1 reply; 20+ messages in thread
From: Tvrtko Ursulin @ 2018-05-25  8:36 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 24/05/2018 14:40, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-05-24 14:34:41)
>>
>> On 19/05/2018 10:04, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2018-05-18 15:42:00)
>>>>
>>>> On 18/05/2018 15:13, Chris Wilson wrote:
>>>>> Quoting Tvrtko Ursulin (2018-05-18 13:36:52)
>>>>>>
>>>>>> On 18/05/2018 13:28, Chris Wilson wrote:
>>>>>>> Quoting Tvrtko Ursulin (2018-05-18 12:50:41)
>>>>>>>>
>>>>>>>> On 18/05/2018 12:10, Chris Wilson wrote:
>>>>>>>>> Quoting Tvrtko Ursulin (2018-05-18 12:05:17)
>>>>>>>>>>
>>>>>>>>>> On 18/05/2018 11:09, Chris Wilson wrote:
>>>>>>>>>>> Inside the live_hangcheck (reset) selftests, we occasionally see
>>>>>>>>>>> failures like
>>>>>>>>>>>
>>>>>>>>>>> <7>[  239.094840] i915_gem_set_wedged rcs0
>>>>>>>>>>> <7>[  239.094843] i915_gem_set_wedged         current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
>>>>>>>>>>> <7>[  239.094846] i915_gem_set_wedged         Reset count: 6239 (global 1)
>>>>>>>>>>> <7>[  239.094848] i915_gem_set_wedged         Requests:
>>>>>>>>>>> <7>[  239.095052] i915_gem_set_wedged                 first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
>>>>>>>>>>> <7>[  239.095056] i915_gem_set_wedged                 last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
>>>>>>>>>>> <7>[  239.095059] i915_gem_set_wedged                 active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
>>>>>>>>>>> <7>[  239.095062] i915_gem_set_wedged                 [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
>>>>>>>>>>> <7>[  239.100050] i915_gem_set_wedged                 ring->start:  0x00283000
>>>>>>>>>>> <7>[  239.100053] i915_gem_set_wedged                 ring->head:   0x000001f8
>>>>>>>>>>> <7>[  239.100055] i915_gem_set_wedged                 ring->tail:   0x000002a8
>>>>>>>>>>> <7>[  239.100057] i915_gem_set_wedged                 ring->emit:   0x000002a8
>>>>>>>>>>> <7>[  239.100059] i915_gem_set_wedged                 ring->space:  0x00000f10
>>>>>>>>>>> <7>[  239.100085] i915_gem_set_wedged         RING_START: 0x00283000
>>>>>>>>>>> <7>[  239.100088] i915_gem_set_wedged         RING_HEAD:  0x00000260
>>>>>>>>>>> <7>[  239.100091] i915_gem_set_wedged         RING_TAIL:  0x000002a8
>>>>>>>>>>> <7>[  239.100094] i915_gem_set_wedged         RING_CTL:   0x00000001
>>>>>>>>>>> <7>[  239.100097] i915_gem_set_wedged         RING_MODE:  0x00000300 [idle]
>>>>>>>>>>> <7>[  239.100100] i915_gem_set_wedged         RING_IMR: fffffefe
>>>>>>>>>>> <7>[  239.100104] i915_gem_set_wedged         ACTHD:  0x00000000_0000609c
>>>>>>>>>>> <7>[  239.100108] i915_gem_set_wedged         BBADDR: 0x00000000_0000609d
>>>>>>>>>>> <7>[  239.100111] i915_gem_set_wedged         DMA_FADDR: 0x00000000_00283260
>>>>>>>>>>> <7>[  239.100114] i915_gem_set_wedged         IPEIR: 0x00000000
>>>>>>>>>>> <7>[  239.100117] i915_gem_set_wedged         IPEHR: 0x02800000
>>>>>>>>>>> <7>[  239.100120] i915_gem_set_wedged         Execlist status: 0x00044052 00000002
>>>>>>>>>>> <7>[  239.100124] i915_gem_set_wedged         Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
>>>>>>>>>>> <7>[  239.100128] i915_gem_set_wedged                 ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
>>>>>>>>>>> <7>[  239.100132] i915_gem_set_wedged                 ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
>>>>>>>>>>> <7>[  239.100135] i915_gem_set_wedged                 HW active? 0x5
>>>>>>>>>>> <7>[  239.100250] i915_gem_set_wedged                 E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
>>>>>>>>>>> <7>[  239.100338] i915_gem_set_wedged                 E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
>>>>>>>>>>> <7>[  239.100340] i915_gem_set_wedged                 Queue priority: 139
>>>>>>>>>>> <7>[  239.100343] i915_gem_set_wedged                 Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
>>>>>>>>>>> <7>[  239.100346] i915_gem_set_wedged                 Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
>>>>>>>>>>> <7>[  239.100349] i915_gem_set_wedged                 Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
>>>>>>>>>>> <7>[  239.100352] i915_gem_set_wedged                 Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
>>>>>>>>>>> <7>[  239.100356] i915_gem_set_wedged                 Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
>>>>>>>>>>> <7>[  239.100362] i915_gem_set_wedged         drv_selftest [5894] waiting for 19a99
>>>>>>>>>>>
>>>>>>>>>>> where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
>>>>>>>>>>> The RING_MODE indicates that is idle and has the STOP_RING bit set, so
>>>>>>>>>>> try clearing it.
>>>>>>>>>>>
>>>>>>>>>>> v2: Only clear the bit on restarting the ring, as we want to be sure the
>>>>>>>>>>> STOP_RING bit is kept if reset fails on wedging.
>>>>>>>>>>> v3: Spot when the ring state doesn't make sense when re-initialising the
>>>>>>>>>>> engine and dump it to the logs so that we don't have to wait for an
>>>>>>>>>>> error later and try to guess what happened earlier.
>>>>>>>>>>> v4: Prepare to print all the unexpected state, not just the first.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>>>>>>>>>>> ---
>>>>>>>>>>>        drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++++++++++++
>>>>>>>>>>>        1 file changed, 22 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>>>>>>>>>>> index 3744f5750624..ba8411ba4abf 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>>>>>>>>>>> @@ -1781,6 +1781,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
>>>>>>>>>>>                    I915_WRITE(RING_MODE_GEN7(engine),
>>>>>>>>>>>                               _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
>>>>>>>>>>>        
>>>>>>>>>>> +     I915_WRITE(RING_MI_MODE(engine->mmio_base),
>>>>>>>>>>> +                _MASKED_BIT_DISABLE(STOP_RING));
>>>>>>>>>>
>>>>>>>>>> Worries me a bit to clear it unconditionally since documentation says
>>>>>>>>>> nothing (that I can find) about this scenario.
>>>>>>>>>>
>>>>>>>>>>> +
>>>>>>>>>>>            I915_WRITE(RING_HWS_PGA(engine->mmio_base),
>>>>>>>>>>>                       engine->status_page.ggtt_offset);
>>>>>>>>>>>            POSTING_READ(RING_HWS_PGA(engine->mmio_base));
>>>>>>>>>>> @@ -1789,6 +1792,19 @@ static void enable_execlists(struct intel_engine_cs *engine)
>>>>>>>>>>>            engine->execlists.csb_head = -1;
>>>>>>>>>>>        }
>>>>>>>>>>>        
>>>>>>>>>>> +static bool unexpected_starting_state(struct intel_engine_cs *engine)
>>>>>>>>>>> +{
>>>>>>>>>>> +     struct drm_i915_private *dev_priv = engine->i915;
>>>>>>>>>>> +     bool unexpected = false;
>>>>>>>>>>> +
>>>>>>>>>>> +     if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) {
>>>>>>>>>>> +             DRM_DEBUG_DRIVER("STOP_RING still set in RING_MI_MODE\n");
>>>>>>>>>>
>>>>>>>>>> Also worries me to bury this as a debug message unless we can find in
>>>>>>>>>> documentation that this is known to happen occasionally.
>>>>>>>>>>
>>>>>>>>>> If we had the check and logging immediately after reset, with a more
>>>>>>>>>> visible log, we could build a larger data set of reported failures and
>>>>>>>>>> then go back to hw people and ask them about it.
>>>>>>>>>
>>>>>>>>> The problem with the reset side is trying to avoid mixing in the case
>>>>>>>>> where we are faking it and expect it to still be set. On the ringbuffer
>>>>>>>>> path, we handle STOP_RING as being part of the init sequence. The log
>>>>>>>>> already says we did a reset and STOP_RING is set (since we have tracing
>>>>>>>>> through the reset), I just need to find the right spot to handle it.
>>>>>>>>
>>>>>>>> The "distance" in code paths between reset and init_hw is probably the
>>>>>>>> smallest of my concerns. If we could make this message not debug only,
>>>>>>>> and clear STOP_RING only conditionally I think I would be happy with
>>>>>>>> giving this a go and see if it collects any data.
>>>>>>>
>>>>>>> Why? Who else but us are going to read it, and why bother if it is not
>>>>>>> actionable. It is logged (was logged already, now highlighted) and for
>>>>>>> the user none of it is relevant, the only thing that matters to them is
>>>>>>> that they don't even notice the system suffered a GPU hang.
>>>>>>
>>>>>> My reasoning is this - something has happened which is completely
>>>>>> unexpected and may have other negative effect which are currently
>>>>>> unknown to us. By logging it at a higher level we collect this now very
>>>>>> specific signature on a wider collection of platforms and can then act
>>>>>> on the collected data set. Also very specific log message makes bug
>>>>>> triage easier and a lower chance it gets bunched up with other hang reports.
>>>>>
>>>>> You do have a very specific debug message, but having a flip-flopping
>>>>> test that has no impact on system stability (as evidenced by the lack of
>>>>> errors, if there was a subsequent error it is imperative that the post
>>>>> mortem capture does its best, everything else is lost) does not help us.
>>>>
>>>> What if we deploy this as DRM_NOTICE/INFO, get reports from the field
>>>> and not just the selftest, and then realize the issue is actually
>>>> interesting and could be fed back to HW design?
>>>
>>> But we already have enough to point out that the HW is not behaving as
>>> we^W I would expect.
>>>
>>> For field analysis, I strongly believe that unless you can capture
>>> everything you need without user intervention, you will never get the
>>> debug information required. The instructions we have are "oops,
>>> something went wrong, please file a new bug report here and attach this
>>> file". That's already too complicated wants to be automated, e.g. abrtd.
>>> But for whatever reason we don't get the reports from abrtd.
>>>
>>> Elsewhere we use the method of light periodic checks and if they spot an
>>> issue, switch on heavy debug mode. For something like this where we want
>>> a trace of operations as they happen? I think we need to be able to turn
>>> on GEM tracing on the fly and be able to capture the compressed buffers.
>>
>> Not sure that we need a trace of pre-events for this. It seems pretty
>> straight-forward to me. Register is not in the state in which it should
>> be immediately after reset.
> 
> But you don't always have a reset, that makes throwing a tantrum here
> difficult.
>   
>> It will help sort the hang reports into a new bucket.
> 
> The information is already in the hang report. What isn't which would be
> useful is the trace of events that precede certain hangs (ones that I
> suspect are driver bugs as opposed to userspace GPU programming issues).

It is there but takes more skill to triage.

Okay, given that the check is not only after reset, and given how you do 
a lion's share of bug triaging anyway:

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset
  2018-05-25  8:36                     ` Tvrtko Ursulin
@ 2018-05-25 12:54                       ` Chris Wilson
  0 siblings, 0 replies; 20+ messages in thread
From: Chris Wilson @ 2018-05-25 12:54 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2018-05-25 09:36:18)
> 
> On 24/05/2018 14:40, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-05-24 14:34:41)
> >>
> >> On 19/05/2018 10:04, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2018-05-18 15:42:00)
> >>>>
> >>>> On 18/05/2018 15:13, Chris Wilson wrote:
> >>>>> Quoting Tvrtko Ursulin (2018-05-18 13:36:52)
> >>>>>>
> >>>>>> On 18/05/2018 13:28, Chris Wilson wrote:
> >>>>>>> Quoting Tvrtko Ursulin (2018-05-18 12:50:41)
> >>>>>>>>
> >>>>>>>> On 18/05/2018 12:10, Chris Wilson wrote:
> >>>>>>>>> Quoting Tvrtko Ursulin (2018-05-18 12:05:17)
> >>>>>>>>>>
> >>>>>>>>>> On 18/05/2018 11:09, Chris Wilson wrote:
> >>>>>>>>>>> Inside the live_hangcheck (reset) selftests, we occasionally see
> >>>>>>>>>>> failures like
> >>>>>>>>>>>
> >>>>>>>>>>> <7>[  239.094840] i915_gem_set_wedged rcs0
> >>>>>>>>>>> <7>[  239.094843] i915_gem_set_wedged         current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
> >>>>>>>>>>> <7>[  239.094846] i915_gem_set_wedged         Reset count: 6239 (global 1)
> >>>>>>>>>>> <7>[  239.094848] i915_gem_set_wedged         Requests:
> >>>>>>>>>>> <7>[  239.095052] i915_gem_set_wedged                 first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> >>>>>>>>>>> <7>[  239.095056] i915_gem_set_wedged                 last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
> >>>>>>>>>>> <7>[  239.095059] i915_gem_set_wedged                 active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> >>>>>>>>>>> <7>[  239.095062] i915_gem_set_wedged                 [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
> >>>>>>>>>>> <7>[  239.100050] i915_gem_set_wedged                 ring->start:  0x00283000
> >>>>>>>>>>> <7>[  239.100053] i915_gem_set_wedged                 ring->head:   0x000001f8
> >>>>>>>>>>> <7>[  239.100055] i915_gem_set_wedged                 ring->tail:   0x000002a8
> >>>>>>>>>>> <7>[  239.100057] i915_gem_set_wedged                 ring->emit:   0x000002a8
> >>>>>>>>>>> <7>[  239.100059] i915_gem_set_wedged                 ring->space:  0x00000f10
> >>>>>>>>>>> <7>[  239.100085] i915_gem_set_wedged         RING_START: 0x00283000
> >>>>>>>>>>> <7>[  239.100088] i915_gem_set_wedged         RING_HEAD:  0x00000260
> >>>>>>>>>>> <7>[  239.100091] i915_gem_set_wedged         RING_TAIL:  0x000002a8
> >>>>>>>>>>> <7>[  239.100094] i915_gem_set_wedged         RING_CTL:   0x00000001
> >>>>>>>>>>> <7>[  239.100097] i915_gem_set_wedged         RING_MODE:  0x00000300 [idle]
> >>>>>>>>>>> <7>[  239.100100] i915_gem_set_wedged         RING_IMR: fffffefe
> >>>>>>>>>>> <7>[  239.100104] i915_gem_set_wedged         ACTHD:  0x00000000_0000609c
> >>>>>>>>>>> <7>[  239.100108] i915_gem_set_wedged         BBADDR: 0x00000000_0000609d
> >>>>>>>>>>> <7>[  239.100111] i915_gem_set_wedged         DMA_FADDR: 0x00000000_00283260
> >>>>>>>>>>> <7>[  239.100114] i915_gem_set_wedged         IPEIR: 0x00000000
> >>>>>>>>>>> <7>[  239.100117] i915_gem_set_wedged         IPEHR: 0x02800000
> >>>>>>>>>>> <7>[  239.100120] i915_gem_set_wedged         Execlist status: 0x00044052 00000002
> >>>>>>>>>>> <7>[  239.100124] i915_gem_set_wedged         Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
> >>>>>>>>>>> <7>[  239.100128] i915_gem_set_wedged                 ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> >>>>>>>>>>> <7>[  239.100132] i915_gem_set_wedged                 ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> >>>>>>>>>>> <7>[  239.100135] i915_gem_set_wedged                 HW active? 0x5
> >>>>>>>>>>> <7>[  239.100250] i915_gem_set_wedged                 E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> >>>>>>>>>>> <7>[  239.100338] i915_gem_set_wedged                 E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> >>>>>>>>>>> <7>[  239.100340] i915_gem_set_wedged                 Queue priority: 139
> >>>>>>>>>>> <7>[  239.100343] i915_gem_set_wedged                 Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
> >>>>>>>>>>> <7>[  239.100346] i915_gem_set_wedged                 Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
> >>>>>>>>>>> <7>[  239.100349] i915_gem_set_wedged                 Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
> >>>>>>>>>>> <7>[  239.100352] i915_gem_set_wedged                 Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
> >>>>>>>>>>> <7>[  239.100356] i915_gem_set_wedged                 Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
> >>>>>>>>>>> <7>[  239.100362] i915_gem_set_wedged         drv_selftest [5894] waiting for 19a99
> >>>>>>>>>>>
> >>>>>>>>>>> where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
> >>>>>>>>>>> The RING_MODE indicates that is idle and has the STOP_RING bit set, so
> >>>>>>>>>>> try clearing it.
> >>>>>>>>>>>
> >>>>>>>>>>> v2: Only clear the bit on restarting the ring, as we want to be sure the
> >>>>>>>>>>> STOP_RING bit is kept if reset fails on wedging.
> >>>>>>>>>>> v3: Spot when the ring state doesn't make sense when re-initialising the
> >>>>>>>>>>> engine and dump it to the logs so that we don't have to wait for an
> >>>>>>>>>>> error later and try to guess what happened earlier.
> >>>>>>>>>>> v4: Prepare to print all the unexpected state, not just the first.
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>>>>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >>>>>>>>>>> ---
> >>>>>>>>>>>        drivers/gpu/drm/i915/intel_lrc.c | 22 ++++++++++++++++++++++
> >>>>>>>>>>>        1 file changed, 22 insertions(+)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> >>>>>>>>>>> index 3744f5750624..ba8411ba4abf 100644
> >>>>>>>>>>> --- a/drivers/gpu/drm/i915/intel_lrc.c
> >>>>>>>>>>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> >>>>>>>>>>> @@ -1781,6 +1781,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
> >>>>>>>>>>>                    I915_WRITE(RING_MODE_GEN7(engine),
> >>>>>>>>>>>                               _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
> >>>>>>>>>>>        
> >>>>>>>>>>> +     I915_WRITE(RING_MI_MODE(engine->mmio_base),
> >>>>>>>>>>> +                _MASKED_BIT_DISABLE(STOP_RING));
> >>>>>>>>>>
> >>>>>>>>>> Worries me a bit to clear it unconditionally since documentation says
> >>>>>>>>>> nothing (that I can find) about this scenario.
> >>>>>>>>>>
> >>>>>>>>>>> +
> >>>>>>>>>>>            I915_WRITE(RING_HWS_PGA(engine->mmio_base),
> >>>>>>>>>>>                       engine->status_page.ggtt_offset);
> >>>>>>>>>>>            POSTING_READ(RING_HWS_PGA(engine->mmio_base));
> >>>>>>>>>>> @@ -1789,6 +1792,19 @@ static void enable_execlists(struct intel_engine_cs *engine)
> >>>>>>>>>>>            engine->execlists.csb_head = -1;
> >>>>>>>>>>>        }
> >>>>>>>>>>>        
> >>>>>>>>>>> +static bool unexpected_starting_state(struct intel_engine_cs *engine)
> >>>>>>>>>>> +{
> >>>>>>>>>>> +     struct drm_i915_private *dev_priv = engine->i915;
> >>>>>>>>>>> +     bool unexpected = false;
> >>>>>>>>>>> +
> >>>>>>>>>>> +     if (I915_READ(RING_MI_MODE(engine->mmio_base)) & STOP_RING) {
> >>>>>>>>>>> +             DRM_DEBUG_DRIVER("STOP_RING still set in RING_MI_MODE\n");
> >>>>>>>>>>
> >>>>>>>>>> Also worries me to bury this as a debug message unless we can find in
> >>>>>>>>>> documentation that this is known to happen occasionally.
> >>>>>>>>>>
> >>>>>>>>>> If we had the check and logging immediately after reset, with a more
> >>>>>>>>>> visible log, we could build a larger data set of reported failures and
> >>>>>>>>>> then go back to hw people and ask them about it.
> >>>>>>>>>
> >>>>>>>>> The problem with the reset side is trying to avoid mixing in the case
> >>>>>>>>> where we are faking it and expect it to still be set. On the ringbuffer
> >>>>>>>>> path, we handle STOP_RING as being part of the init sequence. The log
> >>>>>>>>> already says we did a reset and STOP_RING is set (since we have tracing
> >>>>>>>>> through the reset), I just need to find the right spot to handle it.
> >>>>>>>>
> >>>>>>>> The "distance" in code paths between reset and init_hw is probably the
> >>>>>>>> smallest of my concerns. If we could make this message not debug only,
> >>>>>>>> and clear STOP_RING only conditionally I think I would be happy with
> >>>>>>>> giving this a go and see if it collects any data.
> >>>>>>>
> >>>>>>> Why? Who else but us are going to read it, and why bother if it is not
> >>>>>>> actionable. It is logged (was logged already, now highlighted) and for
> >>>>>>> the user none of it is relevant, the only thing that matters to them is
> >>>>>>> that they don't even notice the system suffered a GPU hang.
> >>>>>>
> >>>>>> My reasoning is this - something has happened which is completely
> >>>>>> unexpected and may have other negative effect which are currently
> >>>>>> unknown to us. By logging it at a higher level we collect this now very
> >>>>>> specific signature on a wider collection of platforms and can then act
> >>>>>> on the collected data set. Also very specific log message makes bug
> >>>>>> triage easier and a lower chance it gets bunched up with other hang reports.
> >>>>>
> >>>>> You do have a very specific debug message, but having a flip-flopping
> >>>>> test that has no impact on system stability (as evidenced by the lack of
> >>>>> errors, if there was a subsequent error it is imperative that the post
> >>>>> mortem capture does its best, everything else is lost) does not help us.
> >>>>
> >>>> What if we deploy this as DRM_NOTICE/INFO, get reports from the field
> >>>> and not just the selftest, and then realize the issue is actually
> >>>> interesting and could be fed back to HW design?
> >>>
> >>> But we already have enough to point out that the HW is not behaving as
> >>> we^W I would expect.
> >>>
> >>> For field analysis, I strongly believe that unless you can capture
> >>> everything you need without user intervention, you will never get the
> >>> debug information required. The instructions we have are "oops,
> >>> something went wrong, please file a new bug report here and attach this
> >>> file". That's already too complicated wants to be automated, e.g. abrtd.
> >>> But for whatever reason we don't get the reports from abrtd.
> >>>
> >>> Elsewhere we use the method of light periodic checks and if they spot an
> >>> issue, switch on heavy debug mode. For something like this where we want
> >>> a trace of operations as they happen? I think we need to be able to turn
> >>> on GEM tracing on the fly and be able to capture the compressed buffers.
> >>
> >> Not sure that we need a trace of pre-events for this. It seems pretty
> >> straight-forward to me. Register is not in the state in which it should
> >> be immediately after reset.
> > 
> > But you don't always have a reset, that makes throwing a tantrum here
> > difficult.
> >   
> >> It will help sort the hang reports into a new bucket.
> > 
> > The information is already in the hang report. What isn't which would be
> > useful is the trace of events that precede certain hangs (ones that I
> > suspect are driver bugs as opposed to userspace GPU programming issues).
> 
> It is there but takes more skill to triage.

It is largely pattern matching. I'm sure an AI could do a far better job
at finding discrepancies worth investigating. Add another item to the
wishlist.
 
> Okay, given that the check is not only after reset, and given how you do 
> a lion's share of bug triaging anyway:
> 
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Sorry for holding out on not trying to parcel too much into this, thanks
for the review. Pushed hoping that it truly does stop the flip-flops.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-05-25 12:54 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-18 10:09 [PATCH v4] drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset Chris Wilson
2018-05-18 10:37 ` ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
2018-05-18 10:54 ` ✓ Fi.CI.BAT: success " Patchwork
2018-05-18 11:05 ` [PATCH v4] " Tvrtko Ursulin
2018-05-18 11:10   ` Chris Wilson
2018-05-18 11:50     ` Tvrtko Ursulin
2018-05-18 12:28       ` Chris Wilson
2018-05-18 12:36         ` Tvrtko Ursulin
2018-05-18 14:13           ` Chris Wilson
2018-05-18 14:42             ` Tvrtko Ursulin
2018-05-19  9:04               ` Chris Wilson
2018-05-24 13:34                 ` Tvrtko Ursulin
2018-05-24 13:40                   ` Chris Wilson
2018-05-25  8:36                     ` Tvrtko Ursulin
2018-05-25 12:54                       ` Chris Wilson
2018-05-18 14:43 ` ✗ Fi.CI.IGT: failure for " Patchwork
2018-05-18 16:14 ` ✗ Fi.CI.CHECKPATCH: warning " Patchwork
2018-05-18 16:29 ` ✓ Fi.CI.BAT: success " Patchwork
2018-05-19  0:05 ` ✓ Fi.CI.IGT: " Patchwork
2018-05-19  9:05 ` [PATCH v4] " Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.