[PATCH 1/2] drm/i915/selftests: Wait longer for the old active request

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] drm/i915/selftests: Wait longer for the old active request
@ 2018-05-17 14:24 Chris Wilson
  2018-05-17 14:24 ` [PATCH 2/2] drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset Chris Wilson
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Chris Wilson @ 2018-05-17 14:24 UTC (permalink / raw)
  To: intel-gfx

When testing reset, we wait for 1s on the main thread for the hang to
start. Meanwhile, we continue submitting requests on all the background
threads, and we may have more threads than cores and so potentially
starve the waiter from being woken within the timeout. As the hang
timeout and the active timeouts are the same, it is hard to distinguish
which caused the timeout. Bump the active thread timeouts to 5s,
compared to the 1s timeout for the hang, so that we preferentially
report the hang timing out, while hopefully ensuring that we do at least
wake up the hang thread first before declaring the background active
timeout.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 .../gpu/drm/i915/selftests/intel_hangcheck.c  | 48 +++++++++++++------
 1 file changed, 34 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index 438e0b045a2c..f1dc42a171c8 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -560,6 +560,30 @@ struct active_engine {
 #define TEST_SELF	BIT(2)
 #define TEST_PRIORITY	BIT(3)
 
+static int active_request_put(struct i915_request *rq)
+{
+	int err = 0;
+
+	if (!rq)
+		return 0;
+
+	if (i915_request_wait(rq, 0, 5 * HZ) < 0) {
+		GEM_TRACE("%s timed out waiting for completion of fence %llx:%d, seqno %d.\n",
+			  rq->engine->name,
+			  rq->fence.context,
+			  rq->fence.seqno,
+			  i915_request_global_seqno(rq));
+		GEM_TRACE_DUMP();
+
+		i915_gem_set_wedged(rq->i915);
+		err = -EIO;
+	}
+
+	i915_request_put(rq);
+
+	return err;
+}
+
 static int active_engine(void *data)
 {
 	I915_RND_STATE(prng);
@@ -608,24 +632,20 @@ static int active_engine(void *data)
 		i915_request_add(new);
 		mutex_unlock(&engine->i915->drm.struct_mutex);
 
-		if (old) {
-			if (i915_request_wait(old, 0, HZ) < 0) {
-				GEM_TRACE("%s timed out.\n", engine->name);
-				GEM_TRACE_DUMP();
-
-				i915_gem_set_wedged(engine->i915);
-				i915_request_put(old);
-				err = -EIO;
-				break;
-			}
-			i915_request_put(old);
-		}
+		err = active_request_put(old);
+		if (err)
+			break;
 
 		cond_resched();
 	}
 
-	for (count = 0; count < ARRAY_SIZE(rq); count++)
-		i915_request_put(rq[count]);
+	for (count = 0; count < ARRAY_SIZE(rq); count++) {
+		int err__ = active_request_put(rq[count]);
+
+		/* Keep the first error */
+		if (!err)
+			err = err__;
+	}
 
 err_file:
 	mock_file_free(engine->i915, file);
-- 
2.17.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset
  2018-05-17 14:24 [PATCH 1/2] drm/i915/selftests: Wait longer for the old active request Chris Wilson
@ 2018-05-17 14:24 ` Chris Wilson
  2018-05-18  9:33   ` Tvrtko Ursulin
  2018-05-17 15:04 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/2] drm/i915/selftests: Wait longer for the old active request Patchwork
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Chris Wilson @ 2018-05-17 14:24 UTC (permalink / raw)
  To: intel-gfx

Inside the live_hangcheck (reset) selftests, we occasionally see
failures like

<7>[  239.094840] i915_gem_set_wedged rcs0
<7>[  239.094843] i915_gem_set_wedged 	current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
<7>[  239.094846] i915_gem_set_wedged 	Reset count: 6239 (global 1)
<7>[  239.094848] i915_gem_set_wedged 	Requests:
<7>[  239.095052] i915_gem_set_wedged 		first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
<7>[  239.095056] i915_gem_set_wedged 		last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
<7>[  239.095059] i915_gem_set_wedged 		active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
<7>[  239.095062] i915_gem_set_wedged 		[head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
<7>[  239.100050] i915_gem_set_wedged 		ring->start:  0x00283000
<7>[  239.100053] i915_gem_set_wedged 		ring->head:   0x000001f8
<7>[  239.100055] i915_gem_set_wedged 		ring->tail:   0x000002a8
<7>[  239.100057] i915_gem_set_wedged 		ring->emit:   0x000002a8
<7>[  239.100059] i915_gem_set_wedged 		ring->space:  0x00000f10
<7>[  239.100085] i915_gem_set_wedged 	RING_START: 0x00283000
<7>[  239.100088] i915_gem_set_wedged 	RING_HEAD:  0x00000260
<7>[  239.100091] i915_gem_set_wedged 	RING_TAIL:  0x000002a8
<7>[  239.100094] i915_gem_set_wedged 	RING_CTL:   0x00000001
<7>[  239.100097] i915_gem_set_wedged 	RING_MODE:  0x00000300 [idle]
<7>[  239.100100] i915_gem_set_wedged 	RING_IMR: fffffefe
<7>[  239.100104] i915_gem_set_wedged 	ACTHD:  0x00000000_0000609c
<7>[  239.100108] i915_gem_set_wedged 	BBADDR: 0x00000000_0000609d
<7>[  239.100111] i915_gem_set_wedged 	DMA_FADDR: 0x00000000_00283260
<7>[  239.100114] i915_gem_set_wedged 	IPEIR: 0x00000000
<7>[  239.100117] i915_gem_set_wedged 	IPEHR: 0x02800000
<7>[  239.100120] i915_gem_set_wedged 	Execlist status: 0x00044052 00000002
<7>[  239.100124] i915_gem_set_wedged 	Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
<7>[  239.100128] i915_gem_set_wedged 		ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
<7>[  239.100132] i915_gem_set_wedged 		ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
<7>[  239.100135] i915_gem_set_wedged 		HW active? 0x5
<7>[  239.100250] i915_gem_set_wedged 		E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
<7>[  239.100338] i915_gem_set_wedged 		E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
<7>[  239.100340] i915_gem_set_wedged 		Queue priority: 139
<7>[  239.100343] i915_gem_set_wedged 		Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
<7>[  239.100346] i915_gem_set_wedged 		Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
<7>[  239.100349] i915_gem_set_wedged 		Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
<7>[  239.100352] i915_gem_set_wedged 		Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
<7>[  239.100356] i915_gem_set_wedged 		Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
<7>[  239.100362] i915_gem_set_wedged 	drv_selftest [5894] waiting for 19a99

where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
The RING_MODE indicates that is idle and has the STOP_RING bit set, so
try clearing it.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/intel_uncore.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index b36a3b5736a0..082b0045ac8c 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1720,6 +1720,8 @@ static void gen3_stop_engine(struct intel_engine_cs *engine)
 	if (I915_READ_FW(RING_HEAD(base)) != 0)
 		DRM_DEBUG_DRIVER("%s: ring head not parked\n",
 				 engine->name);
+
+	I915_WRITE_FW(RING_MI_MODE(base), _MASKED_BIT_DISABLE(STOP_RING));
 }
 
 static void i915_stop_engines(struct drm_i915_private *dev_priv,
-- 
2.17.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/2] drm/i915/selftests: Wait longer for the old active request
  2018-05-17 14:24 [PATCH 1/2] drm/i915/selftests: Wait longer for the old active request Chris Wilson
  2018-05-17 14:24 ` [PATCH 2/2] drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset Chris Wilson
@ 2018-05-17 15:04 ` Patchwork
  2018-05-17 15:20 ` ✗ Fi.CI.BAT: failure " Patchwork
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Patchwork @ 2018-05-17 15:04 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/2] drm/i915/selftests: Wait longer for the old active request
URL   : https://patchwork.freedesktop.org/series/43334/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
ddf6fd832a10 drm/i915/selftests: Wait longer for the old active request
26d41bd6d7ab drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset
-:11: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#11: 
<7>[  239.094843] i915_gem_set_wedged 	current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]

total: 0 errors, 1 warnings, 0 checks, 8 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* ✗ Fi.CI.BAT: failure for series starting with [1/2] drm/i915/selftests: Wait longer for the old active request
  2018-05-17 14:24 [PATCH 1/2] drm/i915/selftests: Wait longer for the old active request Chris Wilson
  2018-05-17 14:24 ` [PATCH 2/2] drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset Chris Wilson
  2018-05-17 15:04 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/2] drm/i915/selftests: Wait longer for the old active request Patchwork
@ 2018-05-17 15:20 ` Patchwork
  2018-05-17 16:02 ` ✗ Fi.CI.CHECKPATCH: warning " Patchwork
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Patchwork @ 2018-05-17 15:20 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/2] drm/i915/selftests: Wait longer for the old active request
URL   : https://patchwork.freedesktop.org/series/43334/
State : failure

== Summary ==

= CI Bug Log - changes from CI_DRM_4197 -> Patchwork_9029 =

== Summary - FAILURE ==

  Serious unknown changes coming with Patchwork_9029 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_9029, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/43334/revisions/1/mbox/

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_9029:

  === IGT changes ===

    ==== Possible regressions ====

    igt@gem_exec_fence@await-hang-default:
      fi-blb-e6850:       PASS -> INCOMPLETE

    
    ==== Warnings ====

    igt@gem_exec_gttfill@basic:
      fi-pnv-d510:        PASS -> SKIP

    
== Known issues ==

  Here are the changes found in Patchwork_9029 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_frontbuffer_tracking@basic:
      fi-hsw-4200u:       PASS -> DMESG-FAIL (fdo#102614, fdo#106103)

    
    ==== Possible fixes ====

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a:
      fi-cnl-psr:         DMESG-WARN (fdo#104951) -> PASS

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
      fi-snb-2520m:       INCOMPLETE (fdo#103713) -> PASS

    
  fdo#102614 https://bugs.freedesktop.org/show_bug.cgi?id=102614
  fdo#103713 https://bugs.freedesktop.org/show_bug.cgi?id=103713
  fdo#104951 https://bugs.freedesktop.org/show_bug.cgi?id=104951
  fdo#106103 https://bugs.freedesktop.org/show_bug.cgi?id=106103


== Participating hosts (43 -> 39) ==

  Missing    (4): fi-ilk-m540 fi-byt-squawks fi-bsw-cyan fi-skl-6700hq 


== Build changes ==

    * Linux: CI_DRM_4197 -> Patchwork_9029

  CI_DRM_4197: 4079eb91298e7ef6b8c3569adc0232b7d2492d78 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4487: eccae1360d6d01e73c6af2bd97122cef708207ef @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9029: 26d41bd6d7ab93e8339977bb551ef6b4396b21bc @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4487: 6ab75f7eb5e1dccbb773e1739beeb2d7cbd6ad0d @ git://anongit.freedesktop.org/piglit


== Linux commits ==

26d41bd6d7ab drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset
ddf6fd832a10 drm/i915/selftests: Wait longer for the old active request

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9029/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/2] drm/i915/selftests: Wait longer for the old active request
  2018-05-17 14:24 [PATCH 1/2] drm/i915/selftests: Wait longer for the old active request Chris Wilson
                   ` (2 preceding siblings ...)
  2018-05-17 15:20 ` ✗ Fi.CI.BAT: failure " Patchwork
@ 2018-05-17 16:02 ` Patchwork
  2018-05-17 16:18 ` ✗ Fi.CI.BAT: failure " Patchwork
  2018-05-18  9:22 ` [PATCH 1/2] " Tvrtko Ursulin
  5 siblings, 0 replies; 13+ messages in thread
From: Patchwork @ 2018-05-17 16:02 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/2] drm/i915/selftests: Wait longer for the old active request
URL   : https://patchwork.freedesktop.org/series/43334/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
a809889d3010 drm/i915/selftests: Wait longer for the old active request
cbab86526526 drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset
-:11: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#11: 
<7>[  239.094843] i915_gem_set_wedged 	current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]

total: 0 errors, 1 warnings, 0 checks, 8 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* ✗ Fi.CI.BAT: failure for series starting with [1/2] drm/i915/selftests: Wait longer for the old active request
  2018-05-17 14:24 [PATCH 1/2] drm/i915/selftests: Wait longer for the old active request Chris Wilson
                   ` (3 preceding siblings ...)
  2018-05-17 16:02 ` ✗ Fi.CI.CHECKPATCH: warning " Patchwork
@ 2018-05-17 16:18 ` Patchwork
  2018-05-18  9:22 ` [PATCH 1/2] " Tvrtko Ursulin
  5 siblings, 0 replies; 13+ messages in thread
From: Patchwork @ 2018-05-17 16:18 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: series starting with [1/2] drm/i915/selftests: Wait longer for the old active request
URL   : https://patchwork.freedesktop.org/series/43334/
State : failure

== Summary ==

= CI Bug Log - changes from CI_DRM_4197 -> Patchwork_9032 =

== Summary - FAILURE ==

  Serious unknown changes coming with Patchwork_9032 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_9032, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/43334/revisions/1/mbox/

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_9032:

  === IGT changes ===

    ==== Possible regressions ====

    igt@gem_exec_fence@await-hang-default:
      fi-blb-e6850:       PASS -> INCOMPLETE

    
== Known issues ==

  Here are the changes found in Patchwork_9032 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_frontbuffer_tracking@basic:
      fi-hsw-peppy:       PASS -> DMESG-FAIL (fdo#102614, fdo#106103)

    igt@prime_vgem@basic-fence-flip:
      fi-ilk-650:         PASS -> FAIL (fdo#104008)

    
    ==== Possible fixes ====

    igt@gem_mmap_gtt@basic-small-bo-tiledx:
      fi-gdg-551:         FAIL (fdo#102575) -> PASS

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a:
      fi-cnl-psr:         DMESG-WARN (fdo#104951) -> PASS

    igt@kms_pipe_crc_basic@suspend-read-crc-pipe-b:
      fi-snb-2520m:       INCOMPLETE (fdo#103713) -> PASS

    
  fdo#102575 https://bugs.freedesktop.org/show_bug.cgi?id=102575
  fdo#102614 https://bugs.freedesktop.org/show_bug.cgi?id=102614
  fdo#103713 https://bugs.freedesktop.org/show_bug.cgi?id=103713
  fdo#104008 https://bugs.freedesktop.org/show_bug.cgi?id=104008
  fdo#104951 https://bugs.freedesktop.org/show_bug.cgi?id=104951
  fdo#106103 https://bugs.freedesktop.org/show_bug.cgi?id=106103


== Participating hosts (43 -> 39) ==

  Missing    (4): fi-ilk-m540 fi-byt-squawks fi-bsw-cyan fi-skl-6700hq 


== Build changes ==

    * Linux: CI_DRM_4197 -> Patchwork_9032

  CI_DRM_4197: 4079eb91298e7ef6b8c3569adc0232b7d2492d78 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4487: eccae1360d6d01e73c6af2bd97122cef708207ef @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9032: cbab86526526d81869250c4c7c673c3d3a8dc051 @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4487: 6ab75f7eb5e1dccbb773e1739beeb2d7cbd6ad0d @ git://anongit.freedesktop.org/piglit


== Linux commits ==

cbab86526526 drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset
a809889d3010 drm/i915/selftests: Wait longer for the old active request

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9032/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] drm/i915/selftests: Wait longer for the old active request
  2018-05-17 14:24 [PATCH 1/2] drm/i915/selftests: Wait longer for the old active request Chris Wilson
                   ` (4 preceding siblings ...)
  2018-05-17 16:18 ` ✗ Fi.CI.BAT: failure " Patchwork
@ 2018-05-18  9:22 ` Tvrtko Ursulin
  5 siblings, 0 replies; 13+ messages in thread
From: Tvrtko Ursulin @ 2018-05-18  9:22 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 17/05/2018 15:24, Chris Wilson wrote:
> When testing reset, we wait for 1s on the main thread for the hang to
> start. Meanwhile, we continue submitting requests on all the background
> threads, and we may have more threads than cores and so potentially
> starve the waiter from being woken within the timeout. As the hang
> timeout and the active timeouts are the same, it is hard to distinguish
> which caused the timeout. Bump the active thread timeouts to 5s,
> compared to the 1s timeout for the hang, so that we preferentially
> report the hang timing out, while hopefully ensuring that we do at least
> wake up the hang thread first before declaring the background active
> timeout.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   .../gpu/drm/i915/selftests/intel_hangcheck.c  | 48 +++++++++++++------
>   1 file changed, 34 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 438e0b045a2c..f1dc42a171c8 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -560,6 +560,30 @@ struct active_engine {
>   #define TEST_SELF	BIT(2)
>   #define TEST_PRIORITY	BIT(3)
>   
> +static int active_request_put(struct i915_request *rq)
> +{
> +	int err = 0;
> +
> +	if (!rq)
> +		return 0;
> +
> +	if (i915_request_wait(rq, 0, 5 * HZ) < 0) {
> +		GEM_TRACE("%s timed out waiting for completion of fence %llx:%d, seqno %d.\n",
> +			  rq->engine->name,
> +			  rq->fence.context,
> +			  rq->fence.seqno,
> +			  i915_request_global_seqno(rq));
> +		GEM_TRACE_DUMP();
> +
> +		i915_gem_set_wedged(rq->i915);
> +		err = -EIO;
> +	}
> +
> +	i915_request_put(rq);
> +
> +	return err;
> +}
> +
>   static int active_engine(void *data)
>   {
>   	I915_RND_STATE(prng);
> @@ -608,24 +632,20 @@ static int active_engine(void *data)
>   		i915_request_add(new);
>   		mutex_unlock(&engine->i915->drm.struct_mutex);
>   
> -		if (old) {
> -			if (i915_request_wait(old, 0, HZ) < 0) {
> -				GEM_TRACE("%s timed out.\n", engine->name);
> -				GEM_TRACE_DUMP();
> -
> -				i915_gem_set_wedged(engine->i915);
> -				i915_request_put(old);
> -				err = -EIO;
> -				break;
> -			}
> -			i915_request_put(old);
> -		}
> +		err = active_request_put(old);
> +		if (err)
> +			break;
>   
>   		cond_resched();
>   	}
>   
> -	for (count = 0; count < ARRAY_SIZE(rq); count++)
> -		i915_request_put(rq[count]);
> +	for (count = 0; count < ARRAY_SIZE(rq); count++) {
> +		int err__ = active_request_put(rq[count]);
> +
> +		/* Keep the first error */
> +		if (!err)
> +			err = err__;
> +	}
>   
>   err_file:
>   	mock_file_free(engine->i915, file);
> 

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset
  2018-05-17 14:24 ` [PATCH 2/2] drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset Chris Wilson
@ 2018-05-18  9:33   ` Tvrtko Ursulin
  2018-05-18  9:47     ` Chris Wilson
  0 siblings, 1 reply; 13+ messages in thread
From: Tvrtko Ursulin @ 2018-05-18  9:33 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 17/05/2018 15:24, Chris Wilson wrote:
> Inside the live_hangcheck (reset) selftests, we occasionally see
> failures like
> 
> <7>[  239.094840] i915_gem_set_wedged rcs0
> <7>[  239.094843] i915_gem_set_wedged 	current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
> <7>[  239.094846] i915_gem_set_wedged 	Reset count: 6239 (global 1)
> <7>[  239.094848] i915_gem_set_wedged 	Requests:
> <7>[  239.095052] i915_gem_set_wedged 		first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> <7>[  239.095056] i915_gem_set_wedged 		last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
> <7>[  239.095059] i915_gem_set_wedged 		active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> <7>[  239.095062] i915_gem_set_wedged 		[head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
> <7>[  239.100050] i915_gem_set_wedged 		ring->start:  0x00283000
> <7>[  239.100053] i915_gem_set_wedged 		ring->head:   0x000001f8
> <7>[  239.100055] i915_gem_set_wedged 		ring->tail:   0x000002a8
> <7>[  239.100057] i915_gem_set_wedged 		ring->emit:   0x000002a8
> <7>[  239.100059] i915_gem_set_wedged 		ring->space:  0x00000f10
> <7>[  239.100085] i915_gem_set_wedged 	RING_START: 0x00283000
> <7>[  239.100088] i915_gem_set_wedged 	RING_HEAD:  0x00000260
> <7>[  239.100091] i915_gem_set_wedged 	RING_TAIL:  0x000002a8
> <7>[  239.100094] i915_gem_set_wedged 	RING_CTL:   0x00000001
> <7>[  239.100097] i915_gem_set_wedged 	RING_MODE:  0x00000300 [idle]
> <7>[  239.100100] i915_gem_set_wedged 	RING_IMR: fffffefe
> <7>[  239.100104] i915_gem_set_wedged 	ACTHD:  0x00000000_0000609c
> <7>[  239.100108] i915_gem_set_wedged 	BBADDR: 0x00000000_0000609d
> <7>[  239.100111] i915_gem_set_wedged 	DMA_FADDR: 0x00000000_00283260
> <7>[  239.100114] i915_gem_set_wedged 	IPEIR: 0x00000000
> <7>[  239.100117] i915_gem_set_wedged 	IPEHR: 0x02800000
> <7>[  239.100120] i915_gem_set_wedged 	Execlist status: 0x00044052 00000002
> <7>[  239.100124] i915_gem_set_wedged 	Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
> <7>[  239.100128] i915_gem_set_wedged 		ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> <7>[  239.100132] i915_gem_set_wedged 		ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> <7>[  239.100135] i915_gem_set_wedged 		HW active? 0x5
> <7>[  239.100250] i915_gem_set_wedged 		E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> <7>[  239.100338] i915_gem_set_wedged 		E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> <7>[  239.100340] i915_gem_set_wedged 		Queue priority: 139
> <7>[  239.100343] i915_gem_set_wedged 		Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
> <7>[  239.100346] i915_gem_set_wedged 		Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
> <7>[  239.100349] i915_gem_set_wedged 		Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
> <7>[  239.100352] i915_gem_set_wedged 		Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
> <7>[  239.100356] i915_gem_set_wedged 		Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
> <7>[  239.100362] i915_gem_set_wedged 	drv_selftest [5894] waiting for 19a99
> 
> where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
> The RING_MODE indicates that is idle and has the STOP_RING bit set, so
> try clearing it.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>   drivers/gpu/drm/i915/intel_uncore.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index b36a3b5736a0..082b0045ac8c 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -1720,6 +1720,8 @@ static void gen3_stop_engine(struct intel_engine_cs *engine)
>   	if (I915_READ_FW(RING_HEAD(base)) != 0)
>   		DRM_DEBUG_DRIVER("%s: ring head not parked\n",
>   				 engine->name);
> +
> +	I915_WRITE_FW(RING_MI_MODE(base), _MASKED_BIT_DISABLE(STOP_RING));
>   }
>   
>   static void i915_stop_engines(struct drm_i915_private *dev_priv,
> 

Right, so expectation is after reset STOP_RING will not be set, but it 
sometimes is?

Should we also add a notice or info if it is set in intel_gpu_reset, 
after the reset is called? Could add i915_check_engine_running(..) 
helper or something.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset
  2018-05-18  9:33   ` Tvrtko Ursulin
@ 2018-05-18  9:47     ` Chris Wilson
  2018-05-18  9:53       ` Tvrtko Ursulin
  2018-05-18 10:02       ` Tvrtko Ursulin
  0 siblings, 2 replies; 13+ messages in thread
From: Chris Wilson @ 2018-05-18  9:47 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2018-05-18 10:33:44)
> 
> On 17/05/2018 15:24, Chris Wilson wrote:
> > Inside the live_hangcheck (reset) selftests, we occasionally see
> > failures like
> > 
> > <7>[  239.094840] i915_gem_set_wedged rcs0
> > <7>[  239.094843] i915_gem_set_wedged         current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
> > <7>[  239.094846] i915_gem_set_wedged         Reset count: 6239 (global 1)
> > <7>[  239.094848] i915_gem_set_wedged         Requests:
> > <7>[  239.095052] i915_gem_set_wedged                 first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> > <7>[  239.095056] i915_gem_set_wedged                 last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
> > <7>[  239.095059] i915_gem_set_wedged                 active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> > <7>[  239.095062] i915_gem_set_wedged                 [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
> > <7>[  239.100050] i915_gem_set_wedged                 ring->start:  0x00283000
> > <7>[  239.100053] i915_gem_set_wedged                 ring->head:   0x000001f8
> > <7>[  239.100055] i915_gem_set_wedged                 ring->tail:   0x000002a8
> > <7>[  239.100057] i915_gem_set_wedged                 ring->emit:   0x000002a8
> > <7>[  239.100059] i915_gem_set_wedged                 ring->space:  0x00000f10
> > <7>[  239.100085] i915_gem_set_wedged         RING_START: 0x00283000
> > <7>[  239.100088] i915_gem_set_wedged         RING_HEAD:  0x00000260
> > <7>[  239.100091] i915_gem_set_wedged         RING_TAIL:  0x000002a8
> > <7>[  239.100094] i915_gem_set_wedged         RING_CTL:   0x00000001
> > <7>[  239.100097] i915_gem_set_wedged         RING_MODE:  0x00000300 [idle]
> > <7>[  239.100100] i915_gem_set_wedged         RING_IMR: fffffefe
> > <7>[  239.100104] i915_gem_set_wedged         ACTHD:  0x00000000_0000609c
> > <7>[  239.100108] i915_gem_set_wedged         BBADDR: 0x00000000_0000609d
> > <7>[  239.100111] i915_gem_set_wedged         DMA_FADDR: 0x00000000_00283260
> > <7>[  239.100114] i915_gem_set_wedged         IPEIR: 0x00000000
> > <7>[  239.100117] i915_gem_set_wedged         IPEHR: 0x02800000
> > <7>[  239.100120] i915_gem_set_wedged         Execlist status: 0x00044052 00000002
> > <7>[  239.100124] i915_gem_set_wedged         Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
> > <7>[  239.100128] i915_gem_set_wedged                 ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> > <7>[  239.100132] i915_gem_set_wedged                 ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> > <7>[  239.100135] i915_gem_set_wedged                 HW active? 0x5
> > <7>[  239.100250] i915_gem_set_wedged                 E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> > <7>[  239.100338] i915_gem_set_wedged                 E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> > <7>[  239.100340] i915_gem_set_wedged                 Queue priority: 139
> > <7>[  239.100343] i915_gem_set_wedged                 Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
> > <7>[  239.100346] i915_gem_set_wedged                 Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
> > <7>[  239.100349] i915_gem_set_wedged                 Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
> > <7>[  239.100352] i915_gem_set_wedged                 Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
> > <7>[  239.100356] i915_gem_set_wedged                 Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
> > <7>[  239.100362] i915_gem_set_wedged         drv_selftest [5894] waiting for 19a99
> > 
> > where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
> > The RING_MODE indicates that is idle and has the STOP_RING bit set, so
> > try clearing it.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> >   drivers/gpu/drm/i915/intel_uncore.c | 2 ++
> >   1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> > index b36a3b5736a0..082b0045ac8c 100644
> > --- a/drivers/gpu/drm/i915/intel_uncore.c
> > +++ b/drivers/gpu/drm/i915/intel_uncore.c
> > @@ -1720,6 +1720,8 @@ static void gen3_stop_engine(struct intel_engine_cs *engine)
> >       if (I915_READ_FW(RING_HEAD(base)) != 0)
> >               DRM_DEBUG_DRIVER("%s: ring head not parked\n",
> >                                engine->name);
> > +
> > +     I915_WRITE_FW(RING_MI_MODE(base), _MASKED_BIT_DISABLE(STOP_RING));
> >   }
> >   
> >   static void i915_stop_engines(struct drm_i915_private *dev_priv,
> > 
> 
> Right, so expectation is after reset STOP_RING will not be set, but it 
> sometimes is?

Yes.

> Should we also add a notice or info if it is set in intel_gpu_reset, 
> after the reset is called? Could add i915_check_engine_running(..) 
> helper or something.

Could do, will be any more useful than the dump we give above? ;)
Might be interesting to do the dump on unusual state on takeover just
for reference in later fails.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset
  2018-05-18  9:47     ` Chris Wilson
@ 2018-05-18  9:53       ` Tvrtko Ursulin
  2018-05-18 10:02       ` Tvrtko Ursulin
  1 sibling, 0 replies; 13+ messages in thread
From: Tvrtko Ursulin @ 2018-05-18  9:53 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 18/05/2018 10:47, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-05-18 10:33:44)
>>
>> On 17/05/2018 15:24, Chris Wilson wrote:
>>> Inside the live_hangcheck (reset) selftests, we occasionally see
>>> failures like
>>>
>>> <7>[  239.094840] i915_gem_set_wedged rcs0
>>> <7>[  239.094843] i915_gem_set_wedged         current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
>>> <7>[  239.094846] i915_gem_set_wedged         Reset count: 6239 (global 1)
>>> <7>[  239.094848] i915_gem_set_wedged         Requests:
>>> <7>[  239.095052] i915_gem_set_wedged                 first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
>>> <7>[  239.095056] i915_gem_set_wedged                 last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
>>> <7>[  239.095059] i915_gem_set_wedged                 active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
>>> <7>[  239.095062] i915_gem_set_wedged                 [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
>>> <7>[  239.100050] i915_gem_set_wedged                 ring->start:  0x00283000
>>> <7>[  239.100053] i915_gem_set_wedged                 ring->head:   0x000001f8
>>> <7>[  239.100055] i915_gem_set_wedged                 ring->tail:   0x000002a8
>>> <7>[  239.100057] i915_gem_set_wedged                 ring->emit:   0x000002a8
>>> <7>[  239.100059] i915_gem_set_wedged                 ring->space:  0x00000f10
>>> <7>[  239.100085] i915_gem_set_wedged         RING_START: 0x00283000
>>> <7>[  239.100088] i915_gem_set_wedged         RING_HEAD:  0x00000260
>>> <7>[  239.100091] i915_gem_set_wedged         RING_TAIL:  0x000002a8
>>> <7>[  239.100094] i915_gem_set_wedged         RING_CTL:   0x00000001
>>> <7>[  239.100097] i915_gem_set_wedged         RING_MODE:  0x00000300 [idle]
>>> <7>[  239.100100] i915_gem_set_wedged         RING_IMR: fffffefe
>>> <7>[  239.100104] i915_gem_set_wedged         ACTHD:  0x00000000_0000609c
>>> <7>[  239.100108] i915_gem_set_wedged         BBADDR: 0x00000000_0000609d
>>> <7>[  239.100111] i915_gem_set_wedged         DMA_FADDR: 0x00000000_00283260
>>> <7>[  239.100114] i915_gem_set_wedged         IPEIR: 0x00000000
>>> <7>[  239.100117] i915_gem_set_wedged         IPEHR: 0x02800000
>>> <7>[  239.100120] i915_gem_set_wedged         Execlist status: 0x00044052 00000002
>>> <7>[  239.100124] i915_gem_set_wedged         Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
>>> <7>[  239.100128] i915_gem_set_wedged                 ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
>>> <7>[  239.100132] i915_gem_set_wedged                 ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
>>> <7>[  239.100135] i915_gem_set_wedged                 HW active? 0x5
>>> <7>[  239.100250] i915_gem_set_wedged                 E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
>>> <7>[  239.100338] i915_gem_set_wedged                 E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
>>> <7>[  239.100340] i915_gem_set_wedged                 Queue priority: 139
>>> <7>[  239.100343] i915_gem_set_wedged                 Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
>>> <7>[  239.100346] i915_gem_set_wedged                 Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
>>> <7>[  239.100349] i915_gem_set_wedged                 Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
>>> <7>[  239.100352] i915_gem_set_wedged                 Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
>>> <7>[  239.100356] i915_gem_set_wedged                 Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
>>> <7>[  239.100362] i915_gem_set_wedged         drv_selftest [5894] waiting for 19a99
>>>
>>> where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
>>> The RING_MODE indicates that is idle and has the STOP_RING bit set, so
>>> try clearing it.
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> ---
>>>    drivers/gpu/drm/i915/intel_uncore.c | 2 ++
>>>    1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
>>> index b36a3b5736a0..082b0045ac8c 100644
>>> --- a/drivers/gpu/drm/i915/intel_uncore.c
>>> +++ b/drivers/gpu/drm/i915/intel_uncore.c
>>> @@ -1720,6 +1720,8 @@ static void gen3_stop_engine(struct intel_engine_cs *engine)
>>>        if (I915_READ_FW(RING_HEAD(base)) != 0)
>>>                DRM_DEBUG_DRIVER("%s: ring head not parked\n",
>>>                                 engine->name);
>>> +
>>> +     I915_WRITE_FW(RING_MI_MODE(base), _MASKED_BIT_DISABLE(STOP_RING));
>>>    }
>>>    
>>>    static void i915_stop_engines(struct drm_i915_private *dev_priv,
>>>
>>
>> Right, so expectation is after reset STOP_RING will not be set, but it
>> sometimes is?
> 
> Yes.
> 
>> Should we also add a notice or info if it is set in intel_gpu_reset,
>> after the reset is called? Could add i915_check_engine_running(..)
>> helper or something.
> 
> Could do, will be any more useful than the dump we give above? ;)

I think so - it would immediately and clearly say reset did not go to 
plan and bad things could follow. (While the hangcheck dump above makes 
requires one to think and analyse.)

Also, is stuck STOP_RING bit the only thing which goes wrong or could 
there be more weirdness under the covers?

> Might be interesting to do the dump on unusual state on takeover just
> for reference in later fails.

Takeover as in when initializing the engines?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset
  2018-05-18  9:47     ` Chris Wilson
  2018-05-18  9:53       ` Tvrtko Ursulin
@ 2018-05-18 10:02       ` Tvrtko Ursulin
  1 sibling, 0 replies; 13+ messages in thread
From: Tvrtko Ursulin @ 2018-05-18 10:02 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 18/05/2018 10:47, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-05-18 10:33:44)
>>
>> On 17/05/2018 15:24, Chris Wilson wrote:
>>> Inside the live_hangcheck (reset) selftests, we occasionally see
>>> failures like
>>>
>>> <7>[  239.094840] i915_gem_set_wedged rcs0
>>> <7>[  239.094843] i915_gem_set_wedged         current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
>>> <7>[  239.094846] i915_gem_set_wedged         Reset count: 6239 (global 1)
>>> <7>[  239.094848] i915_gem_set_wedged         Requests:
>>> <7>[  239.095052] i915_gem_set_wedged                 first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
>>> <7>[  239.095056] i915_gem_set_wedged                 last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
>>> <7>[  239.095059] i915_gem_set_wedged                 active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
>>> <7>[  239.095062] i915_gem_set_wedged                 [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
>>> <7>[  239.100050] i915_gem_set_wedged                 ring->start:  0x00283000
>>> <7>[  239.100053] i915_gem_set_wedged                 ring->head:   0x000001f8
>>> <7>[  239.100055] i915_gem_set_wedged                 ring->tail:   0x000002a8
>>> <7>[  239.100057] i915_gem_set_wedged                 ring->emit:   0x000002a8
>>> <7>[  239.100059] i915_gem_set_wedged                 ring->space:  0x00000f10
>>> <7>[  239.100085] i915_gem_set_wedged         RING_START: 0x00283000
>>> <7>[  239.100088] i915_gem_set_wedged         RING_HEAD:  0x00000260
>>> <7>[  239.100091] i915_gem_set_wedged         RING_TAIL:  0x000002a8
>>> <7>[  239.100094] i915_gem_set_wedged         RING_CTL:   0x00000001
>>> <7>[  239.100097] i915_gem_set_wedged         RING_MODE:  0x00000300 [idle]
>>> <7>[  239.100100] i915_gem_set_wedged         RING_IMR: fffffefe
>>> <7>[  239.100104] i915_gem_set_wedged         ACTHD:  0x00000000_0000609c
>>> <7>[  239.100108] i915_gem_set_wedged         BBADDR: 0x00000000_0000609d
>>> <7>[  239.100111] i915_gem_set_wedged         DMA_FADDR: 0x00000000_00283260
>>> <7>[  239.100114] i915_gem_set_wedged         IPEIR: 0x00000000
>>> <7>[  239.100117] i915_gem_set_wedged         IPEHR: 0x02800000
>>> <7>[  239.100120] i915_gem_set_wedged         Execlist status: 0x00044052 00000002
>>> <7>[  239.100124] i915_gem_set_wedged         Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
>>> <7>[  239.100128] i915_gem_set_wedged                 ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
>>> <7>[  239.100132] i915_gem_set_wedged                 ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
>>> <7>[  239.100135] i915_gem_set_wedged                 HW active? 0x5
>>> <7>[  239.100250] i915_gem_set_wedged                 E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
>>> <7>[  239.100338] i915_gem_set_wedged                 E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
>>> <7>[  239.100340] i915_gem_set_wedged                 Queue priority: 139
>>> <7>[  239.100343] i915_gem_set_wedged                 Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
>>> <7>[  239.100346] i915_gem_set_wedged                 Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
>>> <7>[  239.100349] i915_gem_set_wedged                 Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
>>> <7>[  239.100352] i915_gem_set_wedged                 Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
>>> <7>[  239.100356] i915_gem_set_wedged                 Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
>>> <7>[  239.100362] i915_gem_set_wedged         drv_selftest [5894] waiting for 19a99
>>>
>>> where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
>>> The RING_MODE indicates that is idle and has the STOP_RING bit set, so
>>> try clearing it.
>>>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> ---
>>>    drivers/gpu/drm/i915/intel_uncore.c | 2 ++
>>>    1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
>>> index b36a3b5736a0..082b0045ac8c 100644
>>> --- a/drivers/gpu/drm/i915/intel_uncore.c
>>> +++ b/drivers/gpu/drm/i915/intel_uncore.c
>>> @@ -1720,6 +1720,8 @@ static void gen3_stop_engine(struct intel_engine_cs *engine)
>>>        if (I915_READ_FW(RING_HEAD(base)) != 0)
>>>                DRM_DEBUG_DRIVER("%s: ring head not parked\n",
>>>                                 engine->name);
>>> +
>>> +     I915_WRITE_FW(RING_MI_MODE(base), _MASKED_BIT_DISABLE(STOP_RING));
>>>    }
>>>    
>>>    static void i915_stop_engines(struct drm_i915_private *dev_priv,
>>>
>>
>> Right, so expectation is after reset STOP_RING will not be set, but it
>> sometimes is?
> 
> Yes.

Also there is a comment in there which says engine must be stopped 
before reset on some platforms. So shouldn't the manual attempt to 
unstuck it go after the reset and not in gen3_stop_engine?

Like the suggested i915_check_engine_running:

if (stopped) {
	DRM_NOTICE(Manually starting engine after reset);
	clear_stop_ring;
}

?


>> Should we also add a notice or info if it is set in intel_gpu_reset,
>> after the reset is called? Could add i915_check_engine_running(..)
>> helper or something.
> 
> Could do, will be any more useful than the dump we give above? ;)
> Might be interesting to do the dump on unusual state on takeover just
> for reference in later fails.
> -Chris
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset
  2018-05-17 15:47 ` [PATCH 2/2] drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset Chris Wilson
@ 2018-05-18  7:29   ` Chris Wilson
  0 siblings, 0 replies; 13+ messages in thread
From: Chris Wilson @ 2018-05-18  7:29 UTC (permalink / raw)
  To: intel-gfx

Quoting Chris Wilson (2018-05-17 16:47:26)
> Inside the live_hangcheck (reset) selftests, we occasionally see
> failures like
> 
> <7>[  239.094840] i915_gem_set_wedged rcs0
> <7>[  239.094843] i915_gem_set_wedged   current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
> <7>[  239.094846] i915_gem_set_wedged   Reset count: 6239 (global 1)
> <7>[  239.094848] i915_gem_set_wedged   Requests:
> <7>[  239.095052] i915_gem_set_wedged           first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> <7>[  239.095056] i915_gem_set_wedged           last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
> <7>[  239.095059] i915_gem_set_wedged           active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
> <7>[  239.095062] i915_gem_set_wedged           [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
> <7>[  239.100050] i915_gem_set_wedged           ring->start:  0x00283000
> <7>[  239.100053] i915_gem_set_wedged           ring->head:   0x000001f8
> <7>[  239.100055] i915_gem_set_wedged           ring->tail:   0x000002a8
> <7>[  239.100057] i915_gem_set_wedged           ring->emit:   0x000002a8
> <7>[  239.100059] i915_gem_set_wedged           ring->space:  0x00000f10
> <7>[  239.100085] i915_gem_set_wedged   RING_START: 0x00283000
> <7>[  239.100088] i915_gem_set_wedged   RING_HEAD:  0x00000260
> <7>[  239.100091] i915_gem_set_wedged   RING_TAIL:  0x000002a8
> <7>[  239.100094] i915_gem_set_wedged   RING_CTL:   0x00000001
> <7>[  239.100097] i915_gem_set_wedged   RING_MODE:  0x00000300 [idle]
> <7>[  239.100100] i915_gem_set_wedged   RING_IMR: fffffefe
> <7>[  239.100104] i915_gem_set_wedged   ACTHD:  0x00000000_0000609c
> <7>[  239.100108] i915_gem_set_wedged   BBADDR: 0x00000000_0000609d
> <7>[  239.100111] i915_gem_set_wedged   DMA_FADDR: 0x00000000_00283260
> <7>[  239.100114] i915_gem_set_wedged   IPEIR: 0x00000000
> <7>[  239.100117] i915_gem_set_wedged   IPEHR: 0x02800000
> <7>[  239.100120] i915_gem_set_wedged   Execlist status: 0x00044052 00000002
> <7>[  239.100124] i915_gem_set_wedged   Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
> <7>[  239.100128] i915_gem_set_wedged           ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> <7>[  239.100132] i915_gem_set_wedged           ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> <7>[  239.100135] i915_gem_set_wedged           HW active? 0x5
> <7>[  239.100250] i915_gem_set_wedged           E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
> <7>[  239.100338] i915_gem_set_wedged           E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
> <7>[  239.100340] i915_gem_set_wedged           Queue priority: 139
> <7>[  239.100343] i915_gem_set_wedged           Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
> <7>[  239.100346] i915_gem_set_wedged           Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
> <7>[  239.100349] i915_gem_set_wedged           Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
> <7>[  239.100352] i915_gem_set_wedged           Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
> <7>[  239.100356] i915_gem_set_wedged           Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
> <7>[  239.100362] i915_gem_set_wedged   drv_selftest [5894] waiting for 19a99
> 
> where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
> The RING_MODE indicates that is idle and has the STOP_RING bit set, so
> try clearing it.
> 
> v2: Only clear the bit on restarting the ring, as we want to be sure the
> STOP_RING bit is kept if reset fails on wedging.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

2/2 passes, it might not just be a coincidence! Please kindly review,
-Chris

> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 646ecf267411..211585187d2f 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1773,6 +1773,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
>                 I915_WRITE(RING_MODE_GEN7(engine),
>                            _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
>  
> +       I915_WRITE(RING_MI_MODE(engine->mmio_base),
> +                  _MASKED_BIT_DISABLE(STOP_RING));
> +
>         I915_WRITE(RING_HWS_PGA(engine->mmio_base),
>                    engine->status_page.ggtt_offset);
>         POSTING_READ(RING_HWS_PGA(engine->mmio_base));
> -- 
> 2.17.0
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/2] drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset
  2018-05-17 15:47 Chris Wilson
@ 2018-05-17 15:47 ` Chris Wilson
  2018-05-18  7:29   ` Chris Wilson
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Wilson @ 2018-05-17 15:47 UTC (permalink / raw)
  To: intel-gfx

Inside the live_hangcheck (reset) selftests, we occasionally see
failures like

<7>[  239.094840] i915_gem_set_wedged rcs0
<7>[  239.094843] i915_gem_set_wedged 	current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
<7>[  239.094846] i915_gem_set_wedged 	Reset count: 6239 (global 1)
<7>[  239.094848] i915_gem_set_wedged 	Requests:
<7>[  239.095052] i915_gem_set_wedged 		first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
<7>[  239.095056] i915_gem_set_wedged 		last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
<7>[  239.095059] i915_gem_set_wedged 		active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
<7>[  239.095062] i915_gem_set_wedged 		[head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
<7>[  239.100050] i915_gem_set_wedged 		ring->start:  0x00283000
<7>[  239.100053] i915_gem_set_wedged 		ring->head:   0x000001f8
<7>[  239.100055] i915_gem_set_wedged 		ring->tail:   0x000002a8
<7>[  239.100057] i915_gem_set_wedged 		ring->emit:   0x000002a8
<7>[  239.100059] i915_gem_set_wedged 		ring->space:  0x00000f10
<7>[  239.100085] i915_gem_set_wedged 	RING_START: 0x00283000
<7>[  239.100088] i915_gem_set_wedged 	RING_HEAD:  0x00000260
<7>[  239.100091] i915_gem_set_wedged 	RING_TAIL:  0x000002a8
<7>[  239.100094] i915_gem_set_wedged 	RING_CTL:   0x00000001
<7>[  239.100097] i915_gem_set_wedged 	RING_MODE:  0x00000300 [idle]
<7>[  239.100100] i915_gem_set_wedged 	RING_IMR: fffffefe
<7>[  239.100104] i915_gem_set_wedged 	ACTHD:  0x00000000_0000609c
<7>[  239.100108] i915_gem_set_wedged 	BBADDR: 0x00000000_0000609d
<7>[  239.100111] i915_gem_set_wedged 	DMA_FADDR: 0x00000000_00283260
<7>[  239.100114] i915_gem_set_wedged 	IPEIR: 0x00000000
<7>[  239.100117] i915_gem_set_wedged 	IPEHR: 0x02800000
<7>[  239.100120] i915_gem_set_wedged 	Execlist status: 0x00044052 00000002
<7>[  239.100124] i915_gem_set_wedged 	Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
<7>[  239.100128] i915_gem_set_wedged 		ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
<7>[  239.100132] i915_gem_set_wedged 		ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
<7>[  239.100135] i915_gem_set_wedged 		HW active? 0x5
<7>[  239.100250] i915_gem_set_wedged 		E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
<7>[  239.100338] i915_gem_set_wedged 		E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
<7>[  239.100340] i915_gem_set_wedged 		Queue priority: 139
<7>[  239.100343] i915_gem_set_wedged 		Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
<7>[  239.100346] i915_gem_set_wedged 		Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
<7>[  239.100349] i915_gem_set_wedged 		Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
<7>[  239.100352] i915_gem_set_wedged 		Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
<7>[  239.100356] i915_gem_set_wedged 		Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
<7>[  239.100362] i915_gem_set_wedged 	drv_selftest [5894] waiting for 19a99

where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
The RING_MODE indicates that is idle and has the STOP_RING bit set, so
try clearing it.

v2: Only clear the bit on restarting the ring, as we want to be sure the
STOP_RING bit is kept if reset fails on wedging.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/intel_lrc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 646ecf267411..211585187d2f 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1773,6 +1773,9 @@ static void enable_execlists(struct intel_engine_cs *engine)
 		I915_WRITE(RING_MODE_GEN7(engine),
 			   _MASKED_BIT_ENABLE(GFX_RUN_LIST_ENABLE));
 
+	I915_WRITE(RING_MI_MODE(engine->mmio_base),
+		   _MASKED_BIT_DISABLE(STOP_RING));
+
 	I915_WRITE(RING_HWS_PGA(engine->mmio_base),
 		   engine->status_page.ggtt_offset);
 	POSTING_READ(RING_HWS_PGA(engine->mmio_base));
-- 
2.17.0

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-05-18 10:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-17 14:24 [PATCH 1/2] drm/i915/selftests: Wait longer for the old active request Chris Wilson
2018-05-17 14:24 ` [PATCH 2/2] drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset Chris Wilson
2018-05-18  9:33   ` Tvrtko Ursulin
2018-05-18  9:47     ` Chris Wilson
2018-05-18  9:53       ` Tvrtko Ursulin
2018-05-18 10:02       ` Tvrtko Ursulin
2018-05-17 15:04 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/2] drm/i915/selftests: Wait longer for the old active request Patchwork
2018-05-17 15:20 ` ✗ Fi.CI.BAT: failure " Patchwork
2018-05-17 16:02 ` ✗ Fi.CI.CHECKPATCH: warning " Patchwork
2018-05-17 16:18 ` ✗ Fi.CI.BAT: failure " Patchwork
2018-05-18  9:22 ` [PATCH 1/2] " Tvrtko Ursulin
2018-05-17 15:47 Chris Wilson
2018-05-17 15:47 ` [PATCH 2/2] drm/i915: Flush the RING stop bit after clearing RING_HEAD in reset Chris Wilson
2018-05-18  7:29   ` Chris Wilson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.