* [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
@ 2017-09-15 13:09 Chris Wilson
2017-09-15 13:29 ` Chris Wilson
` (5 more replies)
0 siblings, 6 replies; 11+ messages in thread
From: Chris Wilson @ 2017-09-15 13:09 UTC (permalink / raw)
To: intel-gfx; +Cc: Jari Tahvanainen
If we see the seqno stop progressing, we abandon the test for fear that
the GPU died following the reset. However, during test teardown we still
wait for the GPU to idle before continuing, but we have already
confirmed that the GPU is dead. Furthermore, since we are inside a reset
test, we have disabled the hangchecker, and so there is no safety net and
we wait indefinitely. Detect the stuck GPU and declare it wedged as a
state of emergency so we can escape.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
1 file changed, 20 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index 02e52a146ed8..913fe752f6b4 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
*batch++ = lower_32_bits(vma->node.start);
}
*batch++ = MI_BATCH_BUFFER_END; /* not reached */
+ wmb();
flags = 0;
if (INTEL_GEN(vm->i915) <= 5)
@@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
__i915_add_request(rq, true);
if (!wait_for_hang(&h, rq)) {
- pr_err("Failed to start request %x\n", rq->fence.seqno);
+ pr_err("Failed to start request %x, at %x\n",
+ rq->fence.seqno, hws_seqno(&h, rq));
+
+ i915_reset(i915, 0);
+ i915_gem_set_wedged(i915);
+
err = -EIO;
goto out_rq;
}
@@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
__i915_add_request(rq, true);
if (!wait_for_hang(&h, prev)) {
- pr_err("Failed to start request %x\n",
- prev->fence.seqno);
+ pr_err("Failed to start request %x, at %x\n",
+ rq->fence.seqno, hws_seqno(&h, rq));
i915_gem_request_put(rq);
i915_gem_request_put(prev);
+
+ i915_reset(i915, 0);
+ i915_gem_set_wedged(i915);
+
err = -EIO;
goto fini;
}
@@ -806,7 +816,12 @@ static int igt_handle_error(void *arg)
__i915_add_request(rq, true);
if (!wait_for_hang(&h, rq)) {
- pr_err("Failed to start request %x\n", rq->fence.seqno);
+ pr_err("Failed to start request %x, at %x\n",
+ rq->fence.seqno, hws_seqno(&h, rq));
+
+ i915_reset(i915, 0);
+ i915_gem_set_wedged(i915);
+
err = -EIO;
goto err_request;
}
@@ -843,8 +858,8 @@ static int igt_handle_error(void *arg)
int intel_hangcheck_live_selftests(struct drm_i915_private *i915)
{
static const struct i915_subtest tests[] = {
+ SUBTEST(igt_global_reset), /* attempt to recover GPU first */
SUBTEST(igt_hang_sanitycheck),
- SUBTEST(igt_global_reset),
SUBTEST(igt_reset_engine),
SUBTEST(igt_reset_active_engines),
SUBTEST(igt_wait_reset),
--
2.14.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
@ 2017-09-15 13:29 ` Chris Wilson
2017-09-15 13:31 ` ✓ Fi.CI.BAT: success for " Patchwork
` (4 subsequent siblings)
5 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2017-09-15 13:29 UTC (permalink / raw)
To: intel-gfx; +Cc: Jari Tahvanainen
Quoting Chris Wilson (2017-09-15 14:09:29)
> If we see the seqno stop progressing, we abandon the test for fear that
> the GPU died following the reset. However, during test teardown we still
> wait for the GPU to idle before continuing, but we have already
> confirmed that the GPU is dead. Furthermore, since we are inside a reset
> test, we have disabled the hangchecker, and so there is no safety net and
> we wait indefinitely. Detect the stuck GPU and declare it wedged as a
> state of emergency so we can escape.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
> drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
> 1 file changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 02e52a146ed8..913fe752f6b4 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
> *batch++ = lower_32_bits(vma->node.start);
> }
> *batch++ = MI_BATCH_BUFFER_END; /* not reached */
> + wmb();
>
> flags = 0;
> if (INTEL_GEN(vm->i915) <= 5)
> @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, rq)) {
> - pr_err("Failed to start request %x\n", rq->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto out_rq;
> }
> @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, prev)) {
> - pr_err("Failed to start request %x\n",
> - prev->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
Odd one out needs s/rq/prev/
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* ✓ Fi.CI.BAT: success for drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
2017-09-15 13:29 ` Chris Wilson
@ 2017-09-15 13:31 ` Patchwork
2017-09-15 15:04 ` ✓ Fi.CI.IGT: " Patchwork
` (3 subsequent siblings)
5 siblings, 0 replies; 11+ messages in thread
From: Patchwork @ 2017-09-15 13:31 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx
== Series Details ==
Series: drm/i915/selftests: Try to recover from a wedged GPU during reset tests
URL : https://patchwork.freedesktop.org/series/30419/
State : success
== Summary ==
Series 30419v1 drm/i915/selftests: Try to recover from a wedged GPU during reset tests
https://patchwork.freedesktop.org/api/1.0/series/30419/revisions/1/mbox/
Test chamelium:
Subgroup dp-crc-fast:
pass -> FAIL (fi-kbl-7500u) fdo#102514
Test kms_cursor_legacy:
Subgroup basic-busy-flip-before-cursor-atomic:
fail -> PASS (fi-snb-2600) fdo#100215 +1
Test pm_rpm:
Subgroup basic-rte:
pass -> DMESG-WARN (fi-cfl-s) fdo#102294
fdo#102514 https://bugs.freedesktop.org/show_bug.cgi?id=102514
fdo#100215 https://bugs.freedesktop.org/show_bug.cgi?id=100215
fdo#102294 https://bugs.freedesktop.org/show_bug.cgi?id=102294
fi-bdw-5557u total:289 pass:268 dwarn:0 dfail:0 fail:0 skip:21 time:444s
fi-bdw-gvtdvm total:289 pass:265 dwarn:0 dfail:0 fail:0 skip:24 time:456s
fi-blb-e6850 total:289 pass:224 dwarn:1 dfail:0 fail:0 skip:64 time:379s
fi-bsw-n3050 total:289 pass:243 dwarn:0 dfail:0 fail:0 skip:46 time:528s
fi-bwr-2160 total:289 pass:184 dwarn:0 dfail:0 fail:0 skip:105 time:268s
fi-bxt-j4205 total:289 pass:260 dwarn:0 dfail:0 fail:0 skip:29 time:504s
fi-byt-j1900 total:289 pass:254 dwarn:1 dfail:0 fail:0 skip:34 time:504s
fi-byt-n2820 total:289 pass:250 dwarn:1 dfail:0 fail:0 skip:38 time:493s
fi-cfl-s total:289 pass:222 dwarn:35 dfail:0 fail:0 skip:32 time:543s
fi-elk-e7500 total:289 pass:230 dwarn:0 dfail:0 fail:0 skip:59 time:414s
fi-glk-2a total:289 pass:260 dwarn:0 dfail:0 fail:0 skip:29 time:600s
fi-hsw-4770 total:289 pass:263 dwarn:0 dfail:0 fail:0 skip:26 time:429s
fi-hsw-4770r total:289 pass:263 dwarn:0 dfail:0 fail:0 skip:26 time:408s
fi-ilk-650 total:289 pass:229 dwarn:0 dfail:0 fail:0 skip:60 time:436s
fi-ivb-3520m total:289 pass:261 dwarn:0 dfail:0 fail:0 skip:28 time:484s
fi-ivb-3770 total:289 pass:261 dwarn:0 dfail:0 fail:0 skip:28 time:468s
fi-kbl-7500u total:289 pass:263 dwarn:1 dfail:0 fail:1 skip:24 time:485s
fi-kbl-7560u total:289 pass:270 dwarn:0 dfail:0 fail:0 skip:19 time:586s
fi-kbl-r total:289 pass:262 dwarn:0 dfail:0 fail:0 skip:27 time:588s
fi-pnv-d510 total:289 pass:223 dwarn:1 dfail:0 fail:0 skip:65 time:553s
fi-skl-6260u total:289 pass:269 dwarn:0 dfail:0 fail:0 skip:20 time:458s
fi-skl-6700k total:289 pass:265 dwarn:0 dfail:0 fail:0 skip:24 time:521s
fi-skl-6770hq total:289 pass:269 dwarn:0 dfail:0 fail:0 skip:20 time:493s
fi-skl-gvtdvm total:289 pass:266 dwarn:0 dfail:0 fail:0 skip:23 time:457s
fi-skl-x1585l total:289 pass:268 dwarn:0 dfail:0 fail:0 skip:21 time:474s
fi-snb-2520m total:289 pass:251 dwarn:0 dfail:0 fail:0 skip:38 time:570s
fi-snb-2600 total:289 pass:250 dwarn:0 dfail:0 fail:0 skip:39 time:433s
9adc9e93d6243c82bcefd175c2d11770802de194 drm-tip: 2017y-09m-15d-11h-44m-46s UTC integration manifest
d69539e5cad5 drm/i915/selftests: Try to recover from a wedged GPU during reset tests
== Logs ==
For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5712/
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* ✓ Fi.CI.IGT: success for drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
2017-09-15 13:29 ` Chris Wilson
2017-09-15 13:31 ` ✓ Fi.CI.BAT: success for " Patchwork
@ 2017-09-15 15:04 ` Patchwork
2017-09-19 14:18 ` [PATCH] " Chris Wilson
` (2 subsequent siblings)
5 siblings, 0 replies; 11+ messages in thread
From: Patchwork @ 2017-09-15 15:04 UTC (permalink / raw)
To: Chris Wilson; +Cc: intel-gfx
== Series Details ==
Series: drm/i915/selftests: Try to recover from a wedged GPU during reset tests
URL : https://patchwork.freedesktop.org/series/30419/
State : success
== Summary ==
Test perf:
Subgroup polling:
pass -> FAIL (shard-hsw) fdo#102252 +1
fdo#102252 https://bugs.freedesktop.org/show_bug.cgi?id=102252
shard-hsw total:2313 pass:1245 dwarn:0 dfail:0 fail:13 skip:1055 time:9350s
== Logs ==
For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5712/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
` (2 preceding siblings ...)
2017-09-15 15:04 ` ✓ Fi.CI.IGT: " Patchwork
@ 2017-09-19 14:18 ` Chris Wilson
2017-09-19 14:24 ` Tahvanainen, Jari
2017-09-25 12:01 ` Chris Wilson
2017-09-26 12:48 ` Mika Kuoppala
5 siblings, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2017-09-19 14:18 UTC (permalink / raw)
To: intel-gfx; +Cc: Jari Tahvanainen
Quoting Chris Wilson (2017-09-15 14:09:29)
> If we see the seqno stop progressing, we abandon the test for fear that
> the GPU died following the reset. However, during test teardown we still
> wait for the GPU to idle before continuing, but we have already
> confirmed that the GPU is dead. Furthermore, since we are inside a reset
> test, we have disabled the hangchecker, and so there is no safety net and
> we wait indefinitely. Detect the stuck GPU and declare it wedged as a
> state of emergency so we can escape.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Ping?
> ---
> drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
> 1 file changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 02e52a146ed8..913fe752f6b4 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
> *batch++ = lower_32_bits(vma->node.start);
> }
> *batch++ = MI_BATCH_BUFFER_END; /* not reached */
> + wmb();
>
> flags = 0;
> if (INTEL_GEN(vm->i915) <= 5)
> @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, rq)) {
> - pr_err("Failed to start request %x\n", rq->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto out_rq;
> }
> @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, prev)) {
> - pr_err("Failed to start request %x\n",
> - prev->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> i915_gem_request_put(rq);
> i915_gem_request_put(prev);
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto fini;
> }
> @@ -806,7 +816,12 @@ static int igt_handle_error(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, rq)) {
> - pr_err("Failed to start request %x\n", rq->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto err_request;
> }
> @@ -843,8 +858,8 @@ static int igt_handle_error(void *arg)
> int intel_hangcheck_live_selftests(struct drm_i915_private *i915)
> {
> static const struct i915_subtest tests[] = {
> + SUBTEST(igt_global_reset), /* attempt to recover GPU first */
> SUBTEST(igt_hang_sanitycheck),
> - SUBTEST(igt_global_reset),
> SUBTEST(igt_reset_engine),
> SUBTEST(igt_reset_active_engines),
> SUBTEST(igt_wait_reset),
> --
> 2.14.1
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-19 14:18 ` [PATCH] " Chris Wilson
@ 2017-09-19 14:24 ` Tahvanainen, Jari
2017-09-19 14:33 ` Chris Wilson
0 siblings, 1 reply; 11+ messages in thread
From: Tahvanainen, Jari @ 2017-09-19 14:24 UTC (permalink / raw)
To: Chris Wilson, intel-gfx
-----Original Message-----
From: Chris Wilson [mailto:chris@chris-wilson.co.uk]
Sent: Tuesday, September 19, 2017 5:19 PM
To: intel-gfx@lists.freedesktop.org
Cc: Tahvanainen, Jari <jari.tahvanainen@intel.com>; Mika Kuoppala <mika.kuoppala@linux.intel.com>
Subject: Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
Quoting Chris Wilson (2017-09-15 14:09:29)
> If we see the seqno stop progressing, we abandon the test for fear
> that the GPU died following the reset. However, during test teardown
> we still wait for the GPU to idle before continuing, but we have
> already confirmed that the GPU is dead. Furthermore, since we are
> inside a reset test, we have disabled the hangchecker, and so there is
> no safety net and we wait indefinitely. Detect the stuck GPU and
> declare it wedged as a state of emergency so we can escape.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>Ping?
Sorry Chris for late answer. Tried to get touch with you earlier through IRC.
I merged the series on top of the drm-tip and executed it in HSW - no hang anymore - FAIL.
(drv_selftest:6304) igt-kmod-CRITICAL: Test assertion failure function igt_kselftest_execute, file igt_kmod.c:513:
(drv_selftest:6304) igt-kmod-CRITICAL: Failed assertion: err == 0
(drv_selftest:6304) igt-kmod-CRITICAL: kselftest "i915 igt__19__live_hangcheck=1 live_selftests=-1" failed: Input/output error [5]
(drv_selftest:6304) igt-core-INFO: Stack trace:
(drv_selftest:6304) igt-core-INFO: #0 [__igt_fail_assert+0x101]
(drv_selftest:6304) igt-core-INFO: #1 [igt_kselftest_execute+0x296]
(drv_selftest:6304) igt-core-INFO: #2 [igt_kselftests+0x295]
(drv_selftest:6304) igt-core-INFO: #3 [main+0x5f]
(drv_selftest:6304) igt-core-INFO: #4 [__libc_start_main+0xf1]
(drv_selftest:6304) igt-core-INFO: #5 [_start+0x2a]
(drv_selftest:6304) igt-core-INFO: #6 [<unknown>+0x2a]
**** END ****
Stack trace:
#0 [__igt_fail_assert+0x101]
#1 [igt_kselftest_execute+0x296]
#2 [igt_kselftests+0x295]
#3 [main+0x5f]
#4 [__libc_start_main+0xf1]
#5 [_start+0x2a]
#6 [<unknown>+0x2a]
Subtest live_hangcheck: FAIL (1.911s)
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-19 14:24 ` Tahvanainen, Jari
@ 2017-09-19 14:33 ` Chris Wilson
0 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2017-09-19 14:33 UTC (permalink / raw)
To: Tahvanainen, Jari, intel-gfx
Quoting Tahvanainen, Jari (2017-09-19 15:24:22)
> -----Original Message-----
> From: Chris Wilson [mailto:chris@chris-wilson.co.uk]
> Sent: Tuesday, September 19, 2017 5:19 PM
> To: intel-gfx@lists.freedesktop.org
> Cc: Tahvanainen, Jari <jari.tahvanainen@intel.com>; Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Subject: Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
>
> Quoting Chris Wilson (2017-09-15 14:09:29)
> > If we see the seqno stop progressing, we abandon the test for fear
> > that the GPU died following the reset. However, during test teardown
> > we still wait for the GPU to idle before continuing, but we have
> > already confirmed that the GPU is dead. Furthermore, since we are
> > inside a reset test, we have disabled the hangchecker, and so there is
> > no safety net and we wait indefinitely. Detect the stuck GPU and
> > declare it wedged as a state of emergency so we can escape.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>
> >Ping?
>
> Sorry Chris for late answer. Tried to get touch with you earlier through IRC.
> I merged the series on top of the drm-tip and executed it in HSW - no hang anymore - FAIL.
>
> (drv_selftest:6304) igt-kmod-CRITICAL: Test assertion failure function igt_kselftest_execute, file igt_kmod.c:513:
> (drv_selftest:6304) igt-kmod-CRITICAL: Failed assertion: err == 0
> (drv_selftest:6304) igt-kmod-CRITICAL: kselftest "i915 igt__19__live_hangcheck=1 live_selftests=-1" failed: Input/output error [5]
> (drv_selftest:6304) igt-core-INFO: Stack trace:
> (drv_selftest:6304) igt-core-INFO: #0 [__igt_fail_assert+0x101]
> (drv_selftest:6304) igt-core-INFO: #1 [igt_kselftest_execute+0x296]
> (drv_selftest:6304) igt-core-INFO: #2 [igt_kselftests+0x295]
> (drv_selftest:6304) igt-core-INFO: #3 [main+0x5f]
> (drv_selftest:6304) igt-core-INFO: #4 [__libc_start_main+0xf1]
> (drv_selftest:6304) igt-core-INFO: #5 [_start+0x2a]
> (drv_selftest:6304) igt-core-INFO: #6 [<unknown>+0x2a]
> **** END ****
> Stack trace:
> #0 [__igt_fail_assert+0x101]
> #1 [igt_kselftest_execute+0x296]
> #2 [igt_kselftests+0x295]
> #3 [main+0x5f]
> #4 [__libc_start_main+0xf1]
> #5 [_start+0x2a]
> #6 [<unknown>+0x2a]
> Subtest live_hangcheck: FAIL (1.911s)
That's what it is meant to do; stop the fail from freezing the machine.
I'll take that as a
Tested-by: Jari Tahvanainen <jari.tahvanainen@intel.com>
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
` (3 preceding siblings ...)
2017-09-19 14:18 ` [PATCH] " Chris Wilson
@ 2017-09-25 12:01 ` Chris Wilson
2017-09-26 12:48 ` Mika Kuoppala
5 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2017-09-25 12:01 UTC (permalink / raw)
To: intel-gfx; +Cc: Jari Tahvanainen
Quoting Chris Wilson (2017-09-15 14:09:29)
> If we see the seqno stop progressing, we abandon the test for fear that
> the GPU died following the reset. However, during test teardown we still
> wait for the GPU to idle before continuing, but we have already
> confirmed that the GPU is dead. Furthermore, since we are inside a reset
> test, we have disabled the hangchecker, and so there is no safety net and
> we wait indefinitely. Detect the stuck GPU and declare it wedged as a
> state of emergency so we can escape.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Ping? We now have CI coverage of kselftests!
-Chris
> ---
> drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
> 1 file changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 02e52a146ed8..913fe752f6b4 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
> *batch++ = lower_32_bits(vma->node.start);
> }
> *batch++ = MI_BATCH_BUFFER_END; /* not reached */
> + wmb();
>
> flags = 0;
> if (INTEL_GEN(vm->i915) <= 5)
> @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, rq)) {
> - pr_err("Failed to start request %x\n", rq->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto out_rq;
> }
> @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, prev)) {
> - pr_err("Failed to start request %x\n",
> - prev->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> i915_gem_request_put(rq);
> i915_gem_request_put(prev);
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto fini;
> }
> @@ -806,7 +816,12 @@ static int igt_handle_error(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, rq)) {
> - pr_err("Failed to start request %x\n", rq->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto err_request;
> }
> @@ -843,8 +858,8 @@ static int igt_handle_error(void *arg)
> int intel_hangcheck_live_selftests(struct drm_i915_private *i915)
> {
> static const struct i915_subtest tests[] = {
> + SUBTEST(igt_global_reset), /* attempt to recover GPU first */
> SUBTEST(igt_hang_sanitycheck),
> - SUBTEST(igt_global_reset),
> SUBTEST(igt_reset_engine),
> SUBTEST(igt_reset_active_engines),
> SUBTEST(igt_wait_reset),
> --
> 2.14.1
>
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
` (4 preceding siblings ...)
2017-09-25 12:01 ` Chris Wilson
@ 2017-09-26 12:48 ` Mika Kuoppala
2017-09-26 13:03 ` Chris Wilson
5 siblings, 1 reply; 11+ messages in thread
From: Mika Kuoppala @ 2017-09-26 12:48 UTC (permalink / raw)
To: Chris Wilson, intel-gfx; +Cc: Jari Tahvanainen
Chris Wilson <chris@chris-wilson.co.uk> writes:
> If we see the seqno stop progressing, we abandon the test for fear that
> the GPU died following the reset. However, during test teardown we still
> wait for the GPU to idle before continuing, but we have already
> confirmed that the GPU is dead. Furthermore, since we are inside a reset
> test, we have disabled the hangchecker, and so there is no safety net and
> we wait indefinitely. Detect the stuck GPU and declare it wedged as a
> state of emergency so we can escape.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
> drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
> 1 file changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 02e52a146ed8..913fe752f6b4 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
> *batch++ = lower_32_bits(vma->node.start);
> }
> *batch++ = MI_BATCH_BUFFER_END; /* not reached */
> + wmb();
>
Why not the big hammer with i915_gem_chipset_flush() here?
> flags = 0;
> if (INTEL_GEN(vm->i915) <= 5)
> @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, rq)) {
> - pr_err("Failed to start request %x\n", rq->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto out_rq;
> }
> @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, prev)) {
> - pr_err("Failed to start request %x\n",
> - prev->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
As you pointed out the debug in here is for wrong request.
Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> i915_gem_request_put(rq);
> i915_gem_request_put(prev);
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto fini;
> }
> @@ -806,7 +816,12 @@ static int igt_handle_error(void *arg)
> __i915_add_request(rq, true);
>
> if (!wait_for_hang(&h, rq)) {
> - pr_err("Failed to start request %x\n", rq->fence.seqno);
> + pr_err("Failed to start request %x, at %x\n",
> + rq->fence.seqno, hws_seqno(&h, rq));
> +
> + i915_reset(i915, 0);
> + i915_gem_set_wedged(i915);
> +
> err = -EIO;
> goto err_request;
> }
> @@ -843,8 +858,8 @@ static int igt_handle_error(void *arg)
> int intel_hangcheck_live_selftests(struct drm_i915_private *i915)
> {
> static const struct i915_subtest tests[] = {
> + SUBTEST(igt_global_reset), /* attempt to recover GPU first */
> SUBTEST(igt_hang_sanitycheck),
> - SUBTEST(igt_global_reset),
> SUBTEST(igt_reset_engine),
> SUBTEST(igt_reset_active_engines),
> SUBTEST(igt_wait_reset),
> --
> 2.14.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-26 12:48 ` Mika Kuoppala
@ 2017-09-26 13:03 ` Chris Wilson
2017-09-26 13:36 ` Mika Kuoppala
0 siblings, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2017-09-26 13:03 UTC (permalink / raw)
To: Mika Kuoppala, intel-gfx; +Cc: Jari Tahvanainen
Quoting Mika Kuoppala (2017-09-26 13:48:17)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
>
> > If we see the seqno stop progressing, we abandon the test for fear that
> > the GPU died following the reset. However, during test teardown we still
> > wait for the GPU to idle before continuing, but we have already
> > confirmed that the GPU is dead. Furthermore, since we are inside a reset
> > test, we have disabled the hangchecker, and so there is no safety net and
> > we wait indefinitely. Detect the stuck GPU and declare it wedged as a
> > state of emergency so we can escape.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > ---
> > drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
> > 1 file changed, 20 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > index 02e52a146ed8..913fe752f6b4 100644
> > --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> > @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
> > *batch++ = lower_32_bits(vma->node.start);
> > }
> > *batch++ = MI_BATCH_BUFFER_END; /* not reached */
> > + wmb();
> >
>
> Why not the big hammer with i915_gem_chipset_flush() here?
It didn't cross my mind, I was just doodling :)
>
> > flags = 0;
> > if (INTEL_GEN(vm->i915) <= 5)
> > @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
> > __i915_add_request(rq, true);
> >
> > if (!wait_for_hang(&h, rq)) {
> > - pr_err("Failed to start request %x\n", rq->fence.seqno);
> > + pr_err("Failed to start request %x, at %x\n",
> > + rq->fence.seqno, hws_seqno(&h, rq));
> > +
> > + i915_reset(i915, 0);
> > + i915_gem_set_wedged(i915);
> > +
> > err = -EIO;
> > goto out_rq;
> > }
> > @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
> > __i915_add_request(rq, true);
> >
> > if (!wait_for_hang(&h, prev)) {
> > - pr_err("Failed to start request %x\n",
> > - prev->fence.seqno);
> > + pr_err("Failed to start request %x, at %x\n",
> > + rq->fence.seqno, hws_seqno(&h, rq));
>
> As you pointed out the debug in here is for wrong request.
>
> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Happy if I drop the wmb() for a later patch and replace it with a
chipset flush instead?
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests
2017-09-26 13:03 ` Chris Wilson
@ 2017-09-26 13:36 ` Mika Kuoppala
0 siblings, 0 replies; 11+ messages in thread
From: Mika Kuoppala @ 2017-09-26 13:36 UTC (permalink / raw)
To: Chris Wilson, intel-gfx; +Cc: Jari Tahvanainen
Chris Wilson <chris@chris-wilson.co.uk> writes:
> Quoting Mika Kuoppala (2017-09-26 13:48:17)
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>>
>> > If we see the seqno stop progressing, we abandon the test for fear that
>> > the GPU died following the reset. However, during test teardown we still
>> > wait for the GPU to idle before continuing, but we have already
>> > confirmed that the GPU is dead. Furthermore, since we are inside a reset
>> > test, we have disabled the hangchecker, and so there is no safety net and
>> > we wait indefinitely. Detect the stuck GPU and declare it wedged as a
>> > state of emergency so we can escape.
>> >
>> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> > Cc: Jari Tahvanainen <jari.tahvanainen@intel.com>
>> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> > ---
>> > drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 25 +++++++++++++++++++-----
>> > 1 file changed, 20 insertions(+), 5 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
>> > index 02e52a146ed8..913fe752f6b4 100644
>> > --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
>> > +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
>> > @@ -165,6 +165,7 @@ static int emit_recurse_batch(struct hang *h,
>> > *batch++ = lower_32_bits(vma->node.start);
>> > }
>> > *batch++ = MI_BATCH_BUFFER_END; /* not reached */
>> > + wmb();
>> >
>>
>> Why not the big hammer with i915_gem_chipset_flush() here?
>
> It didn't cross my mind, I was just doodling :)
>
>>
>> > flags = 0;
>> > if (INTEL_GEN(vm->i915) <= 5)
>> > @@ -621,7 +622,12 @@ static int igt_wait_reset(void *arg)
>> > __i915_add_request(rq, true);
>> >
>> > if (!wait_for_hang(&h, rq)) {
>> > - pr_err("Failed to start request %x\n", rq->fence.seqno);
>> > + pr_err("Failed to start request %x, at %x\n",
>> > + rq->fence.seqno, hws_seqno(&h, rq));
>> > +
>> > + i915_reset(i915, 0);
>> > + i915_gem_set_wedged(i915);
>> > +
>> > err = -EIO;
>> > goto out_rq;
>> > }
>> > @@ -708,10 +714,14 @@ static int igt_reset_queue(void *arg)
>> > __i915_add_request(rq, true);
>> >
>> > if (!wait_for_hang(&h, prev)) {
>> > - pr_err("Failed to start request %x\n",
>> > - prev->fence.seqno);
>> > + pr_err("Failed to start request %x, at %x\n",
>> > + rq->fence.seqno, hws_seqno(&h, rq));
>>
>> As you pointed out the debug in here is for wrong request.
>>
>> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>
> Happy if I drop the wmb() for a later patch and replace it with a
> chipset flush instead?
Will be happy.
-Mika
> -Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2017-09-26 13:39 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-15 13:09 [PATCH] drm/i915/selftests: Try to recover from a wedged GPU during reset tests Chris Wilson
2017-09-15 13:29 ` Chris Wilson
2017-09-15 13:31 ` ✓ Fi.CI.BAT: success for " Patchwork
2017-09-15 15:04 ` ✓ Fi.CI.IGT: " Patchwork
2017-09-19 14:18 ` [PATCH] " Chris Wilson
2017-09-19 14:24 ` Tahvanainen, Jari
2017-09-19 14:33 ` Chris Wilson
2017-09-25 12:01 ` Chris Wilson
2017-09-26 12:48 ` Mika Kuoppala
2017-09-26 13:03 ` Chris Wilson
2017-09-26 13:36 ` Mika Kuoppala
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.