All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/i915/selftests: Avoid repeatedly harming the same innocent context
@ 2018-03-30 13:18 Chris Wilson
  2018-03-30 13:30 ` ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
                   ` (7 more replies)
  0 siblings, 8 replies; 11+ messages in thread
From: Chris Wilson @ 2018-03-30 13:18 UTC (permalink / raw)
  To: intel-gfx

We don't handle resetting the kernel context very well, or presumably any
context executing its breadcrumb commands in the ring as opposed to the
batchbuffer and flush. If we trigger a device reset twice in quick
succession while the kernel context is executing, we may end up skipping
the breadcrumb.  This is really only a problem for the selftest as
normally there is a large interlude between resets (hangcheck), or we
focus on resetting just one engine and so avoid repeatedly resetting
innocents.

Something to try would be a preempt-to-idle to quiesce the engine before
reset, so that innocent contexts would be spared the reset.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
CC: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c                  |  3 ++
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 48 ++++++++++++++++++++++--
 2 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index d354627882e3..684060ed8db6 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1886,6 +1886,8 @@ void i915_reset(struct drm_i915_private *i915)
 	int ret;
 	int i;
 
+	GEM_TRACE("flags=%lx\n", error->flags);
+
 	might_sleep();
 	lockdep_assert_held(&i915->drm.struct_mutex);
 	GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, &error->flags));
@@ -2016,6 +2018,7 @@ int i915_reset_engine(struct intel_engine_cs *engine, const char *msg)
 	struct i915_request *active_request;
 	int ret;
 
+	GEM_TRACE("%s flags=%lx\n", engine->name, error->flags);
 	GEM_BUG_ON(!test_bit(I915_RESET_ENGINE + engine->id, &error->flags));
 
 	active_request = i915_gem_reset_prepare_engine(engine);
diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index 9e4e0ad62724..122a32e0a69d 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -979,6 +979,23 @@ static int igt_wait_reset(void *arg)
 	return err;
 }
 
+static int wait_for_others(struct drm_i915_private *i915,
+			    struct intel_engine_cs *exclude)
+{
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+
+	for_each_engine(engine, i915, id) {
+		if (engine == exclude)
+			continue;
+
+		if (wait_for(intel_engine_is_idle(engine), 10))
+			return -EIO;
+	}
+
+	return 0;
+}
+
 static int igt_reset_queue(void *arg)
 {
 	struct drm_i915_private *i915 = arg;
@@ -1027,13 +1044,36 @@ static int igt_reset_queue(void *arg)
 			i915_request_get(rq);
 			__i915_request_add(rq, true);
 
+			/*
+			 * XXX We don't handle resetting the kernel context
+			 * very well. If we trigger a device reset twice in
+			 * quick succession while the kernel context is
+			 * executing, we may end up skipping the breadcrumb.
+			 * This is really only a problem for the selftest as
+			 * normally there is a large interlude between resets
+			 * (hangcheck), or we focus on resetting just one
+			 * engine and so avoid repeatedly resetting innocents.
+			 */
+			err = wait_for_others(i915, engine);
+			if (err) {
+				pr_err("%s(%s): Failed to idle other inactive engines after device reset\n",
+				       __func__, engine->name);
+				i915_request_put(rq);
+				i915_request_put(prev);
+
+				GEM_TRACE_DUMP();
+				i915_gem_set_wedged(i915);
+				goto fini;
+			}
+
 			if (!wait_for_hang(&h, prev)) {
 				struct drm_printer p = drm_info_printer(i915->drm.dev);
 
-				pr_err("%s: Failed to start request %x, at %x\n",
-				       __func__, prev->fence.seqno, hws_seqno(&h, prev));
-				intel_engine_dump(prev->engine, &p,
-						  "%s\n", prev->engine->name);
+				pr_err("%s(%s): Failed to start request %x, at %x\n",
+				       __func__, engine->name,
+				       prev->fence.seqno, hws_seqno(&h, prev));
+				intel_engine_dump(engine, &p,
+						  "%s\n", engine->name);
 
 				i915_request_put(rq);
 				i915_request_put(prev);
-- 
2.16.3

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for drm/i915/selftests: Avoid repeatedly harming the same innocent context
  2018-03-30 13:18 [PATCH] drm/i915/selftests: Avoid repeatedly harming the same innocent context Chris Wilson
@ 2018-03-30 13:30 ` Patchwork
  2018-03-30 13:47 ` ✓ Fi.CI.BAT: success " Patchwork
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Patchwork @ 2018-03-30 13:30 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915/selftests: Avoid repeatedly harming the same innocent context
URL   : https://patchwork.freedesktop.org/series/40941/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
477c0fa0c0b5 drm/i915/selftests: Avoid repeatedly harming the same innocent context
-:57: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#57: FILE: drivers/gpu/drm/i915/selftests/intel_hangcheck.c:983:
+static int wait_for_others(struct drm_i915_private *i915,
+			    struct intel_engine_cs *exclude)

total: 0 errors, 0 warnings, 1 checks, 78 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* ✓ Fi.CI.BAT: success for drm/i915/selftests: Avoid repeatedly harming the same innocent context
  2018-03-30 13:18 [PATCH] drm/i915/selftests: Avoid repeatedly harming the same innocent context Chris Wilson
  2018-03-30 13:30 ` ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
@ 2018-03-30 13:47 ` Patchwork
  2018-03-30 14:33 ` ✗ Fi.CI.IGT: failure " Patchwork
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Patchwork @ 2018-03-30 13:47 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915/selftests: Avoid repeatedly harming the same innocent context
URL   : https://patchwork.freedesktop.org/series/40941/
State : success

== Summary ==

Series 40941v1 drm/i915/selftests: Avoid repeatedly harming the same innocent context
https://patchwork.freedesktop.org/api/1.0/series/40941/revisions/1/mbox/

---- Known issues:

Test prime_vgem:
        Subgroup basic-fence-flip:
                pass       -> FAIL       (fi-ilk-650) fdo#104008

fdo#104008 https://bugs.freedesktop.org/show_bug.cgi?id=104008

fi-bdw-5557u     total:285  pass:264  dwarn:0   dfail:0   fail:0   skip:21  time:435s
fi-bdw-gvtdvm    total:285  pass:261  dwarn:0   dfail:0   fail:0   skip:24  time:441s
fi-blb-e6850     total:285  pass:220  dwarn:1   dfail:0   fail:0   skip:64  time:379s
fi-bsw-n3050     total:285  pass:239  dwarn:0   dfail:0   fail:0   skip:46  time:531s
fi-bwr-2160      total:285  pass:180  dwarn:0   dfail:0   fail:0   skip:105 time:298s
fi-bxt-dsi       total:285  pass:255  dwarn:0   dfail:0   fail:0   skip:30  time:512s
fi-bxt-j4205     total:285  pass:256  dwarn:0   dfail:0   fail:0   skip:29  time:513s
fi-byt-j1900     total:285  pass:250  dwarn:0   dfail:0   fail:0   skip:35  time:519s
fi-byt-n2820     total:285  pass:246  dwarn:0   dfail:0   fail:0   skip:39  time:506s
fi-cfl-8700k     total:285  pass:257  dwarn:0   dfail:0   fail:0   skip:28  time:412s
fi-cfl-s3        total:285  pass:259  dwarn:0   dfail:0   fail:0   skip:26  time:560s
fi-cfl-u         total:285  pass:259  dwarn:0   dfail:0   fail:0   skip:26  time:509s
fi-cnl-y3        total:285  pass:259  dwarn:0   dfail:0   fail:0   skip:26  time:589s
fi-elk-e7500     total:285  pass:225  dwarn:1   dfail:0   fail:0   skip:59  time:426s
fi-gdg-551       total:285  pass:176  dwarn:0   dfail:0   fail:1   skip:108 time:316s
fi-glk-1         total:285  pass:257  dwarn:0   dfail:0   fail:0   skip:28  time:533s
fi-hsw-4770      total:285  pass:258  dwarn:0   dfail:0   fail:0   skip:27  time:402s
fi-ilk-650       total:285  pass:224  dwarn:0   dfail:0   fail:1   skip:60  time:421s
fi-ivb-3520m     total:285  pass:256  dwarn:0   dfail:0   fail:0   skip:29  time:476s
fi-ivb-3770      total:285  pass:252  dwarn:0   dfail:0   fail:0   skip:33  time:427s
fi-kbl-7500u     total:285  pass:260  dwarn:1   dfail:0   fail:0   skip:24  time:472s
fi-kbl-7567u     total:285  pass:265  dwarn:0   dfail:0   fail:0   skip:20  time:461s
fi-kbl-r         total:285  pass:258  dwarn:0   dfail:0   fail:0   skip:27  time:508s
fi-pnv-d510      total:285  pass:219  dwarn:1   dfail:0   fail:0   skip:65  time:658s
fi-skl-6260u     total:285  pass:265  dwarn:0   dfail:0   fail:0   skip:20  time:439s
fi-skl-6600u     total:285  pass:258  dwarn:0   dfail:0   fail:0   skip:27  time:533s
fi-skl-6700k2    total:285  pass:261  dwarn:0   dfail:0   fail:0   skip:24  time:506s
fi-skl-6770hq    total:285  pass:265  dwarn:0   dfail:0   fail:0   skip:20  time:514s
fi-skl-guc       total:285  pass:257  dwarn:0   dfail:0   fail:0   skip:28  time:430s
fi-skl-gvtdvm    total:285  pass:262  dwarn:0   dfail:0   fail:0   skip:23  time:447s
fi-snb-2520m     total:285  pass:245  dwarn:0   dfail:0   fail:0   skip:40  time:581s
fi-snb-2600      total:285  pass:245  dwarn:0   dfail:0   fail:0   skip:40  time:400s
Blacklisted hosts:
fi-cnl-psr       total:285  pass:256  dwarn:3   dfail:0   fail:0   skip:26  time:532s
fi-glk-j4005     total:285  pass:256  dwarn:0   dfail:0   fail:0   skip:29  time:486s

335ef9849310af26d65c54f7d2d2e9dcbce238b9 drm-tip: 2018y-03m-30d-09h-08m-40s UTC integration manifest
477c0fa0c0b5 drm/i915/selftests: Avoid repeatedly harming the same innocent context

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8548/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* ✗ Fi.CI.IGT: failure for drm/i915/selftests: Avoid repeatedly harming the same innocent context
  2018-03-30 13:18 [PATCH] drm/i915/selftests: Avoid repeatedly harming the same innocent context Chris Wilson
  2018-03-30 13:30 ` ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
  2018-03-30 13:47 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2018-03-30 14:33 ` Patchwork
  2018-03-31 20:24 ` [PATCH] " kbuild test robot
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: Patchwork @ 2018-03-30 14:33 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

== Series Details ==

Series: drm/i915/selftests: Avoid repeatedly harming the same innocent context
URL   : https://patchwork.freedesktop.org/series/40941/
State : failure

== Summary ==

---- Possible new issues:

Test kms_cursor_legacy:
        Subgroup cursor-vs-flip-atomic-transitions:
                pass       -> FAIL       (shard-snb)
Test kms_frontbuffer_tracking:
        Subgroup fbcpsr-2p-scndscrn-spr-indfb-move:
                skip       -> FAIL       (shard-snb)
        Subgroup fbcpsr-rgb101010-draw-render:
                skip       -> FAIL       (shard-snb)
        Subgroup psr-2p-primscrn-spr-indfb-onoff:
                skip       -> FAIL       (shard-snb)
Test kms_plane_lowres:
        Subgroup pipe-a-tiling-none:
                pass       -> FAIL       (shard-snb)
Test kms_vblank:
        Subgroup pipe-a-query-idle:
                pass       -> FAIL       (shard-snb)

---- Known issues:

Test gem_softpin:
        Subgroup noreloc-s3:
                pass       -> INCOMPLETE (shard-snb) fdo#103375
Test kms_flip:
        Subgroup 2x-dpms-vs-vblank-race:
                fail       -> PASS       (shard-hsw) fdo#103060
        Subgroup 2x-flip-vs-expired-vblank-interruptible:
                fail       -> PASS       (shard-hsw) fdo#102887
        Subgroup flip-vs-blocking-wf-vblank:
                fail       -> PASS       (shard-hsw) fdo#100368
Test kms_setmode:
        Subgroup basic:
                fail       -> PASS       (shard-apl) fdo#99912

fdo#103375 https://bugs.freedesktop.org/show_bug.cgi?id=103375
fdo#103060 https://bugs.freedesktop.org/show_bug.cgi?id=103060
fdo#102887 https://bugs.freedesktop.org/show_bug.cgi?id=102887
fdo#100368 https://bugs.freedesktop.org/show_bug.cgi?id=100368
fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912

shard-apl        total:3495 pass:1833 dwarn:1   dfail:0   fail:6   skip:1655 time:12857s
shard-hsw        total:3495 pass:1783 dwarn:1   dfail:0   fail:1   skip:1709 time:11682s
shard-snb        total:3398 pass:1320 dwarn:1   dfail:0   fail:10  skip:2066 time:6563s
Blacklisted hosts:
shard-kbl        total:3495 pass:1958 dwarn:1   dfail:0   fail:8   skip:1528 time:9256s

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8548/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/selftests: Avoid repeatedly harming the same innocent context
  2018-03-30 13:18 [PATCH] drm/i915/selftests: Avoid repeatedly harming the same innocent context Chris Wilson
                   ` (2 preceding siblings ...)
  2018-03-30 14:33 ` ✗ Fi.CI.IGT: failure " Patchwork
@ 2018-03-31 20:24 ` kbuild test robot
  2018-03-31 20:41 ` kbuild test robot
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: kbuild test robot @ 2018-03-31 20:24 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, kbuild-all

[-- Attachment #1: Type: text/plain, Size: 5845 bytes --]

Hi Chris,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on drm-intel/for-linux-next]
[also build test ERROR on v4.16-rc7 next-20180329]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Chris-Wilson/drm-i915-selftests-Avoid-repeatedly-harming-the-same-innocent-context/20180401-022503
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: i386-randconfig-b0-04010137 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from drivers/gpu/drm/i915/intel_hangcheck.c:465:0:
   drivers/gpu/drm/i915/selftests/intel_hangcheck.c: In function 'igt_reset_queue':
>> drivers/gpu/drm/i915/selftests/intel_hangcheck.c:988:5: error: implicit declaration of function 'GEM_TRACE_DUMP' [-Werror=implicit-function-declaration]
        GEM_TRACE_DUMP();
        ^
   cc1: all warnings being treated as errors

vim +/GEM_TRACE_DUMP +988 drivers/gpu/drm/i915/selftests/intel_hangcheck.c

   922	
   923	static int igt_reset_queue(void *arg)
   924	{
   925		struct drm_i915_private *i915 = arg;
   926		struct intel_engine_cs *engine;
   927		enum intel_engine_id id;
   928		struct hang h;
   929		int err;
   930	
   931		/* Check that we replay pending requests following a hang */
   932	
   933		global_reset_lock(i915);
   934	
   935		mutex_lock(&i915->drm.struct_mutex);
   936		err = hang_init(&h, i915);
   937		if (err)
   938			goto unlock;
   939	
   940		for_each_engine(engine, i915, id) {
   941			struct i915_request *prev;
   942			IGT_TIMEOUT(end_time);
   943			unsigned int count;
   944	
   945			if (!intel_engine_can_store_dword(engine))
   946				continue;
   947	
   948			prev = hang_create_request(&h, engine);
   949			if (IS_ERR(prev)) {
   950				err = PTR_ERR(prev);
   951				goto fini;
   952			}
   953	
   954			i915_request_get(prev);
   955			__i915_request_add(prev, true);
   956	
   957			count = 0;
   958			do {
   959				struct i915_request *rq;
   960				unsigned int reset_count;
   961	
   962				rq = hang_create_request(&h, engine);
   963				if (IS_ERR(rq)) {
   964					err = PTR_ERR(rq);
   965					goto fini;
   966				}
   967	
   968				i915_request_get(rq);
   969				__i915_request_add(rq, true);
   970	
   971				/*
   972				 * XXX We don't handle resetting the kernel context
   973				 * very well. If we trigger a device reset twice in
   974				 * quick succession while the kernel context is
   975				 * executing, we may end up skipping the breadcrumb.
   976				 * This is really only a problem for the selftest as
   977				 * normally there is a large interlude between resets
   978				 * (hangcheck), or we focus on resetting just one
   979				 * engine and so avoid repeatedly resetting innocents.
   980				 */
   981				err = wait_for_others(i915, engine);
   982				if (err) {
   983					pr_err("%s(%s): Failed to idle other inactive engines after device reset\n",
   984					       __func__, engine->name);
   985					i915_request_put(rq);
   986					i915_request_put(prev);
   987	
 > 988					GEM_TRACE_DUMP();
   989					i915_gem_set_wedged(i915);
   990					goto fini;
   991				}
   992	
   993				if (!wait_for_hang(&h, prev)) {
   994					struct drm_printer p = drm_info_printer(i915->drm.dev);
   995	
   996					pr_err("%s(%s): Failed to start request %x, at %x\n",
   997					       __func__, engine->name,
   998					       prev->fence.seqno, hws_seqno(&h, prev));
   999					intel_engine_dump(engine, &p,
  1000							  "%s\n", engine->name);
  1001	
  1002					i915_request_put(rq);
  1003					i915_request_put(prev);
  1004	
  1005					i915_reset(i915, 0);
  1006					i915_gem_set_wedged(i915);
  1007	
  1008					err = -EIO;
  1009					goto fini;
  1010				}
  1011	
  1012				reset_count = fake_hangcheck(prev);
  1013	
  1014				i915_reset(i915, I915_RESET_QUIET);
  1015	
  1016				GEM_BUG_ON(test_bit(I915_RESET_HANDOFF,
  1017						    &i915->gpu_error.flags));
  1018	
  1019				if (prev->fence.error != -EIO) {
  1020					pr_err("GPU reset not recorded on hanging request [fence.error=%d]!\n",
  1021					       prev->fence.error);
  1022					i915_request_put(rq);
  1023					i915_request_put(prev);
  1024					err = -EINVAL;
  1025					goto fini;
  1026				}
  1027	
  1028				if (rq->fence.error) {
  1029					pr_err("Fence error status not zero [%d] after unrelated reset\n",
  1030					       rq->fence.error);
  1031					i915_request_put(rq);
  1032					i915_request_put(prev);
  1033					err = -EINVAL;
  1034					goto fini;
  1035				}
  1036	
  1037				if (i915_reset_count(&i915->gpu_error) == reset_count) {
  1038					pr_err("No GPU reset recorded!\n");
  1039					i915_request_put(rq);
  1040					i915_request_put(prev);
  1041					err = -EINVAL;
  1042					goto fini;
  1043				}
  1044	
  1045				i915_request_put(prev);
  1046				prev = rq;
  1047				count++;
  1048			} while (time_before(jiffies, end_time));
  1049			pr_info("%s: Completed %d resets\n", engine->name, count);
  1050	
  1051			*h.batch = MI_BATCH_BUFFER_END;
  1052			i915_gem_chipset_flush(i915);
  1053	
  1054			i915_request_put(prev);
  1055	
  1056			err = flush_test(i915, I915_WAIT_LOCKED);
  1057			if (err)
  1058				break;
  1059		}
  1060	
  1061	fini:
  1062		hang_fini(&h);
  1063	unlock:
  1064		mutex_unlock(&i915->drm.struct_mutex);
  1065		global_reset_unlock(i915);
  1066	
  1067		if (i915_terminally_wedged(&i915->gpu_error))
  1068			return -EIO;
  1069	
  1070		return err;
  1071	}
  1072	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28573 bytes --]

[-- Attachment #3: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/selftests: Avoid repeatedly harming the same innocent context
  2018-03-30 13:18 [PATCH] drm/i915/selftests: Avoid repeatedly harming the same innocent context Chris Wilson
                   ` (3 preceding siblings ...)
  2018-03-31 20:24 ` [PATCH] " kbuild test robot
@ 2018-03-31 20:41 ` kbuild test robot
  2018-04-01  2:45 ` kbuild test robot
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 11+ messages in thread
From: kbuild test robot @ 2018-03-31 20:41 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, kbuild-all

[-- Attachment #1: Type: text/plain, Size: 5889 bytes --]

Hi Chris,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on drm-intel/for-linux-next]
[also build test ERROR on v4.16-rc7 next-20180329]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Chris-Wilson/drm-i915-selftests-Avoid-repeatedly-harming-the-same-innocent-context/20180401-022503
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-randconfig-x011-201813 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   In file included from drivers/gpu/drm/i915/intel_hangcheck.c:465:0:
   drivers/gpu/drm/i915/selftests/intel_hangcheck.c: In function 'igt_reset_queue':
>> drivers/gpu/drm/i915/selftests/intel_hangcheck.c:988:5: error: implicit declaration of function 'GEM_TRACE_DUMP'; did you mean 'GEM_TRACE'? [-Werror=implicit-function-declaration]
        GEM_TRACE_DUMP();
        ^~~~~~~~~~~~~~
        GEM_TRACE
   cc1: some warnings being treated as errors

vim +988 drivers/gpu/drm/i915/selftests/intel_hangcheck.c

   922	
   923	static int igt_reset_queue(void *arg)
   924	{
   925		struct drm_i915_private *i915 = arg;
   926		struct intel_engine_cs *engine;
   927		enum intel_engine_id id;
   928		struct hang h;
   929		int err;
   930	
   931		/* Check that we replay pending requests following a hang */
   932	
   933		global_reset_lock(i915);
   934	
   935		mutex_lock(&i915->drm.struct_mutex);
   936		err = hang_init(&h, i915);
   937		if (err)
   938			goto unlock;
   939	
   940		for_each_engine(engine, i915, id) {
   941			struct i915_request *prev;
   942			IGT_TIMEOUT(end_time);
   943			unsigned int count;
   944	
   945			if (!intel_engine_can_store_dword(engine))
   946				continue;
   947	
   948			prev = hang_create_request(&h, engine);
   949			if (IS_ERR(prev)) {
   950				err = PTR_ERR(prev);
   951				goto fini;
   952			}
   953	
   954			i915_request_get(prev);
   955			__i915_request_add(prev, true);
   956	
   957			count = 0;
   958			do {
   959				struct i915_request *rq;
   960				unsigned int reset_count;
   961	
   962				rq = hang_create_request(&h, engine);
   963				if (IS_ERR(rq)) {
   964					err = PTR_ERR(rq);
   965					goto fini;
   966				}
   967	
   968				i915_request_get(rq);
   969				__i915_request_add(rq, true);
   970	
   971				/*
   972				 * XXX We don't handle resetting the kernel context
   973				 * very well. If we trigger a device reset twice in
   974				 * quick succession while the kernel context is
   975				 * executing, we may end up skipping the breadcrumb.
   976				 * This is really only a problem for the selftest as
   977				 * normally there is a large interlude between resets
   978				 * (hangcheck), or we focus on resetting just one
   979				 * engine and so avoid repeatedly resetting innocents.
   980				 */
   981				err = wait_for_others(i915, engine);
   982				if (err) {
   983					pr_err("%s(%s): Failed to idle other inactive engines after device reset\n",
   984					       __func__, engine->name);
   985					i915_request_put(rq);
   986					i915_request_put(prev);
   987	
 > 988					GEM_TRACE_DUMP();
   989					i915_gem_set_wedged(i915);
   990					goto fini;
   991				}
   992	
   993				if (!wait_for_hang(&h, prev)) {
   994					struct drm_printer p = drm_info_printer(i915->drm.dev);
   995	
   996					pr_err("%s(%s): Failed to start request %x, at %x\n",
   997					       __func__, engine->name,
   998					       prev->fence.seqno, hws_seqno(&h, prev));
   999					intel_engine_dump(engine, &p,
  1000							  "%s\n", engine->name);
  1001	
  1002					i915_request_put(rq);
  1003					i915_request_put(prev);
  1004	
  1005					i915_reset(i915, 0);
  1006					i915_gem_set_wedged(i915);
  1007	
  1008					err = -EIO;
  1009					goto fini;
  1010				}
  1011	
  1012				reset_count = fake_hangcheck(prev);
  1013	
  1014				i915_reset(i915, I915_RESET_QUIET);
  1015	
  1016				GEM_BUG_ON(test_bit(I915_RESET_HANDOFF,
  1017						    &i915->gpu_error.flags));
  1018	
  1019				if (prev->fence.error != -EIO) {
  1020					pr_err("GPU reset not recorded on hanging request [fence.error=%d]!\n",
  1021					       prev->fence.error);
  1022					i915_request_put(rq);
  1023					i915_request_put(prev);
  1024					err = -EINVAL;
  1025					goto fini;
  1026				}
  1027	
  1028				if (rq->fence.error) {
  1029					pr_err("Fence error status not zero [%d] after unrelated reset\n",
  1030					       rq->fence.error);
  1031					i915_request_put(rq);
  1032					i915_request_put(prev);
  1033					err = -EINVAL;
  1034					goto fini;
  1035				}
  1036	
  1037				if (i915_reset_count(&i915->gpu_error) == reset_count) {
  1038					pr_err("No GPU reset recorded!\n");
  1039					i915_request_put(rq);
  1040					i915_request_put(prev);
  1041					err = -EINVAL;
  1042					goto fini;
  1043				}
  1044	
  1045				i915_request_put(prev);
  1046				prev = rq;
  1047				count++;
  1048			} while (time_before(jiffies, end_time));
  1049			pr_info("%s: Completed %d resets\n", engine->name, count);
  1050	
  1051			*h.batch = MI_BATCH_BUFFER_END;
  1052			i915_gem_chipset_flush(i915);
  1053	
  1054			i915_request_put(prev);
  1055	
  1056			err = flush_test(i915, I915_WAIT_LOCKED);
  1057			if (err)
  1058				break;
  1059		}
  1060	
  1061	fini:
  1062		hang_fini(&h);
  1063	unlock:
  1064		mutex_unlock(&i915->drm.struct_mutex);
  1065		global_reset_unlock(i915);
  1066	
  1067		if (i915_terminally_wedged(&i915->gpu_error))
  1068			return -EIO;
  1069	
  1070		return err;
  1071	}
  1072	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28470 bytes --]

[-- Attachment #3: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/selftests: Avoid repeatedly harming the same innocent context
  2018-03-30 13:18 [PATCH] drm/i915/selftests: Avoid repeatedly harming the same innocent context Chris Wilson
                   ` (4 preceding siblings ...)
  2018-03-31 20:41 ` kbuild test robot
@ 2018-04-01  2:45 ` kbuild test robot
  2018-04-05 19:29 ` Chris Wilson
  2018-04-06  9:11 ` Tvrtko Ursulin
  7 siblings, 0 replies; 11+ messages in thread
From: kbuild test robot @ 2018-04-01  2:45 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx, kbuild-all

Hi Chris,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on v4.16-rc7 next-20180329]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Chris-Wilson/drm-i915-selftests-Avoid-repeatedly-harming-the-same-innocent-context/20180401-022503
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   drivers/gpu/drm/i915/selftests/intel_hangcheck.c:988:33: sparse: undefined identifier 'GEM_TRACE_DUMP'
>> drivers/gpu/drm/i915/selftests/intel_hangcheck.c:988:47: sparse: call with no type!
   In file included from drivers/gpu/drm/i915/intel_hangcheck.c:465:0:
   drivers/gpu/drm/i915/selftests/intel_hangcheck.c: In function 'igt_reset_queue':
   drivers/gpu/drm/i915/selftests/intel_hangcheck.c:988:5: error: implicit declaration of function 'GEM_TRACE_DUMP'; did you mean 'GEM_TRACE'? [-Werror=implicit-function-declaration]
        GEM_TRACE_DUMP();
        ^~~~~~~~~~~~~~
        GEM_TRACE
   cc1: some warnings being treated as errors

vim +988 drivers/gpu/drm/i915/selftests/intel_hangcheck.c

   922	
   923	static int igt_reset_queue(void *arg)
   924	{
   925		struct drm_i915_private *i915 = arg;
   926		struct intel_engine_cs *engine;
   927		enum intel_engine_id id;
   928		struct hang h;
   929		int err;
   930	
   931		/* Check that we replay pending requests following a hang */
   932	
   933		global_reset_lock(i915);
   934	
   935		mutex_lock(&i915->drm.struct_mutex);
   936		err = hang_init(&h, i915);
   937		if (err)
   938			goto unlock;
   939	
   940		for_each_engine(engine, i915, id) {
   941			struct i915_request *prev;
   942			IGT_TIMEOUT(end_time);
   943			unsigned int count;
   944	
   945			if (!intel_engine_can_store_dword(engine))
   946				continue;
   947	
   948			prev = hang_create_request(&h, engine);
   949			if (IS_ERR(prev)) {
   950				err = PTR_ERR(prev);
   951				goto fini;
   952			}
   953	
   954			i915_request_get(prev);
   955			__i915_request_add(prev, true);
   956	
   957			count = 0;
   958			do {
   959				struct i915_request *rq;
   960				unsigned int reset_count;
   961	
   962				rq = hang_create_request(&h, engine);
   963				if (IS_ERR(rq)) {
   964					err = PTR_ERR(rq);
   965					goto fini;
   966				}
   967	
   968				i915_request_get(rq);
   969				__i915_request_add(rq, true);
   970	
   971				/*
   972				 * XXX We don't handle resetting the kernel context
   973				 * very well. If we trigger a device reset twice in
   974				 * quick succession while the kernel context is
   975				 * executing, we may end up skipping the breadcrumb.
   976				 * This is really only a problem for the selftest as
   977				 * normally there is a large interlude between resets
   978				 * (hangcheck), or we focus on resetting just one
   979				 * engine and so avoid repeatedly resetting innocents.
   980				 */
   981				err = wait_for_others(i915, engine);
   982				if (err) {
   983					pr_err("%s(%s): Failed to idle other inactive engines after device reset\n",
   984					       __func__, engine->name);
   985					i915_request_put(rq);
   986					i915_request_put(prev);
   987	
 > 988					GEM_TRACE_DUMP();
   989					i915_gem_set_wedged(i915);
   990					goto fini;
   991				}
   992	
   993				if (!wait_for_hang(&h, prev)) {
   994					struct drm_printer p = drm_info_printer(i915->drm.dev);
   995	
   996					pr_err("%s(%s): Failed to start request %x, at %x\n",
   997					       __func__, engine->name,
   998					       prev->fence.seqno, hws_seqno(&h, prev));
   999					intel_engine_dump(engine, &p,
  1000							  "%s\n", engine->name);
  1001	
  1002					i915_request_put(rq);
  1003					i915_request_put(prev);
  1004	
  1005					i915_reset(i915, 0);
  1006					i915_gem_set_wedged(i915);
  1007	
  1008					err = -EIO;
  1009					goto fini;
  1010				}
  1011	
  1012				reset_count = fake_hangcheck(prev);
  1013	
  1014				i915_reset(i915, I915_RESET_QUIET);
  1015	
  1016				GEM_BUG_ON(test_bit(I915_RESET_HANDOFF,
  1017						    &i915->gpu_error.flags));
  1018	
  1019				if (prev->fence.error != -EIO) {
  1020					pr_err("GPU reset not recorded on hanging request [fence.error=%d]!\n",
  1021					       prev->fence.error);
  1022					i915_request_put(rq);
  1023					i915_request_put(prev);
  1024					err = -EINVAL;
  1025					goto fini;
  1026				}
  1027	
  1028				if (rq->fence.error) {
  1029					pr_err("Fence error status not zero [%d] after unrelated reset\n",
  1030					       rq->fence.error);
  1031					i915_request_put(rq);
  1032					i915_request_put(prev);
  1033					err = -EINVAL;
  1034					goto fini;
  1035				}
  1036	
  1037				if (i915_reset_count(&i915->gpu_error) == reset_count) {
  1038					pr_err("No GPU reset recorded!\n");
  1039					i915_request_put(rq);
  1040					i915_request_put(prev);
  1041					err = -EINVAL;
  1042					goto fini;
  1043				}
  1044	
  1045				i915_request_put(prev);
  1046				prev = rq;
  1047				count++;
  1048			} while (time_before(jiffies, end_time));
  1049			pr_info("%s: Completed %d resets\n", engine->name, count);
  1050	
  1051			*h.batch = MI_BATCH_BUFFER_END;
  1052			i915_gem_chipset_flush(i915);
  1053	
  1054			i915_request_put(prev);
  1055	
  1056			err = flush_test(i915, I915_WAIT_LOCKED);
  1057			if (err)
  1058				break;
  1059		}
  1060	
  1061	fini:
  1062		hang_fini(&h);
  1063	unlock:
  1064		mutex_unlock(&i915->drm.struct_mutex);
  1065		global_reset_unlock(i915);
  1066	
  1067		if (i915_terminally_wedged(&i915->gpu_error))
  1068			return -EIO;
  1069	
  1070		return err;
  1071	}
  1072	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/selftests: Avoid repeatedly harming the same innocent context
  2018-03-30 13:18 [PATCH] drm/i915/selftests: Avoid repeatedly harming the same innocent context Chris Wilson
                   ` (5 preceding siblings ...)
  2018-04-01  2:45 ` kbuild test robot
@ 2018-04-05 19:29 ` Chris Wilson
  2018-04-06  9:11 ` Tvrtko Ursulin
  7 siblings, 0 replies; 11+ messages in thread
From: Chris Wilson @ 2018-04-05 19:29 UTC (permalink / raw)
  To: intel-gfx; +Cc: Mika

Quoting Chris Wilson (2018-03-30 14:18:01)
> We don't handle resetting the kernel context very well, or presumably any
> context executing its breadcrumb commands in the ring as opposed to the
> batchbuffer and flush. If we trigger a device reset twice in quick
> succession while the kernel context is executing, we may end up skipping
> the breadcrumb.  This is really only a problem for the selftest as
> normally there is a large interlude between resets (hangcheck), or we
> focus on resetting just one engine and so avoid repeatedly resetting
> innocents.
> 
> Something to try would be a preempt-to-idle to quiesce the engine before
> reset, so that innocent contexts would be spared the reset.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> CC: Michel Thierry <michel.thierry@intel.com>

Anyone?

> ---
>  drivers/gpu/drm/i915/i915_drv.c                  |  3 ++
>  drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 48 ++++++++++++++++++++++--
>  2 files changed, 47 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index d354627882e3..684060ed8db6 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1886,6 +1886,8 @@ void i915_reset(struct drm_i915_private *i915)
>         int ret;
>         int i;
>  
> +       GEM_TRACE("flags=%lx\n", error->flags);
> +
>         might_sleep();
>         lockdep_assert_held(&i915->drm.struct_mutex);
>         GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, &error->flags));
> @@ -2016,6 +2018,7 @@ int i915_reset_engine(struct intel_engine_cs *engine, const char *msg)
>         struct i915_request *active_request;
>         int ret;
>  
> +       GEM_TRACE("%s flags=%lx\n", engine->name, error->flags);
>         GEM_BUG_ON(!test_bit(I915_RESET_ENGINE + engine->id, &error->flags));
>  
>         active_request = i915_gem_reset_prepare_engine(engine);
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 9e4e0ad62724..122a32e0a69d 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -979,6 +979,23 @@ static int igt_wait_reset(void *arg)
>         return err;
>  }
>  
> +static int wait_for_others(struct drm_i915_private *i915,
> +                           struct intel_engine_cs *exclude)
> +{
> +       struct intel_engine_cs *engine;
> +       enum intel_engine_id id;
> +
> +       for_each_engine(engine, i915, id) {
> +               if (engine == exclude)
> +                       continue;
> +
> +               if (wait_for(intel_engine_is_idle(engine), 10))
> +                       return -EIO;
> +       }
> +
> +       return 0;
> +}
> +
>  static int igt_reset_queue(void *arg)
>  {
>         struct drm_i915_private *i915 = arg;
> @@ -1027,13 +1044,36 @@ static int igt_reset_queue(void *arg)
>                         i915_request_get(rq);
>                         __i915_request_add(rq, true);
>  
> +                       /*
> +                        * XXX We don't handle resetting the kernel context
> +                        * very well. If we trigger a device reset twice in
> +                        * quick succession while the kernel context is
> +                        * executing, we may end up skipping the breadcrumb.
> +                        * This is really only a problem for the selftest as
> +                        * normally there is a large interlude between resets
> +                        * (hangcheck), or we focus on resetting just one
> +                        * engine and so avoid repeatedly resetting innocents.
> +                        */
> +                       err = wait_for_others(i915, engine);
> +                       if (err) {
> +                               pr_err("%s(%s): Failed to idle other inactive engines after device reset\n",
> +                                      __func__, engine->name);
> +                               i915_request_put(rq);
> +                               i915_request_put(prev);
> +
> +                               GEM_TRACE_DUMP();
> +                               i915_gem_set_wedged(i915);
> +                               goto fini;
> +                       }
> +
>                         if (!wait_for_hang(&h, prev)) {
>                                 struct drm_printer p = drm_info_printer(i915->drm.dev);
>  
> -                               pr_err("%s: Failed to start request %x, at %x\n",
> -                                      __func__, prev->fence.seqno, hws_seqno(&h, prev));
> -                               intel_engine_dump(prev->engine, &p,
> -                                                 "%s\n", prev->engine->name);
> +                               pr_err("%s(%s): Failed to start request %x, at %x\n",
> +                                      __func__, engine->name,
> +                                      prev->fence.seqno, hws_seqno(&h, prev));
> +                               intel_engine_dump(engine, &p,
> +                                                 "%s\n", engine->name);
>  
>                                 i915_request_put(rq);
>                                 i915_request_put(prev);
> -- 
> 2.16.3
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/selftests: Avoid repeatedly harming the same innocent context
  2018-03-30 13:18 [PATCH] drm/i915/selftests: Avoid repeatedly harming the same innocent context Chris Wilson
                   ` (6 preceding siblings ...)
  2018-04-05 19:29 ` Chris Wilson
@ 2018-04-06  9:11 ` Tvrtko Ursulin
  2018-04-06  9:40   ` Chris Wilson
  7 siblings, 1 reply; 11+ messages in thread
From: Tvrtko Ursulin @ 2018-04-06  9:11 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 30/03/2018 14:18, Chris Wilson wrote:
> We don't handle resetting the kernel context very well, or presumably any
> context executing its breadcrumb commands in the ring as opposed to the
> batchbuffer and flush. If we trigger a device reset twice in quick
> succession while the kernel context is executing, we may end up skipping
> the breadcrumb.  This is really only a problem for the selftest as
> normally there is a large interlude between resets (hangcheck), or we
> focus on resetting just one engine and so avoid repeatedly resetting
> innocents.
> 
> Something to try would be a preempt-to-idle to quiesce the engine before
> reset, so that innocent contexts would be spared the reset.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> CC: Michel Thierry <michel.thierry@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.c                  |  3 ++
>   drivers/gpu/drm/i915/selftests/intel_hangcheck.c | 48 ++++++++++++++++++++++--
>   2 files changed, 47 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index d354627882e3..684060ed8db6 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1886,6 +1886,8 @@ void i915_reset(struct drm_i915_private *i915)
>   	int ret;
>   	int i;
>   
> +	GEM_TRACE("flags=%lx\n", error->flags);
> +
>   	might_sleep();
>   	lockdep_assert_held(&i915->drm.struct_mutex);
>   	GEM_BUG_ON(!test_bit(I915_RESET_BACKOFF, &error->flags));
> @@ -2016,6 +2018,7 @@ int i915_reset_engine(struct intel_engine_cs *engine, const char *msg)
>   	struct i915_request *active_request;
>   	int ret;
>   
> +	GEM_TRACE("%s flags=%lx\n", engine->name, error->flags);
>   	GEM_BUG_ON(!test_bit(I915_RESET_ENGINE + engine->id, &error->flags));
>   
>   	active_request = i915_gem_reset_prepare_engine(engine);
> diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> index 9e4e0ad62724..122a32e0a69d 100644
> --- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> +++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
> @@ -979,6 +979,23 @@ static int igt_wait_reset(void *arg)
>   	return err;
>   }
>   
> +static int wait_for_others(struct drm_i915_private *i915,
> +			    struct intel_engine_cs *exclude)
> +{
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +
> +	for_each_engine(engine, i915, id) { > +		if (engine == exclude)> +			continue;

Bike-shedder in me says:

for_each_engine_masked(engine, i915, ~BIT(exclude->id), id)

:)

Although for_each_engine_masked failed to enclose its arguments in 
parentheses so never mind.

> +
> +		if (wait_for(intel_engine_is_idle(engine), 10))
> +			return -EIO;
> +	}
> +
> +	return 0;
> +}
> +
>   static int igt_reset_queue(void *arg)
>   {
>   	struct drm_i915_private *i915 = arg;
> @@ -1027,13 +1044,36 @@ static int igt_reset_queue(void *arg)
>   			i915_request_get(rq);
>   			__i915_request_add(rq, true);
>   
> +			/*
> +			 * XXX We don't handle resetting the kernel context
> +			 * very well. If we trigger a device reset twice in
> +			 * quick succession while the kernel context is
> +			 * executing, we may end up skipping the breadcrumb.
> +			 * This is really only a problem for the selftest as
> +			 * normally there is a large interlude between resets
> +			 * (hangcheck), or we focus on resetting just one
> +			 * engine and so avoid repeatedly resetting innocents.
> +			 */
> +			err = wait_for_others(i915, engine);
> +			if (err) {
> +				pr_err("%s(%s): Failed to idle other inactive engines after device reset\n",
> +				       __func__, engine->name);
> +				i915_request_put(rq);
> +				i915_request_put(prev);
> +
> +				GEM_TRACE_DUMP();
> +				i915_gem_set_wedged(i915);
> +				goto fini;
> +			}
> +
>   			if (!wait_for_hang(&h, prev)) {

This was confusing me a period since I was assuming the seqno check is 
against the breadcrumb, but it is actually about the same number written 
to a different place. So it actually means 
wait_for_request_to_start_executing.

>   				struct drm_printer p = drm_info_printer(i915->drm.dev);
>   
> -				pr_err("%s: Failed to start request %x, at %x\n",
> -				       __func__, prev->fence.seqno, hws_seqno(&h, prev));
> -				intel_engine_dump(prev->engine, &p,
> -						  "%s\n", prev->engine->name);
> +				pr_err("%s(%s): Failed to start request %x, at %x\n",
> +				       __func__, engine->name,
> +				       prev->fence.seqno, hws_seqno(&h, prev));
> +				intel_engine_dump(engine, &p,
> +						  "%s\n", engine->name);
>   
>   				i915_request_put(rq);
>   				i915_request_put(prev);
> 

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/selftests: Avoid repeatedly harming the same innocent context
  2018-04-06  9:11 ` Tvrtko Ursulin
@ 2018-04-06  9:40   ` Chris Wilson
  2018-04-06  9:59     ` Tvrtko Ursulin
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Wilson @ 2018-04-06  9:40 UTC (permalink / raw)
  To: Tvrtko Ursulin, intel-gfx

Quoting Tvrtko Ursulin (2018-04-06 10:11:57)
> 
> On 30/03/2018 14:18, Chris Wilson wrote:
> >                       if (!wait_for_hang(&h, prev)) {
> 
> This was confusing me a period since I was assuming the seqno check is 
> against the breadcrumb, but it is actually about the same number written 
> to a different place. So it actually means 
> wait_for_request_to_start_executing.

Bonus points for a better name than struct hang.

struct spinner;

spinner_wait_until_active() ?
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] drm/i915/selftests: Avoid repeatedly harming the same innocent context
  2018-04-06  9:40   ` Chris Wilson
@ 2018-04-06  9:59     ` Tvrtko Ursulin
  0 siblings, 0 replies; 11+ messages in thread
From: Tvrtko Ursulin @ 2018-04-06  9:59 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 06/04/2018 10:40, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-04-06 10:11:57)
>>
>> On 30/03/2018 14:18, Chris Wilson wrote:
>>>                        if (!wait_for_hang(&h, prev)) {
>>
>> This was confusing me a period since I was assuming the seqno check is
>> against the breadcrumb, but it is actually about the same number written
>> to a different place. So it actually means
>> wait_for_request_to_start_executing.
> 
> Bonus points for a better name than struct hang.
> 
> struct spinner;
> 
> spinner_wait_until_active() ?

Maybe just rename the helper to wait_until_running if you feel like it. 
To keep things simple and scope in check.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-04-06  9:59 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-30 13:18 [PATCH] drm/i915/selftests: Avoid repeatedly harming the same innocent context Chris Wilson
2018-03-30 13:30 ` ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
2018-03-30 13:47 ` ✓ Fi.CI.BAT: success " Patchwork
2018-03-30 14:33 ` ✗ Fi.CI.IGT: failure " Patchwork
2018-03-31 20:24 ` [PATCH] " kbuild test robot
2018-03-31 20:41 ` kbuild test robot
2018-04-01  2:45 ` kbuild test robot
2018-04-05 19:29 ` Chris Wilson
2018-04-06  9:11 ` Tvrtko Ursulin
2018-04-06  9:40   ` Chris Wilson
2018-04-06  9:59     ` Tvrtko Ursulin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.