All of lore.kernel.org
 help / color / mirror / Atom feed
* [igt-dev] [RFC v5] tests/gem_watchdog: Initial set of tests for GPU watchdog
@ 2019-05-28 21:35 Carlos Santa
  2019-05-28 22:03 ` [igt-dev] ✓ Fi.CI.BAT: success for tests/gem_watchdog: Initial set of tests for GPU watchdog (rev7) Patchwork
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Carlos Santa @ 2019-05-28 21:35 UTC (permalink / raw)
  To: igt-dev; +Cc: Ursulin Tvrtko

This test adds basic set of tests to reset the different
GPU engines through the gpu watchdog timer.

Credits to Antonio for the original codebase this is based on.

v2: remove gem_context_get_param() during set (Antonio)
    remove clearing of the engines_threshold[] in the default case
    inside context_set_watchdog(). (Antonio)
    fix indexing when creating low/high priority contexts
    get rid of 2 threads idea (Antonio)
    fix context prio bug due to wrong indexing (Antonio)
v3: no need to force clear drm_i915_gem_watchdog_timeout struct
    when setting the watchdog timeouts (Antonio)
    -use gem_context_set_priority() instead to use error checking
    -always use a looper to test all engines (Antonio)
    -modify test gpu_watchodg_hang_long_batch_single_engine()
    to use a couple of batches and set the timeout in between them (Antonio)
    -make use of for_each_engine_class_instance() (Antonio)
v4: -tests should be able to handle s/w watchdog timeout (Tvrtko)
    -remove dead code inside batch_buffer factory (Tvrtko)
    -set_watchdog() should return ENODEV for fail cases (Tvrtko)
    -enclose name, priority and reset params inside array (Tvrtko)
v5: -Rebased. Tests 5,6 and 8 from the test plan are now added

Test Plan:

Assumptions:
1. Use fence status to figure out which ctx was reset
2. Use spin batches w/ user space timers to control duration
and corking to control ordering
3. Use context priorities to force preemption

Tests:
1. ctx1, long_batch -> execute
2. ctx1, set_watchdog -> reset
3. ctx2/ctx1 -> execute, reset
4. ctx1/ctx2 -> reset, execute
5. ctx1_just_below_threshold -> execute
6. ctx_over_the_threshold -> reset
7. set watchdog on some engines -> execute
8. submit long_batch and after half of the
expected runtime submit higher prio batch
9. submit low prio batch w/o watchdog then
higher prio with watchdog

Unsresolved items:
1. The check for sync_fence_status(*fence)
returns -1 instead of EIO.

Cc: Ursulin Tvrtko <tvrtko.ursulin@intel.com>
Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 tests/Makefile.sources    |   3 +
 tests/i915/gem_watchdog.c | 490 ++++++++++++++++++++++++++++++++++++++
 tests/meson.build         |   1 +
 3 files changed, 494 insertions(+)
 create mode 100644 tests/i915/gem_watchdog.c

diff --git a/tests/Makefile.sources b/tests/Makefile.sources
index 2d5c929e32fc..65b868e7d927 100644
--- a/tests/Makefile.sources
+++ b/tests/Makefile.sources
@@ -446,6 +446,9 @@ gem_userptr_blits_SOURCES = i915/gem_userptr_blits.c
 TESTS_progs += gem_wait
 gem_wait_SOURCES = i915/gem_wait.c
 
+TESTS_progs += gem_watchdog
+gem_watchdog_SOURCES = i915/gem_watchdog.c
+
 TESTS_progs += gem_workarounds
 gem_workarounds_SOURCES = i915/gem_workarounds.c
 
diff --git a/tests/i915/gem_watchdog.c b/tests/i915/gem_watchdog.c
new file mode 100644
index 000000000000..e9cdb886d3d6
--- /dev/null
+++ b/tests/i915/gem_watchdog.c
@@ -0,0 +1,490 @@
+/*
+ * Copyright © 2019 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+#include "igt.h"
+#include "igt_sysfs.h"
+#include "sw_sync.h"
+
+#include <pthread.h>
+#include <fcntl.h>
+
+#include <sys/time.h>
+#include <sys/ioctl.h>
+#include <sys/poll.h>
+#include <sys/signal.h>
+#include "i915/gem_ring.h"
+
+#define LOCAL_I915_EXEC_BSD_SHIFT	(13)
+#define LOCAL_I915_EXEC_BSD_RING1 	(1 << LOCAL_I915_EXEC_BSD_SHIFT)
+#define LOCAL_I915_EXEC_BSD_RING2 	(2 << LOCAL_I915_EXEC_BSD_SHIFT)
+
+#define MAX_PRIO LOCAL_I915_CONTEXT_MAX_USER_PRIORITY
+#define DEFAULT_PRIO LOCAL_I915_CONTEXT_DEFAULT_PRIORITY
+#define MIN_PRIO LOCAL_I915_CONTEXT_MIN_USER_PRIORITY
+#define HIGH 1
+#define LOW 0
+#define WATCHDOG_OVER_THRESHOLD (2070000)
+#define WATCHDOG_BELOW_THRESHOLD (1200000)
+#define WATCHDOG_THRESHOLD (1200)
+#define MAX_ENGINES 5
+#define RENDER_CLASS 0
+#define VIDEO_DECODE_CLASS 1
+#define VIDEO_ENHANCEMENT_CLASS 2
+#define COPY_ENGINE_CLASS 3
+#define LOCAL_I915_CONTEXT_PARAM_WATCHDOG 0x10
+
+#define GET_RESET_STATS_IOCTL DRM_IOWR(DRM_COMMAND_BASE + 0x32, struct local_drm_i915_reset_stats)
+struct local_drm_i915_reset_stats {
+	__u32 ctx_id;
+	__u32 flags;
+	__u32 reset_count;
+	__u32 batch_active;
+	__u32 batch_pending;
+	__u32 pad;
+};
+
+const uint64_t timeout_100ms = 100000000LL;
+float timedifference_msec(struct timeval t0, struct timeval t1);
+
+struct drm_i915_gem_watchdog_timeout {
+	union {
+		struct {
+			/*
+			 * Engine class & instance to be configured or queried.
+			 */
+			__u16 engine_class;
+			__u16 engine_instance;
+		};
+		/* Index based addressing mode */
+		__u32 index;
+	};
+	/* GPU Engine watchdog resets timeout in us */
+	__u32 timeout_us;
+};
+
+static void clear_error_state(int fd)
+{
+	int dir;
+
+	dir = igt_sysfs_open(fd);
+
+	if (dir < 0)
+		return;
+
+	/* Any write to the error state clears it */
+	igt_sysfs_set(dir, "error", "");
+	close(dir);
+}
+
+static int context_set_watchdog(int fd, unsigned engine_id,
+				 const char *engine_name,
+                                 unsigned ctx_id, unsigned threshold)
+{
+	struct drm_i915_gem_watchdog_timeout engines_threshold[MAX_ENGINES];
+	struct drm_i915_gem_context_param arg = {
+		.param = LOCAL_I915_CONTEXT_PARAM_WATCHDOG,
+		.ctx_id = ctx_id,
+		.size = sizeof(engines_threshold),
+		.value = (uint64_t)&engines_threshold
+	};
+
+	memset(&engines_threshold, 0, sizeof(engines_threshold));
+
+	switch (engine_id & I915_EXEC_RING_MASK) {
+	case I915_EXEC_RENDER:
+		engines_threshold[0].timeout_us = threshold;
+		engines_threshold[0].engine_class = RENDER_CLASS;
+		engines_threshold[0].engine_instance = 0;
+		break;
+	case I915_EXEC_BLT:
+		if (__gem_context_get_param(fd, &arg) == -ENODEV)
+			return -ENODEV;
+		else
+			engines_threshold[3].timeout_us = threshold;
+			engines_threshold[3].engine_class = COPY_ENGINE_CLASS;
+			engines_threshold[3].engine_instance = 0;
+		break;
+	case I915_EXEC_BSD:
+		engines_threshold[1].timeout_us = threshold;
+		engines_threshold[1].engine_class = VIDEO_DECODE_CLASS;
+		engines_threshold[1].engine_instance = 0;
+		break;
+	case I915_EXEC_VEBOX:
+		engines_threshold[2].timeout_us = threshold;
+		engines_threshold[2].engine_class = VIDEO_ENHANCEMENT_CLASS;
+		engines_threshold[2].engine_instance = 0;
+		break;
+	default:
+		break;
+	}
+
+	gem_context_set_param(fd, &arg);
+
+	return 0;
+}
+
+static void batch_buffer_factory(uint32_t fd, uint32_t ctx_id, unsigned exec_id,
+				 uint32_t target, uint32_t offset,
+				 uint32_t *handle, uint64_t timeout,
+				 int *fence, int fence_index)
+{
+    struct drm_i915_gem_exec_object2 obj[2];
+    struct drm_i915_gem_relocation_entry reloc;
+    struct drm_i915_gem_execbuffer2 execbuf;
+    igt_spin_t *spin = NULL;
+    const uint32_t bbe = MI_BATCH_BUFFER_END;
+    int i = 0;
+
+    gem_quiescent_gpu(fd);
+
+    memset(&execbuf, 0, sizeof(execbuf));
+    memset(&obj, 0, sizeof(obj));
+    memset(&reloc, 0, sizeof(reloc));
+
+    execbuf.buffers_ptr = to_user_pointer(obj);
+
+    execbuf.buffer_count = 2;
+    execbuf.flags = exec_id | I915_EXEC_FENCE_OUT ;
+
+    obj[0].handle = target;
+    obj[1].handle = gem_create(fd, 4096);
+
+    obj[1].relocation_count = 1;
+    obj[1].relocs_ptr = to_user_pointer(&reloc);
+
+    reloc.target_handle = obj[0].handle;
+    reloc.read_domains = I915_GEM_DOMAIN_COMMAND;
+    reloc.write_domain = I915_GEM_DOMAIN_COMMAND;
+    reloc.delta = offset * sizeof(uint32_t);
+
+    reloc.offset = i * sizeof(uint32_t);
+    gem_write(fd, obj[1].handle, 0, &bbe, sizeof(bbe));
+
+    __sync_synchronize();
+
+    gem_sync(fd, obj[1].handle);
+    execbuf.rsvd1 = ctx_id;
+    execbuf.rsvd2 = -1;
+
+    spin = igt_spin_new(fd, .dependency = obj[0].handle);
+    igt_spin_set_timeout(spin, timeout);
+    igt_assert(gem_bo_busy(fd, obj[0].handle));
+
+    gem_execbuf_wr(fd, &execbuf);
+    igt_spin_free(fd, spin);
+
+    fence[fence_index] = execbuf.rsvd2 >> 32;
+
+    gem_close(fd, obj[1].handle);
+    gem_quiescent_gpu(fd);
+}
+
+static void inject_hang(uint32_t fd, unsigned engine_id, uint32_t ctx_id,
+			 unsigned flags)
+{
+	igt_hang_t hang;
+	hang = igt_hang_ctx(fd, ctx_id, engine_id, flags);
+	gem_sync(fd, hang.spin->handle);
+}
+
+/* Test#1: create some work and let it run on all engines */
+static void long_batch_test1(int fd, int prio1, int prio2, int reset_ctx1,
+			int reset_ctx2, unsigned threshold)
+{
+	unsigned engine_id = 0;
+	unsigned nengine = 0;
+	struct intel_execution_engine2 *e_;
+	uint32_t ctx[8];
+	uint32_t scratch[8];
+	uint32_t ctx2[8];
+	uint32_t scratch2[8];
+	int *fence, i = 0;
+	const uint64_t batch_timeout_ms = timeout_100ms * 4;
+
+	fence = (int *)malloc(sizeof(int)*8);
+	igt_assert(fence);
+
+	__for_each_physical_engine(fd, e_) {
+		engine_id = e_->flags;
+
+		scratch2[nengine] = gem_create(fd, 4096);
+		ctx2[nengine] = gem_context_create(fd);
+		gem_context_set_priority(fd, ctx2[nengine], prio2);
+		batch_buffer_factory(fd, ctx2[nengine], engine_id,
+				 scratch2[nengine], 0, NULL,
+				 batch_timeout_ms, fence, nengine);
+
+		scratch[nengine] = gem_create(fd, 4096);
+		ctx[nengine] = gem_context_create(fd);
+		gem_context_set_priority(fd, ctx[nengine], prio1);
+		batch_buffer_factory(fd, ctx[nengine], engine_id,
+				 scratch[nengine], 0, NULL,
+				 batch_timeout_ms, fence, nengine);
+		igt_info("Test #1: ctx1/ctx2 --> execute on %s\n",e_->name);
+		nengine++;
+	}
+
+	for (i = 0; i < nengine; i++) {
+		close(fence[i]);
+		gem_context_destroy(fd, ctx[i]);
+		gem_context_destroy(fd, ctx2[i]);
+		gem_close(fd, scratch[i]);
+		gem_close(fd, scratch2[i]);
+	}
+}
+
+float timedifference_msec(struct timeval t0, struct timeval t1)
+{
+    return (t1.tv_sec - t0.tv_sec) * 1000.0f + (t1.tv_usec - t0.tv_usec) / 1000.0f;
+}
+
+/* Test#2: submit a long batch and after half of the expected runtime
+   submit a higher priority batch and then try to cancel the execution*/
+static void long_batch_test2(int fd, int prio1, int prio2, int reset_ctx1,
+			int reset_ctx2, unsigned threshold)
+{
+	unsigned engine_id = 0;
+	unsigned nengine = 0;
+	struct intel_execution_engine2 *e_;
+	uint32_t ctx[8];
+	uint32_t scratch[8];
+	uint32_t ctx2[8];
+	uint32_t scratch2[8];
+	int *fence, i = 0;
+	unsigned flags = HANG_ALLOW_CAPTURE;
+	const uint64_t batch_timeout_ms = timeout_100ms * 4;
+	struct timeval t0;
+	struct timeval t1;
+	float elapsed;
+
+	fence = (int *)malloc(sizeof(int)*8);
+	igt_assert(fence);
+
+	nengine = 0;
+
+	__for_each_physical_engine(fd, e_) {
+		engine_id = e_->flags;
+
+		scratch2[nengine] = gem_create(fd, 4096);
+		ctx2[nengine] = gem_context_create(fd);
+		gem_context_set_priority(fd, ctx2[nengine], prio2);
+		batch_buffer_factory(fd, ctx2[nengine], engine_id,
+				 scratch2[nengine], 0, NULL,
+				 batch_timeout_ms, fence, nengine);
+
+		gettimeofday(&t0, 0);
+		while(elapsed < 200) {
+			gettimeofday(&t1, 0);
+			elapsed += timedifference_msec(t0, t1);
+		}
+
+		scratch[nengine] = gem_create(fd, 4096);
+		ctx[nengine] = gem_context_create(fd);
+		gem_context_set_priority(fd, ctx[nengine], prio1);
+		batch_buffer_factory(fd, ctx[nengine], engine_id,
+				 scratch[nengine], 0, NULL,
+				 batch_timeout_ms, fence, nengine);
+
+		if (context_set_watchdog(fd, engine_id, e_->name,
+					 ctx[nengine],
+					 threshold) == -ENODEV) {
+			igt_info("No support for gpu h/w watchdog on %s\n",
+			 e_->name);
+			goto skip_case;
+		}
+		clear_error_state(fd);
+		inject_hang(fd, engine_id, ctx[nengine], flags);
+		/* Now check the engine was reset successfully */
+		//igt_assert_eq(sync_fence_status(*fence), -EIO);
+		igt_info("Test #2 ctx1/ctx2 --> set watchdog and"
+				" cancel ctx2 at half expected run with higher"
+				" priority engine: %s, fence status: 0x%x \n",
+					e_->name, sync_fence_status(*fence));
+skip_case:
+		nengine++;
+	}
+
+	for (i = 0; i < nengine; i++) {
+		close(fence[i]);
+		gem_context_destroy(fd, ctx[i]);
+		gem_context_destroy(fd, ctx2[i]);
+		gem_close(fd, scratch[i]);
+		gem_close(fd, scratch2[i]);
+	}
+}
+
+/* Test#3: create 2 ctx, set a gpu watchdog timeout on both,
+   and either execute or cancel */
+static void long_batch_test3(int fd, int prio1, int prio2, int reset_ctx1,
+			int reset_ctx2, unsigned threshold)
+{
+	unsigned engine_id = 0;
+	unsigned nengine = 0;
+	struct intel_execution_engine2 *e_;
+	uint32_t ctx[8];
+	uint32_t scratch[8];
+	uint32_t ctx2[8];
+	uint32_t scratch2[8];
+	int *fence, i = 0;
+	unsigned flags = HANG_ALLOW_CAPTURE;
+	const uint64_t batch_timeout_ms = timeout_100ms * 4;
+
+	fence = (int *)malloc(sizeof(int)*8);
+	igt_assert(fence);
+
+	nengine = 0;
+
+	__for_each_physical_engine(fd, e_) {
+		engine_id = e_->flags;
+
+		scratch2[nengine] = gem_create(fd, 4096);
+		ctx2[nengine] = gem_context_create(fd);
+		gem_context_set_priority(fd, ctx2[nengine], prio2);
+		batch_buffer_factory(fd, ctx2[nengine], engine_id,
+				 scratch2[nengine], 0, NULL,
+				 batch_timeout_ms, fence, nengine);
+
+		scratch[nengine] = gem_create(fd, 4096);
+		ctx[nengine] = gem_context_create(fd);
+		gem_context_set_priority(fd, ctx[nengine], prio1);
+		batch_buffer_factory(fd, ctx[nengine], engine_id,
+				 scratch[nengine], 0, NULL,
+				 batch_timeout_ms, fence, nengine);
+
+		if (context_set_watchdog(fd, engine_id, e_->name,
+					 ctx2[nengine],
+					 threshold) == -ENODEV) {
+			igt_info("No support for gpu h/w watchdog on %s\n",
+			 e_->name);
+			goto skip_case;
+		}
+
+		if (reset_ctx2) {
+			clear_error_state(fd);
+			inject_hang(fd, engine_id, ctx2[nengine], flags);
+
+			/* Now check the engine was reset successfully */
+			//igt_assert_eq(sync_fence_status(*fence), -EIO);
+			igt_info("Test #3 ctx1/ctx2 --> set watchdog and"
+				" cancel ctx2 on %s with fence status: 0x%x \n",
+					e_->name, sync_fence_status(*fence));
+		}
+
+		context_set_watchdog(fd, engine_id, e_->name,
+					 ctx[nengine],
+					 threshold);
+		if (reset_ctx1) {
+			clear_error_state(fd);
+			inject_hang(fd, engine_id, ctx[nengine], flags);
+
+			/* Now check the engine was reset successfully */
+			//igt_assert_eq(sync_fence_status(*fence), -EIO);
+			igt_info("Test #3: ctx1/ctx2 --> set watchdog and"
+				" cancel ctx1 on %s with fence status: 0x%x\n",
+				e_->name, sync_fence_status(*fence));
+		}
+skip_case:
+		nengine++;
+	}
+
+	for (i = 0; i < nengine; i++) {
+		close(fence[i]);
+		gem_context_destroy(fd, ctx[i]);
+		gem_context_destroy(fd, ctx2[i]);
+		gem_close(fd, scratch[i]);
+		gem_close(fd, scratch2[i]);
+	}
+}
+
+igt_main
+{
+	int fd;
+	unsigned int i=0;
+	struct {
+		char *name;
+		int prio[2];
+		bool reset[2];
+		unsigned threshold;
+	} tests[] = {
+		{"ctx1-exec-ctx2-exec-all-engines",
+		{DEFAULT_PRIO, DEFAULT_PRIO}, {false, false}, WATCHDOG_THRESHOLD},
+		{"ctx1-low-prio-exec-ctx2-high-prio-reset-after-half-time-all-engines",
+		{MAX_PRIO, MIN_PRIO}, {false, false}, WATCHDOG_THRESHOLD},
+		{"ctx1-reset-ctx2-exec-all-engines",
+		{DEFAULT_PRIO, DEFAULT_PRIO}, {true, false}, WATCHDOG_THRESHOLD},
+		{"ctx1-reset-ctx2-exec-all-engines-below-threshold",
+		{DEFAULT_PRIO, DEFAULT_PRIO}, {true, false},
+		 WATCHDOG_BELOW_THRESHOLD},
+		{"ctx1-reset-ctx2-exec-all-engines-over-threshold",
+		{DEFAULT_PRIO, DEFAULT_PRIO}, {true, false},
+		 WATCHDOG_OVER_THRESHOLD},
+		{"ctx2-reset-ctx1-exec-all-engines",
+		{DEFAULT_PRIO, DEFAULT_PRIO}, {false, true}, WATCHDOG_THRESHOLD},
+		{"ctx2-reset-ctx1-reset-all-engines",
+		{DEFAULT_PRIO, DEFAULT_PRIO}, {true, true}, WATCHDOG_THRESHOLD},
+		{"ctx1-high-prio-reset-ctx2-low-prio-exec-all-engines",
+		{MAX_PRIO, MIN_PRIO}, {true, false}, WATCHDOG_THRESHOLD},
+		{"ctx1-low-prio-reset-ctx2-high-prio-exec-all-engines",
+		{MIN_PRIO, MAX_PRIO}, {true, false}, WATCHDOG_THRESHOLD},
+		{"ctx1-low-prio-reset-ctx2-high-prio-reset-all-engines",
+		{MIN_PRIO, MAX_PRIO}, {true, true}, WATCHDOG_THRESHOLD},
+	};
+
+	igt_skip_on_simulation();
+
+	igt_fixture {
+		fd = drm_open_driver(DRIVER_INTEL);
+		igt_require_gem(fd);
+	}
+
+	igt_subtest_group {
+		igt_subtest_f("%s", tests[0].name) {
+			long_batch_test1( fd, tests[0].prio[0],
+					tests[0].prio[1],
+					tests[0].reset[0],
+					tests[0].reset[1],
+					tests[0].threshold);
+		}
+
+		igt_subtest_f("%s", tests[1].name) {
+			long_batch_test2( fd, tests[1].prio[0],
+					tests[1].prio[1],
+					tests[1].reset[0],
+					tests[1].reset[1],
+					tests[1].threshold);
+		}
+
+		for(i = 2; i < ARRAY_SIZE(tests); i++) {
+			igt_subtest_f("%s", tests[i].name) {
+				long_batch_test3( fd, tests[i].prio[0],
+						tests[i].prio[1],
+						tests[i].reset[0],
+						tests[i].reset[1],
+						tests[i].threshold);
+			}
+		}
+	}
+
+    igt_fixture {
+	close(fd);
+    }
+}
diff --git a/tests/meson.build b/tests/meson.build
index 806766e51667..ab73a0043c7c 100644
--- a/tests/meson.build
+++ b/tests/meson.build
@@ -216,6 +216,7 @@ i915_progs = [
 	'gem_unref_active_buffers',
 	'gem_userptr_blits',
 	'gem_wait',
+        'gem_watchdog',
 	'gem_workarounds',
 	'gem_write_read_ring_switch',
 	'i915_fb_tiling',
-- 
2.19.1

_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [igt-dev] ✓ Fi.CI.BAT: success for tests/gem_watchdog: Initial set of tests for GPU watchdog (rev7)
  2019-05-28 21:35 [igt-dev] [RFC v5] tests/gem_watchdog: Initial set of tests for GPU watchdog Carlos Santa
@ 2019-05-28 22:03 ` Patchwork
  2019-05-29  8:23 ` [igt-dev] ✓ Fi.CI.IGT: " Patchwork
  2019-05-29  9:03 ` [igt-dev] [RFC v5] tests/gem_watchdog: Initial set of tests for GPU watchdog Chris Wilson
  2 siblings, 0 replies; 6+ messages in thread
From: Patchwork @ 2019-05-28 22:03 UTC (permalink / raw)
  To: Carlos Santa; +Cc: igt-dev

== Series Details ==

Series: tests/gem_watchdog: Initial set of tests for GPU watchdog (rev7)
URL   : https://patchwork.freedesktop.org/series/50041/
State : success

== Summary ==

CI Bug Log - changes from IGT_5022 -> IGTPW_3068
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/50041/revisions/7/mbox/

Known issues
------------

  Here are the changes found in IGTPW_3068 that come from known issues:

### IGT changes ###

#### Possible fixes ####

  * igt@kms_chamelium@common-hpd-after-suspend:
    - fi-kbl-7567u:       [WARN][1] ([fdo#109380]) -> [PASS][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/fi-kbl-7567u/igt@kms_chamelium@common-hpd-after-suspend.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/fi-kbl-7567u/igt@kms_chamelium@common-hpd-after-suspend.html

  * igt@kms_pipe_crc_basic@read-crc-pipe-c:
    - fi-kbl-7567u:       [SKIP][3] ([fdo#109271]) -> [PASS][4] +23 similar issues
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/fi-kbl-7567u/igt@kms_pipe_crc_basic@read-crc-pipe-c.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/fi-kbl-7567u/igt@kms_pipe_crc_basic@read-crc-pipe-c.html

  
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109380]: https://bugs.freedesktop.org/show_bug.cgi?id=109380


Participating hosts (46 -> 41)
------------------------------

  Additional (2): fi-byt-n2820 fi-pnv-d510 
  Missing    (7): fi-kbl-soraka fi-ilk-m540 fi-hsw-4200u fi-byt-squawks fi-bsw-cyan fi-byt-clapper fi-bdw-samus 


Build changes
-------------

  * IGT: IGT_5022 -> IGTPW_3068

  CI_DRM_6159: f280d33ae895624a247d4431ece09b5088e6f021 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGTPW_3068: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/
  IGT_5022: b7e700567803237e2f3d7a8753fd4775f712d353 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools



== Testlist changes ==

+igt@gem_watchdog@ctx1-exec-ctx2-exec-all-engines
+igt@gem_watchdog@ctx1-high-prio-reset-ctx2-low-prio-exec-all-engines
+igt@gem_watchdog@ctx1-low-prio-exec-ctx2-high-prio-reset-after-half-time-all-engines
+igt@gem_watchdog@ctx1-low-prio-reset-ctx2-high-prio-exec-all-engines
+igt@gem_watchdog@ctx1-low-prio-reset-ctx2-high-prio-reset-all-engines
+igt@gem_watchdog@ctx1-reset-ctx2-exec-all-engines
+igt@gem_watchdog@ctx1-reset-ctx2-exec-all-engines-below-threshold
+igt@gem_watchdog@ctx1-reset-ctx2-exec-all-engines-over-threshold
+igt@gem_watchdog@ctx2-reset-ctx1-exec-all-engines
+igt@gem_watchdog@ctx2-reset-ctx1-reset-all-engines

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [igt-dev] ✓ Fi.CI.IGT: success for tests/gem_watchdog: Initial set of tests for GPU watchdog (rev7)
  2019-05-28 21:35 [igt-dev] [RFC v5] tests/gem_watchdog: Initial set of tests for GPU watchdog Carlos Santa
  2019-05-28 22:03 ` [igt-dev] ✓ Fi.CI.BAT: success for tests/gem_watchdog: Initial set of tests for GPU watchdog (rev7) Patchwork
@ 2019-05-29  8:23 ` Patchwork
  2019-05-29  9:03 ` [igt-dev] [RFC v5] tests/gem_watchdog: Initial set of tests for GPU watchdog Chris Wilson
  2 siblings, 0 replies; 6+ messages in thread
From: Patchwork @ 2019-05-29  8:23 UTC (permalink / raw)
  To: Carlos Santa; +Cc: igt-dev

== Series Details ==

Series: tests/gem_watchdog: Initial set of tests for GPU watchdog (rev7)
URL   : https://patchwork.freedesktop.org/series/50041/
State : success

== Summary ==

CI Bug Log - changes from IGT_5022_full -> IGTPW_3068_full
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/50041/revisions/7/mbox/

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in IGTPW_3068_full:

### IGT changes ###

#### Possible regressions ####

  * {igt@gem_watchdog@ctx1-high-prio-reset-ctx2-low-prio-exec-all-engines} (NEW):
    - shard-hsw:          NOTRUN -> [FAIL][1] +8 similar issues
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-hsw1/igt@gem_watchdog@ctx1-high-prio-reset-ctx2-low-prio-exec-all-engines.html
    - shard-glk:          NOTRUN -> [FAIL][2] +8 similar issues
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-glk2/igt@gem_watchdog@ctx1-high-prio-reset-ctx2-low-prio-exec-all-engines.html

  * {igt@gem_watchdog@ctx1-low-prio-exec-ctx2-high-prio-reset-after-half-time-all-engines} (NEW):
    - shard-snb:          NOTRUN -> [FAIL][3] +9 similar issues
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-snb2/igt@gem_watchdog@ctx1-low-prio-exec-ctx2-high-prio-reset-after-half-time-all-engines.html

  * {igt@gem_watchdog@ctx1-reset-ctx2-exec-all-engines} (NEW):
    - shard-kbl:          NOTRUN -> [FAIL][4] +8 similar issues
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-kbl6/igt@gem_watchdog@ctx1-reset-ctx2-exec-all-engines.html

  * {igt@gem_watchdog@ctx2-reset-ctx1-exec-all-engines} (NEW):
    - shard-apl:          NOTRUN -> [FAIL][5] +7 similar issues
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-apl7/igt@gem_watchdog@ctx2-reset-ctx1-exec-all-engines.html

  
New tests
---------

  New tests have been introduced between IGT_5022_full and IGTPW_3068_full:

### New IGT tests (10) ###

  * igt@gem_watchdog@ctx1-exec-ctx2-exec-all-engines:
    - Statuses : 2 fail(s) 3 pass(s)
    - Exec time: [0.07, 0.13] s

  * igt@gem_watchdog@ctx1-high-prio-reset-ctx2-low-prio-exec-all-engines:
    - Statuses : 5 fail(s)
    - Exec time: [0.03, 0.16] s

  * igt@gem_watchdog@ctx1-low-prio-exec-ctx2-high-prio-reset-after-half-time-all-engines:
    - Statuses : 5 fail(s)
    - Exec time: [0.01, 0.10] s

  * igt@gem_watchdog@ctx1-low-prio-reset-ctx2-high-prio-exec-all-engines:
    - Statuses : 5 fail(s)
    - Exec time: [0.03, 0.10] s

  * igt@gem_watchdog@ctx1-low-prio-reset-ctx2-high-prio-reset-all-engines:
    - Statuses : 4 fail(s)
    - Exec time: [0.07, 0.13] s

  * igt@gem_watchdog@ctx1-reset-ctx2-exec-all-engines:
    - Statuses : 5 fail(s)
    - Exec time: [0.03, 0.16] s

  * igt@gem_watchdog@ctx1-reset-ctx2-exec-all-engines-below-threshold:
    - Statuses : 5 fail(s)
    - Exec time: [0.09, 0.15] s

  * igt@gem_watchdog@ctx1-reset-ctx2-exec-all-engines-over-threshold:
    - Statuses : 5 fail(s)
    - Exec time: [0.03, 0.17] s

  * igt@gem_watchdog@ctx2-reset-ctx1-exec-all-engines:
    - Statuses : 5 fail(s)
    - Exec time: [0.05, 0.13] s

  * igt@gem_watchdog@ctx2-reset-ctx1-reset-all-engines:
    - Statuses : 4 fail(s)
    - Exec time: [0.07, 0.10] s

  

Known issues
------------

  Here are the changes found in IGTPW_3068_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_eio@in-flight-suspend:
    - shard-apl:          [PASS][6] -> [DMESG-WARN][7] ([fdo#108566]) +4 similar issues
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/shard-apl5/igt@gem_eio@in-flight-suspend.html
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-apl4/igt@gem_eio@in-flight-suspend.html

  * igt@kms_flip@2x-plain-flip-ts-check:
    - shard-hsw:          [PASS][8] -> [INCOMPLETE][9] ([fdo#103540])
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/shard-hsw6/igt@kms_flip@2x-plain-flip-ts-check.html
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-hsw4/igt@kms_flip@2x-plain-flip-ts-check.html

  * igt@kms_frontbuffer_tracking@fbc-2p-primscrn-cur-indfb-draw-pwrite:
    - shard-hsw:          [PASS][10] -> [SKIP][11] ([fdo#109271]) +27 similar issues
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/shard-hsw4/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-cur-indfb-draw-pwrite.html
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-hsw1/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-cur-indfb-draw-pwrite.html

  * igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c:
    - shard-kbl:          [PASS][12] -> [DMESG-WARN][13] ([fdo#108566]) +1 similar issue
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/shard-kbl3/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c.html
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-kbl7/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-c.html

  
#### Possible fixes ####

  * igt@gem_ctx_isolation@rcs0-s3:
    - shard-kbl:          [DMESG-WARN][14] ([fdo#108566]) -> [PASS][15] +3 similar issues
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/shard-kbl7/igt@gem_ctx_isolation@rcs0-s3.html
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-kbl2/igt@gem_ctx_isolation@rcs0-s3.html

  * igt@gem_ctx_isolation@vcs1-s3:
    - shard-kbl:          [INCOMPLETE][16] ([fdo#103665]) -> [PASS][17]
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/shard-kbl3/igt@gem_ctx_isolation@vcs1-s3.html
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-kbl2/igt@gem_ctx_isolation@vcs1-s3.html

  * igt@gem_eio@suspend:
    - shard-kbl:          [DMESG-WARN][18] ([fdo#103558] / [fdo#105602]) -> [PASS][19] +10 similar issues
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/shard-kbl3/igt@gem_eio@suspend.html
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-kbl6/igt@gem_eio@suspend.html

  * igt@gem_tiled_swapping@non-threaded:
    - shard-kbl:          [DMESG-WARN][20] ([fdo#108686]) -> [PASS][21]
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/shard-kbl7/igt@gem_tiled_swapping@non-threaded.html
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-kbl1/igt@gem_tiled_swapping@non-threaded.html

  * igt@kms_cursor_crc@pipe-a-cursor-256x85-sliding:
    - shard-kbl:          [FAIL][22] ([fdo#103232]) -> [PASS][23]
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/shard-kbl6/igt@kms_cursor_crc@pipe-a-cursor-256x85-sliding.html
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-kbl2/igt@kms_cursor_crc@pipe-a-cursor-256x85-sliding.html
    - shard-apl:          [FAIL][24] ([fdo#103232]) -> [PASS][25]
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/shard-apl7/igt@kms_cursor_crc@pipe-a-cursor-256x85-sliding.html
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-apl5/igt@kms_cursor_crc@pipe-a-cursor-256x85-sliding.html

  * igt@kms_cursor_legacy@cursor-vs-flip-atomic-transitions:
    - shard-hsw:          [FAIL][26] ([fdo#103355]) -> [PASS][27]
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/shard-hsw6/igt@kms_cursor_legacy@cursor-vs-flip-atomic-transitions.html
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-hsw6/igt@kms_cursor_legacy@cursor-vs-flip-atomic-transitions.html

  * igt@kms_flip@2x-plain-flip-ts-check-interruptible:
    - shard-hsw:          [SKIP][28] ([fdo#109271]) -> [PASS][29] +24 similar issues
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/shard-hsw1/igt@kms_flip@2x-plain-flip-ts-check-interruptible.html
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-hsw1/igt@kms_flip@2x-plain-flip-ts-check-interruptible.html

  * igt@kms_setmode@basic:
    - shard-apl:          [FAIL][30] ([fdo#99912]) -> [PASS][31]
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/shard-apl8/igt@kms_setmode@basic.html
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-apl4/igt@kms_setmode@basic.html

  * igt@kms_vblank@pipe-c-ts-continuation-suspend:
    - shard-apl:          [DMESG-WARN][32] ([fdo#108566]) -> [PASS][33] +6 similar issues
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/shard-apl8/igt@kms_vblank@pipe-c-ts-continuation-suspend.html
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-apl5/igt@kms_vblank@pipe-c-ts-continuation-suspend.html

  * igt@perf_pmu@rc6-runtime-pm-long:
    - shard-glk:          [FAIL][34] ([fdo#105010]) -> [PASS][35]
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/shard-glk9/igt@perf_pmu@rc6-runtime-pm-long.html
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-glk1/igt@perf_pmu@rc6-runtime-pm-long.html

  
#### Warnings ####

  * igt@kms_cursor_crc@pipe-a-cursor-512x170-random:
    - shard-kbl:          [SKIP][36] ([fdo#105602] / [fdo#109271]) -> [SKIP][37] ([fdo#109271]) +3 similar issues
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/shard-kbl3/igt@kms_cursor_crc@pipe-a-cursor-512x170-random.html
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-kbl3/igt@kms_cursor_crc@pipe-a-cursor-512x170-random.html

  * igt@prime_vgem@wait-bsd1:
    - shard-snb:          [INCOMPLETE][38] ([fdo#105411]) -> [FAIL][39] ([fdo#110764]) +1 similar issue
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5022/shard-snb5/igt@prime_vgem@wait-bsd1.html
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/shard-snb7/igt@prime_vgem@wait-bsd1.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#103232]: https://bugs.freedesktop.org/show_bug.cgi?id=103232
  [fdo#103355]: https://bugs.freedesktop.org/show_bug.cgi?id=103355
  [fdo#103540]: https://bugs.freedesktop.org/show_bug.cgi?id=103540
  [fdo#103558]: https://bugs.freedesktop.org/show_bug.cgi?id=103558
  [fdo#103665]: https://bugs.freedesktop.org/show_bug.cgi?id=103665
  [fdo#105010]: https://bugs.freedesktop.org/show_bug.cgi?id=105010
  [fdo#105411]: https://bugs.freedesktop.org/show_bug.cgi?id=105411
  [fdo#105602]: https://bugs.freedesktop.org/show_bug.cgi?id=105602
  [fdo#108566]: https://bugs.freedesktop.org/show_bug.cgi?id=108566
  [fdo#108686]: https://bugs.freedesktop.org/show_bug.cgi?id=108686
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#110764]: https://bugs.freedesktop.org/show_bug.cgi?id=110764
  [fdo#99912]: https://bugs.freedesktop.org/show_bug.cgi?id=99912


Participating hosts (6 -> 5)
------------------------------

  Missing    (1): shard-skl 


Build changes
-------------

  * IGT: IGT_5022 -> IGTPW_3068

  CI_DRM_6159: f280d33ae895624a247d4431ece09b5088e6f021 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGTPW_3068: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/
  IGT_5022: b7e700567803237e2f3d7a8753fd4775f712d353 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3068/
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [igt-dev] [RFC v5] tests/gem_watchdog: Initial set of tests for GPU watchdog
  2019-05-28 21:35 [igt-dev] [RFC v5] tests/gem_watchdog: Initial set of tests for GPU watchdog Carlos Santa
  2019-05-28 22:03 ` [igt-dev] ✓ Fi.CI.BAT: success for tests/gem_watchdog: Initial set of tests for GPU watchdog (rev7) Patchwork
  2019-05-29  8:23 ` [igt-dev] ✓ Fi.CI.IGT: " Patchwork
@ 2019-05-29  9:03 ` Chris Wilson
  2019-06-11  1:55   ` Carlos Santa
  2019-06-18  1:35   ` Carlos Santa
  2 siblings, 2 replies; 6+ messages in thread
From: Chris Wilson @ 2019-05-29  9:03 UTC (permalink / raw)
  To: Carlos Santa, igt-dev; +Cc: Ursulin Tvrtko

Quoting Carlos Santa (2019-05-28 22:35:59)
> This test adds basic set of tests to reset the different
> GPU engines through the gpu watchdog timer.
> 
> Credits to Antonio for the original codebase this is based on.
> 
> v2: remove gem_context_get_param() during set (Antonio)
>     remove clearing of the engines_threshold[] in the default case
>     inside context_set_watchdog(). (Antonio)
>     fix indexing when creating low/high priority contexts
>     get rid of 2 threads idea (Antonio)
>     fix context prio bug due to wrong indexing (Antonio)
> v3: no need to force clear drm_i915_gem_watchdog_timeout struct
>     when setting the watchdog timeouts (Antonio)
>     -use gem_context_set_priority() instead to use error checking
>     -always use a looper to test all engines (Antonio)
>     -modify test gpu_watchodg_hang_long_batch_single_engine()
>     to use a couple of batches and set the timeout in between them (Antonio)
>     -make use of for_each_engine_class_instance() (Antonio)
> v4: -tests should be able to handle s/w watchdog timeout (Tvrtko)
>     -remove dead code inside batch_buffer factory (Tvrtko)
>     -set_watchdog() should return ENODEV for fail cases (Tvrtko)
>     -enclose name, priority and reset params inside array (Tvrtko)
> v5: -Rebased. Tests 5,6 and 8 from the test plan are now added
> 
> Test Plan:
> 
> Assumptions:
> 1. Use fence status to figure out which ctx was reset
> 2. Use spin batches w/ user space timers to control duration
> and corking to control ordering
> 3. Use context priorities to force preemption
> 
> Tests:
> 1. ctx1, long_batch -> execute
> 2. ctx1, set_watchdog -> reset
> 3. ctx2/ctx1 -> execute, reset
> 4. ctx1/ctx2 -> reset, execute
> 5. ctx1_just_below_threshold -> execute
> 6. ctx_over_the_threshold -> reset
> 7. set watchdog on some engines -> execute
> 8. submit long_batch and after half of the
> expected runtime submit higher prio batch
> 9. submit low prio batch w/o watchdog then
> higher prio with watchdog
> 
> Unsresolved items:
> 1. The check for sync_fence_status(*fence)
> returns -1 instead of EIO.
> 
> Cc: Ursulin Tvrtko <tvrtko.ursulin@intel.com>
> Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> ---
>  tests/Makefile.sources    |   3 +
>  tests/i915/gem_watchdog.c | 490 ++++++++++++++++++++++++++++++++++++++
>  tests/meson.build         |   1 +
>  3 files changed, 494 insertions(+)
>  create mode 100644 tests/i915/gem_watchdog.c
> 
> diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> index 2d5c929e32fc..65b868e7d927 100644
> --- a/tests/Makefile.sources
> +++ b/tests/Makefile.sources
> @@ -446,6 +446,9 @@ gem_userptr_blits_SOURCES = i915/gem_userptr_blits.c
>  TESTS_progs += gem_wait
>  gem_wait_SOURCES = i915/gem_wait.c
>  
> +TESTS_progs += gem_watchdog
> +gem_watchdog_SOURCES = i915/gem_watchdog.c
> +
>  TESTS_progs += gem_workarounds
>  gem_workarounds_SOURCES = i915/gem_workarounds.c
>  
> diff --git a/tests/i915/gem_watchdog.c b/tests/i915/gem_watchdog.c
> new file mode 100644
> index 000000000000..e9cdb886d3d6
> --- /dev/null
> +++ b/tests/i915/gem_watchdog.c
> @@ -0,0 +1,490 @@
> +/*
> + * Copyright © 2019 Intel Corporation
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a
> + * copy of this software and associated documentation files (the "Software"),
> + * to deal in the Software without restriction, including without limitation
> + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> + * and/or sell copies of the Software, and to permit persons to whom the
> + * Software is furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice (including the next
> + * paragraph) shall be included in all copies or substantial portions of the
> + * Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
> + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
> + * IN THE SOFTWARE.
> + */
> +#include "igt.h"
> +#include "igt_sysfs.h"
> +#include "sw_sync.h"
> +
> +#include <pthread.h>
> +#include <fcntl.h>
> +
> +#include <sys/time.h>
> +#include <sys/ioctl.h>
> +#include <sys/poll.h>
> +#include <sys/signal.h>
> +#include "i915/gem_ring.h"
> +
> +#define LOCAL_I915_EXEC_BSD_SHIFT      (13)
> +#define LOCAL_I915_EXEC_BSD_RING1      (1 << LOCAL_I915_EXEC_BSD_SHIFT)
> +#define LOCAL_I915_EXEC_BSD_RING2      (2 << LOCAL_I915_EXEC_BSD_SHIFT)
> +
> +#define MAX_PRIO LOCAL_I915_CONTEXT_MAX_USER_PRIORITY
> +#define DEFAULT_PRIO LOCAL_I915_CONTEXT_DEFAULT_PRIORITY
> +#define MIN_PRIO LOCAL_I915_CONTEXT_MIN_USER_PRIORITY
> +#define HIGH 1
> +#define LOW 0
> +#define WATCHDOG_OVER_THRESHOLD (2070000)
> +#define WATCHDOG_BELOW_THRESHOLD (1200000)
> +#define WATCHDOG_THRESHOLD (1200)
> +#define MAX_ENGINES 5
> +#define RENDER_CLASS 0
> +#define VIDEO_DECODE_CLASS 1
> +#define VIDEO_ENHANCEMENT_CLASS 2
> +#define COPY_ENGINE_CLASS 3
> +#define LOCAL_I915_CONTEXT_PARAM_WATCHDOG 0x10
> +
> +#define GET_RESET_STATS_IOCTL DRM_IOWR(DRM_COMMAND_BASE + 0x32, struct local_drm_i915_reset_stats)
> +struct local_drm_i915_reset_stats {
> +       __u32 ctx_id;
> +       __u32 flags;
> +       __u32 reset_count;
> +       __u32 batch_active;
> +       __u32 batch_pending;
> +       __u32 pad;
> +};
> +
> +const uint64_t timeout_100ms = 100000000LL;
> +float timedifference_msec(struct timeval t0, struct timeval t1);
> +
> +struct drm_i915_gem_watchdog_timeout {
> +       union {
> +               struct {
> +                       /*
> +                        * Engine class & instance to be configured or queried.
> +                        */
> +                       __u16 engine_class;
> +                       __u16 engine_instance;
> +               };
> +               /* Index based addressing mode */
> +               __u32 index;
> +       };
> +       /* GPU Engine watchdog resets timeout in us */
> +       __u32 timeout_us;
> +};
> +
> +static void clear_error_state(int fd)
> +{
> +       int dir;
> +
> +       dir = igt_sysfs_open(fd);
> +
> +       if (dir < 0)
> +               return;
> +
> +       /* Any write to the error state clears it */
> +       igt_sysfs_set(dir, "error", "");
> +       close(dir);
> +}
> +
> +static int context_set_watchdog(int fd, unsigned engine_id,
> +                                const char *engine_name,
> +                                 unsigned ctx_id, unsigned threshold)
> +{
> +       struct drm_i915_gem_watchdog_timeout engines_threshold[MAX_ENGINES];
> +       struct drm_i915_gem_context_param arg = {
> +               .param = LOCAL_I915_CONTEXT_PARAM_WATCHDOG,
> +               .ctx_id = ctx_id,
> +               .size = sizeof(engines_threshold),

Won't the kernel complain about the invalid values?

> +               .value = (uint64_t)&engines_threshold
> +       };
> +
> +       memset(&engines_threshold, 0, sizeof(engines_threshold));

You set all the unspecified values to be rcs0 with 0 timeout.

> +
> +       switch (engine_id & I915_EXEC_RING_MASK) {
> +       case I915_EXEC_RENDER:
> +               engines_threshold[0].timeout_us = threshold;

I presume this is a u64, so might as well use _ns for consistency.

> +               engines_threshold[0].engine_class = RENDER_CLASS;
> +               engines_threshold[0].engine_instance = 0;

Use struct i915_engine_class_instance.

> +               break;
> +       case I915_EXEC_BLT:
> +               if (__gem_context_get_param(fd, &arg) == -ENODEV)
> +                       return -ENODEV;

And otherwise setting random values into engine_thresholds.

> +               else
> +                       engines_threshold[3].timeout_us = threshold;
> +                       engines_threshold[3].engine_class = COPY_ENGINE_CLASS;
> +                       engines_threshold[3].engine_instance = 0;
> +               break;
> +       case I915_EXEC_BSD:
> +               engines_threshold[1].timeout_us = threshold;
> +               engines_threshold[1].engine_class = VIDEO_DECODE_CLASS;
> +               engines_threshold[1].engine_instance = 0;
> +               break;
> +       case I915_EXEC_VEBOX:
> +               engines_threshold[2].timeout_us = threshold;
> +               engines_threshold[2].engine_class = VIDEO_ENHANCEMENT_CLASS;
> +               engines_threshold[2].engine_instance = 0;
> +               break;
> +       default:
> +               break;
> +       }

Ok, so you only set one engine and the rest to zero. That seems like a
bad example in how to use the uAPI.

> +       gem_context_set_param(fd, &arg);
> +
> +       return 0;
> +}
> +
> +static void batch_buffer_factory(uint32_t fd, uint32_t ctx_id, unsigned exec_id,
> +                                uint32_t target, uint32_t offset,
> +                                uint32_t *handle, uint64_t timeout,
> +                                int *fence, int fence_index)
> +{
> +    struct drm_i915_gem_exec_object2 obj[2];
> +    struct drm_i915_gem_relocation_entry reloc;
> +    struct drm_i915_gem_execbuffer2 execbuf;
> +    igt_spin_t *spin = NULL;
> +    const uint32_t bbe = MI_BATCH_BUFFER_END;
> +    int i = 0;
> +
> +    gem_quiescent_gpu(fd);
> +
> +    memset(&execbuf, 0, sizeof(execbuf));
> +    memset(&obj, 0, sizeof(obj));
> +    memset(&reloc, 0, sizeof(reloc));
> +
> +    execbuf.buffers_ptr = to_user_pointer(obj);
> +
> +    execbuf.buffer_count = 2;
> +    execbuf.flags = exec_id | I915_EXEC_FENCE_OUT ;
> +
> +    obj[0].handle = target;
> +    obj[1].handle = gem_create(fd, 4096);
> +
> +    obj[1].relocation_count = 1;
> +    obj[1].relocs_ptr = to_user_pointer(&reloc);
> +
> +    reloc.target_handle = obj[0].handle;
> +    reloc.read_domains = I915_GEM_DOMAIN_COMMAND;
> +    reloc.write_domain = I915_GEM_DOMAIN_COMMAND;
> +    reloc.delta = offset * sizeof(uint32_t);
> +
> +    reloc.offset = i * sizeof(uint32_t);
> +    gem_write(fd, obj[1].handle, 0, &bbe, sizeof(bbe));
> +
> +    __sync_synchronize();

What?

> +
> +    gem_sync(fd, obj[1].handle);
> +    execbuf.rsvd1 = ctx_id;
> +    execbuf.rsvd2 = -1;
> +
> +    spin = igt_spin_new(fd, .dependency = obj[0].handle);
> +    igt_spin_set_timeout(spin, timeout);
> +    igt_assert(gem_bo_busy(fd, obj[0].handle));
> +
> +    gem_execbuf_wr(fd, &execbuf);
> +    igt_spin_free(fd, spin);

This still doesn't do what you think it does. Hint, this never runs to
timeout.

> +    fence[fence_index] = execbuf.rsvd2 >> 32;

You can just ask the spinner to return a fence.
spin = igt_spin_new(fd, .engine =engine, .flags = IGT_SPIN_FENCE_OUT);
then fence = spin->out_fence.

> +    gem_close(fd, obj[1].handle);
> +    gem_quiescent_gpu(fd);
> +}
> +
> +static void inject_hang(uint32_t fd, unsigned engine_id, uint32_t ctx_id,
> +                        unsigned flags)
> +{
> +       igt_hang_t hang;
> +       hang = igt_hang_ctx(fd, ctx_id, engine_id, flags);
> +       gem_sync(fd, hang.spin->handle);
> +}
> +
> +/* Test#1: create some work and let it run on all engines */
> +static void long_batch_test1(int fd, int prio1, int prio2, int reset_ctx1,
> +                       int reset_ctx2, unsigned threshold)
> +{
> +       unsigned engine_id = 0;
> +       unsigned nengine = 0;
> +       struct intel_execution_engine2 *e_;
> +       uint32_t ctx[8];
> +       uint32_t scratch[8];
> +       uint32_t ctx2[8];
> +       uint32_t scratch2[8];

A struct might spell it out a little more concisely.

And 8! That's cutting it a little fine. MAX_ENGINES is 64 and we are
already dreading the day we exceed that.

> +       int *fence, i = 0;
> +       const uint64_t batch_timeout_ms = timeout_100ms * 4;
> +
> +       fence = (int *)malloc(sizeof(int)*8);

Please be kind to readers. The cast is implicit, and spaces are required
around '*'.

> +       igt_assert(fence);
> +
> +       __for_each_physical_engine(fd, e_) {
> +               engine_id = e_->flags;
> +
> +               scratch2[nengine] = gem_create(fd, 4096);
> +               ctx2[nengine] = gem_context_create(fd);
> +               gem_context_set_priority(fd, ctx2[nengine], prio2);
> +               batch_buffer_factory(fd, ctx2[nengine], engine_id,
> +                                scratch2[nengine], 0, NULL,
> +                                batch_timeout_ms, fence, nengine);
> +
> +               scratch[nengine] = gem_create(fd, 4096);
> +               ctx[nengine] = gem_context_create(fd);
> +               gem_context_set_priority(fd, ctx[nengine], prio1);
> +               batch_buffer_factory(fd, ctx[nengine], engine_id,
> +                                scratch[nengine], 0, NULL,
> +                                batch_timeout_ms, fence, nengine);
> +               igt_info("Test #1: ctx1/ctx2 --> execute on %s\n",e_->name);
> +               nengine++;

So that doesn't test anything at all.

> +       }
> +
> +       for (i = 0; i < nengine; i++) {
> +               close(fence[i]);
> +               gem_context_destroy(fd, ctx[i]);
> +               gem_context_destroy(fd, ctx2[i]);
> +               gem_close(fd, scratch[i]);
> +               gem_close(fd, scratch2[i]);
> +       }
> +}
> +
> +float timedifference_msec(struct timeval t0, struct timeval t1)
> +{
> +    return (t1.tv_sec - t0.tv_sec) * 1000.0f + (t1.tv_usec - t0.tv_usec) / 1000.0f;
> +}
> +
> +/* Test#2: submit a long batch and after half of the expected runtime
> +   submit a higher priority batch and then try to cancel the execution*/
> +static void long_batch_test2(int fd, int prio1, int prio2, int reset_ctx1,
> +                       int reset_ctx2, unsigned threshold)
> +{
> +       unsigned engine_id = 0;
> +       unsigned nengine = 0;
> +       struct intel_execution_engine2 *e_;
> +       uint32_t ctx[8];
> +       uint32_t scratch[8];
> +       uint32_t ctx2[8];
> +       uint32_t scratch2[8];
> +       int *fence, i = 0;
> +       unsigned flags = HANG_ALLOW_CAPTURE;
> +       const uint64_t batch_timeout_ms = timeout_100ms * 4;
> +       struct timeval t0;
> +       struct timeval t1;
> +       float elapsed;
> +
> +       fence = (int *)malloc(sizeof(int)*8);
> +       igt_assert(fence);
> +
> +       nengine = 0;
> +
> +       __for_each_physical_engine(fd, e_) {
> +               engine_id = e_->flags;
> +
> +               scratch2[nengine] = gem_create(fd, 4096);
> +               ctx2[nengine] = gem_context_create(fd);
> +               gem_context_set_priority(fd, ctx2[nengine], prio2);
> +               batch_buffer_factory(fd, ctx2[nengine], engine_id,
> +                                scratch2[nengine], 0, NULL,
> +                                batch_timeout_ms, fence, nengine);
> +
> +               gettimeofday(&t0, 0);
> +               while(elapsed < 200) {
> +                       gettimeofday(&t1, 0);
> +                       elapsed += timedifference_msec(t0, t1);
> +               }

See usleep() or nanosleep()

> +               scratch[nengine] = gem_create(fd, 4096);
> +               ctx[nengine] = gem_context_create(fd);
> +               gem_context_set_priority(fd, ctx[nengine], prio1);
> +               batch_buffer_factory(fd, ctx[nengine], engine_id,
> +                                scratch[nengine], 0, NULL,
> +                                batch_timeout_ms, fence, nengine);
> +
> +               if (context_set_watchdog(fd, engine_id, e_->name,
> +                                        ctx[nengine],
> +                                        threshold) == -ENODEV) {
> +                       igt_info("No support for gpu h/w watchdog on %s\n",
> +                        e_->name);
> +                       goto skip_case;
> +               }
> +               clear_error_state(fd);
> +               inject_hang(fd, engine_id, ctx[nengine], flags);
> +               /* Now check the engine was reset successfully */
> +               //igt_assert_eq(sync_fence_status(*fence), -EIO);

Yeah, that fence has nothing to do with anything.

> +               igt_info("Test #2 ctx1/ctx2 --> set watchdog and"
> +                               " cancel ctx2 at half expected run with higher"
> +                               " priority engine: %s, fence status: 0x%x \n",
> +                                       e_->name, sync_fence_status(*fence));

You've set a watchdog, but don't validate that it had any effect.

May I suggest, you using a timing batch. Write the initial timestamp to
scratch, and write a second timestamp and recurse back to writing the
second timestamp. After that batch is forcibly reset, you can compare
the two timestamps to see how long it executed for (without having to
dig into the context image to extract the ppHWSP runtimes). The contract
given by the watchdog uAPI would seem to be that the batch is hung after
that runtime exceeds the timeout threshold (though how strictly we fail
for exceeding the timeout???) and most definitely not be reset before we
reach the threshold (give or take a few microseconds required for the
recursion and context switching/execution that will be charged to the
watchdog and not recorded by the batch).
-Chris
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [igt-dev] [RFC v5] tests/gem_watchdog: Initial set of tests for GPU watchdog
  2019-05-29  9:03 ` [igt-dev] [RFC v5] tests/gem_watchdog: Initial set of tests for GPU watchdog Chris Wilson
@ 2019-06-11  1:55   ` Carlos Santa
  2019-06-18  1:35   ` Carlos Santa
  1 sibling, 0 replies; 6+ messages in thread
From: Carlos Santa @ 2019-06-11  1:55 UTC (permalink / raw)
  To: Chris Wilson, igt-dev; +Cc: Ursulin Tvrtko

On Wed, 2019-05-29 at 10:03 +0100, Chris Wilson wrote:
> Quoting Carlos Santa (2019-05-28 22:35:59)
> > This test adds basic set of tests to reset the different
> > GPU engines through the gpu watchdog timer.
> > 
> > Credits to Antonio for the original codebase this is based on.
> > 
> > v2: remove gem_context_get_param() during set (Antonio)
> >     remove clearing of the engines_threshold[] in the default case
> >     inside context_set_watchdog(). (Antonio)
> >     fix indexing when creating low/high priority contexts
> >     get rid of 2 threads idea (Antonio)
> >     fix context prio bug due to wrong indexing (Antonio)
> > v3: no need to force clear drm_i915_gem_watchdog_timeout struct
> >     when setting the watchdog timeouts (Antonio)
> >     -use gem_context_set_priority() instead to use error checking
> >     -always use a looper to test all engines (Antonio)
> >     -modify test gpu_watchodg_hang_long_batch_single_engine()
> >     to use a couple of batches and set the timeout in between them
> > (Antonio)
> >     -make use of for_each_engine_class_instance() (Antonio)
> > v4: -tests should be able to handle s/w watchdog timeout (Tvrtko)
> >     -remove dead code inside batch_buffer factory (Tvrtko)
> >     -set_watchdog() should return ENODEV for fail cases (Tvrtko)
> >     -enclose name, priority and reset params inside array (Tvrtko)
> > v5: -Rebased. Tests 5,6 and 8 from the test plan are now added
> > 
> > Test Plan:
> > 
> > Assumptions:
> > 1. Use fence status to figure out which ctx was reset
> > 2. Use spin batches w/ user space timers to control duration
> > and corking to control ordering
> > 3. Use context priorities to force preemption
> > 
> > Tests:
> > 1. ctx1, long_batch -> execute
> > 2. ctx1, set_watchdog -> reset
> > 3. ctx2/ctx1 -> execute, reset
> > 4. ctx1/ctx2 -> reset, execute
> > 5. ctx1_just_below_threshold -> execute
> > 6. ctx_over_the_threshold -> reset
> > 7. set watchdog on some engines -> execute
> > 8. submit long_batch and after half of the
> > expected runtime submit higher prio batch
> > 9. submit low prio batch w/o watchdog then
> > higher prio with watchdog
> > 
> > Unsresolved items:
> > 1. The check for sync_fence_status(*fence)
> > returns -1 instead of EIO.
> > 
> > Cc: Ursulin Tvrtko <tvrtko.ursulin@intel.com>
> > Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> > ---
> >  tests/Makefile.sources    |   3 +
> >  tests/i915/gem_watchdog.c | 490
> > ++++++++++++++++++++++++++++++++++++++
> >  tests/meson.build         |   1 +
> >  3 files changed, 494 insertions(+)
> >  create mode 100644 tests/i915/gem_watchdog.c
> > 
> > diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> > index 2d5c929e32fc..65b868e7d927 100644
> > --- a/tests/Makefile.sources
> > +++ b/tests/Makefile.sources
> > @@ -446,6 +446,9 @@ gem_userptr_blits_SOURCES =
> > i915/gem_userptr_blits.c
> >  TESTS_progs += gem_wait
> >  gem_wait_SOURCES = i915/gem_wait.c
> >  
> > +TESTS_progs += gem_watchdog
> > +gem_watchdog_SOURCES = i915/gem_watchdog.c
> > +
> >  TESTS_progs += gem_workarounds
> >  gem_workarounds_SOURCES = i915/gem_workarounds.c
> >  
> > diff --git a/tests/i915/gem_watchdog.c b/tests/i915/gem_watchdog.c
> > new file mode 100644
> > index 000000000000..e9cdb886d3d6
> > --- /dev/null
> > +++ b/tests/i915/gem_watchdog.c
> > @@ -0,0 +1,490 @@
> > +/*
> > + * Copyright © 2019 Intel Corporation
> > + *
> > + * Permission is hereby granted, free of charge, to any person
> > obtaining a
> > + * copy of this software and associated documentation files (the
> > "Software"),
> > + * to deal in the Software without restriction, including without
> > limitation
> > + * the rights to use, copy, modify, merge, publish, distribute,
> > sublicense,
> > + * and/or sell copies of the Software, and to permit persons to
> > whom the
> > + * Software is furnished to do so, subject to the following
> > conditions:
> > + *
> > + * The above copyright notice and this permission notice
> > (including the next
> > + * paragraph) shall be included in all copies or substantial
> > portions of the
> > + * Software.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > EXPRESS OR
> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > MERCHANTABILITY,
> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO
> > EVENT SHALL
> > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
> > DAMAGES OR OTHER
> > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > ARISING
> > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> > OTHER DEALINGS
> > + * IN THE SOFTWARE.
> > + */
> > +#include "igt.h"
> > +#include "igt_sysfs.h"
> > +#include "sw_sync.h"
> > +
> > +#include <pthread.h>
> > +#include <fcntl.h>
> > +
> > +#include <sys/time.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/poll.h>
> > +#include <sys/signal.h>
> > +#include "i915/gem_ring.h"
> > +
> > +#define LOCAL_I915_EXEC_BSD_SHIFT      (13)
> > +#define LOCAL_I915_EXEC_BSD_RING1      (1 <<
> > LOCAL_I915_EXEC_BSD_SHIFT)
> > +#define LOCAL_I915_EXEC_BSD_RING2      (2 <<
> > LOCAL_I915_EXEC_BSD_SHIFT)
> > +
> > +#define MAX_PRIO LOCAL_I915_CONTEXT_MAX_USER_PRIORITY
> > +#define DEFAULT_PRIO LOCAL_I915_CONTEXT_DEFAULT_PRIORITY
> > +#define MIN_PRIO LOCAL_I915_CONTEXT_MIN_USER_PRIORITY
> > +#define HIGH 1
> > +#define LOW 0
> > +#define WATCHDOG_OVER_THRESHOLD (2070000)
> > +#define WATCHDOG_BELOW_THRESHOLD (1200000)
> > +#define WATCHDOG_THRESHOLD (1200)
> > +#define MAX_ENGINES 5
> > +#define RENDER_CLASS 0
> > +#define VIDEO_DECODE_CLASS 1
> > +#define VIDEO_ENHANCEMENT_CLASS 2
> > +#define COPY_ENGINE_CLASS 3
> > +#define LOCAL_I915_CONTEXT_PARAM_WATCHDOG 0x10
> > +
> > +#define GET_RESET_STATS_IOCTL DRM_IOWR(DRM_COMMAND_BASE + 0x32,
> > struct local_drm_i915_reset_stats)
> > +struct local_drm_i915_reset_stats {
> > +       __u32 ctx_id;
> > +       __u32 flags;
> > +       __u32 reset_count;
> > +       __u32 batch_active;
> > +       __u32 batch_pending;
> > +       __u32 pad;
> > +};
> > +
> > +const uint64_t timeout_100ms = 100000000LL;
> > +float timedifference_msec(struct timeval t0, struct timeval t1);
> > +
> > +struct drm_i915_gem_watchdog_timeout {
> > +       union {
> > +               struct {
> > +                       /*
> > +                        * Engine class & instance to be configured
> > or queried.
> > +                        */
> > +                       __u16 engine_class;
> > +                       __u16 engine_instance;
> > +               };
> > +               /* Index based addressing mode */
> > +               __u32 index;
> > +       };
> > +       /* GPU Engine watchdog resets timeout in us */
> > +       __u32 timeout_us;
> > +};
> > +
> > +static void clear_error_state(int fd)
> > +{
> > +       int dir;
> > +
> > +       dir = igt_sysfs_open(fd);
> > +
> > +       if (dir < 0)
> > +               return;
> > +
> > +       /* Any write to the error state clears it */
> > +       igt_sysfs_set(dir, "error", "");
> > +       close(dir);
> > +}
> > +
> > +static int context_set_watchdog(int fd, unsigned engine_id,
> > +                                const char *engine_name,
> > +                                 unsigned ctx_id, unsigned
> > threshold)
> > +{
> > +       struct drm_i915_gem_watchdog_timeout
> > engines_threshold[MAX_ENGINES];
> > +       struct drm_i915_gem_context_param arg = {
> > +               .param = LOCAL_I915_CONTEXT_PARAM_WATCHDOG,
> > +               .ctx_id = ctx_id,
> > +               .size = sizeof(engines_threshold),
> 
> Won't the kernel complain about the invalid values?
> 
> > +               .value = (uint64_t)&engines_threshold
> > +       };
> > +
> > +       memset(&engines_threshold, 0, sizeof(engines_threshold));
> 
> You set all the unspecified values to be rcs0 with 0 timeout.
> 
> > +
> > +       switch (engine_id & I915_EXEC_RING_MASK) {
> > +       case I915_EXEC_RENDER:
> > +               engines_threshold[0].timeout_us = threshold;
> 
> I presume this is a u64, so might as well use _ns for consistency.
> 
> > +               engines_threshold[0].engine_class = RENDER_CLASS;
> > +               engines_threshold[0].engine_instance = 0;
> 
> Use struct i915_engine_class_instance.
> 
> > +               break;
> > +       case I915_EXEC_BLT:
> > +               if (__gem_context_get_param(fd, &arg) == -ENODEV)
> > +                       return -ENODEV;
> 
> And otherwise setting random values into engine_thresholds.
> 
> > +               else
> > +                       engines_threshold[3].timeout_us =
> > threshold;
> > +                       engines_threshold[3].engine_class =
> > COPY_ENGINE_CLASS;
> > +                       engines_threshold[3].engine_instance = 0;
> > +               break;
> > +       case I915_EXEC_BSD:
> > +               engines_threshold[1].timeout_us = threshold;
> > +               engines_threshold[1].engine_class =
> > VIDEO_DECODE_CLASS;
> > +               engines_threshold[1].engine_instance = 0;
> > +               break;
> > +       case I915_EXEC_VEBOX:
> > +               engines_threshold[2].timeout_us = threshold;
> > +               engines_threshold[2].engine_class =
> > VIDEO_ENHANCEMENT_CLASS;
> > +               engines_threshold[2].engine_instance = 0;
> > +               break;
> > +       default:
> > +               break;
> > +       }
> 
> Ok, so you only set one engine and the rest to zero. That seems like
> a
> bad example in how to use the uAPI.
> 
> > +       gem_context_set_param(fd, &arg);
> > +
> > +       return 0;
> > +}
> > +
> > +static void batch_buffer_factory(uint32_t fd, uint32_t ctx_id,
> > unsigned exec_id,
> > +                                uint32_t target, uint32_t offset,
> > +                                uint32_t *handle, uint64_t
> > timeout,
> > +                                int *fence, int fence_index)
> > +{
> > +    struct drm_i915_gem_exec_object2 obj[2];
> > +    struct drm_i915_gem_relocation_entry reloc;
> > +    struct drm_i915_gem_execbuffer2 execbuf;
> > +    igt_spin_t *spin = NULL;
> > +    const uint32_t bbe = MI_BATCH_BUFFER_END;
> > +    int i = 0;
> > +
> > +    gem_quiescent_gpu(fd);
> > +
> > +    memset(&execbuf, 0, sizeof(execbuf));
> > +    memset(&obj, 0, sizeof(obj));
> > +    memset(&reloc, 0, sizeof(reloc));
> > +
> > +    execbuf.buffers_ptr = to_user_pointer(obj);
> > +
> > +    execbuf.buffer_count = 2;
> > +    execbuf.flags = exec_id | I915_EXEC_FENCE_OUT ;
> > +
> > +    obj[0].handle = target;
> > +    obj[1].handle = gem_create(fd, 4096);
> > +
> > +    obj[1].relocation_count = 1;
> > +    obj[1].relocs_ptr = to_user_pointer(&reloc);
> > +
> > +    reloc.target_handle = obj[0].handle;
> > +    reloc.read_domains = I915_GEM_DOMAIN_COMMAND;
> > +    reloc.write_domain = I915_GEM_DOMAIN_COMMAND;
> > +    reloc.delta = offset * sizeof(uint32_t);
> > +
> > +    reloc.offset = i * sizeof(uint32_t);
> > +    gem_write(fd, obj[1].handle, 0, &bbe, sizeof(bbe));
> > +
> > +    __sync_synchronize();
> 
> What?
> 
> > +
> > +    gem_sync(fd, obj[1].handle);
> > +    execbuf.rsvd1 = ctx_id;
> > +    execbuf.rsvd2 = -1;
> > +
> > +    spin = igt_spin_new(fd, .dependency = obj[0].handle);
> > +    igt_spin_set_timeout(spin, timeout);
> > +    igt_assert(gem_bo_busy(fd, obj[0].handle));
> > +
> > +    gem_execbuf_wr(fd, &execbuf);
> > +    igt_spin_free(fd, spin);
> 
> This still doesn't do what you think it does. Hint, this never runs
> to
> timeout.
> 
> > +    fence[fence_index] = execbuf.rsvd2 >> 32;
> 
> You can just ask the spinner to return a fence.
> spin = igt_spin_new(fd, .engine =engine, .flags =
> IGT_SPIN_FENCE_OUT);
> then fence = spin->out_fence.

Hi Chris,

Can you help me understand the above a little bit more?

The spinner looks like this now:

    spin = igt_spin_new(fd,
			.dependency = obj[0].handle,
			.engine = exec_id,
			.flags = IGT_SPIN_FENCE_OUT);
    igt_assert(spin->out_fence != -1);
    igt_spin_set_timeout(spin, timeout);
    igt_assert(gem_bo_busy(fd, obj[0].handle));

And then I store the fence as:

fence[fence_index] = spin->out_fence;

If at this point (creation of the batch_buffer work) I check for the
fence and the status of it then I get fence: 5 and status: 0

printf("fence: %d status: %d \n",spin-
>out_fence,sync_fence_status(spin->out_fence));

The 0 I think refers to this fence still being active, which I think is
correct.

However, after I force reset the engines (some delta time after)
through watchdog then I get a -9 for the status of the fence,

Why am I not getting a -5 or -EIO after the reset?

Carlos 

> 
> > +    gem_close(fd, obj[1].handle);
> > +    gem_quiescent_gpu(fd);
> > +}
> > +
> > +static void inject_hang(uint32_t fd, unsigned engine_id, uint32_t
> > ctx_id,
> > +                        unsigned flags)
> > +{
> > +       igt_hang_t hang;
> > +       hang = igt_hang_ctx(fd, ctx_id, engine_id, flags);
> > +       gem_sync(fd, hang.spin->handle);
> > +}
> > +
> > +/* Test#1: create some work and let it run on all engines */
> > +static void long_batch_test1(int fd, int prio1, int prio2, int
> > reset_ctx1,
> > +                       int reset_ctx2, unsigned threshold)
> > +{
> > +       unsigned engine_id = 0;
> > +       unsigned nengine = 0;
> > +       struct intel_execution_engine2 *e_;
> > +       uint32_t ctx[8];
> > +       uint32_t scratch[8];
> > +       uint32_t ctx2[8];
> > +       uint32_t scratch2[8];
> 
> A struct might spell it out a little more concisely.
> 
> And 8! That's cutting it a little fine. MAX_ENGINES is 64 and we are
> already dreading the day we exceed that.
> 
> > +       int *fence, i = 0;
> > +       const uint64_t batch_timeout_ms = timeout_100ms * 4;
> > +
> > +       fence = (int *)malloc(sizeof(int)*8);
> 
> Please be kind to readers. The cast is implicit, and spaces are
> required
> around '*'.
> 
> > +       igt_assert(fence);
> > +
> > +       __for_each_physical_engine(fd, e_) {
> > +               engine_id = e_->flags;
> > +
> > +               scratch2[nengine] = gem_create(fd, 4096);
> > +               ctx2[nengine] = gem_context_create(fd);
> > +               gem_context_set_priority(fd, ctx2[nengine], prio2);
> > +               batch_buffer_factory(fd, ctx2[nengine], engine_id,
> > +                                scratch2[nengine], 0, NULL,
> > +                                batch_timeout_ms, fence, nengine);
> > +
> > +               scratch[nengine] = gem_create(fd, 4096);
> > +               ctx[nengine] = gem_context_create(fd);
> > +               gem_context_set_priority(fd, ctx[nengine], prio1);
> > +               batch_buffer_factory(fd, ctx[nengine], engine_id,
> > +                                scratch[nengine], 0, NULL,
> > +                                batch_timeout_ms, fence, nengine);
> > +               igt_info("Test #1: ctx1/ctx2 --> execute on
> > %s\n",e_->name);
> > +               nengine++;
> 
> So that doesn't test anything at all.
> 
> > +       }
> > +
> > +       for (i = 0; i < nengine; i++) {
> > +               close(fence[i]);
> > +               gem_context_destroy(fd, ctx[i]);
> > +               gem_context_destroy(fd, ctx2[i]);
> > +               gem_close(fd, scratch[i]);
> > +               gem_close(fd, scratch2[i]);
> > +       }
> > +}
> > +
> > +float timedifference_msec(struct timeval t0, struct timeval t1)
> > +{
> > +    return (t1.tv_sec - t0.tv_sec) * 1000.0f + (t1.tv_usec -
> > t0.tv_usec) / 1000.0f;
> > +}
> > +
> > +/* Test#2: submit a long batch and after half of the expected
> > runtime
> > +   submit a higher priority batch and then try to cancel the
> > execution*/
> > +static void long_batch_test2(int fd, int prio1, int prio2, int
> > reset_ctx1,
> > +                       int reset_ctx2, unsigned threshold)
> > +{
> > +       unsigned engine_id = 0;
> > +       unsigned nengine = 0;
> > +       struct intel_execution_engine2 *e_;
> > +       uint32_t ctx[8];
> > +       uint32_t scratch[8];
> > +       uint32_t ctx2[8];
> > +       uint32_t scratch2[8];
> > +       int *fence, i = 0;
> > +       unsigned flags = HANG_ALLOW_CAPTURE;
> > +       const uint64_t batch_timeout_ms = timeout_100ms * 4;
> > +       struct timeval t0;
> > +       struct timeval t1;
> > +       float elapsed;
> > +
> > +       fence = (int *)malloc(sizeof(int)*8);
> > +       igt_assert(fence);
> > +
> > +       nengine = 0;
> > +
> > +       __for_each_physical_engine(fd, e_) {
> > +               engine_id = e_->flags;
> > +
> > +               scratch2[nengine] = gem_create(fd, 4096);
> > +               ctx2[nengine] = gem_context_create(fd);
> > +               gem_context_set_priority(fd, ctx2[nengine], prio2);
> > +               batch_buffer_factory(fd, ctx2[nengine], engine_id,
> > +                                scratch2[nengine], 0, NULL,
> > +                                batch_timeout_ms, fence, nengine);
> > +
> > +               gettimeofday(&t0, 0);
> > +               while(elapsed < 200) {
> > +                       gettimeofday(&t1, 0);
> > +                       elapsed += timedifference_msec(t0, t1);
> > +               }
> 
> See usleep() or nanosleep()
> 
> > +               scratch[nengine] = gem_create(fd, 4096);
> > +               ctx[nengine] = gem_context_create(fd);
> > +               gem_context_set_priority(fd, ctx[nengine], prio1);
> > +               batch_buffer_factory(fd, ctx[nengine], engine_id,
> > +                                scratch[nengine], 0, NULL,
> > +                                batch_timeout_ms, fence, nengine);
> > +
> > +               if (context_set_watchdog(fd, engine_id, e_->name,
> > +                                        ctx[nengine],
> > +                                        threshold) == -ENODEV) {
> > +                       igt_info("No support for gpu h/w watchdog
> > on %s\n",
> > +                        e_->name);
> > +                       goto skip_case;
> > +               }
> > +               clear_error_state(fd);
> > +               inject_hang(fd, engine_id, ctx[nengine], flags);
> > +               /* Now check the engine was reset successfully */
> > +               //igt_assert_eq(sync_fence_status(*fence), -EIO);
> 
> Yeah, that fence has nothing to do with anything.
> 
> > +               igt_info("Test #2 ctx1/ctx2 --> set watchdog and"
> > +                               " cancel ctx2 at half expected run
> > with higher"
> > +                               " priority engine: %s, fence
> > status: 0x%x \n",
> > +                                       e_->name,
> > sync_fence_status(*fence));
> 
> You've set a watchdog, but don't validate that it had any effect.
> 
> May I suggest, you using a timing batch. Write the initial timestamp
> to
> scratch, and write a second timestamp and recurse back to writing the
> second timestamp. After that batch is forcibly reset, you can compare
> the two timestamps to see how long it executed for (without having to
> dig into the context image to extract the ppHWSP runtimes). The
> contract
> given by the watchdog uAPI would seem to be that the batch is hung
> after
> that runtime exceeds the timeout threshold (though how strictly we
> fail
> for exceeding the timeout???) and most definitely not be reset before
> we
> reach the threshold (give or take a few microseconds required for the
> recursion and context switching/execution that will be charged to the
> watchdog and not recorded by the batch).
> -Chris

_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [igt-dev] [RFC v5] tests/gem_watchdog: Initial set of tests for GPU watchdog
  2019-05-29  9:03 ` [igt-dev] [RFC v5] tests/gem_watchdog: Initial set of tests for GPU watchdog Chris Wilson
  2019-06-11  1:55   ` Carlos Santa
@ 2019-06-18  1:35   ` Carlos Santa
  1 sibling, 0 replies; 6+ messages in thread
From: Carlos Santa @ 2019-06-18  1:35 UTC (permalink / raw)
  To: Chris Wilson, igt-dev; +Cc: Ursulin Tvrtko

On Wed, 2019-05-29 at 10:03 +0100, Chris Wilson wrote:
> Quoting Carlos Santa (2019-05-28 22:35:59)
> > This test adds basic set of tests to reset the different
> > GPU engines through the gpu watchdog timer.
> > 
> > Credits to Antonio for the original codebase this is based on.
> > 
> > v2: remove gem_context_get_param() during set (Antonio)
> >     remove clearing of the engines_threshold[] in the default case
> >     inside context_set_watchdog(). (Antonio)
> >     fix indexing when creating low/high priority contexts
> >     get rid of 2 threads idea (Antonio)
> >     fix context prio bug due to wrong indexing (Antonio)
> > v3: no need to force clear drm_i915_gem_watchdog_timeout struct
> >     when setting the watchdog timeouts (Antonio)
> >     -use gem_context_set_priority() instead to use error checking
> >     -always use a looper to test all engines (Antonio)
> >     -modify test gpu_watchodg_hang_long_batch_single_engine()
> >     to use a couple of batches and set the timeout in between them
> > (Antonio)
> >     -make use of for_each_engine_class_instance() (Antonio)
> > v4: -tests should be able to handle s/w watchdog timeout (Tvrtko)
> >     -remove dead code inside batch_buffer factory (Tvrtko)
> >     -set_watchdog() should return ENODEV for fail cases (Tvrtko)
> >     -enclose name, priority and reset params inside array (Tvrtko)
> > v5: -Rebased. Tests 5,6 and 8 from the test plan are now added
> > 
> > Test Plan:
> > 
> > Assumptions:
> > 1. Use fence status to figure out which ctx was reset
> > 2. Use spin batches w/ user space timers to control duration
> > and corking to control ordering
> > 3. Use context priorities to force preemption
> > 
> > Tests:
> > 1. ctx1, long_batch -> execute
> > 2. ctx1, set_watchdog -> reset
> > 3. ctx2/ctx1 -> execute, reset
> > 4. ctx1/ctx2 -> reset, execute
> > 5. ctx1_just_below_threshold -> execute
> > 6. ctx_over_the_threshold -> reset
> > 7. set watchdog on some engines -> execute
> > 8. submit long_batch and after half of the
> > expected runtime submit higher prio batch
> > 9. submit low prio batch w/o watchdog then
> > higher prio with watchdog
> > 
> > Unsresolved items:
> > 1. The check for sync_fence_status(*fence)
> > returns -1 instead of EIO.
> > 
> > Cc: Ursulin Tvrtko <tvrtko.ursulin@intel.com>
> > Cc: Antonio Argenziano <antonio.argenziano@intel.com>
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Signed-off-by: Carlos Santa <carlos.santa@intel.com>
> > ---
> >  tests/Makefile.sources    |   3 +
> >  tests/i915/gem_watchdog.c | 490
> > ++++++++++++++++++++++++++++++++++++++
> >  tests/meson.build         |   1 +
> >  3 files changed, 494 insertions(+)
> >  create mode 100644 tests/i915/gem_watchdog.c
> > 
> > diff --git a/tests/Makefile.sources b/tests/Makefile.sources
> > index 2d5c929e32fc..65b868e7d927 100644
> > --- a/tests/Makefile.sources
> > +++ b/tests/Makefile.sources
> > @@ -446,6 +446,9 @@ gem_userptr_blits_SOURCES =
> > i915/gem_userptr_blits.c
> >  TESTS_progs += gem_wait
> >  gem_wait_SOURCES = i915/gem_wait.c
> >  
> > +TESTS_progs += gem_watchdog
> > +gem_watchdog_SOURCES = i915/gem_watchdog.c
> > +
> >  TESTS_progs += gem_workarounds
> >  gem_workarounds_SOURCES = i915/gem_workarounds.c
> >  
> > diff --git a/tests/i915/gem_watchdog.c b/tests/i915/gem_watchdog.c
> > new file mode 100644
> > index 000000000000..e9cdb886d3d6
> > --- /dev/null
> > +++ b/tests/i915/gem_watchdog.c
> > @@ -0,0 +1,490 @@
> > +/*
> > + * Copyright © 2019 Intel Corporation
> > + *
> > + * Permission is hereby granted, free of charge, to any person
> > obtaining a
> > + * copy of this software and associated documentation files (the
> > "Software"),
> > + * to deal in the Software without restriction, including without
> > limitation
> > + * the rights to use, copy, modify, merge, publish, distribute,
> > sublicense,
> > + * and/or sell copies of the Software, and to permit persons to
> > whom the
> > + * Software is furnished to do so, subject to the following
> > conditions:
> > + *
> > + * The above copyright notice and this permission notice
> > (including the next
> > + * paragraph) shall be included in all copies or substantial
> > portions of the
> > + * Software.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > EXPRESS OR
> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > MERCHANTABILITY,
> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO
> > EVENT SHALL
> > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
> > DAMAGES OR OTHER
> > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
> > ARISING
> > + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> > OTHER DEALINGS
> > + * IN THE SOFTWARE.
> > + */
> > +#include "igt.h"
> > +#include "igt_sysfs.h"
> > +#include "sw_sync.h"
> > +
> > +#include <pthread.h>
> > +#include <fcntl.h>
> > +
> > +#include <sys/time.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/poll.h>
> > +#include <sys/signal.h>
> > +#include "i915/gem_ring.h"
> > +
> > +#define LOCAL_I915_EXEC_BSD_SHIFT      (13)
> > +#define LOCAL_I915_EXEC_BSD_RING1      (1 <<
> > LOCAL_I915_EXEC_BSD_SHIFT)
> > +#define LOCAL_I915_EXEC_BSD_RING2      (2 <<
> > LOCAL_I915_EXEC_BSD_SHIFT)
> > +
> > +#define MAX_PRIO LOCAL_I915_CONTEXT_MAX_USER_PRIORITY
> > +#define DEFAULT_PRIO LOCAL_I915_CONTEXT_DEFAULT_PRIORITY
> > +#define MIN_PRIO LOCAL_I915_CONTEXT_MIN_USER_PRIORITY
> > +#define HIGH 1
> > +#define LOW 0
> > +#define WATCHDOG_OVER_THRESHOLD (2070000)
> > +#define WATCHDOG_BELOW_THRESHOLD (1200000)
> > +#define WATCHDOG_THRESHOLD (1200)
> > +#define MAX_ENGINES 5
> > +#define RENDER_CLASS 0
> > +#define VIDEO_DECODE_CLASS 1
> > +#define VIDEO_ENHANCEMENT_CLASS 2
> > +#define COPY_ENGINE_CLASS 3
> > +#define LOCAL_I915_CONTEXT_PARAM_WATCHDOG 0x10
> > +
> > +#define GET_RESET_STATS_IOCTL DRM_IOWR(DRM_COMMAND_BASE + 0x32,
> > struct local_drm_i915_reset_stats)
> > +struct local_drm_i915_reset_stats {
> > +       __u32 ctx_id;
> > +       __u32 flags;
> > +       __u32 reset_count;
> > +       __u32 batch_active;
> > +       __u32 batch_pending;
> > +       __u32 pad;
> > +};
> > +
> > +const uint64_t timeout_100ms = 100000000LL;
> > +float timedifference_msec(struct timeval t0, struct timeval t1);
> > +
> > +struct drm_i915_gem_watchdog_timeout {
> > +       union {
> > +               struct {
> > +                       /*
> > +                        * Engine class & instance to be configured
> > or queried.
> > +                        */
> > +                       __u16 engine_class;
> > +                       __u16 engine_instance;
> > +               };
> > +               /* Index based addressing mode */
> > +               __u32 index;
> > +       };
> > +       /* GPU Engine watchdog resets timeout in us */
> > +       __u32 timeout_us;
> > +};
> > +
> > +static void clear_error_state(int fd)
> > +{
> > +       int dir;
> > +
> > +       dir = igt_sysfs_open(fd);
> > +
> > +       if (dir < 0)
> > +               return;
> > +
> > +       /* Any write to the error state clears it */
> > +       igt_sysfs_set(dir, "error", "");
> > +       close(dir);
> > +}
> > +
> > +static int context_set_watchdog(int fd, unsigned engine_id,
> > +                                const char *engine_name,
> > +                                 unsigned ctx_id, unsigned
> > threshold)
> > +{
> > +       struct drm_i915_gem_watchdog_timeout
> > engines_threshold[MAX_ENGINES];
> > +       struct drm_i915_gem_context_param arg = {
> > +               .param = LOCAL_I915_CONTEXT_PARAM_WATCHDOG,
> > +               .ctx_id = ctx_id,
> > +               .size = sizeof(engines_threshold),
> 
> Won't the kernel complain about the invalid values?

No, but I'll check again since I had some extra debugging added...
> 
> > +               .value = (uint64_t)&engines_threshold
> > +       };
> > +
> > +       memset(&engines_threshold, 0, sizeof(engines_threshold));
> 
> You set all the unspecified values to be rcs0 with 0 timeout.

I think I forced everything to 0 to avoid garbage data...

> 
> > +
> > +       switch (engine_id & I915_EXEC_RING_MASK) {
> > +       case I915_EXEC_RENDER:
> > +               engines_threshold[0].timeout_us = threshold;
> 
> I presume this is a u64, so might as well use _ns for consistency.

The uApi is u32 but I'll change it to u64.
> 
> > +               engines_threshold[0].engine_class = RENDER_CLASS;
> > +               engines_threshold[0].engine_instance = 0;
> 
> Use struct i915_engine_class_instance.

Ok.

> 
> > +               break;
> > +       case I915_EXEC_BLT:
> > +               if (__gem_context_get_param(fd, &arg) == -ENODEV)
> > +                       return -ENODEV;
> 
> And otherwise setting random values into engine_thresholds.

Hmm, Ursulin still wanted to add support for CPU watchdog so this is
how we check whether there's support for watchdog on BLT, it returns
ENODEV for this case...

> 
> > +               else
> > +                       engines_threshold[3].timeout_us =
> > threshold;
> > +                       engines_threshold[3].engine_class =
> > COPY_ENGINE_CLASS;
> > +                       engines_threshold[3].engine_instance = 0;
> > +               break;
> > +       case I915_EXEC_BSD:
> > +               engines_threshold[1].timeout_us = threshold;
> > +               engines_threshold[1].engine_class =
> > VIDEO_DECODE_CLASS;
> > +               engines_threshold[1].engine_instance = 0;
> > +               break;
> > +       case I915_EXEC_VEBOX:
> > +               engines_threshold[2].timeout_us = threshold;
> > +               engines_threshold[2].engine_class =
> > VIDEO_ENHANCEMENT_CLASS;
> > +               engines_threshold[2].engine_instance = 0;
> > +               break;
> > +       default:
> > +               break;
> > +       }
> 
> Ok, so you only set one engine and the rest to zero. That seems like
> a
> bad example in how to use the uAPI.

The alternative would be to set them all at once?? but then what about
the media decode engine that only cares about the media engines,
wouldn't this case only be setting the threshold to one specific
engine? Also, I am looping through all engines so eventually they all
get populated.


> 
> > +       gem_context_set_param(fd, &arg);
> > +
> > +       return 0;
> > +}
> > +
> > +static void batch_buffer_factory(uint32_t fd, uint32_t ctx_id,
> > unsigned exec_id,
> > +                                uint32_t target, uint32_t offset,
> > +                                uint32_t *handle, uint64_t
> > timeout,
> > +                                int *fence, int fence_index)
> > +{
> > +    struct drm_i915_gem_exec_object2 obj[2];
> > +    struct drm_i915_gem_relocation_entry reloc;
> > +    struct drm_i915_gem_execbuffer2 execbuf;
> > +    igt_spin_t *spin = NULL;
> > +    const uint32_t bbe = MI_BATCH_BUFFER_END;
> > +    int i = 0;
> > +
> > +    gem_quiescent_gpu(fd);
> > +
> > +    memset(&execbuf, 0, sizeof(execbuf));
> > +    memset(&obj, 0, sizeof(obj));
> > +    memset(&reloc, 0, sizeof(reloc));
> > +
> > +    execbuf.buffers_ptr = to_user_pointer(obj);
> > +
> > +    execbuf.buffer_count = 2;
> > +    execbuf.flags = exec_id | I915_EXEC_FENCE_OUT ;
> > +
> > +    obj[0].handle = target;
> > +    obj[1].handle = gem_create(fd, 4096);
> > +
> > +    obj[1].relocation_count = 1;
> > +    obj[1].relocs_ptr = to_user_pointer(&reloc);
> > +
> > +    reloc.target_handle = obj[0].handle;
> > +    reloc.read_domains = I915_GEM_DOMAIN_COMMAND;
> > +    reloc.write_domain = I915_GEM_DOMAIN_COMMAND;
> > +    reloc.delta = offset * sizeof(uint32_t);
> > +
> > +    reloc.offset = i * sizeof(uint32_t);
> > +    gem_write(fd, obj[1].handle, 0, &bbe, sizeof(bbe));
> > +
> > +    __sync_synchronize();
> 
> What?

Not sure but I'll remove it, this is from the older reference code this
is based on.
> 
> > +
> > +    gem_sync(fd, obj[1].handle);
> > +    execbuf.rsvd1 = ctx_id;
> > +    execbuf.rsvd2 = -1;
> > +
> > +    spin = igt_spin_new(fd, .dependency = obj[0].handle);
> > +    igt_spin_set_timeout(spin, timeout);
> > +    igt_assert(gem_bo_busy(fd, obj[0].handle));
> > +
> > +    gem_execbuf_wr(fd, &execbuf);
> > +    igt_spin_free(fd, spin);
> 
> This still doesn't do what you think it does. Hint, this never runs
> to
> timeout.
> 
> > +    fence[fence_index] = execbuf.rsvd2 >> 32;
> 
> You can just ask the spinner to return a fence.
> spin = igt_spin_new(fd, .engine =engine, .flags =
> IGT_SPIN_FENCE_OUT);
> then fence = spin->out_fence.
> 

Ok, I did the abovebut please review the new implementation when I send
it (hopefully tomorrow) because I did just that but I still must be
missing something as I now get a -9 when I check the status of the
fence.

> > +    gem_close(fd, obj[1].handle);
> > +    gem_quiescent_gpu(fd);
> > +}
> > +
> > +static void inject_hang(uint32_t fd, unsigned engine_id, uint32_t
> > ctx_id,
> > +                        unsigned flags)
> > +{
> > +       igt_hang_t hang;
> > +       hang = igt_hang_ctx(fd, ctx_id, engine_id, flags);
> > +       gem_sync(fd, hang.spin->handle);
> > +}
> > +
> > +/* Test#1: create some work and let it run on all engines */
> > +static void long_batch_test1(int fd, int prio1, int prio2, int
> > reset_ctx1,
> > +                       int reset_ctx2, unsigned threshold)
> > +{
> > +       unsigned engine_id = 0;
> > +       unsigned nengine = 0;
> > +       struct intel_execution_engine2 *e_;
> > +       uint32_t ctx[8];
> > +       uint32_t scratch[8];
> > +       uint32_t ctx2[8];
> > +       uint32_t scratch2[8];
> 
> A struct might spell it out a little more concisely.

Agreed.

> 
> And 8! That's cutting it a little fine. MAX_ENGINES is 64 and we are
> already dreading the day we exceed that.

Ok. 
> 
> > +       int *fence, i = 0;
> > +       const uint64_t batch_timeout_ms = timeout_100ms * 4;
> > +
> > +       fence = (int *)malloc(sizeof(int)*8);
> 
> Please be kind to readers. The cast is implicit, and spaces are
> required
> around '*'.

Sorry about that.

> 
> > +       igt_assert(fence);
> > +
> > +       __for_each_physical_engine(fd, e_) {
> > +               engine_id = e_->flags;
> > +
> > +               scratch2[nengine] = gem_create(fd, 4096);
> > +               ctx2[nengine] = gem_context_create(fd);
> > +               gem_context_set_priority(fd, ctx2[nengine], prio2);
> > +               batch_buffer_factory(fd, ctx2[nengine], engine_id,
> > +                                scratch2[nengine], 0, NULL,
> > +                                batch_timeout_ms, fence, nengine);
> > +
> > +               scratch[nengine] = gem_create(fd, 4096);
> > +               ctx[nengine] = gem_context_create(fd);
> > +               gem_context_set_priority(fd, ctx[nengine], prio1);
> > +               batch_buffer_factory(fd, ctx[nengine], engine_id,
> > +                                scratch[nengine], 0, NULL,
> > +                                batch_timeout_ms, fence, nengine);
> > +               igt_info("Test #1: ctx1/ctx2 --> execute on
> > %s\n",e_->name);
> > +               nengine++;
> 
> So that doesn't test anything at all.

Hmm, Ursulin has provided a test plan (please see the commit message
above) some time back and that's part of that, we were trying to cover
different cases and that's one of them, it simply create some work and
let it run on all the different engines....

> 
> > +       }
> > +
> > +       for (i = 0; i < nengine; i++) {
> > +               close(fence[i]);
> > +               gem_context_destroy(fd, ctx[i]);
> > +               gem_context_destroy(fd, ctx2[i]);
> > +               gem_close(fd, scratch[i]);
> > +               gem_close(fd, scratch2[i]);
> > +       }
> > +}
> > +
> > +float timedifference_msec(struct timeval t0, struct timeval t1)
> > +{
> > +    return (t1.tv_sec - t0.tv_sec) * 1000.0f + (t1.tv_usec -
> > t0.tv_usec) / 1000.0f;
> > +}
> > +
> > +/* Test#2: submit a long batch and after half of the expected
> > runtime
> > +   submit a higher priority batch and then try to cancel the
> > execution*/
> > +static void long_batch_test2(int fd, int prio1, int prio2, int
> > reset_ctx1,
> > +                       int reset_ctx2, unsigned threshold)
> > +{
> > +       unsigned engine_id = 0;
> > +       unsigned nengine = 0;
> > +       struct intel_execution_engine2 *e_;
> > +       uint32_t ctx[8];
> > +       uint32_t scratch[8];
> > +       uint32_t ctx2[8];
> > +       uint32_t scratch2[8];
> > +       int *fence, i = 0;
> > +       unsigned flags = HANG_ALLOW_CAPTURE;
> > +       const uint64_t batch_timeout_ms = timeout_100ms * 4;
> > +       struct timeval t0;
> > +       struct timeval t1;
> > +       float elapsed;
> > +
> > +       fence = (int *)malloc(sizeof(int)*8);
> > +       igt_assert(fence);
> > +
> > +       nengine = 0;
> > +
> > +       __for_each_physical_engine(fd, e_) {
> > +               engine_id = e_->flags;
> > +
> > +               scratch2[nengine] = gem_create(fd, 4096);
> > +               ctx2[nengine] = gem_context_create(fd);
> > +               gem_context_set_priority(fd, ctx2[nengine], prio2);
> > +               batch_buffer_factory(fd, ctx2[nengine], engine_id,
> > +                                scratch2[nengine], 0, NULL,
> > +                                batch_timeout_ms, fence, nengine);
> > +
> > +               gettimeofday(&t0, 0);
> > +               while(elapsed < 200) {
> > +                       gettimeofday(&t1, 0);
> > +                       elapsed += timedifference_msec(t0, t1);
> > +               }
> 
> See usleep() or nanosleep()

Ok. 
> 
> > +               scratch[nengine] = gem_create(fd, 4096);
> > +               ctx[nengine] = gem_context_create(fd);
> > +               gem_context_set_priority(fd, ctx[nengine], prio1);
> > +               batch_buffer_factory(fd, ctx[nengine], engine_id,
> > +                                scratch[nengine], 0, NULL,
> > +                                batch_timeout_ms, fence, nengine);
> > +
> > +               if (context_set_watchdog(fd, engine_id, e_->name,
> > +                                        ctx[nengine],
> > +                                        threshold) == -ENODEV) {
> > +                       igt_info("No support for gpu h/w watchdog
> > on %s\n",
> > +                        e_->name);
> > +                       goto skip_case;
> > +               }
> > +               clear_error_state(fd);
> > +               inject_hang(fd, engine_id, ctx[nengine], flags);
> > +               /* Now check the engine was reset successfully */
> > +               //igt_assert_eq(sync_fence_status(*fence), -EIO);
> 
> Yeah, that fence has nothing to do with anything.
> 
> > +               igt_info("Test #2 ctx1/ctx2 --> set watchdog and"
> > +                               " cancel ctx2 at half expected run
> > with higher"
> > +                               " priority engine: %s, fence
> > status: 0x%x \n",
> > +                                       e_->name,
> > sync_fence_status(*fence));
> 
> You've set a watchdog, but don't validate that it had any effect.
> 
> May I suggest, you using a timing batch. Write the initial timestamp
> to
> scratch, and write a second timestamp and recurse back to writing the
> second timestamp. After that batch is forcibly reset, you can compare
> the two timestamps to see how long it executed for (without having to
> dig into the context image to extract the ppHWSP runtimes). The
> contract
> given by the watchdog uAPI would seem to be that the batch is hung
> after
> that runtime exceeds the timeout threshold (though how strictly we
> fail
> for exceeding the timeout???) and most definitely not be reset before
> we
> reach the threshold (give or take a few microseconds required for the
> recursion and context switching/execution that will be charged to the
> watchdog and not recorded by the batch).
> -Chris

I synced with Antonio and we think we have what you're describing
above, however the calculated delta seems to be way to big for the work
being submitted. Please review when I send the new version.

Regards,
Carlos


_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-06-18  1:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-28 21:35 [igt-dev] [RFC v5] tests/gem_watchdog: Initial set of tests for GPU watchdog Carlos Santa
2019-05-28 22:03 ` [igt-dev] ✓ Fi.CI.BAT: success for tests/gem_watchdog: Initial set of tests for GPU watchdog (rev7) Patchwork
2019-05-29  8:23 ` [igt-dev] ✓ Fi.CI.IGT: " Patchwork
2019-05-29  9:03 ` [igt-dev] [RFC v5] tests/gem_watchdog: Initial set of tests for GPU watchdog Chris Wilson
2019-06-11  1:55   ` Carlos Santa
2019-06-18  1:35   ` Carlos Santa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.