All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH i-g-t 0/2] Workload simulation and tracing
@ 2017-03-31 14:58 Tvrtko Ursulin
  2017-03-31 14:58 ` [PATCH i-g-t 1/2] benchmarks/gem_wsim: Command submission workload simulator Tvrtko Ursulin
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Tvrtko Ursulin @ 2017-03-31 14:58 UTC (permalink / raw)
  To: Intel-gfx

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Another WIP posting with some interopability improvements this time.

Example usage:

root@scnuc:~/intel-gpu-tools# benchmarks/gem_wsim
Calibrating nop delay with 1% tolerance...
Nop calibration for 1000us delay is 438660.
root@scnuc:~/intel-gpu-tools# scripts/trace.pl --trace benchmarks/gem_wsim -n 438660 -r 600 -w benchmarks/wsim/workload1 -c 2 -x
Using 438660 nop calibration for 1000us delay.
2 clients.
Swapping VCS rings between clients.
0: 10.222053s elapsed (58.696622 workloads/s)
1: 10.225807s elapsed (58.675078 workloads/s)
10.226307s elapsed (117.344411 workloads/s)
[ perf record: Woken up 16 times to write data ]
[ perf record: Captured and wrote 3.929 MB perf.data (44688 samples) ]
root@scnuc:~/intel-gpu-tools# perf script | scripts/trace.pl -i 1 -i 4
Ring0: 4832 batches, 4329.07 (4348.11) avg batch us, 0.15% idle, 138.73% busy, 23.88% runnable, 79.66% queued, 0.01% wait)
Ring2: 1808 batches, 1715.68 (1719.99) avg batch us, 69.68% idle, 30.38% busy, 0.12% runnable, 112.04% queued, 99.32% wait)
Ring3: 1810 batches, 1727.59 (1739.31) avg batch us, 69.28% idle, 30.75% busy, 0.11% runnable, 105.37% queued, 99.46% wait)

Most interesting metric here is the engine idle time which will come into play
as I start adding the load balancing options to gem_wsim.

Secondary mode here would be:

root@scnuc:~/intel-gpu-tools# perf script | scripts/trace.pl -i 1 -i 4 --html >graph.html

This would enable a timeline of GPU request execution to be viewed with a little
bit of local setup as described in the trace.pl help text.

Tvrtko Ursulin (2):
  benchmarks/gem_wsim: Command submission workload simulator
  igt/scripts: trace.pl to parse the i915 tracepoints

 benchmarks/Makefile.sources |   1 +
 benchmarks/gem_wsim.c       | 593 +++++++++++++++++++++++++++
 benchmarks/wsim/workload1   |   7 +
 scripts/Makefile.am         |   2 +-
 scripts/trace.pl            | 946 ++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 1548 insertions(+), 1 deletion(-)
 create mode 100644 benchmarks/gem_wsim.c
 create mode 100644 benchmarks/wsim/workload1
 create mode 100755 scripts/trace.pl

-- 
2.9.3

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH i-g-t 1/2] benchmarks/gem_wsim: Command submission workload simulator
  2017-03-31 14:58 [PATCH i-g-t 0/2] Workload simulation and tracing Tvrtko Ursulin
@ 2017-03-31 14:58 ` Tvrtko Ursulin
  2017-03-31 15:19   ` Chris Wilson
                     ` (2 more replies)
  2017-03-31 14:58 ` [PATCH i-g-t 2/2] igt/scripts: trace.pl to parse the i915 tracepoints Tvrtko Ursulin
  2017-04-24 14:42 ` [PATCH i-g-t v4] " Tvrtko Ursulin
  2 siblings, 3 replies; 26+ messages in thread
From: Tvrtko Ursulin @ 2017-03-31 14:58 UTC (permalink / raw)
  To: Intel-gfx

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Tool which emits batch buffers to engines with configurable
sequences, durations, contexts, dependencies and userspace waits.

Unfinished but shows promise so sending out for early feedback.

v2:
 * Load workload descriptors from files. (also -w)
 * Help text.
 * Calibration control if needed. (-t)
 * NORELOC | LUT to eb flags.
 * Added sample workload to wsim/workload1.

TODO list:

 * Better error handling.
 * Multi-context support for individual clients.
 * Random/variable batch length.
 * Load balancing plug-in.
 * ... ?

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com>

gem_wsim updates

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 benchmarks/Makefile.sources |   1 +
 benchmarks/gem_wsim.c       | 593 ++++++++++++++++++++++++++++++++++++++++++++
 benchmarks/wsim/workload1   |   7 +
 3 files changed, 601 insertions(+)
 create mode 100644 benchmarks/gem_wsim.c
 create mode 100644 benchmarks/wsim/workload1

diff --git a/benchmarks/Makefile.sources b/benchmarks/Makefile.sources
index 3af54ebe36f2..3a941150abb3 100644
--- a/benchmarks/Makefile.sources
+++ b/benchmarks/Makefile.sources
@@ -14,6 +14,7 @@ benchmarks_prog_list =			\
 	gem_prw				\
 	gem_set_domain			\
 	gem_syslatency			\
+	gem_wsim			\
 	kms_vblank			\
 	prime_lookup			\
 	vgem_mmap			\
diff --git a/benchmarks/gem_wsim.c b/benchmarks/gem_wsim.c
new file mode 100644
index 000000000000..029967281251
--- /dev/null
+++ b/benchmarks/gem_wsim.c
@@ -0,0 +1,593 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <errno.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/ioctl.h>
+#include <sys/time.h>
+#include <time.h>
+#include <assert.h>
+
+#include "drm.h"
+#include "ioctl_wrappers.h"
+#include "drmtest.h"
+#include "intel_io.h"
+
+struct w_step
+{
+	/* Workload step metadata */
+	unsigned int context;
+	unsigned int engine;
+	unsigned int duration;
+	int dependency;
+	int wait;
+
+	/* Implementation details */
+	struct drm_i915_gem_execbuffer2 eb;
+	struct drm_i915_gem_exec_object2 obj[3];
+};
+
+struct workload
+{
+	unsigned int nr_steps;
+	struct w_step *steps;
+
+	uint32_t ctx_id;
+};
+
+enum intel_engine_id {
+	RCS,
+	BCS,
+	balance_VCS,
+	VCS,
+	VCS1,
+	VCS2,
+	VECS,
+	NUM_ENGINES
+};
+
+static const unsigned int eb_engine_map[NUM_ENGINES] = {
+	[RCS] = I915_EXEC_RENDER,
+	[BCS] = I915_EXEC_BLT,
+	[balance_VCS] = I915_EXEC_BSD,
+	[VCS] = I915_EXEC_BSD,
+	[VCS1] = I915_EXEC_BSD | I915_EXEC_BSD_RING1,
+	[VCS2] = I915_EXEC_BSD | I915_EXEC_BSD_RING2,
+	[VECS] = I915_EXEC_VEBOX };
+
+static const uint32_t bbe = 0xa << 23;
+static const unsigned int nop_calibration_us = 1000;
+static unsigned long nop_calibration;
+
+static bool quiet;
+static int fd;
+
+/*
+ * Workload descriptor:
+ *
+ * ctx.engine.duration.dependency.wait,...
+ * <uint>.<str>.<uint>.<int <= 0>.<0|1>,...
+ *
+ * Engine ids: RCS, BCS, VCS, VCS1, VCS2, VECS
+ *
+ * "1.VCS1.3000.0.1,1.RCS.1000.-1.0,1.RCS.3700.0.0,1.RCS.1000.-2.0,1.VCS2.2300.-2.0,1.RCS.4700.-1.0,1.VCS2.600.-1.1"
+ */
+
+static struct workload *parse_workload(char *desc)
+{
+	struct workload *wrk;
+	unsigned int nr_steps = 0;
+	char *token, *tctx, *tstart = desc;
+	char *field, *fctx, *fstart;
+	struct w_step step, *steps = NULL;
+	unsigned int valid;
+	int tmp;
+
+	while ((token = strtok_r(tstart, ",", &tctx)) != NULL) {
+		tstart = NULL;
+		fstart = token;
+		valid = 0;
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			fstart = NULL;
+
+			tmp = atoi(field);
+			if (tmp != 1) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid ctx id at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.context = tmp;
+
+			valid++;
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			fstart = NULL;
+
+			if (!strcasecmp(field, "RCS")) {
+				step.engine = RCS;
+				valid++;
+			} else if (!strcasecmp(field, "BCS")) {
+				step.engine = BCS;
+				valid++;
+			} else if (!strcasecmp(field, "balance_VCS")) {
+				step.engine = balance_VCS;
+				valid++;
+			} else if (!strcasecmp(field, "VCS")) {
+				step.engine = VCS;
+				valid++;
+			} else if (!strcasecmp(field, "VCS1")) {
+				step.engine = VCS1;
+				valid++;
+			} else if (!strcasecmp(field, "VCS2")) {
+				step.engine = VCS2;
+				valid++;
+			} else if (!strcasecmp(field, "VECS")) {
+				step.engine = VECS;
+				valid++;
+			} else {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid engine id at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			fstart = NULL;
+
+			tmp = atoi(field);
+			if (tmp <= 0) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid duration at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.duration = tmp;
+
+			valid++;
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			fstart = NULL;
+
+			tmp = atoi(field);
+			if (tmp > 0) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid forward dependency at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.dependency = tmp;
+
+			valid++;
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			fstart = NULL;
+
+			tmp = atoi(field);
+			if (tmp != 0 && tmp != 1) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid wait boolean at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.wait = tmp;
+
+			valid++;
+		}
+
+		if (valid != 5) {
+			if (!quiet)
+				fprintf(stderr, "Invalid record at step %u!\n",
+					nr_steps);
+			return NULL;
+		}
+
+		nr_steps++;
+		steps = realloc(steps, sizeof(step) * nr_steps);
+		igt_assert(steps);
+
+		memcpy(&steps[nr_steps - 1], &step, sizeof(step));
+	}
+
+	wrk = malloc(sizeof(*wrk));
+	igt_assert(wrk);
+
+	wrk->nr_steps = nr_steps;
+	wrk->steps = steps;
+
+	return wrk;
+}
+
+static struct workload *
+clone_workload(struct workload *_wrk)
+{
+	struct workload *wrk;
+
+	wrk = malloc(sizeof(*wrk));
+	igt_assert(wrk);
+
+	wrk->nr_steps = _wrk->nr_steps;
+	wrk->steps = malloc(sizeof(struct w_step) * wrk->nr_steps);
+	igt_assert(wrk->steps);
+
+	memcpy(wrk->steps, _wrk->steps, sizeof(struct w_step) * wrk->nr_steps);
+
+	return wrk;
+}
+
+static void prepare_workload(struct workload *wrk, bool swap_vcs)
+{
+	struct drm_i915_gem_context_create arg = {};
+	struct w_step *w;
+	int i;
+
+	drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_CREATE, &arg);
+	wrk->ctx_id = arg.ctx_id;
+
+	for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
+		memset(&w->eb, 0, sizeof(w->eb));
+		memset(&w->obj, 0, sizeof(w->obj));
+	}
+
+	for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
+		unsigned long sz;
+		enum intel_engine_id engine = w->engine;
+
+		sz = ALIGN(w->duration * nop_calibration * sizeof(uint32_t) /
+			   nop_calibration_us, sizeof(uint32_t));
+
+		igt_assert(w->context == 1); /* TODO */
+
+		w->obj[0].handle = gem_create(fd, 4096);
+		w->obj[0].flags = EXEC_OBJECT_WRITE;
+
+		w->obj[1].handle = gem_create(fd, sz);
+		gem_write(fd, w->obj[1].handle, sz - sizeof(bbe), &bbe,
+			  sizeof(bbe));
+
+		w->eb.buffer_count = 2;
+		w->eb.buffers_ptr = to_user_pointer(w->obj);
+		if (swap_vcs && engine == VCS1)
+			engine = VCS2;
+		else if (swap_vcs && engine == VCS2)
+			engine = VCS1;
+		w->eb.flags = eb_engine_map[engine];
+		w->eb.flags |= I915_EXEC_NO_RELOC;
+		w->eb.flags |= I915_EXEC_HANDLE_LUT;
+		w->eb.rsvd1 = wrk->ctx_id;
+
+		igt_assert(w->dependency <= 0);
+		if (w->dependency) {
+			int dep_idx = i + w->dependency;
+
+			igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
+
+			w->obj[2].handle = w->obj[1].handle;
+			w->obj[1].handle = wrk->steps[dep_idx].obj[0].handle;
+			w->eb.buffer_count = 3;
+		}
+
+#ifdef DEBUG
+		printf("%u: %u:%x|%x|%x %10lu flags=%llx\n",
+		       i, w->eb.buffer_count,
+		       w->obj[0].handle, w->obj[1].handle, w->obj[2].handle,
+		       sz, w->eb.flags);
+#endif
+	}
+}
+
+static double elapsed(const struct timespec *start, const struct timespec *end)
+{
+	return (end->tv_sec - start->tv_sec) +
+	       (end->tv_nsec - start->tv_nsec) / 1e9;
+}
+
+static void
+run_workload(unsigned int id, struct workload *wrk, unsigned int repeat)
+{
+	struct timespec t_start, t_end;
+	struct w_step *w;
+	double t;
+	int i, j;
+
+	clock_gettime(CLOCK_MONOTONIC, &t_start);
+
+	for (j = 0; j < repeat; j++) {
+		for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
+			gem_execbuf(fd, &w->eb);
+			if (w->wait)
+				gem_sync(fd, w->obj[0].handle);
+		}
+	}
+
+	clock_gettime(CLOCK_MONOTONIC, &t_end);
+
+	t = elapsed(&t_start, &t_end);
+	if (!quiet)
+		printf("%u: %fs elapsed (%f workloads/s)\n", id, t, repeat / t);
+}
+
+static void fini_workload(struct workload *wrk)
+{
+	free(wrk->steps);
+	free(wrk);
+}
+
+static unsigned long calibrate_nop(unsigned int tolerance_pct)
+{
+	unsigned int loops = 17;
+	unsigned int usecs = nop_calibration_us;
+	struct drm_i915_gem_exec_object2 obj = {};
+	struct drm_i915_gem_execbuffer2 eb =
+		{ .buffer_count = 1, .buffers_ptr = (uintptr_t)&obj};
+	long size, last_size;
+	struct timespec t_0, t_end;
+
+	clock_gettime(CLOCK_MONOTONIC, &t_0);
+
+	size = 256 * 1024;
+	do {
+		struct timespec t_start;
+
+		obj.handle = gem_create(fd, size);
+		gem_write(fd, obj.handle, size - sizeof(bbe), &bbe,
+			  sizeof(bbe));
+		gem_execbuf(fd, &eb);
+		gem_sync(fd, obj.handle);
+
+		clock_gettime(CLOCK_MONOTONIC, &t_start);
+		for (int loop = 0; loop < loops; loop++)
+			gem_execbuf(fd, &eb);
+		gem_sync(fd, obj.handle);
+		clock_gettime(CLOCK_MONOTONIC, &t_end);
+
+		gem_close(fd, obj.handle);
+
+		last_size = size;
+		size = loops * size / elapsed(&t_start, &t_end) / 1e6 * usecs;
+		size = ALIGN(size, sizeof(uint32_t));
+	} while (elapsed(&t_0, &t_end) < 5 ||
+		 abs(size - last_size) > (size * tolerance_pct / 100));
+
+	return size / sizeof(uint32_t);
+}
+
+static void print_help(void)
+{
+	puts(
+"Usage: gem_wsim [OPTIONS]\n"
+"\n"
+"Runs a simulated workload on the GPU.\n"
+"When ran without arguments performs a GPU calibration result of which needs\n"
+"to be provided when running the simulation in subsequent invocations.\n"
+"\n"
+"Options:\n"
+"	-h		This text.\n"
+"	-q		Be quiet - do not output anything to stdout.\n"
+"	-n <n>		Nop calibration value.\n"
+"	-t <n>		Nop calibration tolerance percentage.\n"
+"			Use when there is a difficuly obtaining calibration\n"
+"			with the default settings.\n"
+"	-w <desc|path>	Filename or a workload descriptor.\n"
+"	-r <n>		How many times to emit the workload.\n"
+"	-c <n>		Fork n clients emitting the workload simultaneously.\n"
+"	-x		Swap VCS1 and VCS2 engines in every other client.\n"
+"\n"
+"Workload descriptor format:\n"
+"\n"
+"	ctx.engine.duration_us.dependency.wait,...\n"
+"	<uint>.<str>.<uint>.<int <= 0>.<0|1>,...\n"
+"\n"
+"	Engine ids: RCS, BCS, balance_VCS, VCS, VCS1, VCS2, VECS\n"
+"\n"
+"Example:\n"
+"	1.VCS1.3000.0.1\n"
+"	1.RCS.1000.-1.0\n"
+"	1.RCS.3700.0.0\n"
+"	1.RCS.1000.-2.0\n"
+"	1.VCS2.2300.-2.0\n"
+"	1.RCS.4700.-1.0\n"
+"	1.VCS2.600.-1.1\n"
+"\n"
+"The above workload described in human language works like this:\n"
+"A batch is sent to the VCS1 engine which will be executing for 3ms on the\n"
+"GPU and userspace will wait until it is finished before proceeding.\n"
+"Now three batches are sent to RCS with durations of 1ms, 3.7ms and 1ms\n"
+"respectively. The first batch has a data dependency on the preceding VCS1\n"
+"batch, and the last of the group depends on the first from the group.\n"
+"Now a 2.3ms batch is sent to VCS2, with a data dependency on the 3.7ms RCS\n"
+"batch, followed by a 4.7ms RCS batch with a data dependency on the 2.3ms\n"
+"VCS2 batch, and finally a 0.6ms VCS2 batch depending on the previous RCS one.\n"
+"The tool is then told to wait for the last one to complete before optionally\n"
+"starting the next iteration (-r).\n"
+"\n"
+"When workload descriptors are provided on the command line, commas must be\n"
+"used instead of newlines.\n"
+	);
+}
+
+static char *load_workload_descriptor(char *filename)
+{
+	struct stat sbuf;
+	char *buf;
+	int infd, ret, i;
+	ssize_t len;
+
+	ret = stat(filename, &sbuf);
+	if (ret || !S_ISREG(sbuf.st_mode))
+		return filename;
+
+	igt_assert(sbuf.st_size < 1024 * 1024); /* Just so. */
+	buf = malloc(sbuf.st_size);
+	igt_assert(buf);
+
+	infd = open(filename, O_RDONLY);
+	igt_assert(infd >= 0);
+	len = read(infd, buf, sbuf.st_size);
+	igt_assert(len == sbuf.st_size);
+	close(infd);
+
+	for (i = 0; i < len; i++) {
+		if (buf[i] == '\n')
+			buf[i] = ',';
+	}
+
+	return buf;
+}
+
+int main(int argc, char **argv)
+{
+	unsigned int repeat = 1;
+	unsigned int clients = 1;
+	bool swap_vcs = false;
+	struct timespec t_start, t_end;
+	struct workload **w, *wrk;
+	char *w_str = NULL;
+	unsigned int tolerance_pct = 1;
+	double t;
+	int i, c;
+
+	fd = drm_open_driver(DRIVER_INTEL);
+
+	while ((c = getopt(argc, argv, "c:n:r:qxw:t:h")) != -1) {
+		switch (c) {
+		case 'w':
+			w_str = optarg;
+			break;
+		case 'c':
+			clients = strtol(optarg, NULL, 0);
+			break;
+		case 't':
+			tolerance_pct = strtol(optarg, NULL, 0);
+			break;
+		case 'n':
+			nop_calibration = strtol(optarg, NULL, 0);
+			break;
+		case 'r':
+			repeat = strtol(optarg, NULL, 0);
+			break;
+		case 'q':
+			quiet = true;
+			break;
+		case 'x':
+			swap_vcs = true;
+			break;
+		case 'h':
+			print_help();
+			return 0;
+		default:
+			return 1;
+		}
+	}
+
+	if (!nop_calibration) {
+		if (!quiet)
+			printf("Calibrating nop delay with %u%% tolerance...\n",
+				tolerance_pct);
+		nop_calibration = calibrate_nop(tolerance_pct);
+		if (!quiet)
+			printf("Nop calibration for %uus delay is %lu.\n",
+			       nop_calibration_us, nop_calibration);
+
+		return 0;
+	} else {
+		if (!w_str) {
+			if (!quiet)
+				fprintf(stderr,
+					"Workload descriptor missing!\n");
+			return 1;
+		}
+
+		w_str = load_workload_descriptor(w_str);
+		if (!w_str) {
+			if (!quiet)
+				fprintf(stderr,
+					"Failed to load workload descriptor!\n");
+			return 1;
+		}
+
+		wrk = parse_workload(w_str);
+		if (!wrk) {
+			if (!quiet)
+				fprintf(stderr, "Failed to parse workload!\n");
+			return 1;
+		}
+	}
+
+	if (!quiet) {
+		printf("Using %lu nop calibration for %uus delay.\n",
+		       nop_calibration, nop_calibration_us);
+		printf("%u client%s.\n", clients, clients > 1 ? "s" : "");
+		if (swap_vcs)
+			printf("Swapping VCS rings between clients.\n");
+	}
+
+	w = malloc(sizeof(struct workload *) * clients);
+	igt_assert(w);
+
+	for (i = 0; i < clients; i++) {
+		w[i] = clone_workload(wrk);
+		prepare_workload(w[i], swap_vcs && (i & 1));
+	}
+
+	clock_gettime(CLOCK_MONOTONIC, &t_start);
+
+	igt_fork(child, clients)
+		run_workload(child, w[child], repeat);
+
+	igt_waitchildren();
+
+	clock_gettime(CLOCK_MONOTONIC, &t_end);
+
+	t = elapsed(&t_start, &t_end);
+	if (!quiet)
+		printf("%fs elapsed (%f workloads/s)\n",
+		       t, clients * repeat / t);
+
+	for (i = 0; i < clients; i++)
+		fini_workload(w[i]);
+
+	free(w);
+	fini_workload(wrk);
+
+	return 0;
+}
diff --git a/benchmarks/wsim/workload1 b/benchmarks/wsim/workload1
new file mode 100644
index 000000000000..5f533d8e168b
--- /dev/null
+++ b/benchmarks/wsim/workload1
@@ -0,0 +1,7 @@
+1.VCS1.3000.0.1
+1.RCS.1000.-1.0
+1.RCS.3700.0.0
+1.RCS.1000.-2.0
+1.VCS2.2300.-2.0
+1.RCS.4700.-1.0
+1.VCS2.600.-1.1
-- 
2.9.3

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH i-g-t 2/2] igt/scripts: trace.pl to parse the i915 tracepoints
  2017-03-31 14:58 [PATCH i-g-t 0/2] Workload simulation and tracing Tvrtko Ursulin
  2017-03-31 14:58 ` [PATCH i-g-t 1/2] benchmarks/gem_wsim: Command submission workload simulator Tvrtko Ursulin
@ 2017-03-31 14:58 ` Tvrtko Ursulin
  2017-04-24 14:42 ` [PATCH i-g-t v4] " Tvrtko Ursulin
  2 siblings, 0 replies; 26+ messages in thread
From: Tvrtko Ursulin @ 2017-03-31 14:58 UTC (permalink / raw)
  To: Intel-gfx; +Cc: Harri Syrja

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Given a log file created via perf with some interesting trace
events enabled, this tool can generate the timeline graph of
requests getting queued, their dependencies resolved, sent to
the GPU for executing and finally completed.

This can be useful when analyzing certain classes of performance
issues. More help is available in the tool itself.

The tool will also calculate some overall per engine statistics,
like total time engine was idle and similar.

v2:
 * Address missing git add.
 * Make html output optional (--html switch) and by default
   just output aggregated per engine stats to stdout.

v3:
 * Added --trace option which invokes perf with the correct
   options automatically.
 * Added --avg-delay-stats which prints averages for things
   like waiting on ready, waiting on GPU and context save
   duration.
 * Fix warnings when no waits on an engine.
 * Correct help text.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Harri Syrja <harri.syrja@intel.com>
Cc: Krzysztof E Olinski <krzysztof.e.olinski@intel.com>
---
 scripts/Makefile.am |   2 +-
 scripts/trace.pl    | 946 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 947 insertions(+), 1 deletion(-)
 create mode 100755 scripts/trace.pl

diff --git a/scripts/Makefile.am b/scripts/Makefile.am
index 85d4a5cf4e5c..641715294936 100644
--- a/scripts/Makefile.am
+++ b/scripts/Makefile.am
@@ -1,2 +1,2 @@
-dist_noinst_SCRIPTS = intel-gfx-trybot who.sh run-tests.sh
+dist_noinst_SCRIPTS = intel-gfx-trybot who.sh run-tests.sh trace.pl
 noinst_PYTHON = throttle.py
diff --git a/scripts/trace.pl b/scripts/trace.pl
new file mode 100755
index 000000000000..6bf97ef63560
--- /dev/null
+++ b/scripts/trace.pl
@@ -0,0 +1,946 @@
+#! /usr/bin/perl
+#
+# Copyright © 2017 Intel Corporation
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice (including the next
+# paragraph) shall be included in all copies or substantial portions of the
+# Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+# IN THE SOFTWARE.
+#
+
+use strict;
+use warnings;
+use 5.010;
+
+my $gid = 0;
+my (%db, %queue, %submit, %notify, %rings, %ctxdb, %ringmap, %reqwait);
+my @freqs;
+
+my $max_items = 3000;
+my $width_us = 32000;
+my $correct_durations = 0;
+my %ignore_ring;
+my %skip_box;
+my $html = 0;
+my $trace = 0;
+my $avg_delay_stats = 0;
+
+my @args;
+
+sub arg_help
+{
+	return unless scalar(@_);
+
+	if ($_[0] eq '--help' or $_[0] eq '-h') {
+		shift @_;
+print <<ENDHELP;
+Notes:
+
+   The tool parse the output generated by the 'perf script' command after the
+   correct set of i915 tracepoints have been collected via perf record.
+
+   To collect the data:
+
+	./trace.pl --trace [command-to-be-profiled]
+
+   The above will invoke perf record, or alternatively it can be done directly:
+
+	perf record -a -c 1 -e i915:intel_gpu_freq_change, \
+			       i915:i915_gem_request_add, \
+			       i915:i915_gem_request_submit, \
+			       i915:i915_gem_request_in, \
+			       i915:i915_gem_request_out, \
+			       i915:intel_engine_notify, \
+			       i915:i915_gem_request_wait_begin, \
+			       i915:i915_gem_request_wait_end \
+			       [command-to-be-profiled]
+
+   Then create the log file with:
+
+	perf script >trace.log
+
+   This file in turn should be piped into this tool which will generate some
+   statistics out of it, or if --html was given HTML output.
+
+   HTML can be viewed from a directory containing the 'vis' JavaScript module.
+   On Ubuntu this can be installed like this:
+
+	apt-get install npm
+	npm install vis
+
+Usage:
+   trace.pl <options> <input-file >output-file
+
+      --help / -h			This help text
+      --max-items=num / -m num		Maximum number of boxes to put on the
+					timeline. More boxes means more work for
+					the JavaScript engine in the browser.
+      --zoom-width-ms=ms / -z ms	Width of the initial timeline zoom
+      --split-requests / -s		Try to split out request which were
+					submitted together due coalescing in the
+					driver. May not be 100% accurate and may
+					influence the per-engine statistics so
+					use with care.
+      --ignore-ring=id / -i id		Ignore ring with the numerical id when
+					parsing the log (enum intel_engine_id).
+					Can be given multiple times.
+      --skip-box=name / -x name		Do not put a certain type of a box on
+					the timeline. One of: queue, ready,
+					execute and ctxsave.
+					Can be given multiple times.
+      --html				Generate HTML output.
+      --trace cmd			Trace the following command.
+      --avg-delay-stats			Print average delay stats.
+ENDHELP
+
+		exit 0;
+	}
+
+	return @_;
+}
+
+sub arg_html
+{
+	return unless scalar(@_);
+
+	if ($_[0] eq '--html') {
+		shift @_;
+		$html = 1;
+	}
+
+	return @_;
+}
+
+sub arg_avg_delay_stats
+{
+	return unless scalar(@_);
+
+	if ($_[0] eq '--avg-delay-stats') {
+		shift @_;
+		$avg_delay_stats = 1;
+	}
+
+	return @_;
+}
+
+sub arg_trace
+{
+	my @events = ( 'i915:intel_gpu_freq_change',
+		       'i915:i915_gem_request_add',
+		       'i915:i915_gem_request_submit',
+		       'i915:i915_gem_request_in',
+		       'i915:i915_gem_request_out',
+		       'i915:intel_engine_notify',
+		       'i915:i915_gem_request_wait_begin',
+		       'i915:i915_gem_request_wait_end' );
+
+	return unless scalar(@_);
+
+	if ($_[0] eq '--trace') {
+		shift @_;
+
+		unshift @_, join(',', @events);
+		unshift @_, ('perf', 'record', '-a', '-c', '1', '-e');
+
+		exec @_;
+	}
+
+	return @_;
+}
+
+sub arg_max_items
+{
+	my $val;
+
+	return unless scalar(@_);
+
+	if ($_[0] eq '--max-items' or $_[0] eq '-m') {
+		shift @_;
+		$val = shift @_;
+	} elsif ($_[0] =~ /--max-items=(\d+)/) {
+		shift @_;
+		$val = $1;
+	}
+
+	$max_items = int($val) if defined $val;
+
+	return @_;
+}
+
+sub arg_zoom_width
+{
+	my $val;
+
+	return unless scalar(@_);
+
+	if ($_[0] eq '--zoom-width-ms' or $_[0] eq '-z') {
+		shift @_;
+		$val = shift @_;
+	} elsif ($_[0] =~ /--zoom-width-ms=(\d+)/) {
+		shift @_;
+		$val = $1;
+	}
+
+	$width_us = int($val) * 1000 if defined $val;
+
+	return @_;
+}
+
+sub arg_split_requests
+{
+	return unless scalar(@_);
+
+	if ($_[0] eq '--split-requests' or $_[0] eq '-s') {
+		shift @_;
+		$correct_durations = 1;
+	}
+
+	return @_;
+}
+
+sub arg_ignore_ring
+{
+	my $val;
+
+	return unless scalar(@_);
+
+	if ($_[0] eq '--ignore-ring' or $_[0] eq '-i') {
+		shift @_;
+		$val = shift @_;
+	} elsif ($_[0] =~ /--ignore-ring=(\d+)/) {
+		shift @_;
+		$val = $1;
+	}
+
+	$ignore_ring{$val} = 1 if defined $val;
+
+	return @_;
+}
+
+sub arg_skip_box
+{
+	my $val;
+
+	return unless scalar(@_);
+
+	if ($_[0] eq '--skip-box' or $_[0] eq '-x') {
+		shift @_;
+		$val = shift @_;
+	} elsif ($_[0] =~ /--skip-box=(\d+)/) {
+		shift @_;
+		$val = $1;
+	}
+
+	$skip_box{$val} = 1 if defined $val;
+
+	return @_;
+}
+
+@args = @ARGV;
+while (@args) {
+	my $left = scalar(@args);
+
+	@args = arg_help(@args);
+	@args = arg_html(@args);
+	@args = arg_avg_delay_stats(@args);
+	@args = arg_trace(@args);
+	@args = arg_max_items(@args);
+	@args = arg_zoom_width(@args);
+	@args = arg_split_requests(@args);
+	@args = arg_ignore_ring(@args);
+	@args = arg_skip_box(@args);
+
+	last if $left == scalar(@args);
+}
+
+die if scalar(@args);
+
+@ARGV = @args;
+
+sub parse_req
+{
+	my ($line, $tp) = @_;
+	state %cache;
+
+	$cache{$tp} = qr/(\d+)\.(\d+):.*$tp.*ring=(\d+), ctx=(\d+), seqno=(\d+), global(?:_seqno)?=(\d+)/ unless exists $cache{$tp};
+
+	if ($line =~ $cache{$tp}) {
+		return ($1, $2, $3, $4, $5, $6);
+	} else {
+		return undef;
+	}
+}
+
+sub parse_req_hw
+{
+	my ($line, $tp) = @_;
+	state %cache;
+
+	$cache{$tp} = qr/(\d+)\.(\d+):.*$tp.*ring=(\d+), ctx=(\d+), seqno=(\d+), global(?:_seqno)?=(\d+), port=(\d+)/ unless exists $cache{$tp};
+
+	if ($line =~ $cache{$tp}) {
+		return ($1, $2, $3, $4, $5, $6, $7);
+	} else {
+		return undef;
+	}
+}
+
+sub parse_req_wait_begin
+{
+	my ($line, $tp) = @_;
+
+	if ($line =~ /(\d+)\.(\d+):.*i915_gem_request_wait_begin.*ring=(\d+), ctx=(\d+), seqno=(\d+)/) {
+		return ($1, $2, $3, $4, $5);
+	} else {
+		return undef;
+	}
+}
+
+sub parse_notify
+{
+	my ($line) = @_;
+
+	if ($line =~ /(\d+)\.(\d+):.*intel_engine_notify.*ring=(\d+), seqno=(\d+)/) {
+		return ($1, $2, $3, $4);
+	} else {
+		return undef;
+	}
+}
+
+sub parse_freq
+{
+	my ($line) = @_;
+
+	if ($line =~ /(\d+)\.(\d+):.*intel_gpu_freq_change.*new_freq=(\d+)/) {
+		return ($1, $2, $3);
+	} else {
+		return undef;
+	}
+}
+
+sub us
+{
+	my ($s, $us) = @_;
+
+	return $s * 1000000 + $us;
+}
+
+sub db_key
+{
+	my ($ring, $ctx, $seqno) = @_;
+
+	return $ring . '/' . $ctx . '/' . $seqno;
+}
+
+sub global_key
+{
+	my ($ring, $seqno) = @_;
+
+	return $ring . '/' . $seqno;
+}
+
+sub sanitize_ctx
+{
+	my ($ctx) = @_;
+
+	if (exists $ctxdb{$ctx}) {
+		return $ctx . '.' . $ctxdb{$ctx};
+	} else {
+		return $ctx;
+	}
+}
+
+sub ts
+{
+	my ($us) = @_;
+	my ($h, $m, $s);
+
+	$s = int($us / 1000000);
+	$us = $us % 1000000;
+
+	$m = int($s / 60);
+	$s = $s % 60;
+
+	$h = int($m / 60);
+	$m = $m % 60;
+
+	return sprintf('2017-01-01 %02u:%02u:%02u.%06u', int($h), int($m), int($s), int($us));
+}
+
+# Main input loop - parse lines and build the internal representation of the
+# trace using a hash of requests and some auxilliary data structures.
+my $prev_freq = 0;
+my $prev_freq_ts = 0;
+my $oldkernelwa = 0;
+my ($no_queue, $no_in);
+while (<>) {
+	my ($s, $us, $ring, $ctx, $seqno, $global_seqno, $port);
+	my $freq;
+	my $key;
+
+	chomp;
+
+	($s, $us, $ring, $ctx, $seqno) = parse_req_wait_begin($_);
+	if (defined $s) {
+		my %rw;
+
+		next if exists $ignore_ring{$ring};
+
+		$ctx = sanitize_ctx($ctx);
+		$key = db_key($ring, $ctx, $seqno);
+
+		next if exists $reqwait{$key};
+
+		$rw{'key'} = $key;
+		$rw{'ring'} = $ring;
+		$rw{'seqno'} = $seqno;
+		$rw{'ctx'} = $ctx;
+		$rw{'start'} = us($s, $us);
+		$reqwait{$key} = \%rw;
+		next;
+	}
+
+	($s, $us, $ring, $ctx, $seqno, $global_seqno) = parse_req($_, 'i915:i915_gem_request_wait_end');
+	if (defined $s) {
+		next if exists $ignore_ring{$ring};
+
+		$ctx = sanitize_ctx($ctx);
+		$key = db_key($ring, $ctx, $seqno);
+
+		next unless exists $reqwait{$key};
+
+		$reqwait{$key}->{'end'} = us($s, $us);
+		next;
+	}
+
+	($s, $us, $ring, $ctx, $seqno, $global_seqno) = parse_req($_, 'i915:i915_gem_request_add');
+	if (defined $s) {
+		my $orig_ctx = $ctx;
+
+		next if exists $ignore_ring{$ring};
+
+		$ctx = sanitize_ctx($ctx);
+		$key = db_key($ring, $ctx, $seqno);
+
+		if (exists $queue{$key}) {
+			$ctxdb{$orig_ctx}++;
+			$ctx = sanitize_ctx($orig_ctx);
+			$key = db_key($ring, $ctx, $seqno);
+		}
+
+		$queue{$key} = us($s, $us);
+		next;
+	}
+
+	($s, $us, $ring, $ctx, $seqno, $global_seqno) = parse_req($_, 'i915:i915_gem_request_submit');
+	if (defined $s) {
+		next if exists $ignore_ring{$ring};
+
+		$ctx = sanitize_ctx($ctx);
+		$key = db_key($ring, $ctx, $seqno);
+
+		die if exists $submit{$key};
+		die unless exists $queue{$key};
+
+		$submit{$key} = us($s, $us);
+		next;
+	}
+
+	($s, $us, $ring, $ctx, $seqno, $global_seqno, $port) = parse_req_hw($_, 'i915:i915_gem_request_in');
+	if (defined $s) {
+		my %req;
+
+		next if exists $ignore_ring{$ring};
+
+		$ctx = sanitize_ctx($ctx);
+		$key = db_key($ring, $ctx, $seqno);
+
+		die if exists $db{$key};
+		if (not exists $queue{$key} and $oldkernelwa) {
+			$no_queue++;
+			next;
+		}
+		die unless exists $queue{$key};
+		die unless exists $submit{$key};
+
+		$req{'start'} = us($s, $us);
+		$req{'ring'} = $ring;
+		$req{'seqno'} = $seqno;
+		$req{'ctx'} = $ctx;
+		$req{'name'} = $ctx . '/' . $seqno;
+		$req{'global'} = $global_seqno;
+		$req{'port'} = $port;
+		$req{'queue'} = $queue{$key};
+		$req{'submit-delay'} = $submit{$key} - $queue{$key};
+		$req{'execute-delay'} = $req{'start'} - $submit{$key};
+		$rings{$ring} = $gid++ unless exists $rings{$ring};
+		$ringmap{$rings{$ring}} = $ring;
+		$db{$key} = \%req;
+		next;
+	}
+
+	($s, $us, $ring, $ctx, $seqno, $global_seqno, $port) = parse_req($_, 'i915:i915_gem_request_out');
+	if (defined $s) {
+		my $gkey = global_key($ring, $global_seqno);
+
+		next if exists $ignore_ring{$ring};
+
+		$ctx = sanitize_ctx($ctx);
+		$key = db_key($ring, $ctx, $seqno);
+
+		if (not exists $db{$key} and $oldkernelwa) {
+			$no_in++;
+			next;
+		}
+		die unless exists $db{$key};
+		die unless exists $db{$key}->{'start'};
+		die if exists $db{$key}->{'end'};
+
+		$db{$key}->{'end'} = us($s, $us);
+		if (exists $notify{$gkey}) {
+			$db{$key}->{'notify'} = $notify{$gkey};
+		} else {
+			# No notify so far. Maybe it will arrive later which
+			# will be handled in the sanitation loop below.
+			$db{$key}->{'notify'} = $db{$key}->{'end'};
+			$db{$key}->{'no-notify'} = 1;
+		}
+		$db{$key}->{'duration'} = $db{$key}->{'notify'} - $db{$key}->{'start'};
+		$db{$key}->{'context-complete-delay'} = $db{$key}->{'end'} - $db{$key}->{'notify'};
+		next;
+	}
+
+	($s, $us, $ring, $seqno) = parse_notify($_);
+	if (defined $s) {
+		next if exists $ignore_ring{$ring};
+		$notify{global_key($ring, $seqno)} = us($s, $us);
+		next;
+	}
+
+	($s, $us, $freq) = parse_freq($_);
+	if (defined $s) {
+		my $cur = us($s, $us);
+
+		push @freqs, [$prev_freq_ts, $cur, $prev_freq] if $prev_freq;
+		$prev_freq_ts = $cur;
+		$prev_freq = $freq;
+		next;
+	}
+}
+
+# Sanitation pass to fixup up out of order notify and context complete, and to
+# fine the largest seqno to be used for timeline sorting purposes.
+my $max_seqno = 0;
+foreach my $key (keys %db) {
+	my $gkey = global_key($db{$key}->{'ring'}, $db{$key}->{'global'});
+
+	die unless exists $db{$key}->{'start'};
+
+	$max_seqno = $db{$key}->{'seqno'} if $db{$key}->{'seqno'} > $max_seqno;
+
+	unless (exists $db{$key}->{'end'}) {
+		# Context complete not received.
+		if (exists $notify{$gkey}) {
+			# No context complete due req merging - use notify.
+			$db{$key}->{'notify'} = $notify{$gkey};
+			$db{$key}->{'end'} = $db{$key}->{'notify'};
+			$db{$key}->{'no-end'} = 1;
+		} else {
+			# No notify and no context complete - mark it.
+			$db{$key}->{'end'} = $db{$key}->{'start'} + 999;
+			$db{$key}->{'notify'} = $db{$key}->{'end'};
+			$db{$key}->{'incomplete'} = 1;
+		}
+
+		$db{$key}->{'duration'} = $db{$key}->{'notify'} - $db{$key}->{'start'};
+		$db{$key}->{'context-complete-delay'} = $db{$key}->{'end'} - $db{$key}->{'notify'};
+	} else {
+		# Notify arrived after context complete.
+		if (exists $db{$key}->{'no-notify'} and exists $notify{$gkey}) {
+			delete $db{$key}->{'no-notify'};
+			$db{$key}->{'notify'} = $notify{$gkey};
+			$db{$key}->{'duration'} = $db{$key}->{'notify'} - $db{$key}->{'start'};
+			$db{$key}->{'context-complete-delay'} = $db{$key}->{'end'} - $db{$key}->{'notify'};
+		}
+	}
+}
+
+# GPU time accounting
+my (%running, %runnable, %queued, %batch_avg, %batch_total_avg, %batch_count);
+my (%submit_avg, %execute_avg, %ctxsave_avg);
+my $last_ts = 0;
+my $first_ts;
+
+foreach my $key (sort {$db{$a}->{'start'} <=> $db{$b}->{'start'}} keys %db) {
+	my $ring = $db{$key}->{'ring'};
+	my $end = $db{$key}->{'end'};
+
+	$first_ts = $db{$key}->{'queue'} if not defined $first_ts or $db{$key}->{'queue'} < $first_ts;
+	$last_ts = $end if $end > $last_ts;
+
+	# Adjust batch start with legacy execlists.
+	# Port == 2 mean batch was merged udring queuing and hasn't actually
+	# been submitted to the gpu until the batch with port < 2 is found.
+	if ($correct_durations and $oldkernelwa and $db{$key}->{'port'} == 2) {
+		my $ctx = $db{$key}->{'ctx'};
+		my $seqno = $db{$key}->{'seqno'};
+		my $next_key;
+		my $i = 1;
+
+		do {
+			$next_key = db_key($ring, $ctx, $seqno + $i);
+			$i++;
+		} until ((exists $db{$next_key} and $db{$next_key}->{'port'} < 2) or $i > scalar(keys(%db)));  # ugly stop hack
+
+		if (exists $db{$next_key}) {
+			$db{$key}->{'start'} = $db{$next_key}->{'start'};
+			$db{$key}->{'end'} = $db{$next_key}->{'end'};
+			die if $db{$key}->{'start'} > $db{$key}->{'end'};
+		}
+	}
+
+	$running{$ring} += $end - $db{$key}->{'start'} unless exists $db{$key}->{'no-end'};
+	$runnable{$ring} += $db{$key}->{'execute-delay'};
+	$queued{$ring} += $db{$key}->{'start'} - $db{$key}->{'execute-delay'} - $db{$key}->{'queue'};
+
+	$batch_count{$ring}++;
+
+	# correct duration of merged batches
+	if ($correct_durations and exists $db{$key}->{'no-end'}) {
+		my $start = $db{$key}->{'start'};
+		my $ctx = $db{$key}->{'ctx'};
+		my $seqno = $db{$key}->{'seqno'};
+		my $next_key;
+		my $i = 1;
+
+		do {
+			$next_key = db_key($ring, $ctx, $seqno + $i);
+			$i++;
+		} until (exists $db{$next_key} or $i > scalar(keys(%db)));  # ugly stop hack
+
+		# 20us tolerance
+		if (exists $db{$next_key} and $db{$next_key}->{'start'} < $start + 20) {
+			$db{$next_key}->{'start'} = $start + $db{$key}->{'duration'};
+			$db{$next_key}->{'start'} = $db{$next_key}->{'end'} if $db{$next_key}->{'start'} > $db{$next_key}->{'end'};
+			$db{$next_key}->{'duration'} = $db{$next_key}->{'notify'} - $db{$next_key}->{'start'};
+			$end = $db{$key}->{'notify'};
+			die if $db{$next_key}->{'start'} > $db{$next_key}->{'end'};
+		}
+		die if $db{$key}->{'start'} > $db{$key}->{'end'};
+	}
+	$batch_avg{$ring} += $db{$key}->{'duration'};
+	$batch_total_avg{$ring} += $end - $db{$key}->{'start'};
+
+	$submit_avg{$ring} += $db{$key}->{'submit-delay'};
+	$execute_avg{$ring} += $db{$key}->{'execute-delay'};
+	$ctxsave_avg{$ring} += $db{$key}->{'end'} - $db{$key}->{'notify'};
+}
+
+foreach my $ring (keys %batch_avg) {
+	$batch_avg{$ring} /= $batch_count{$ring};
+	$batch_total_avg{$ring} /= $batch_count{$ring};
+	$submit_avg{$ring} /= $batch_count{$ring};
+	$execute_avg{$ring} /= $batch_count{$ring};
+	$ctxsave_avg{$ring} /= $batch_count{$ring};
+}
+
+# Calculate engine idle time
+my %flat_busy;
+foreach my $gid (sort keys %rings) {
+	my $ring = $ringmap{$rings{$gid}};
+	my (@s_, @e_);
+
+	# Extract all GPU busy intervals and sort them.
+	foreach my $key (sort {$db{$a}->{'start'} <=> $db{$b}->{'start'}} keys %db) {
+		next unless $db{$key}->{'ring'} == $ring;
+		push @s_, $db{$key}->{'start'};
+		push @e_, $db{$key}->{'end'};
+		die if $db{$key}->{'start'} > $db{$key}->{'end'};
+	}
+
+	die unless $#s_ == $#e_;
+
+	# Flatten the intervals.
+	for my $i (1..$#s_) {
+		last if $i >= @s_; # End of array.
+		die if $e_[$i] < $s_[$i];
+		if ($s_[$i] <= $e_[$i - 1]) {
+			# Current entry overlaps with the previous one. We need
+			# to merge end of the previous interval from the list
+			# with the start of the current one.
+			splice @e_, $i - 1, 1;
+			splice @s_, $i, 1;
+			# Continue with the same element when list got squashed.
+			redo;
+		}
+	}
+
+	# Add up all busy times.
+	my $total = 0;
+	for my $i (0..$#s_) {
+		die if $e_[$i] < $s_[$i];
+
+		$total = $total + ($e_[$i] - $s_[$i]);
+	}
+
+	$flat_busy{$ring} = $total;
+}
+
+my %reqw;
+foreach my $key (keys %reqwait) {
+	$reqw{$reqwait{$key}->{'ring'}} += $reqwait{$key}->{'end'} - $reqwait{$key}->{'start'};
+}
+
+print <<ENDHTML if $html;
+<!DOCTYPE HTML>
+<html>
+<head>
+  <title>i915 GT timeline</title>
+
+  <style type="text/css">
+    body, html {
+      font-family: sans-serif;
+    }
+  </style>
+
+  <script src="node_modules/vis/dist/vis.js"></script>
+  <link href="node_modules/vis//dist/vis.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+
+<button onclick="toggleStackSubgroups()">Toggle stacking</button>
+
+<p>
+pink = requests executing on the GPU<br>
+grey = runnable requests waiting for a slot on GPU<br>
+blue = requests waiting on fences and dependencies before they are runnable<br>
+</p>
+<p>
+Boxes are in format 'ctx-id/seqno'.
+</p>
+<p>
+Use Ctrl+scroll-action to zoom-in/out and scroll-action or dragging to move around the timeline.
+</p>
+
+<div id="visualization"></div>
+
+<script type="text/javascript">
+  var container = document.getElementById('visualization');
+
+  var groups = new vis.DataSet([
+ENDHTML
+
+#   var groups = new vis.DataSet([
+# 	{id: 1, content: 'g0'},
+# 	{id: 2, content: 'g1'}
+#   ]);
+
+sub html_stats
+{
+	my ($stats, $group, $id) = @_;
+	my $name;
+
+	$name = 'Ring' . $group;
+	$name .= '<br><small><br>';
+	$name .= sprintf('%2.2f', $stats->{'idle'}) . '% idle<br><br>';
+	$name .= sprintf('%2.2f', $stats->{'busy'}) . '% busy<br>';
+	$name .= sprintf('%2.2f', $stats->{'runnable'}) . '% runnable<br>';
+	$name .= sprintf('%2.2f', $stats->{'queued'}) . '% queued<br><br>';
+	$name .= sprintf('%2.2f', $stats->{'wait'}) . '% wait<br><br>';
+	$name .= $stats->{'count'} . ' batches<br>';
+	$name .= sprintf('%2.2f', $stats->{'avg'}) . 'us avg batch<br>';
+	$name .= sprintf('%2.2f', $stats->{'total-avg'}) . 'us avg engine batch<br>';
+	$name .= '</small>';
+
+	print "\t{id: $id, content: '$name'},\n";
+}
+
+sub stdio_stats
+{
+	my ($stats, $group, $id) = @_;
+	my $str;
+
+	$str = 'Ring' . $group . ': ';
+	$str .= $stats->{'count'} . ' batches, ';
+	$str .= sprintf('%2.2f (%2.2f) avg batch us, ', $stats->{'avg'}, $stats->{'total-avg'});
+	$str .= sprintf('%2.2f', $stats->{'idle'}) . '% idle, ';
+	$str .= sprintf('%2.2f', $stats->{'busy'}) . '% busy, ';
+	$str .= sprintf('%2.2f', $stats->{'runnable'}) . '% runnable, ';
+	$str .= sprintf('%2.2f', $stats->{'queued'}) . '% queued, ';
+	$str .= sprintf('%2.2f', $stats->{'wait'}) . '% wait';
+	if ($avg_delay_stats) {
+		$str .= ', submit/execute/save-avg=(';
+		$str .= sprintf('%2.2f/%2.2f/%2.2f', $stats->{'submit'}, $stats->{'execute'}, $stats->{'save'});
+	}
+	$str .= ')';
+
+	say $str;
+}
+
+print "\t{id: 0, content: 'Freq'},\n" if $html;
+foreach my $group (sort keys %rings) {
+	my $name;
+	my $ring = $ringmap{$rings{$group}};
+	my $id = 1 + $rings{$group};
+	my $elapsed = $last_ts - $first_ts;
+	my %stats;
+
+	$stats{'idle'} = (1.0 - $flat_busy{$ring} / $elapsed) * 100.0;
+	$stats{'busy'} = $running{$ring} / $elapsed * 100.0;
+	$stats{'runnable'} = $runnable{$ring} / $elapsed * 100.0;
+	$stats{'queued'} = $queued{$ring} / $elapsed * 100.0;
+	$reqw{$ring} = 0 unless exists $reqw{$ring};
+	$stats{'wait'} = $reqw{$ring} / $elapsed * 100.0;
+	$stats{'count'} = $batch_count{$ring};
+	$stats{'avg'} = $batch_avg{$ring};
+	$stats{'total-avg'} = $batch_total_avg{$ring};
+	$stats{'submit'} = $submit_avg{$ring};
+	$stats{'execute'} = $execute_avg{$ring};
+	$stats{'save'} = $ctxsave_avg{$ring};
+
+	if ($html) {
+		html_stats(\%stats, $group, $id);
+	} else {
+		stdio_stats(\%stats, $group, $id);
+	}
+}
+
+exit 0 unless $html;
+
+print <<ENDHTML;
+  ]);
+
+  var items = new vis.DataSet([
+ENDHTML
+
+my $i = 0;
+foreach my $key (sort {$db{$a}->{'queue'} <=> $db{$b}->{'queue'}} keys %db) {
+	my ($name, $ctx, $seqno) = ($db{$key}->{'name'}, $db{$key}->{'ctx'}, $db{$key}->{'seqno'});
+	my ($queue, $start, $notify, $end) = ($db{$key}->{'queue'}, $db{$key}->{'start'}, $db{$key}->{'notify'}, $db{$key}->{'end'});
+	my $submit = $queue + $db{$key}->{'submit-delay'};
+	my ($content, $style);
+	my $group = 1 + $rings{$db{$key}->{'ring'}};
+	my $type = ' type: \'range\',';
+	my $startend;
+	my $skey;
+
+	# submit to execute
+	unless (exists $skip_box{'queue'}) {
+		$skey = 2 * $max_seqno * $ctx + 2 * $seqno;
+		$style = 'color: black; background-color: lightblue;';
+		$content = "$name<br>$db{$key}->{'submit-delay'}us <small>($db{$key}->{'execute-delay'}us)</small>";
+		$startend = 'start: \'' . ts($queue) . '\', end: \'' . ts($submit) . '\'';
+		print "\t{id: $i, key: $skey, $type group: $group, subgroup: 1, subgroupOrder: 1, content: '$content', $startend, style: \'$style\'},\n";
+		$i++;
+	}
+
+	# execute to start
+	unless (exists $skip_box{'ready'}) {
+		$skey = 2 * $max_seqno * $ctx + 2 * $seqno + 1;
+		$style = 'color: black; background-color: lightgrey;';
+		$content = "<small>$name<br>$db{$key}->{'execute-delay'}us</small>";
+		$startend = 'start: \'' . ts($submit) . '\', end: \'' . ts($start) . '\'';
+		print "\t{id: $i, key: $skey, $type group: $group, subgroup: 1, subgroupOrder: 2, content: '$content', $startend, style: \'$style\'},\n";
+		$i++;
+	}
+
+	# start to user interrupt
+	unless (exists $skip_box{'execute'}) {
+		$skey = -2 * $max_seqno * $ctx - 2 * $seqno - 1;
+		if (exists $db{$key}->{'incomplete'}) {
+			$style = 'color: white; background-color: red;';
+		} else {
+			$style = 'color: black; background-color: pink;';
+		}
+		$content = "$name <small>$db{$key}->{'port'}</small>";
+		$content .= ' <small><i>???</i></small> ' if exists $db{$key}->{'incomplete'};
+		$content .= ' <small><i>++</i></small> ' if exists $db{$key}->{'no-end'};
+		$content .= ' <small><i>+</i></small> ' if exists $db{$key}->{'no-notify'};
+		$content .= "<br>$db{$key}->{'duration'}us <small>($db{$key}->{'context-complete-delay'}us)</small>";
+		$startend = 'start: \'' . ts($start) . '\', end: \'' . ts($notify) . '\'';
+		print "\t{id: $i, key: $skey, $type group: $group, subgroup: 2, subgroupOrder: 3, content: '$content', $startend, style: \'$style\'},\n";
+		$i++;
+	}
+
+	# user interrupt to context complete
+	unless (exists $skip_box{'ctxsave'}) {
+		$skey = -2 * $max_seqno * $ctx - 2 * $seqno;
+		$style = 'color: black; background-color: orange;';
+		my $ctxsave = $db{$key}->{'end'} - $db{$key}->{'notify'};
+		$content = "<small>$name<br>${ctxsave}us</small>";
+		$content .= ' <small><i>???</i></small> ' if exists $db{$key}->{'incomplete'};
+		$content .= ' <small><i>++</i></small> ' if exists $db{$key}->{'no-end'};
+		$content .= ' <small><i>+</i></small> ' if exists $db{$key}->{'no-notify'};
+		$startend = 'start: \'' . ts($notify) . '\', end: \'' . ts($end) . '\'';
+		print "\t{id: $i, key: $skey, $type group: $group, subgroup: 2, subgroupOrder: 4, content: '$content', $startend, style: \'$style\'},\n";
+		$i++;
+	}
+
+	$last_ts = $end;
+
+	last if $i > $max_items;
+}
+
+foreach my $item (@freqs) {
+	my ($start, $end, $freq) = @$item;
+	my $startend;
+
+	next if $start > $last_ts;
+
+	$start = $first_ts if $start < $first_ts;
+	$end = $last_ts if $end > $last_ts;
+	$startend = 'start: \'' . ts($start) . '\', end: \'' . ts($end) . '\'';
+	print "\t{id: $i, type: 'range', group: 0, content: '$freq', $startend},\n";
+	$i++;
+}
+
+my $end_ts = ts($first_ts + $width_us);
+$first_ts = ts($first_ts);
+
+print <<ENDHTML;
+  ]);
+
+  function customOrder (a, b) {
+  // order by id
+    return a.subgroupOrder - b.subgroupOrder;
+  }
+
+  // Configuration for the Timeline
+  var options = { groupOrder: 'content',
+		  horizontalScroll: true,
+		  stack: true,
+		  stackSubgroups: false,
+		  zoomKey: 'ctrlKey',
+		  orientation: 'top',
+		  order: customOrder,
+		  start: '$first_ts',
+		  end: '$end_ts'};
+
+  // Create a Timeline
+  var timeline = new vis.Timeline(container, items, groups, options);
+
+    function toggleStackSubgroups() {
+        options.stackSubgroups = !options.stackSubgroups;
+        timeline.setOptions(options);
+    }
+ENDHTML
+
+print <<ENDHTML;
+</script>
+</body>
+</html>
+ENDHTML
-- 
2.9.3

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t 1/2] benchmarks/gem_wsim: Command submission workload simulator
  2017-03-31 14:58 ` [PATCH i-g-t 1/2] benchmarks/gem_wsim: Command submission workload simulator Tvrtko Ursulin
@ 2017-03-31 15:19   ` Chris Wilson
  2017-04-05 16:14   ` [PATCH i-g-t v3] " Tvrtko Ursulin
  2017-04-20 12:29   ` [PATCH i-g-t v4] " Tvrtko Ursulin
  2 siblings, 0 replies; 26+ messages in thread
From: Chris Wilson @ 2017-03-31 15:19 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Fri, Mar 31, 2017 at 03:58:25PM +0100, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Tool which emits batch buffers to engines with configurable
> sequences, durations, contexts, dependencies and userspace waits.
> 
> Unfinished but shows promise so sending out for early feedback.
> 
> v2:
>  * Load workload descriptors from files. (also -w)
>  * Help text.
>  * Calibration control if needed. (-t)
>  * NORELOC | LUT to eb flags.
>  * Added sample workload to wsim/workload1.
> 
> TODO list:
> 
>  * Better error handling.
>  * Multi-context support for individual clients.

I think that will also wants multiple dependencies. 

>  * Random/variable batch length.
>  * Load balancing plug-in.
>  * ... ?

Waits and delayed execution cycles.

Multiple clients (as threads).

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH i-g-t v3] benchmarks/gem_wsim: Command submission workload simulator
  2017-03-31 14:58 ` [PATCH i-g-t 1/2] benchmarks/gem_wsim: Command submission workload simulator Tvrtko Ursulin
  2017-03-31 15:19   ` Chris Wilson
@ 2017-04-05 16:14   ` Tvrtko Ursulin
  2017-04-05 16:48     ` Chris Wilson
  2017-04-20 12:29   ` [PATCH i-g-t v4] " Tvrtko Ursulin
  2 siblings, 1 reply; 26+ messages in thread
From: Tvrtko Ursulin @ 2017-04-05 16:14 UTC (permalink / raw)
  To: Intel-gfx

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Tool which emits batch buffers to engines with configurable
sequences, durations, contexts, dependencies and userspace waits.

Unfinished but shows promise so sending out for early feedback.

v2:
 * Load workload descriptors from files. (also -w)
 * Help text.
 * Calibration control if needed. (-t)
 * NORELOC | LUT to eb flags.
 * Added sample workload to wsim/workload1.

v3:
 * Multiple parallel different workloads (-w -w ...).
 * Multi-context workloads.
 * Variable (random) batch length.
 * Load balancing (round robin and queue depth estimation).
 * Workloads delays and explicit sync steps.
 * Workload frequency (period) control.

TODO list:

 * Fence support.
 * Move majority of help text to README.
 * Better error handling.
 * Less 1980's workload parsing.
 * Proper workloads.
 * Explicit waits?
 * Threads?
 * ... ?

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com>
---
 benchmarks/Makefile.sources |    1 +
 benchmarks/gem_wsim.c       | 1053 +++++++++++++++++++++++++++++++++++++++++++
 benchmarks/wsim/workload1   |    7 +
 benchmarks/wsim/workload2   |    7 +
 benchmarks/wsim/workload3   |    7 +
 benchmarks/wsim/workload4   |    8 +
 benchmarks/wsim/workload5   |    8 +
 benchmarks/wsim/workload6   |    8 +
 8 files changed, 1099 insertions(+)
 create mode 100644 benchmarks/gem_wsim.c
 create mode 100644 benchmarks/wsim/workload1
 create mode 100644 benchmarks/wsim/workload2
 create mode 100644 benchmarks/wsim/workload3
 create mode 100644 benchmarks/wsim/workload4
 create mode 100644 benchmarks/wsim/workload5
 create mode 100644 benchmarks/wsim/workload6

diff --git a/benchmarks/Makefile.sources b/benchmarks/Makefile.sources
index 3af54ebe36f2..3a941150abb3 100644
--- a/benchmarks/Makefile.sources
+++ b/benchmarks/Makefile.sources
@@ -14,6 +14,7 @@ benchmarks_prog_list =			\
 	gem_prw				\
 	gem_set_domain			\
 	gem_syslatency			\
+	gem_wsim			\
 	kms_vblank			\
 	prime_lookup			\
 	vgem_mmap			\
diff --git a/benchmarks/gem_wsim.c b/benchmarks/gem_wsim.c
new file mode 100644
index 000000000000..38041da1f6e3
--- /dev/null
+++ b/benchmarks/gem_wsim.c
@@ -0,0 +1,1053 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <errno.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/ioctl.h>
+#include <sys/time.h>
+#include <time.h>
+#include <assert.h>
+#include <limits.h>
+
+
+#include "intel_chipset.h"
+#include "drm.h"
+#include "ioctl_wrappers.h"
+#include "drmtest.h"
+#include "intel_io.h"
+
+enum intel_engine_id {
+	RCS,
+	BCS,
+	VCS,
+	VCS1,
+	VCS2,
+	VECS,
+	NUM_ENGINES
+};
+
+struct duration {
+	unsigned int min, max, cur;
+};
+
+enum w_type
+{
+	BATCH,
+	SYNC,
+	DELAY,
+	PERIOD
+};
+
+struct w_step
+{
+	/* Workload step metadata */
+	enum w_type type;
+	unsigned int context;
+	unsigned int engine;
+	struct duration duration;
+	int dependency;
+	int wait;
+
+	/* Implementation details */
+	struct drm_i915_gem_execbuffer2 eb;
+	struct drm_i915_gem_exec_object2 obj[4];
+	struct drm_i915_gem_relocation_entry reloc;
+	unsigned long bb_sz;
+	uint32_t bb_handle;
+	uint64_t seqno_offset;
+};
+
+struct workload
+{
+	unsigned int nr_steps;
+	struct w_step *steps;
+
+	struct timespec repeat_start;
+
+	unsigned int nr_ctxs;
+	uint32_t *ctx_id;
+
+	unsigned long seqno[NUM_ENGINES];
+	uint32_t status_page_handle[NUM_ENGINES];
+	uint32_t *status_page[NUM_ENGINES];
+	unsigned int vcs_rr;
+
+	unsigned long qd_sum[NUM_ENGINES];
+	unsigned long nr_bb[NUM_ENGINES];
+};
+
+static const unsigned int eb_engine_map[NUM_ENGINES] = {
+	[RCS] = I915_EXEC_RENDER,
+	[BCS] = I915_EXEC_BLT,
+	[VCS] = I915_EXEC_BSD,
+	[VCS1] = I915_EXEC_BSD | I915_EXEC_BSD_RING1,
+	[VCS2] = I915_EXEC_BSD | I915_EXEC_BSD_RING2,
+	[VECS] = I915_EXEC_VEBOX
+};
+
+static const unsigned int nop_calibration_us = 1000;
+static unsigned long nop_calibration;
+
+static bool quiet;
+static int fd;
+
+/*
+ * Workload descriptor:
+ *
+ * ctx.engine.duration.dependency.wait,...
+ * <uint>.<str>.<uint>.<int <= 0>.<0|1>,...
+ *
+ * Engine ids: RCS, BCS, VCS, VCS1, VCS2, VECS
+ *
+ * "1.VCS1.3000.0.1,1.RCS.1000.-1.0,1.RCS.3700.0.0,1.RCS.1000.-2.0,1.VCS2.2300.-2.0,1.RCS.4700.-1.0,1.VCS2.600.-1.1"
+ */
+
+static const char *ring_str_map[NUM_ENGINES] = {
+	[RCS] = "RCS",
+	[BCS] = "BCS",
+	[VCS] = "VCS",
+	[VCS1] = "VCS1",
+	[VCS2] = "VCS2",
+	[VECS] = "VECS",
+};
+
+static struct workload *parse_workload(char *_desc)
+{
+	struct workload *wrk;
+	unsigned int nr_steps = 0;
+	char *desc = strdup(_desc);
+	char *_token, *token, *tctx = NULL, *tstart = desc;
+	char *field, *fctx = NULL, *fstart;
+	struct w_step step = { }, *steps = NULL;
+	unsigned int valid;
+	int tmp;
+
+	while ((_token = strtok_r(tstart, ",", &tctx)) != NULL) {
+		tstart = NULL;
+		token = strdup(_token);
+		fstart = token;
+		valid = 0;
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			fstart = NULL;
+
+			if (!strcasecmp(field, "d")) {
+				if ((field = strtok_r(fstart, ".", &fctx)) !=
+				    NULL) {
+					tmp = atoi(field);
+					if (tmp <= 0) {
+						if (!quiet)
+							fprintf(stderr,
+								"Invalid delay at step %u!\n",
+								nr_steps);
+						return NULL;
+					}
+
+					step.type = DELAY;
+					step.wait = tmp;
+					goto add_step;
+				}
+			} else if (!strcasecmp(field, "p")) {
+				if ((field = strtok_r(fstart, ".", &fctx)) !=
+				    NULL) {
+					tmp = atoi(field);
+					if (tmp <= 0) {
+						if (!quiet)
+							fprintf(stderr,
+								"Invalid period at step %u!\n",
+								nr_steps);
+						return NULL;
+					}
+
+					step.type = PERIOD;
+					step.wait = tmp;
+					goto add_step;
+				}
+			} else if (!strcasecmp(field, "s")) {
+				if ((field = strtok_r(fstart, ".", &fctx)) !=
+				    NULL) {
+					tmp = atoi(field);
+					if (tmp >= 0) {
+						if (!quiet)
+							fprintf(stderr,
+								"Invalid sync target at step %u!\n",
+								nr_steps);
+						return NULL;
+					}
+
+					step.type = SYNC;
+					step.wait = tmp;
+					goto add_step;
+				}
+			}
+
+			tmp = atoi(field);
+			if (tmp < 0) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid ctx id at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.context = tmp;
+
+			valid++;
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			unsigned int i, old_valid = valid;
+
+			fstart = NULL;
+
+			for (i = 0; i < ARRAY_SIZE(ring_str_map); i++) {
+				if (!strcasecmp(field, ring_str_map[i])) {
+					step.engine = i;
+					valid++;
+					break;
+				}
+			}
+
+			if (old_valid == valid) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid engine id at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			char *sep = NULL;
+			long int tmpl;
+
+			fstart = NULL;
+
+			tmpl = strtol(field, &sep, 10);
+			if (tmpl == LONG_MIN || tmpl == LONG_MAX) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid duration at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.duration.min = tmpl;
+
+			if (sep && *sep == '-') {
+				tmpl = strtol(sep + 1, NULL, 10);
+				if (tmpl == LONG_MIN || tmpl == LONG_MAX) {
+					if (!quiet)
+						fprintf(stderr,
+							"Invalid duration range at step %u!\n",
+							nr_steps);
+					return NULL;
+				}
+				step.duration.max = tmpl;
+			} else {
+				step.duration.max = step.duration.min;
+			}
+
+			valid++;
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			fstart = NULL;
+
+			tmp = atoi(field);
+			if (tmp > 0) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid forward dependency at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.dependency = tmp;
+
+			valid++;
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			fstart = NULL;
+
+			tmp = atoi(field);
+			if (tmp != 0 && tmp != 1) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid wait boolean at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.wait = tmp;
+
+			valid++;
+		}
+
+		if (valid != 5) {
+			if (!quiet)
+				fprintf(stderr, "Invalid record at step %u!\n",
+					nr_steps);
+			return NULL;
+		}
+
+		step.type = BATCH;
+
+add_step:
+		nr_steps++;
+		steps = realloc(steps, sizeof(step) * nr_steps);
+		igt_assert(steps);
+
+		memcpy(&steps[nr_steps - 1], &step, sizeof(step));
+
+		free(token);
+	}
+
+	wrk = malloc(sizeof(*wrk));
+	igt_assert(wrk);
+
+	wrk->nr_steps = nr_steps;
+	wrk->steps = steps;
+
+	free(desc);
+
+	return wrk;
+}
+
+static struct workload *
+clone_workload(struct workload *_wrk)
+{
+	struct workload *wrk;
+
+	wrk = malloc(sizeof(*wrk));
+	igt_assert(wrk);
+	memset(wrk, 0, sizeof(*wrk));
+
+	wrk->nr_steps = _wrk->nr_steps;
+	wrk->steps = calloc(wrk->nr_steps, sizeof(struct w_step));
+	igt_assert(wrk->steps);
+
+	memcpy(wrk->steps, _wrk->steps, sizeof(struct w_step) * wrk->nr_steps);
+
+	return wrk;
+}
+
+#define rounddown(x, y) (x - (x%y))
+#ifndef PAGE_SIZE
+#define PAGE_SIZE (4096)
+#endif
+
+static unsigned int get_duration(struct duration *dur)
+{
+	if (dur->min == dur->max)
+		return dur->min;
+	else
+		return dur->min + rand() % (dur->max + 1 - dur->min);
+}
+
+static unsigned long __get_bb_sz(unsigned int duration)
+{
+	return ALIGN(duration * nop_calibration * sizeof(uint32_t) /
+		     nop_calibration_us, sizeof(uint32_t));
+}
+
+static unsigned long get_bb_sz(struct duration *dur)
+{
+	return __get_bb_sz(dur->cur);
+}
+
+static void
+__emit_bb_end(struct w_step *w, bool terminate, bool seqnos, uint32_t seqno)
+{
+	const uint32_t bbe = 0xa << 23;
+	unsigned long bb_sz = get_bb_sz(&w->duration);
+	unsigned long mmap_start, cmd_offset, mmap_len;
+	uint32_t *ptr, *cs;
+
+	mmap_len = (seqnos ? 5 : 1) * sizeof(uint32_t);
+	cmd_offset = bb_sz - mmap_len;
+	mmap_start = rounddown(cmd_offset, PAGE_SIZE);
+	mmap_len += cmd_offset - mmap_start;
+
+	gem_set_domain(fd, w->bb_handle,
+		       I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
+
+	ptr = gem_mmap__cpu(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
+	cs = (uint32_t *)((char *)ptr + cmd_offset - mmap_start);
+
+	if (seqnos) {
+		const int gen = intel_gen(intel_get_drm_devid(fd));
+
+		igt_assert(gen >= 8);
+
+		w->reloc.offset = bb_sz - 4 * sizeof(uint32_t);
+		w->seqno_offset = bb_sz - 2 * sizeof(uint32_t);
+
+		*cs++ = terminate ? MI_STORE_DWORD_IMM : 0;
+		*cs++ = 0;
+		*cs++ = 0;
+		*cs++ = seqno;
+	}
+
+	*cs = terminate ? bbe : 0;
+
+	munmap(ptr, mmap_len);
+}
+
+static void terminate_bb(struct w_step *w, bool seqnos, uint32_t seqno)
+{
+	__emit_bb_end(w, true, seqnos, seqno);
+}
+
+static void unterminate_bb(struct w_step *w, bool seqnos)
+{
+	__emit_bb_end(w, false, seqnos, 0);
+}
+
+static void
+prepare_workload(struct workload *wrk, bool swap_vcs, bool seqnos)
+{
+	int max_ctx = -1;
+	struct w_step *w;
+	int i;
+
+	if (seqnos) {
+		const unsigned int status_sz = sizeof(uint32_t);
+
+		for (i = 0; i < NUM_ENGINES; i++) {
+			wrk->status_page_handle[i] = gem_create(fd, status_sz);
+			wrk->status_page[i] =
+				gem_mmap__cpu(fd, wrk->status_page_handle[i],
+					      0, status_sz, PROT_READ);
+		}
+	}
+
+	for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
+		if ((int)w->context > max_ctx) {
+			int delta = w->context + 1 - wrk->nr_ctxs;
+
+			wrk->nr_ctxs += delta;
+			wrk->ctx_id = realloc(wrk->ctx_id,
+					      wrk->nr_ctxs * sizeof(uint32_t));
+			memset(&wrk->ctx_id[wrk->nr_ctxs - delta], 0,
+			       delta * sizeof(uint32_t));
+
+			max_ctx = w->context;
+		}
+
+		if (!wrk->ctx_id[w->context]) {
+			struct drm_i915_gem_context_create arg = {};
+
+			drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_CREATE, &arg);
+			igt_assert(arg.ctx_id);
+
+			wrk->ctx_id[w->context] = arg.ctx_id;
+		}
+	}
+
+	for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
+		enum intel_engine_id engine = w->engine;
+		unsigned int bb_i, j = 0;
+
+		if (w->type != BATCH)
+			continue;
+
+		w->obj[j].handle = gem_create(fd, 4096);
+		w->obj[j].flags = EXEC_OBJECT_WRITE;
+		j++;
+
+		if (seqnos) {
+			w->obj[j].handle = wrk->status_page_handle[engine];
+			w->obj[j].flags = EXEC_OBJECT_WRITE;
+			j++;
+		}
+
+		bb_i = j++;
+		w->duration.cur = w->duration.max;
+		w->bb_sz = get_bb_sz(&w->duration);
+		w->bb_handle = w->obj[bb_i].handle = gem_create(fd, w->bb_sz);
+		terminate_bb(w, seqnos, 0);
+		if (seqnos) {
+			w->reloc.presumed_offset = -1;
+			w->reloc.target_handle = 1;
+			w->reloc.read_domains = I915_GEM_DOMAIN_INSTRUCTION;
+			w->reloc.write_domain = I915_GEM_DOMAIN_INSTRUCTION;
+		}
+
+		igt_assert(w->dependency <= 0);
+		if (w->dependency) {
+			int dep_idx = i + w->dependency;
+
+			igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
+			igt_assert(wrk->steps[dep_idx].type == BATCH);
+
+			w->obj[j].handle = w->obj[bb_i].handle;
+			bb_i = j;
+			w->obj[j - 1].handle =
+					wrk->steps[dep_idx].obj[0].handle;
+			j++;
+		}
+
+		if (seqnos) {
+			w->obj[bb_i].relocs_ptr = to_user_pointer(&w->reloc);
+			w->obj[bb_i].relocation_count = 1;
+		}
+
+		w->eb.buffers_ptr = to_user_pointer(w->obj);
+		w->eb.buffer_count = j;
+		w->eb.rsvd1 = wrk->ctx_id[w->context];
+
+		if (swap_vcs && engine == VCS1)
+			engine = VCS2;
+		else if (swap_vcs && engine == VCS2)
+			engine = VCS1;
+		w->eb.flags = eb_engine_map[engine];
+		w->eb.flags |= I915_EXEC_HANDLE_LUT;
+		if (!seqnos)
+			w->eb.flags |= I915_EXEC_NO_RELOC;
+#ifdef DEBUG
+		printf("%u: %u:%x|%x|%x|%x %10lu flags=%llx bb=%x[%u] ctx[%u]=%u\n",
+		       i, w->eb.buffer_count, w->obj[0].handle,
+		       w->obj[1].handle, w->obj[2].handle, w->obj[3].handle,
+		       w->bb_sz, w->eb.flags, w->bb_handle, bb_i,
+		       w->context, wrk->ctx_id[w->context]);
+#endif
+	}
+}
+
+static double elapsed(const struct timespec *start, const struct timespec *end)
+{
+	return (end->tv_sec - start->tv_sec) +
+	       (end->tv_nsec - start->tv_nsec) / 1e9;
+}
+
+static int elapsed_us(const struct timespec *start, const struct timespec *end)
+{
+	return (1e9 * (end->tv_sec - start->tv_sec) +
+	       (end->tv_nsec - start->tv_nsec)) / 1e3;
+}
+
+static enum intel_engine_id
+rr_balance(struct workload *wrk, struct w_step *w)
+{
+	unsigned int engine;
+
+	if (wrk->vcs_rr)
+		engine = VCS2;
+	else
+		engine = VCS1;
+
+	wrk->vcs_rr ^= 1;
+
+	return engine;
+}
+
+static enum intel_engine_id
+qd_balance(struct workload *wrk, struct w_step *w)
+{
+	unsigned long qd[NUM_ENGINES];
+	enum intel_engine_id engine = w->engine;
+
+	igt_assert(engine == VCS);
+
+	qd[VCS1] = wrk->seqno[VCS1] - wrk->status_page[VCS1][0];
+	wrk->qd_sum[VCS1] += qd[VCS1];
+
+	qd[VCS2] = wrk->seqno[VCS2] - wrk->status_page[VCS2][0];
+	wrk->qd_sum[VCS2] += qd[VCS2];
+
+	if (qd[VCS1] < qd[VCS2]) {
+		engine = VCS1;
+		wrk->vcs_rr = 0;
+	} else if (qd[VCS2] < qd[VCS1]) {
+		engine = VCS2;
+		wrk->vcs_rr = 1;
+	} else {
+		unsigned int vcs = wrk->vcs_rr ^ 1;
+
+		wrk->vcs_rr = vcs;
+
+		if (vcs == 0)
+			engine = VCS1;
+		else
+			engine = VCS2;
+	}
+
+// printf("qd_balance: 1:%lu 2:%lu rr:%u = %u\n", qd[VCS1], qd[VCS2], wrk->vcs_rr, engine);
+
+	return engine;
+}
+
+static void update_bb_seqno(struct w_step *w, uint32_t seqno)
+{
+	unsigned long mmap_start, mmap_offset, mmap_len;
+	void *ptr;
+
+	mmap_start = rounddown(w->seqno_offset, PAGE_SIZE);
+	mmap_offset = w->seqno_offset - mmap_start;
+	mmap_len = sizeof(uint32_t) + mmap_offset;
+
+	gem_set_domain(fd, w->bb_handle,
+		       I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
+
+	ptr = gem_mmap__cpu(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
+
+	*(uint32_t *)((char *)ptr + mmap_offset) = seqno;
+
+	munmap(ptr, mmap_len);
+}
+
+static void
+run_workload(unsigned int id, struct workload *wrk, unsigned int repeat,
+	     enum intel_engine_id (*balance)(struct workload *wrk,
+					     struct w_step *w), bool seqnos)
+{
+	struct timespec t_start, t_end;
+	struct w_step *w;
+	double t;
+	int i, j;
+
+	clock_gettime(CLOCK_MONOTONIC, &t_start);
+
+	srand(t_start.tv_nsec);
+
+	for (j = 0; j < repeat; j++) {
+		for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
+			enum intel_engine_id engine = w->engine;
+			uint32_t seqno;
+			bool seqno_updated = false;
+			int do_sleep = 0;
+
+			if (i == 0)
+				clock_gettime(CLOCK_MONOTONIC,
+					      &wrk->repeat_start);
+
+			if (w->type == DELAY) {
+				do_sleep = w->wait;
+			} else if (w->type == PERIOD) {
+				struct timespec now;
+
+				clock_gettime(CLOCK_MONOTONIC, &now);
+				do_sleep = w->wait -
+					   elapsed_us(&wrk->repeat_start, &now);
+				if (do_sleep < 0) {
+					if (!quiet) {
+						printf("%u: Dropped period @ %u/%u (%dus late)!\n",
+						       id, j, i, do_sleep);
+						continue;
+					}
+				}
+			} else if (w->type == SYNC) {
+				unsigned int s_idx = i + w->wait;
+
+				igt_assert(i > 0 && i < wrk->nr_steps);
+				igt_assert(wrk->steps[s_idx].type == BATCH);
+				gem_sync(fd, wrk->steps[s_idx].obj[0].handle);
+				continue;
+			}
+
+			if (do_sleep) {
+				usleep(do_sleep);
+				continue;
+			}
+
+			wrk->nr_bb[engine]++;
+
+			if (engine == VCS && balance) {
+				engine = balance(wrk, w);
+				wrk->nr_bb[engine]++;
+
+				w->obj[1].handle = wrk->status_page_handle[engine];
+
+				w->eb.flags = eb_engine_map[engine];
+				w->eb.flags |= I915_EXEC_HANDLE_LUT;
+			}
+
+			seqno = ++wrk->seqno[engine];
+
+			if (w->duration.min != w->duration.max) {
+				unsigned int cur = get_duration(&w->duration);
+
+				if (cur != w->duration.cur) {
+					unterminate_bb(w, seqnos);
+					w->duration.cur = cur;
+					terminate_bb(w, seqnos, seqno);
+					seqno_updated = true;
+				}
+			}
+
+			if (seqnos && !seqno_updated)
+				update_bb_seqno(w, seqno);
+
+			gem_execbuf(fd, &w->eb);
+
+			if (w->wait)
+				gem_sync(fd, w->obj[0].handle);
+		}
+	}
+
+	clock_gettime(CLOCK_MONOTONIC, &t_end);
+
+	t = elapsed(&t_start, &t_end);
+	if (!quiet && !balance)
+		printf("%u: %.3fs elapsed (%.3f workloads/s)\n", id, t, repeat / t);
+	if (!quiet && balance == rr_balance)
+		printf("%u: %.3fs elapsed (%.3f workloads/s). %lu (%lu + %lu) total VCS batches.\n",
+		       id, t, repeat / t,
+		       wrk->nr_bb[VCS], wrk->nr_bb[VCS1], wrk->nr_bb[VCS2]);
+	if (!quiet && balance == qd_balance)
+		printf("%u: %.3fs elapsed (%.3f workloads/s). %lu (%lu + %lu) total VCS batches. Average queue depths %.3f, %.3f.\n",
+		       id, t, repeat / t,
+		       wrk->nr_bb[VCS], wrk->nr_bb[VCS1], wrk->nr_bb[VCS2],
+		       (double)wrk->qd_sum[VCS1] / wrk->nr_bb[VCS],
+		       (double)wrk->qd_sum[VCS2] / wrk->nr_bb[VCS]);
+}
+
+static void fini_workload(struct workload *wrk)
+{
+	free(wrk->steps);
+	free(wrk);
+}
+
+static unsigned long calibrate_nop(unsigned int tolerance_pct)
+{
+	const uint32_t bbe = 0xa << 23;
+	unsigned int loops = 17;
+	unsigned int usecs = nop_calibration_us;
+	struct drm_i915_gem_exec_object2 obj = {};
+	struct drm_i915_gem_execbuffer2 eb =
+		{ .buffer_count = 1, .buffers_ptr = (uintptr_t)&obj};
+	long size, last_size;
+	struct timespec t_0, t_end;
+
+	clock_gettime(CLOCK_MONOTONIC, &t_0);
+
+	size = 256 * 1024;
+	do {
+		struct timespec t_start;
+
+		obj.handle = gem_create(fd, size);
+		gem_write(fd, obj.handle, size - sizeof(bbe), &bbe,
+			  sizeof(bbe));
+		gem_execbuf(fd, &eb);
+		gem_sync(fd, obj.handle);
+
+		clock_gettime(CLOCK_MONOTONIC, &t_start);
+		for (int loop = 0; loop < loops; loop++)
+			gem_execbuf(fd, &eb);
+		gem_sync(fd, obj.handle);
+		clock_gettime(CLOCK_MONOTONIC, &t_end);
+
+		gem_close(fd, obj.handle);
+
+		last_size = size;
+		size = loops * size / elapsed(&t_start, &t_end) / 1e6 * usecs;
+		size = ALIGN(size, sizeof(uint32_t));
+	} while (elapsed(&t_0, &t_end) < 5 ||
+		 abs(size - last_size) > (size * tolerance_pct / 100));
+
+	return size / sizeof(uint32_t);
+}
+
+static void print_help(void)
+{
+	puts(
+"Usage: gem_wsim [OPTIONS]\n"
+"\n"
+"Runs a simulated workload on the GPU.\n"
+"When ran without arguments performs a GPU calibration result of which needs\n"
+"to be provided when running the simulation in subsequent invocations.\n"
+"\n"
+"Options:\n"
+"	-h		This text.\n"
+"	-q		Be quiet - do not output anything to stdout.\n"
+"	-n <n>		Nop calibration value.\n"
+"	-t <n>		Nop calibration tolerance percentage.\n"
+"			Use when there is a difficuly obtaining calibration\n"
+"			with the default settings.\n"
+"	-w <desc|path>	Filename or a workload descriptor.\n"
+"			Can be given multiple times.\n"
+"	-r <n>		How many times to emit the workload.\n"
+"	-c <n>		Fork n clients emitting the workload simultaneously.\n"
+"	-x		Swap VCS1 and VCS2 engines in every other client.\n"
+"	-s		Track batch sequence numbers.\n"
+"	-b <n>		Load balancing to use. (0: rr, 1: qd)\n"
+"\n"
+"Workload descriptor format:\n"
+"\n"
+"	ctx.engine.duration_us.dependency.wait,...\n"
+"	<uint>.<str>.<uint>[-<uint>].<int <= 0>.<0|1>,...\n"
+"	d|p.<uiny>,...\n"
+"\n"
+"	For duration a range can be given from which a random value will be\n"
+"	picked before every submit. Since this and seqno management requirea\n"
+"	CPU access to objects care needs to be taken in order to ensure the\n"
+"	submit queue is deep enough these operations do not affect the\n"
+"	execution speed unless that is desired.\n"
+"\n"
+"	Additional workload steps are also supported:\n"
+"	  * 'd' - adds a delay (in microseconds).\n"
+"	  * 'p' - adds a delay relative to the start of previous loop so that\n"
+"		  the each loop starts execution with a given period.\n"
+"	  * 's' - synchronises the pipeline to a batch relative to the step.\n"
+"\n"
+"	Engine ids: RCS, BCS, VCS, VCS1, VCS2, VECS\n"
+"\n"
+"Example:\n"
+"	1.VCS1.3000.0.1\n"
+"	1.RCS.500-1000.-1.0\n"
+"	d.1000\n"
+"	1.RCS.3700.0.0\n"
+"	1.RCS.1000.-2.0\n"
+"	1.VCS2.2300.-2.0\n"
+"	1.RCS.4700.-1.0\n"
+"	1.VCS2.600.-1.1\n"
+"	p.16000\n"
+"\n"
+"The above workload described in human language works like this:\n"
+"A batch is sent to the VCS1 engine which will be executing for 3ms on the\n"
+"GPU and userspace will wait until it is finished before proceeding.\n"
+"Now three batches are sent to RCS with durations of 0.5-1.5ms (random, 3.7ms\n"
+"and 1ms respectively. The first batch has a data dependency on the preceding\n"
+"VCS1 batch, and the last of the group depends on the first from the group.\n"
+"Now a 2.3ms batch is sent to VCS2, with a data dependency on the 3.7ms RCS\n"
+"batch, followed by a 4.7ms RCS batch with a data dependency on the 2.3ms\n"
+"VCS2 batch, and finally a 0.6ms VCS2 batch depending on the previous RCS one.\n"
+"The tool is then told to wait for the last one to complete before optionally\n"
+"starting the next iteration (-r).\n"
+"\n"
+"When workload descriptors are provided on the command line, commas must be\n"
+"used instead of newlines.\n"
+	);
+}
+
+static char *load_workload_descriptor(char *filename)
+{
+	struct stat sbuf;
+	char *buf;
+	int infd, ret, i;
+	ssize_t len;
+
+	ret = stat(filename, &sbuf);
+	if (ret || !S_ISREG(sbuf.st_mode))
+		return filename;
+
+	igt_assert(sbuf.st_size < 1024 * 1024); /* Just so. */
+	buf = malloc(sbuf.st_size);
+	igt_assert(buf);
+
+	infd = open(filename, O_RDONLY);
+	igt_assert(infd >= 0);
+	len = read(infd, buf, sbuf.st_size);
+	igt_assert(len == sbuf.st_size);
+	close(infd);
+
+	for (i = 0; i < len; i++) {
+		if (buf[i] == '\n')
+			buf[i] = ',';
+	}
+
+	len--;
+	while (buf[len] == ',')
+		buf[len--] = 0;
+
+	return buf;
+}
+
+static char **
+add_workload_arg(char **w_args, unsigned int nr_args, char *w_arg)
+{
+	w_args = realloc(w_args, sizeof(char *) * nr_args);
+	igt_assert(w_args);
+	w_args[nr_args - 1] = w_arg;
+
+	return w_args;
+}
+
+int main(int argc, char **argv)
+{
+	unsigned int repeat = 1;
+	unsigned int clients = 1;
+	bool seqnos = false;
+	bool swap_vcs = false;
+	struct timespec t_start, t_end;
+	struct workload **w, **wrk = NULL;
+	unsigned int nr_w_args = 0;
+	char **w_args = NULL;
+	unsigned int tolerance_pct = 1;
+	enum intel_engine_id (*balance)(struct workload *, struct w_step *) = NULL;
+	double t;
+	int i, c;
+
+	fd = drm_open_driver(DRIVER_INTEL);
+
+	while ((c = getopt(argc, argv, "c:n:r:qxw:t:sb:h")) != -1) {
+		switch (c) {
+		case 'w':
+			w_args = add_workload_arg(w_args, ++nr_w_args, optarg);
+			break;
+		case 'c':
+			clients = strtol(optarg, NULL, 0);
+			break;
+		case 't':
+			tolerance_pct = strtol(optarg, NULL, 0);
+			break;
+		case 'n':
+			nop_calibration = strtol(optarg, NULL, 0);
+			break;
+		case 'r':
+			repeat = strtol(optarg, NULL, 0);
+			break;
+		case 'q':
+			quiet = true;
+			break;
+		case 'x':
+			swap_vcs = true;
+			break;
+		case 's':
+			seqnos = true;
+			break;
+		case 'b':
+			switch (strtol(optarg, NULL, 0)) {
+			case 0:
+				balance = rr_balance;
+				break;
+			case 1:
+				balance = qd_balance;
+				break;
+			default:
+				if (!quiet)
+					fprintf(stderr,
+						"Unknown balancing mode '%s'!\n",
+						optarg);
+				return 1;
+			}
+			break;
+		case 'h':
+			print_help();
+			return 0;
+		default:
+			return 1;
+		}
+	}
+
+	if (!nop_calibration) {
+		if (!quiet)
+			printf("Calibrating nop delay with %u%% tolerance...\n",
+				tolerance_pct);
+		nop_calibration = calibrate_nop(tolerance_pct);
+		if (!quiet)
+			printf("Nop calibration for %uus delay is %lu.\n",
+			       nop_calibration_us, nop_calibration);
+
+		return 0;
+	}
+
+	if (!nr_w_args) {
+		if (!quiet)
+			fprintf(stderr, "No workload descriptor(s)!\n");
+		return 1;
+	}
+
+	if (nr_w_args > 1 && clients > 1) {
+		if (!quiet)
+			fprintf(stderr,
+				"Cloned clients cannot be combined with multiple workloads!\n");
+		return 1;
+	}
+
+	wrk = calloc(nr_w_args, sizeof(*wrk));
+	igt_assert(wrk);
+
+	for (i = 0; i < nr_w_args; i++) {
+		w_args[i] = load_workload_descriptor(w_args[i]);
+		if (!w_args[i]) {
+			if (!quiet)
+				fprintf(stderr,
+					"Failed to load workload descriptor %u!\n",
+					i);
+			return 1;
+		}
+
+		wrk[i] = parse_workload(w_args[i]);
+		if (!wrk[i]) {
+			if (!quiet)
+				fprintf(stderr,
+					"Failed to parse workload %u!\n", i);
+			return 1;
+		}
+	}
+
+	if (!quiet) {
+		printf("Using %lu nop calibration for %uus delay.\n",
+		       nop_calibration, nop_calibration_us);
+		if (nr_w_args > 1)
+			clients = nr_w_args;
+		printf("%u client%s.\n", clients, clients > 1 ? "s" : "");
+		if (swap_vcs)
+			printf("Swapping VCS rings between clients.\n");
+	}
+
+	if (balance && !seqnos) {
+		if (!quiet)
+			fprintf(stderr, "Seqnos are required for load-balancing!\n");
+		return 1;
+	}
+
+	w = calloc(clients, sizeof(struct workload *));
+	igt_assert(w);
+
+	for (i = 0; i < clients; i++) {
+		w[i] = clone_workload(wrk[nr_w_args > 1 ? i : 0]);
+		prepare_workload(w[i], swap_vcs && (i & 1), seqnos);
+	}
+
+	clock_gettime(CLOCK_MONOTONIC, &t_start);
+
+	igt_fork(child, clients)
+		run_workload(child, w[child], repeat, balance, seqnos);
+
+	igt_waitchildren();
+
+	clock_gettime(CLOCK_MONOTONIC, &t_end);
+
+	t = elapsed(&t_start, &t_end);
+	if (!quiet)
+		printf("%.3fs elapsed (%.3f workloads/s)\n",
+		       t, clients * repeat / t);
+
+	for (i = 0; i < clients; i++)
+		fini_workload(w[i]);
+	free(w);
+	for (i = 0; i < nr_w_args; i++)
+		fini_workload(wrk[i]);
+	free(w_args);
+
+	return 0;
+}
diff --git a/benchmarks/wsim/workload1 b/benchmarks/wsim/workload1
new file mode 100644
index 000000000000..5f533d8e168b
--- /dev/null
+++ b/benchmarks/wsim/workload1
@@ -0,0 +1,7 @@
+1.VCS1.3000.0.1
+1.RCS.1000.-1.0
+1.RCS.3700.0.0
+1.RCS.1000.-2.0
+1.VCS2.2300.-2.0
+1.RCS.4700.-1.0
+1.VCS2.600.-1.1
diff --git a/benchmarks/wsim/workload2 b/benchmarks/wsim/workload2
new file mode 100644
index 000000000000..25a692032eae
--- /dev/null
+++ b/benchmarks/wsim/workload2
@@ -0,0 +1,7 @@
+1.VCS.3000.0.1
+1.RCS.1000.-1.0
+1.RCS.3700.0.0
+1.RCS.1000.-2.0
+1.VCS.2300.-2.0
+1.RCS.4700.-1.0
+1.VCS.600.-1.1
diff --git a/benchmarks/wsim/workload3 b/benchmarks/wsim/workload3
new file mode 100644
index 000000000000..bc9f6df52775
--- /dev/null
+++ b/benchmarks/wsim/workload3
@@ -0,0 +1,7 @@
+1.VCS.3000.0.0
+1.RCS.500-1500.-1.0
+0.RCS.3700.0.0
+1.RCS.1000.-2.0
+1.VCS.2300.-2.0
+2.RCS.4700.-1.0
+1.VCS.600.-1.0
diff --git a/benchmarks/wsim/workload4 b/benchmarks/wsim/workload4
new file mode 100644
index 000000000000..3e4720a6949c
--- /dev/null
+++ b/benchmarks/wsim/workload4
@@ -0,0 +1,8 @@
+1.VCS.3000.0.0
+1.RCS.500-1500.-1.0
+d.1000
+0.RCS.3700.0.0
+1.RCS.1000.-3.0
+1.VCS.2300.-2.0
+2.RCS.4700.-1.0
+1.VCS.600.-1.0
diff --git a/benchmarks/wsim/workload5 b/benchmarks/wsim/workload5
new file mode 100644
index 000000000000..65440a8264ef
--- /dev/null
+++ b/benchmarks/wsim/workload5
@@ -0,0 +1,8 @@
+1.VCS.3000.0.0
+1.RCS.500-1500.-1.0
+0.RCS.3700.0.0
+1.RCS.1000.-2.0
+1.VCS.2300.-2.0
+2.RCS.4700.-1.0
+1.VCS.600.-1.0
+p.16000
diff --git a/benchmarks/wsim/workload6 b/benchmarks/wsim/workload6
new file mode 100644
index 000000000000..d5b7141dfdd0
--- /dev/null
+++ b/benchmarks/wsim/workload6
@@ -0,0 +1,8 @@
+1.VCS.3000.0.0
+1.RCS.500-1500.-1.0
+s.-1
+0.RCS.3700.0.0
+1.RCS.1000.-3.0
+1.VCS.2300.-2.0
+2.RCS.4700.-1.0
+1.VCS.600.-1.0
-- 
2.9.3

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t v3] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-05 16:14   ` [PATCH i-g-t v3] " Tvrtko Ursulin
@ 2017-04-05 16:48     ` Chris Wilson
  2017-04-06  8:18       ` Tvrtko Ursulin
  0 siblings, 1 reply; 26+ messages in thread
From: Chris Wilson @ 2017-04-05 16:48 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Wed, Apr 05, 2017 at 05:14:01PM +0100, Tvrtko Ursulin wrote:
> +static void
> +__emit_bb_end(struct w_step *w, bool terminate, bool seqnos, uint32_t seqno)
> +{
> +	const uint32_t bbe = 0xa << 23;
> +	unsigned long bb_sz = get_bb_sz(&w->duration);
> +	unsigned long mmap_start, cmd_offset, mmap_len;
> +	uint32_t *ptr, *cs;
> +
> +	mmap_len = (seqnos ? 5 : 1) * sizeof(uint32_t);
> +	cmd_offset = bb_sz - mmap_len;
> +	mmap_start = rounddown(cmd_offset, PAGE_SIZE);
> +	mmap_len += cmd_offset - mmap_start;
> +
> +	gem_set_domain(fd, w->bb_handle,
> +		       I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
> +
> +	ptr = gem_mmap__cpu(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
> +	cs = (uint32_t *)((char *)ptr + cmd_offset - mmap_start);
> +
> +	if (seqnos) {
> +		const int gen = intel_gen(intel_get_drm_devid(fd));
> +
> +		igt_assert(gen >= 8);
> +
> +		w->reloc.offset = bb_sz - 4 * sizeof(uint32_t);
> +		w->seqno_offset = bb_sz - 2 * sizeof(uint32_t);
> +
> +		*cs++ = terminate ? MI_STORE_DWORD_IMM : 0;
> +		*cs++ = 0;
> +		*cs++ = 0;
> +		*cs++ = seqno;
> +	}
> +
> +	*cs = terminate ? bbe : 0;
> +
> +	munmap(ptr, mmap_len);
> +}
> +
> +static void terminate_bb(struct w_step *w, bool seqnos, uint32_t seqno)
> +{
> +	__emit_bb_end(w, true, seqnos, seqno);
> +}
> +
> +static void unterminate_bb(struct w_step *w, bool seqnos)
> +{
> +	__emit_bb_end(w, false, seqnos, 0);
> +}
> +
> +static void
> +prepare_workload(struct workload *wrk, bool swap_vcs, bool seqnos)
> +{
> +	int max_ctx = -1;
> +	struct w_step *w;
> +	int i;
> +
> +	if (seqnos) {
> +		const unsigned int status_sz = sizeof(uint32_t);
> +
> +		for (i = 0; i < NUM_ENGINES; i++) {
> +			wrk->status_page_handle[i] = gem_create(fd, status_sz);

Need to set_cache_level(CACHED) for llc.

You can use one page for all engines. Just use a different cacheline
for each, for safety.

> +			wrk->status_page[i] =
> +				gem_mmap__cpu(fd, wrk->status_page_handle[i],
> +					      0, status_sz, PROT_READ);
> +		}
> +	}
> +
> +	for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
> +		if ((int)w->context > max_ctx) {
> +			int delta = w->context + 1 - wrk->nr_ctxs;
> +
> +			wrk->nr_ctxs += delta;
> +			wrk->ctx_id = realloc(wrk->ctx_id,
> +					      wrk->nr_ctxs * sizeof(uint32_t));
> +			memset(&wrk->ctx_id[wrk->nr_ctxs - delta], 0,
> +			       delta * sizeof(uint32_t));
> +
> +			max_ctx = w->context;
> +		}
> +
> +		if (!wrk->ctx_id[w->context]) {
> +			struct drm_i915_gem_context_create arg = {};
> +
> +			drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_CREATE, &arg);
> +			igt_assert(arg.ctx_id);
> +
> +			wrk->ctx_id[w->context] = arg.ctx_id;
> +		}
> +	}
> +
> +	for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
> +		enum intel_engine_id engine = w->engine;
> +		unsigned int bb_i, j = 0;
> +
> +		if (w->type != BATCH)
> +			continue;
> +
> +		w->obj[j].handle = gem_create(fd, 4096);
> +		w->obj[j].flags = EXEC_OBJECT_WRITE;
> +		j++;
> +
> +		if (seqnos) {
> +			w->obj[j].handle = wrk->status_page_handle[engine];
> +			w->obj[j].flags = EXEC_OBJECT_WRITE;

The trick for sharing between engines is to not mark this as a WRITE.
Fun little lies.

> +			j++;
> +		}
> +
> +		bb_i = j++;
> +		w->duration.cur = w->duration.max;
> +		w->bb_sz = get_bb_sz(&w->duration);
> +		w->bb_handle = w->obj[bb_i].handle = gem_create(fd, w->bb_sz);
> +		terminate_bb(w, seqnos, 0);
> +		if (seqnos) {
> +			w->reloc.presumed_offset = -1;
> +			w->reloc.target_handle = 1;
> +			w->reloc.read_domains = I915_GEM_DOMAIN_INSTRUCTION;
> +			w->reloc.write_domain = I915_GEM_DOMAIN_INSTRUCTION;

Ugh. That's a magic w/a value for pipecontrols. Fortunately we don't want
to set write_domain here anyway.

> +		}
> +
> +		igt_assert(w->dependency <= 0);
> +		if (w->dependency) {
> +			int dep_idx = i + w->dependency;
> +
> +			igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
> +			igt_assert(wrk->steps[dep_idx].type == BATCH);
> +
> +			w->obj[j].handle = w->obj[bb_i].handle;
> +			bb_i = j;
> +			w->obj[j - 1].handle =
> +					wrk->steps[dep_idx].obj[0].handle;
> +			j++;
> +		}
> +
> +		if (seqnos) {
> +			w->obj[bb_i].relocs_ptr = to_user_pointer(&w->reloc);
> +			w->obj[bb_i].relocation_count = 1;
> +		}
> +
> +		w->eb.buffers_ptr = to_user_pointer(w->obj);
> +		w->eb.buffer_count = j;
> +		w->eb.rsvd1 = wrk->ctx_id[w->context];
> +
> +		if (swap_vcs && engine == VCS1)
> +			engine = VCS2;
> +		else if (swap_vcs && engine == VCS2)
> +			engine = VCS1;
> +		w->eb.flags = eb_engine_map[engine];
> +		w->eb.flags |= I915_EXEC_HANDLE_LUT;
> +		if (!seqnos)
> +			w->eb.flags |= I915_EXEC_NO_RELOC;

Doesn't look too hard to get the relocation right. Forcing relocations
between batches is probably a good one to check (just to say don't do
that)

> +#ifdef DEBUG
> +		printf("%u: %u:%x|%x|%x|%x %10lu flags=%llx bb=%x[%u] ctx[%u]=%u\n",
> +		       i, w->eb.buffer_count, w->obj[0].handle,
> +		       w->obj[1].handle, w->obj[2].handle, w->obj[3].handle,
> +		       w->bb_sz, w->eb.flags, w->bb_handle, bb_i,
> +		       w->context, wrk->ctx_id[w->context]);
> +#endif
> +	}
> +}
> +
> +static double elapsed(const struct timespec *start, const struct timespec *end)
> +{
> +	return (end->tv_sec - start->tv_sec) +
> +	       (end->tv_nsec - start->tv_nsec) / 1e9;
> +}
> +
> +static int elapsed_us(const struct timespec *start, const struct timespec *end)
> +{

return 1e6 * elapsed(); might as well use gcc for something!

> +	return (1e9 * (end->tv_sec - start->tv_sec) +
> +	       (end->tv_nsec - start->tv_nsec)) / 1e3;
> +}
> +
> +static enum intel_engine_id
> +rr_balance(struct workload *wrk, struct w_step *w)
> +{
> +	unsigned int engine;
> +
> +	if (wrk->vcs_rr)
> +		engine = VCS2;
> +	else
> +		engine = VCS1;
> +
> +	wrk->vcs_rr ^= 1;
> +
> +	return engine;
> +}
> +
> +static enum intel_engine_id
> +qd_balance(struct workload *wrk, struct w_step *w)
> +{
> +	unsigned long qd[NUM_ENGINES];
> +	enum intel_engine_id engine = w->engine;
> +
> +	igt_assert(engine == VCS);
> +
> +	qd[VCS1] = wrk->seqno[VCS1] - wrk->status_page[VCS1][0];
> +	wrk->qd_sum[VCS1] += qd[VCS1];
> +
> +	qd[VCS2] = wrk->seqno[VCS2] - wrk->status_page[VCS2][0];
> +	wrk->qd_sum[VCS2] += qd[VCS2];
> +
> +	if (qd[VCS1] < qd[VCS2]) {
> +		engine = VCS1;
> +		wrk->vcs_rr = 0;
> +	} else if (qd[VCS2] < qd[VCS1]) {
> +		engine = VCS2;
> +		wrk->vcs_rr = 1;
> +	} else {
> +		unsigned int vcs = wrk->vcs_rr ^ 1;
> +
> +		wrk->vcs_rr = vcs;
> +
> +		if (vcs == 0)
> +			engine = VCS1;
> +		else
> +			engine = VCS2;
> +	}

Hmm. Just thinking we don't even need hw to simulate a load-balancer,
but that would be boring!

> +// printf("qd_balance: 1:%lu 2:%lu rr:%u = %u\n", qd[VCS1], qd[VCS2], wrk->vcs_rr, engine);
> +
> +	return engine;
> +}
> +
> +static void update_bb_seqno(struct w_step *w, uint32_t seqno)
> +{
> +	unsigned long mmap_start, mmap_offset, mmap_len;
> +	void *ptr;
> +
> +	mmap_start = rounddown(w->seqno_offset, PAGE_SIZE);
> +	mmap_offset = w->seqno_offset - mmap_start;
> +	mmap_len = sizeof(uint32_t) + mmap_offset;
> +
> +	gem_set_domain(fd, w->bb_handle,
> +		       I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
> +
> +	ptr = gem_mmap__cpu(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
> +
> +	*(uint32_t *)((char *)ptr + mmap_offset) = seqno;

Uh oh. I hope this isn't called inside any loop. Note this is
unsynchronized to the gpu so I wonder what this is for.

> +
> +	munmap(ptr, mmap_len);
> +}
> +
> +static void
> +run_workload(unsigned int id, struct workload *wrk, unsigned int repeat,
> +	     enum intel_engine_id (*balance)(struct workload *wrk,
> +					     struct w_step *w), bool seqnos)
> +{
> +	struct timespec t_start, t_end;
> +	struct w_step *w;
> +	double t;
> +	int i, j;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &t_start);
> +
> +	srand(t_start.tv_nsec);
> +
> +	for (j = 0; j < repeat; j++) {
> +		for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
> +			enum intel_engine_id engine = w->engine;
> +			uint32_t seqno;
> +			bool seqno_updated = false;
> +			int do_sleep = 0;
> +
> +			if (i == 0)
> +				clock_gettime(CLOCK_MONOTONIC,
> +					      &wrk->repeat_start);
> +
> +			if (w->type == DELAY) {
> +				do_sleep = w->wait;
> +			} else if (w->type == PERIOD) {
> +				struct timespec now;
> +
> +				clock_gettime(CLOCK_MONOTONIC, &now);
> +				do_sleep = w->wait -
> +					   elapsed_us(&wrk->repeat_start, &now);
> +				if (do_sleep < 0) {
> +					if (!quiet) {
> +						printf("%u: Dropped period @ %u/%u (%dus late)!\n",
> +						       id, j, i, do_sleep);
> +						continue;
> +					}
> +				}
> +			} else if (w->type == SYNC) {
> +				unsigned int s_idx = i + w->wait;
> +
> +				igt_assert(i > 0 && i < wrk->nr_steps);
> +				igt_assert(wrk->steps[s_idx].type == BATCH);
> +				gem_sync(fd, wrk->steps[s_idx].obj[0].handle);
> +				continue;
> +			}
> +
> +			if (do_sleep) {
> +				usleep(do_sleep);
> +				continue;
> +			}
> +
> +			wrk->nr_bb[engine]++;
> +
> +			if (engine == VCS && balance) {
> +				engine = balance(wrk, w);
> +				wrk->nr_bb[engine]++;
> +
> +				w->obj[1].handle = wrk->status_page_handle[engine];
> +
> +				w->eb.flags = eb_engine_map[engine];
> +				w->eb.flags |= I915_EXEC_HANDLE_LUT;
> +			}
> +
> +			seqno = ++wrk->seqno[engine];
> +
> +			if (w->duration.min != w->duration.max) {
> +				unsigned int cur = get_duration(&w->duration);
> +
> +				if (cur != w->duration.cur) {
> +					unterminate_bb(w, seqnos);

Ah, you said this was for adjusting runlength of the batches. I suggest
using batch_start_offset to change the number of nops rather than
rewrite the batch.

I need to study this a bit more...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t v3] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-05 16:48     ` Chris Wilson
@ 2017-04-06  8:18       ` Tvrtko Ursulin
  2017-04-06  8:55         ` Chris Wilson
  0 siblings, 1 reply; 26+ messages in thread
From: Tvrtko Ursulin @ 2017-04-06  8:18 UTC (permalink / raw)
  To: Chris Wilson, Tvrtko Ursulin, Intel-gfx, Tvrtko Ursulin,
	Rogozhkin, Dmitry V


On 05/04/2017 17:48, Chris Wilson wrote:
> On Wed, Apr 05, 2017 at 05:14:01PM +0100, Tvrtko Ursulin wrote:
>> +static void
>> +__emit_bb_end(struct w_step *w, bool terminate, bool seqnos, uint32_t seqno)
>> +{
>> +	const uint32_t bbe = 0xa << 23;
>> +	unsigned long bb_sz = get_bb_sz(&w->duration);
>> +	unsigned long mmap_start, cmd_offset, mmap_len;
>> +	uint32_t *ptr, *cs;
>> +
>> +	mmap_len = (seqnos ? 5 : 1) * sizeof(uint32_t);
>> +	cmd_offset = bb_sz - mmap_len;
>> +	mmap_start = rounddown(cmd_offset, PAGE_SIZE);
>> +	mmap_len += cmd_offset - mmap_start;
>> +
>> +	gem_set_domain(fd, w->bb_handle,
>> +		       I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
>> +
>> +	ptr = gem_mmap__cpu(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
>> +	cs = (uint32_t *)((char *)ptr + cmd_offset - mmap_start);
>> +
>> +	if (seqnos) {
>> +		const int gen = intel_gen(intel_get_drm_devid(fd));
>> +
>> +		igt_assert(gen >= 8);
>> +
>> +		w->reloc.offset = bb_sz - 4 * sizeof(uint32_t);
>> +		w->seqno_offset = bb_sz - 2 * sizeof(uint32_t);
>> +
>> +		*cs++ = terminate ? MI_STORE_DWORD_IMM : 0;
>> +		*cs++ = 0;
>> +		*cs++ = 0;
>> +		*cs++ = seqno;
>> +	}
>> +
>> +	*cs = terminate ? bbe : 0;
>> +
>> +	munmap(ptr, mmap_len);
>> +}
>> +
>> +static void terminate_bb(struct w_step *w, bool seqnos, uint32_t seqno)
>> +{
>> +	__emit_bb_end(w, true, seqnos, seqno);
>> +}
>> +
>> +static void unterminate_bb(struct w_step *w, bool seqnos)
>> +{
>> +	__emit_bb_end(w, false, seqnos, 0);
>> +}
>> +
>> +static void
>> +prepare_workload(struct workload *wrk, bool swap_vcs, bool seqnos)
>> +{
>> +	int max_ctx = -1;
>> +	struct w_step *w;
>> +	int i;
>> +
>> +	if (seqnos) {
>> +		const unsigned int status_sz = sizeof(uint32_t);
>> +
>> +		for (i = 0; i < NUM_ENGINES; i++) {
>> +			wrk->status_page_handle[i] = gem_create(fd, status_sz);
>
> Need to set_cache_level(CACHED) for llc.
>
> You can use one page for all engines. Just use a different cacheline
> for each, for safety.
>
>> +			wrk->status_page[i] =
>> +				gem_mmap__cpu(fd, wrk->status_page_handle[i],
>> +					      0, status_sz, PROT_READ);
>> +		}
>> +	}
>> +
>> +	for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
>> +		if ((int)w->context > max_ctx) {
>> +			int delta = w->context + 1 - wrk->nr_ctxs;
>> +
>> +			wrk->nr_ctxs += delta;
>> +			wrk->ctx_id = realloc(wrk->ctx_id,
>> +					      wrk->nr_ctxs * sizeof(uint32_t));
>> +			memset(&wrk->ctx_id[wrk->nr_ctxs - delta], 0,
>> +			       delta * sizeof(uint32_t));
>> +
>> +			max_ctx = w->context;
>> +		}
>> +
>> +		if (!wrk->ctx_id[w->context]) {
>> +			struct drm_i915_gem_context_create arg = {};
>> +
>> +			drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_CREATE, &arg);
>> +			igt_assert(arg.ctx_id);
>> +
>> +			wrk->ctx_id[w->context] = arg.ctx_id;
>> +		}
>> +	}
>> +
>> +	for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
>> +		enum intel_engine_id engine = w->engine;
>> +		unsigned int bb_i, j = 0;
>> +
>> +		if (w->type != BATCH)
>> +			continue;
>> +
>> +		w->obj[j].handle = gem_create(fd, 4096);
>> +		w->obj[j].flags = EXEC_OBJECT_WRITE;
>> +		j++;
>> +
>> +		if (seqnos) {
>> +			w->obj[j].handle = wrk->status_page_handle[engine];
>> +			w->obj[j].flags = EXEC_OBJECT_WRITE;
>
> The trick for sharing between engines is to not mark this as a WRITE.
> Fun little lies.

Yeah thats why I have per-engine objects. Which I don't mind since it is 
not like they are wasting any resources compared to everything else. But 
not admitting the write sounds still interesting. What would the 
repercussions of that be - limit us to llc platforms or something?

>> +			j++;
>> +		}
>> +
>> +		bb_i = j++;
>> +		w->duration.cur = w->duration.max;
>> +		w->bb_sz = get_bb_sz(&w->duration);
>> +		w->bb_handle = w->obj[bb_i].handle = gem_create(fd, w->bb_sz);
>> +		terminate_bb(w, seqnos, 0);
>> +		if (seqnos) {
>> +			w->reloc.presumed_offset = -1;
>> +			w->reloc.target_handle = 1;
>> +			w->reloc.read_domains = I915_GEM_DOMAIN_INSTRUCTION;
>> +			w->reloc.write_domain = I915_GEM_DOMAIN_INSTRUCTION;
>
> Ugh. That's a magic w/a value for pipecontrols. Fortunately we don't want
> to set write_domain here anyway.

I think I copy-pasted this from another IGT. So you say cheat here as 
well and set zero for both domains?

>
>> +		}
>> +
>> +		igt_assert(w->dependency <= 0);
>> +		if (w->dependency) {
>> +			int dep_idx = i + w->dependency;
>> +
>> +			igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
>> +			igt_assert(wrk->steps[dep_idx].type == BATCH);
>> +
>> +			w->obj[j].handle = w->obj[bb_i].handle;
>> +			bb_i = j;
>> +			w->obj[j - 1].handle =
>> +					wrk->steps[dep_idx].obj[0].handle;
>> +			j++;
>> +		}
>> +
>> +		if (seqnos) {
>> +			w->obj[bb_i].relocs_ptr = to_user_pointer(&w->reloc);
>> +			w->obj[bb_i].relocation_count = 1;
>> +		}
>> +
>> +		w->eb.buffers_ptr = to_user_pointer(w->obj);
>> +		w->eb.buffer_count = j;
>> +		w->eb.rsvd1 = wrk->ctx_id[w->context];
>> +
>> +		if (swap_vcs && engine == VCS1)
>> +			engine = VCS2;
>> +		else if (swap_vcs && engine == VCS2)
>> +			engine = VCS1;
>> +		w->eb.flags = eb_engine_map[engine];
>> +		w->eb.flags |= I915_EXEC_HANDLE_LUT;
>> +		if (!seqnos)
>> +			w->eb.flags |= I915_EXEC_NO_RELOC;
>
> Doesn't look too hard to get the relocation right. Forcing relocations
> between batches is probably a good one to check (just to say don't do
> that)

I am not following here? You are saying don't do relocations at all? How 
do I make sure things stay fixed and even how to find out where they are 
in the first pass?

>> +#ifdef DEBUG
>> +		printf("%u: %u:%x|%x|%x|%x %10lu flags=%llx bb=%x[%u] ctx[%u]=%u\n",
>> +		       i, w->eb.buffer_count, w->obj[0].handle,
>> +		       w->obj[1].handle, w->obj[2].handle, w->obj[3].handle,
>> +		       w->bb_sz, w->eb.flags, w->bb_handle, bb_i,
>> +		       w->context, wrk->ctx_id[w->context]);
>> +#endif
>> +	}
>> +}
>> +
>> +static double elapsed(const struct timespec *start, const struct timespec *end)
>> +{
>> +	return (end->tv_sec - start->tv_sec) +
>> +	       (end->tv_nsec - start->tv_nsec) / 1e9;
>> +}
>> +
>> +static int elapsed_us(const struct timespec *start, const struct timespec *end)
>> +{
>
> return 1e6 * elapsed(); might as well use gcc for something!
>
>> +	return (1e9 * (end->tv_sec - start->tv_sec) +
>> +	       (end->tv_nsec - start->tv_nsec)) / 1e3;
>> +}
>> +
>> +static enum intel_engine_id
>> +rr_balance(struct workload *wrk, struct w_step *w)
>> +{
>> +	unsigned int engine;
>> +
>> +	if (wrk->vcs_rr)
>> +		engine = VCS2;
>> +	else
>> +		engine = VCS1;
>> +
>> +	wrk->vcs_rr ^= 1;
>> +
>> +	return engine;
>> +}
>> +
>> +static enum intel_engine_id
>> +qd_balance(struct workload *wrk, struct w_step *w)
>> +{
>> +	unsigned long qd[NUM_ENGINES];
>> +	enum intel_engine_id engine = w->engine;
>> +
>> +	igt_assert(engine == VCS);
>> +
>> +	qd[VCS1] = wrk->seqno[VCS1] - wrk->status_page[VCS1][0];
>> +	wrk->qd_sum[VCS1] += qd[VCS1];
>> +
>> +	qd[VCS2] = wrk->seqno[VCS2] - wrk->status_page[VCS2][0];
>> +	wrk->qd_sum[VCS2] += qd[VCS2];
>> +
>> +	if (qd[VCS1] < qd[VCS2]) {
>> +		engine = VCS1;
>> +		wrk->vcs_rr = 0;
>> +	} else if (qd[VCS2] < qd[VCS1]) {
>> +		engine = VCS2;
>> +		wrk->vcs_rr = 1;
>> +	} else {
>> +		unsigned int vcs = wrk->vcs_rr ^ 1;
>> +
>> +		wrk->vcs_rr = vcs;
>> +
>> +		if (vcs == 0)
>> +			engine = VCS1;
>> +		else
>> +			engine = VCS2;
>> +	}
>
> Hmm. Just thinking we don't even need hw to simulate a load-balancer,
> but that would be boring!

Definitely not as exciting. :) Perhaps there will be other benefits from 
this tool than the load balancing, but we'll see.

>> +// printf("qd_balance: 1:%lu 2:%lu rr:%u = %u\n", qd[VCS1], qd[VCS2], wrk->vcs_rr, engine);
>> +
>> +	return engine;
>> +}
>> +
>> +static void update_bb_seqno(struct w_step *w, uint32_t seqno)
>> +{
>> +	unsigned long mmap_start, mmap_offset, mmap_len;
>> +	void *ptr;
>> +
>> +	mmap_start = rounddown(w->seqno_offset, PAGE_SIZE);
>> +	mmap_offset = w->seqno_offset - mmap_start;
>> +	mmap_len = sizeof(uint32_t) + mmap_offset;
>> +
>> +	gem_set_domain(fd, w->bb_handle,
>> +		       I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
>> +
>> +	ptr = gem_mmap__cpu(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
>> +
>> +	*(uint32_t *)((char *)ptr + mmap_offset) = seqno;
>
> Uh oh. I hope this isn't called inside any loop. Note this is
> unsynchronized to the gpu so I wonder what this is for.

To update the seqno inside the store_dword_imm. It is called every time 
before a batch is executed so I was thinking whether a gem_sync should 
be preceding it. But then I was thinking it is problematic in general if 
we queue up multiple same batches before they get executed. :( Sounds 
like I would need a separate batch for every iteration for this to work 
correctly. But that sounds too costly. So I don't know at the moment.

>> +
>> +	munmap(ptr, mmap_len);
>> +}
>> +
>> +static void
>> +run_workload(unsigned int id, struct workload *wrk, unsigned int repeat,
>> +	     enum intel_engine_id (*balance)(struct workload *wrk,
>> +					     struct w_step *w), bool seqnos)
>> +{
>> +	struct timespec t_start, t_end;
>> +	struct w_step *w;
>> +	double t;
>> +	int i, j;
>> +
>> +	clock_gettime(CLOCK_MONOTONIC, &t_start);
>> +
>> +	srand(t_start.tv_nsec);
>> +
>> +	for (j = 0; j < repeat; j++) {
>> +		for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
>> +			enum intel_engine_id engine = w->engine;
>> +			uint32_t seqno;
>> +			bool seqno_updated = false;
>> +			int do_sleep = 0;
>> +
>> +			if (i == 0)
>> +				clock_gettime(CLOCK_MONOTONIC,
>> +					      &wrk->repeat_start);
>> +
>> +			if (w->type == DELAY) {
>> +				do_sleep = w->wait;
>> +			} else if (w->type == PERIOD) {
>> +				struct timespec now;
>> +
>> +				clock_gettime(CLOCK_MONOTONIC, &now);
>> +				do_sleep = w->wait -
>> +					   elapsed_us(&wrk->repeat_start, &now);
>> +				if (do_sleep < 0) {
>> +					if (!quiet) {
>> +						printf("%u: Dropped period @ %u/%u (%dus late)!\n",
>> +						       id, j, i, do_sleep);
>> +						continue;
>> +					}
>> +				}
>> +			} else if (w->type == SYNC) {
>> +				unsigned int s_idx = i + w->wait;
>> +
>> +				igt_assert(i > 0 && i < wrk->nr_steps);
>> +				igt_assert(wrk->steps[s_idx].type == BATCH);
>> +				gem_sync(fd, wrk->steps[s_idx].obj[0].handle);
>> +				continue;
>> +			}
>> +
>> +			if (do_sleep) {
>> +				usleep(do_sleep);
>> +				continue;
>> +			}
>> +
>> +			wrk->nr_bb[engine]++;
>> +
>> +			if (engine == VCS && balance) {
>> +				engine = balance(wrk, w);
>> +				wrk->nr_bb[engine]++;
>> +
>> +				w->obj[1].handle = wrk->status_page_handle[engine];
>> +
>> +				w->eb.flags = eb_engine_map[engine];
>> +				w->eb.flags |= I915_EXEC_HANDLE_LUT;
>> +			}
>> +
>> +			seqno = ++wrk->seqno[engine];
>> +
>> +			if (w->duration.min != w->duration.max) {
>> +				unsigned int cur = get_duration(&w->duration);
>> +
>> +				if (cur != w->duration.cur) {
>> +					unterminate_bb(w, seqnos);
>
> Ah, you said this was for adjusting runlength of the batches. I suggest
> using batch_start_offset to change the number of nops rather than
> rewrite the batch.

Yeah thanks for that suggestion, I did not think of that.

> I need to study this a bit more...

Yes please, especially the bit about how to get accurate seqnos written 
out in each step without needing separate execbuf batches.

I've heard recursive batches mentioned in the past so maybe each 
iteration could have it's own small batch which would jump to the 
nop/delay one (shared between all iterations) and write the unique 
seqno. No idea if that is possible/supported at the moment - I'll go and 
dig a bit.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t v3] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-06  8:18       ` Tvrtko Ursulin
@ 2017-04-06  8:55         ` Chris Wilson
  2017-04-07  8:53           ` Tvrtko Ursulin
  0 siblings, 1 reply; 26+ messages in thread
From: Chris Wilson @ 2017-04-06  8:55 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Thu, Apr 06, 2017 at 09:18:36AM +0100, Tvrtko Ursulin wrote:
> 
> On 05/04/2017 17:48, Chris Wilson wrote:
> >On Wed, Apr 05, 2017 at 05:14:01PM +0100, Tvrtko Ursulin wrote:
> >>+static void
> >>+__emit_bb_end(struct w_step *w, bool terminate, bool seqnos, uint32_t seqno)
> >>+{
> >>+	const uint32_t bbe = 0xa << 23;
> >>+	unsigned long bb_sz = get_bb_sz(&w->duration);
> >>+	unsigned long mmap_start, cmd_offset, mmap_len;
> >>+	uint32_t *ptr, *cs;
> >>+
> >>+	mmap_len = (seqnos ? 5 : 1) * sizeof(uint32_t);
> >>+	cmd_offset = bb_sz - mmap_len;
> >>+	mmap_start = rounddown(cmd_offset, PAGE_SIZE);
> >>+	mmap_len += cmd_offset - mmap_start;
> >>+
> >>+	gem_set_domain(fd, w->bb_handle,
> >>+		       I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
> >>+
> >>+	ptr = gem_mmap__cpu(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
> >>+	cs = (uint32_t *)((char *)ptr + cmd_offset - mmap_start);
> >>+
> >>+	if (seqnos) {
> >>+		const int gen = intel_gen(intel_get_drm_devid(fd));
> >>+
> >>+		igt_assert(gen >= 8);
> >>+
> >>+		w->reloc.offset = bb_sz - 4 * sizeof(uint32_t);
> >>+		w->seqno_offset = bb_sz - 2 * sizeof(uint32_t);
> >>+
> >>+		*cs++ = terminate ? MI_STORE_DWORD_IMM : 0;
> >>+		*cs++ = 0;
> >>+		*cs++ = 0;
> >>+		*cs++ = seqno;
> >>+	}
> >>+
> >>+	*cs = terminate ? bbe : 0;
> >>+
> >>+	munmap(ptr, mmap_len);
> >>+}
> >>+
> >>+static void terminate_bb(struct w_step *w, bool seqnos, uint32_t seqno)
> >>+{
> >>+	__emit_bb_end(w, true, seqnos, seqno);
> >>+}
> >>+
> >>+static void unterminate_bb(struct w_step *w, bool seqnos)
> >>+{
> >>+	__emit_bb_end(w, false, seqnos, 0);
> >>+}
> >>+
> >>+static void
> >>+prepare_workload(struct workload *wrk, bool swap_vcs, bool seqnos)
> >>+{
> >>+	int max_ctx = -1;
> >>+	struct w_step *w;
> >>+	int i;
> >>+
> >>+	if (seqnos) {
> >>+		const unsigned int status_sz = sizeof(uint32_t);
> >>+
> >>+		for (i = 0; i < NUM_ENGINES; i++) {
> >>+			wrk->status_page_handle[i] = gem_create(fd, status_sz);
> >
> >Need to set_cache_level(CACHED) for llc.
> >
> >You can use one page for all engines. Just use a different cacheline
> >for each, for safety.
> >
> >>+			wrk->status_page[i] =
> >>+				gem_mmap__cpu(fd, wrk->status_page_handle[i],
> >>+					      0, status_sz, PROT_READ);
> >>+		}
> >>+	}
> >>+
> >>+	for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
> >>+		if ((int)w->context > max_ctx) {
> >>+			int delta = w->context + 1 - wrk->nr_ctxs;
> >>+
> >>+			wrk->nr_ctxs += delta;
> >>+			wrk->ctx_id = realloc(wrk->ctx_id,
> >>+					      wrk->nr_ctxs * sizeof(uint32_t));
> >>+			memset(&wrk->ctx_id[wrk->nr_ctxs - delta], 0,
> >>+			       delta * sizeof(uint32_t));
> >>+
> >>+			max_ctx = w->context;
> >>+		}
> >>+
> >>+		if (!wrk->ctx_id[w->context]) {
> >>+			struct drm_i915_gem_context_create arg = {};
> >>+
> >>+			drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_CREATE, &arg);
> >>+			igt_assert(arg.ctx_id);
> >>+
> >>+			wrk->ctx_id[w->context] = arg.ctx_id;
> >>+		}
> >>+	}
> >>+
> >>+	for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
> >>+		enum intel_engine_id engine = w->engine;
> >>+		unsigned int bb_i, j = 0;
> >>+
> >>+		if (w->type != BATCH)
> >>+			continue;
> >>+
> >>+		w->obj[j].handle = gem_create(fd, 4096);
> >>+		w->obj[j].flags = EXEC_OBJECT_WRITE;
> >>+		j++;
> >>+
> >>+		if (seqnos) {
> >>+			w->obj[j].handle = wrk->status_page_handle[engine];
> >>+			w->obj[j].flags = EXEC_OBJECT_WRITE;
> >
> >The trick for sharing between engines is to not mark this as a WRITE.
> >Fun little lies.
> 
> Yeah thats why I have per-engine objects. Which I don't mind since
> it is not like they are wasting any resources compared to everything
> else. But not admitting the write sounds still interesting. What
> would the repercussions of that be - limit us to llc platforms or
> something?

It used to be that if we evicted the object (e.g. mempressure/suspend),
then we would not save the contents since it was never marked as dirty.
However, between libva being buggy and Vk deliberately eskewing write
hazards, we had to always mark GPU usage as dirtying the buffers. So
nowadays, EXEC_OBJECT_WRITE only means to track the implicit write
hazard.

> 
> >>+			j++;
> >>+		}
> >>+
> >>+		bb_i = j++;
> >>+		w->duration.cur = w->duration.max;
> >>+		w->bb_sz = get_bb_sz(&w->duration);
> >>+		w->bb_handle = w->obj[bb_i].handle = gem_create(fd, w->bb_sz);
> >>+		terminate_bb(w, seqnos, 0);
> >>+		if (seqnos) {
> >>+			w->reloc.presumed_offset = -1;
> >>+			w->reloc.target_handle = 1;
> >>+			w->reloc.read_domains = I915_GEM_DOMAIN_INSTRUCTION;
> >>+			w->reloc.write_domain = I915_GEM_DOMAIN_INSTRUCTION;
> >
> >Ugh. That's a magic w/a value for pipecontrols. Fortunately we don't want
> >to set write_domain here anyway.
> 
> I think I copy-pasted this from another IGT. So you say cheat here
> as well and set zero for both domains?

Technically the MI is outside of all the GPU cache domains we have :)
Which you pick is immaterial, aside from understanding that
(INSTRUCTION, INSTRUCTION) is special ;)

If you were to drop EXEC_OBJECT_WRITE, you would also drop
reloc.write_domain.

> >>+		}
> >>+
> >>+		igt_assert(w->dependency <= 0);
> >>+		if (w->dependency) {
> >>+			int dep_idx = i + w->dependency;
> >>+
> >>+			igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
> >>+			igt_assert(wrk->steps[dep_idx].type == BATCH);
> >>+
> >>+			w->obj[j].handle = w->obj[bb_i].handle;
> >>+			bb_i = j;
> >>+			w->obj[j - 1].handle =
> >>+					wrk->steps[dep_idx].obj[0].handle;
> >>+			j++;
> >>+		}
> >>+
> >>+		if (seqnos) {
> >>+			w->obj[bb_i].relocs_ptr = to_user_pointer(&w->reloc);
> >>+			w->obj[bb_i].relocation_count = 1;
> >>+		}
> >>+
> >>+		w->eb.buffers_ptr = to_user_pointer(w->obj);
> >>+		w->eb.buffer_count = j;
> >>+		w->eb.rsvd1 = wrk->ctx_id[w->context];
> >>+
> >>+		if (swap_vcs && engine == VCS1)
> >>+			engine = VCS2;
> >>+		else if (swap_vcs && engine == VCS2)
> >>+			engine = VCS1;
> >>+		w->eb.flags = eb_engine_map[engine];
> >>+		w->eb.flags |= I915_EXEC_HANDLE_LUT;
> >>+		if (!seqnos)
> >>+			w->eb.flags |= I915_EXEC_NO_RELOC;
> >
> >Doesn't look too hard to get the relocation right. Forcing relocations
> >between batches is probably a good one to check (just to say don't do
> >that)
> 
> I am not following here? You are saying don't do relocations at all?
> How do I make sure things stay fixed and even how to find out where
> they are in the first pass?

Depending on the workload, it may be informative to also do comparisons
between NORELOC and always RELOC. Personally I would make sure we were
using NORELOC as this should be a simulator/example.

> >>+static void update_bb_seqno(struct w_step *w, uint32_t seqno)
> >>+{
> >>+	unsigned long mmap_start, mmap_offset, mmap_len;
> >>+	void *ptr;
> >>+
> >>+	mmap_start = rounddown(w->seqno_offset, PAGE_SIZE);
> >>+	mmap_offset = w->seqno_offset - mmap_start;
> >>+	mmap_len = sizeof(uint32_t) + mmap_offset;
> >>+
> >>+	gem_set_domain(fd, w->bb_handle,
> >>+		       I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
> >>+
> >>+	ptr = gem_mmap__cpu(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
> >>+
> >>+	*(uint32_t *)((char *)ptr + mmap_offset) = seqno;
> >
> >Uh oh. I hope this isn't called inside any loop. Note this is
> >unsynchronized to the gpu so I wonder what this is for.
> 
> To update the seqno inside the store_dword_imm. It is called every
> time before a batch is executed so I was thinking whether a gem_sync
> should be preceding it. But then I was thinking it is problematic in
> general if we queue up multiple same batches before they get
> executed. :( Sounds like I would need a separate batch for every
> iteration for this to work correctly. But that sounds too costly. So
> I don't know at the moment.

mmap/munmap, especially munmap, is not free. The munmap will do a
tlb_flush across all cores -- though maybe that's batched and the
munmaps I do all tend to be large enough to trigger every time.

Since you are using a CPU write, on !llc this will be clflushing
everytime. I would suggest stashing the gem_mmap__wc for updating the
seqno between repeats.

[snip]

> >I need to study this a bit more...
> 
> Yes please, especially the bit about how to get accurate seqnos
> written out in each step without needing separate execbuf batches.
> 
> I've heard recursive batches mentioned in the past so maybe each
> iteration could have it's own small batch which would jump to the
> nop/delay one (shared between all iterations) and write the unique
> seqno. No idea if that is possible/supported at the moment - I'll go
> and dig a bit.

You end up with the same problem of having the reloc change and need to
update every cycle. You could use a fresh batch to rewrite the seqno
values... However, now that you explained what you want, just keep the
WC mmap.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t v3] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-06  8:55         ` Chris Wilson
@ 2017-04-07  8:53           ` Tvrtko Ursulin
  2017-04-07  9:51             ` Chris Wilson
  0 siblings, 1 reply; 26+ messages in thread
From: Tvrtko Ursulin @ 2017-04-07  8:53 UTC (permalink / raw)
  To: Chris Wilson, Tvrtko Ursulin, Intel-gfx, Tvrtko Ursulin,
	Rogozhkin, Dmitry V


On 06/04/2017 09:55, Chris Wilson wrote:
> On Thu, Apr 06, 2017 at 09:18:36AM +0100, Tvrtko Ursulin wrote:

[snip]

>>>> +			j++;
>>>> +		}
>>>> +
>>>> +		bb_i = j++;
>>>> +		w->duration.cur = w->duration.max;
>>>> +		w->bb_sz = get_bb_sz(&w->duration);
>>>> +		w->bb_handle = w->obj[bb_i].handle = gem_create(fd, w->bb_sz);
>>>> +		terminate_bb(w, seqnos, 0);
>>>> +		if (seqnos) {
>>>> +			w->reloc.presumed_offset = -1;
>>>> +			w->reloc.target_handle = 1;
>>>> +			w->reloc.read_domains = I915_GEM_DOMAIN_INSTRUCTION;
>>>> +			w->reloc.write_domain = I915_GEM_DOMAIN_INSTRUCTION;
>>>
>>> Ugh. That's a magic w/a value for pipecontrols. Fortunately we don't want
>>> to set write_domain here anyway.
>>
>> I think I copy-pasted this from another IGT. So you say cheat here
>> as well and set zero for both domains?
>
> Technically the MI is outside of all the GPU cache domains we have :)
> Which you pick is immaterial, aside from understanding that
> (INSTRUCTION, INSTRUCTION) is special ;)
>
> If you were to drop EXEC_OBJECT_WRITE, you would also drop
> reloc.write_domain.

Okay, I will try the cheating approach then.

>>>> +		}
>>>> +
>>>> +		igt_assert(w->dependency <= 0);
>>>> +		if (w->dependency) {
>>>> +			int dep_idx = i + w->dependency;
>>>> +
>>>> +			igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
>>>> +			igt_assert(wrk->steps[dep_idx].type == BATCH);
>>>> +
>>>> +			w->obj[j].handle = w->obj[bb_i].handle;
>>>> +			bb_i = j;
>>>> +			w->obj[j - 1].handle =
>>>> +					wrk->steps[dep_idx].obj[0].handle;
>>>> +			j++;
>>>> +		}
>>>> +
>>>> +		if (seqnos) {
>>>> +			w->obj[bb_i].relocs_ptr = to_user_pointer(&w->reloc);
>>>> +			w->obj[bb_i].relocation_count = 1;
>>>> +		}
>>>> +
>>>> +		w->eb.buffers_ptr = to_user_pointer(w->obj);
>>>> +		w->eb.buffer_count = j;
>>>> +		w->eb.rsvd1 = wrk->ctx_id[w->context];
>>>> +
>>>> +		if (swap_vcs && engine == VCS1)
>>>> +			engine = VCS2;
>>>> +		else if (swap_vcs && engine == VCS2)
>>>> +			engine = VCS1;
>>>> +		w->eb.flags = eb_engine_map[engine];
>>>> +		w->eb.flags |= I915_EXEC_HANDLE_LUT;
>>>> +		if (!seqnos)
>>>> +			w->eb.flags |= I915_EXEC_NO_RELOC;
>>>
>>> Doesn't look too hard to get the relocation right. Forcing relocations
>>> between batches is probably a good one to check (just to say don't do
>>> that)
>>
>> I am not following here? You are saying don't do relocations at all?
>> How do I make sure things stay fixed and even how to find out where
>> they are in the first pass?
>
> Depending on the workload, it may be informative to also do comparisons
> between NORELOC and always RELOC. Personally I would make sure we were
> using NORELOC as this should be a simulator/example.

How do I use NORELOC? I mean, I have to know where to objects will be 
pinned, or be able to pin them first and know they will remain put. What 
am I not understanding here?

>>>> +static void update_bb_seqno(struct w_step *w, uint32_t seqno)
>>>> +{
>>>> +	unsigned long mmap_start, mmap_offset, mmap_len;
>>>> +	void *ptr;
>>>> +
>>>> +	mmap_start = rounddown(w->seqno_offset, PAGE_SIZE);
>>>> +	mmap_offset = w->seqno_offset - mmap_start;
>>>> +	mmap_len = sizeof(uint32_t) + mmap_offset;
>>>> +
>>>> +	gem_set_domain(fd, w->bb_handle,
>>>> +		       I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
>>>> +
>>>> +	ptr = gem_mmap__cpu(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
>>>> +
>>>> +	*(uint32_t *)((char *)ptr + mmap_offset) = seqno;
>>>
>>> Uh oh. I hope this isn't called inside any loop. Note this is
>>> unsynchronized to the gpu so I wonder what this is for.
>>
>> To update the seqno inside the store_dword_imm. It is called every
>> time before a batch is executed so I was thinking whether a gem_sync
>> should be preceding it. But then I was thinking it is problematic in
>> general if we queue up multiple same batches before they get
>> executed. :( Sounds like I would need a separate batch for every
>> iteration for this to work correctly. But that sounds too costly. So
>> I don't know at the moment.
>
> mmap/munmap, especially munmap, is not free. The munmap will do a
> tlb_flush across all cores -- though maybe that's batched and the
> munmaps I do all tend to be large enough to trigger every time.
>
> Since you are using a CPU write, on !llc this will be clflushing
> everytime. I would suggest stashing the gem_mmap__wc for updating the
> seqno between repeats.

Ok, I can try that approach.

> [snip]
>
>>> I need to study this a bit more...
>>
>> Yes please, especially the bit about how to get accurate seqnos
>> written out in each step without needing separate execbuf batches.
>>
>> I've heard recursive batches mentioned in the past so maybe each
>> iteration could have it's own small batch which would jump to the
>> nop/delay one (shared between all iterations) and write the unique
>> seqno. No idea if that is possible/supported at the moment - I'll go
>> and dig a bit.
>
> You end up with the same problem of having the reloc change and need to
> update every cycle. You could use a fresh batch to rewrite the seqno
> values... However, now that you explained what you want, just keep the
> WC mmap.

Hm not sure without researching that approach first.

But in general is this correctly implementing your idea for queue depth 
estimation?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t v3] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-07  8:53           ` Tvrtko Ursulin
@ 2017-04-07  9:51             ` Chris Wilson
  0 siblings, 0 replies; 26+ messages in thread
From: Chris Wilson @ 2017-04-07  9:51 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Fri, Apr 07, 2017 at 09:53:05AM +0100, Tvrtko Ursulin wrote:
> 
> On 06/04/2017 09:55, Chris Wilson wrote:
> >On Thu, Apr 06, 2017 at 09:18:36AM +0100, Tvrtko Ursulin wrote:
> 
> [snip]
[snip]

> >>>>+		if (swap_vcs && engine == VCS1)
> >>>>+			engine = VCS2;
> >>>>+		else if (swap_vcs && engine == VCS2)
> >>>>+			engine = VCS1;
> >>>>+		w->eb.flags = eb_engine_map[engine];
> >>>>+		w->eb.flags |= I915_EXEC_HANDLE_LUT;
> >>>>+		if (!seqnos)
> >>>>+			w->eb.flags |= I915_EXEC_NO_RELOC;
> >>>
> >>>Doesn't look too hard to get the relocation right. Forcing relocations
> >>>between batches is probably a good one to check (just to say don't do
> >>>that)
> >>
> >>I am not following here? You are saying don't do relocations at all?
> >>How do I make sure things stay fixed and even how to find out where
> >>they are in the first pass?
> >
> >Depending on the workload, it may be informative to also do comparisons
> >between NORELOC and always RELOC. Personally I would make sure we were
> >using NORELOC as this should be a simulator/example.
> 
> How do I use NORELOC? I mean, I have to know where to objects will
> be pinned, or be able to pin them first and know they will remain
> put. What am I not understanding here?

It will be assigned an address on first execution. Can I quote the spiel
I wrote for i915_gem_execbuffer.c and see if that answers how to use
NORELOC:

 * Reserving resources for the execbuf is the most complicated phase. We
 * neither want to have to migrate the object in the address space, nor do
 * we want to have to update any relocations pointing to this object. Ideally,
 * we want to leave the object where it is and for all the existing relocations
 * to match. If the object is given a new address, or if userspace thinks the
 * object is elsewhere, we have to parse all the relocation entries and update
 * the addresses. Userspace can set the I915_EXEC_NORELOC flag to hint that
 * all the target addresses in all of its objects match the value in the
 * relocation entries and that they all match the presumed offsets given by the
 * list of execbuffer objects. Using this knowledge, we know that if we haven't
 * moved any buffers, all the relocation entries are valid and we can skip
 * the update. (If userspace is wrong, the likely outcome is an impromptu GPU
 * hang.) The requirement for using I915_EXEC_NO_RELOC are:
 *
 *      The addresses written in the objects must match the corresponding
 *      reloc.presumed_offset which in turn must match the corresponding
 *      execobject.offset.
 *
 *      Any render targets written to in the batch must be flagged with
 *      EXEC_OBJECT_WRITE.
 *
 *      To avoid stalling, execobject.offset should match the current
 *      address of that object within the active context.
 *

Does that make sense? How questions remain unanswered?

Hmm, I usually sum it up as

	batch[reloc.offset] == reloc.presumed_offset + reloc.delta;

and

	execobj.offset == reloc.presumed_offset

must be true at the time of execbuf. Note that upon relocation,
batch[reloc.offset], reloc.presumed_offset and execobj.offset are
updated. This is important to remember if you are prerecording the
reloc/execobj arrays, and not feeding back the results of execbuf
between phases.

> But in general is this correctly implementing your idea for queue
> depth estimation?

From my rough checklist:

	* writes engine->next_seqno++ after each op (in this case end of batch)
	* qlen[engine] = engine->next_seqno - *engine->current_seqno;

Design looks right. Implementation requires checking... I'll be back.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
  2017-03-31 14:58 ` [PATCH i-g-t 1/2] benchmarks/gem_wsim: Command submission workload simulator Tvrtko Ursulin
  2017-03-31 15:19   ` Chris Wilson
  2017-04-05 16:14   ` [PATCH i-g-t v3] " Tvrtko Ursulin
@ 2017-04-20 12:29   ` Tvrtko Ursulin
  2017-04-20 14:23     ` Chris Wilson
                       ` (3 more replies)
  2 siblings, 4 replies; 26+ messages in thread
From: Tvrtko Ursulin @ 2017-04-20 12:29 UTC (permalink / raw)
  To: Intel-gfx

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Tool which emits batch buffers to engines with configurable
sequences, durations, contexts, dependencies and userspace waits.

Unfinished but shows promise so sending out for early feedback.

v2:
 * Load workload descriptors from files. (also -w)
 * Help text.
 * Calibration control if needed. (-t)
 * NORELOC | LUT to eb flags.
 * Added sample workload to wsim/workload1.

v3:
 * Multiple parallel different workloads (-w -w ...).
 * Multi-context workloads.
 * Variable (random) batch length.
 * Load balancing (round robin and queue depth estimation).
 * Workloads delays and explicit sync steps.
 * Workload frequency (period) control.

v4:
 * Fixed queue-depth estimation by creating separate batches
   per engine when qd load balancing is on.
 * Dropped separate -s cmd line option. It can turn itself on
   automatically when needed.
 * Keep a single status page and lie about the write hazard
   as suggested by Chris.
 * Use batch_start_offset for controlling the batch duration.
   (Chris)
 * Set status page object cache level. (Chris)
 * Moved workload description to a README.
 * Tidied example workloads.
 * Some other cleanups and refactorings.

TODO list:

 * Fence support.
 * Better error handling.
 * Less 1980's workload parsing.
 * Proper workloads.
 * Threads?
 * ... ?

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com>
---

Comparing some test workloads under load balancing it seems that it is starting
to work, but it still needs more thorough verification. For example, round-
robin balancing:

# benchmarks/gem_wsim -n 585341 \
		      -w benchmarks/wsim/vcs1.wsim \
		      -w benchmarks/wsim/vcs_balanced.wsim \
		      -r 100 -b 0
Using 585341 nop calibration for 1000us delay.
2 clients.
1: 3.008s elapsed (33.243 workloads/s). 2500 (1250 + 1250) total VCS batches.
0: 4.455s elapsed (22.449 workloads/s). 0 (2500 + 0) total VCS batches.
4.455s elapsed (44.889 workloads/s)


Versus the queue-depth estimation:

# benchmarks/gem_wsim -n 585341 \
		      -w benchmarks/wsim/vcs1.wsim \
		      -w benchmarks/wsim/vcs_balanced.wsim \
		      -r 100 -b 1
Using 585341 nop calibration for 1000us delay.
2 clients.
1: 2.239s elapsed (44.659 workloads/s). 2500 (837 + 1663) total VCS batches. Average queue depths 27.575, 19.285.
0: 4.012s elapsed (24.928 workloads/s). 0 (2500 + 0) total VCS batches. Average queue depths -nan, -nan.
4.012s elapsed (49.845 workloads/s)

In both cases we run two workloads, one which only submits to VCS1 and one which
can be load-balanced. The latter gets a ~33% boost with queue-depth estimation,
and the non-balancing workload ~10%.

---
 benchmarks/Makefile.sources                  |    1 +
 benchmarks/gem_wsim.c                        | 1014 ++++++++++++++++++++++++++
 benchmarks/wsim/README                       |   54 ++
 benchmarks/wsim/media_17i7.wsim              |    7 +
 benchmarks/wsim/media_load_balance_17i7.wsim |    7 +
 benchmarks/wsim/vcs1.wsim                    |   25 +
 benchmarks/wsim/vcs_balanced.wsim            |   25 +
 7 files changed, 1133 insertions(+)
 create mode 100644 benchmarks/gem_wsim.c
 create mode 100644 benchmarks/wsim/README
 create mode 100644 benchmarks/wsim/media_17i7.wsim
 create mode 100644 benchmarks/wsim/media_load_balance_17i7.wsim
 create mode 100644 benchmarks/wsim/vcs1.wsim
 create mode 100644 benchmarks/wsim/vcs_balanced.wsim

diff --git a/benchmarks/Makefile.sources b/benchmarks/Makefile.sources
index 3af54ebe36f2..3a941150abb3 100644
--- a/benchmarks/Makefile.sources
+++ b/benchmarks/Makefile.sources
@@ -14,6 +14,7 @@ benchmarks_prog_list =			\
 	gem_prw				\
 	gem_set_domain			\
 	gem_syslatency			\
+	gem_wsim			\
 	kms_vblank			\
 	prime_lookup			\
 	vgem_mmap			\
diff --git a/benchmarks/gem_wsim.c b/benchmarks/gem_wsim.c
new file mode 100644
index 000000000000..adf2d6decf12
--- /dev/null
+++ b/benchmarks/gem_wsim.c
@@ -0,0 +1,1014 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <errno.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/ioctl.h>
+#include <sys/time.h>
+#include <time.h>
+#include <assert.h>
+#include <limits.h>
+
+
+#include "intel_chipset.h"
+#include "drm.h"
+#include "ioctl_wrappers.h"
+#include "drmtest.h"
+#include "intel_io.h"
+
+enum intel_engine_id {
+	RCS,
+	BCS,
+	VCS,
+	VCS1,
+	VCS2,
+	VECS,
+	NUM_ENGINES
+};
+
+struct duration {
+	unsigned int min, max;
+};
+
+enum w_type
+{
+	BATCH,
+	SYNC,
+	DELAY,
+	PERIOD
+};
+
+struct w_step
+{
+	/* Workload step metadata */
+	enum w_type type;
+	unsigned int context;
+	unsigned int engine;
+	struct duration duration;
+	int dependency;
+	int wait;
+
+	/* Implementation details */
+	unsigned int idx;
+
+	struct w_step_eb {
+		struct drm_i915_gem_execbuffer2 eb;
+		struct drm_i915_gem_exec_object2 obj[4];
+		struct drm_i915_gem_relocation_entry reloc;
+		unsigned long bb_sz;
+		uint32_t bb_handle;
+		uint32_t *mapped_batch, *mapped_seqno;
+		unsigned int mapped_len;
+	} b[2]; /* One for each VCS when load balancing */
+};
+
+struct workload
+{
+	unsigned int nr_steps;
+	struct w_step *steps;
+
+	struct timespec repeat_start;
+
+	unsigned int nr_ctxs;
+	uint32_t *ctx_id;
+
+	unsigned long seqno[NUM_ENGINES];
+	uint32_t status_page_handle;
+	uint32_t *status_page;
+	unsigned int vcs_rr;
+
+	unsigned long qd_sum[NUM_ENGINES];
+	unsigned long nr_bb[NUM_ENGINES];
+};
+
+static const unsigned int eb_engine_map[NUM_ENGINES] = {
+	[RCS] = I915_EXEC_RENDER,
+	[BCS] = I915_EXEC_BLT,
+	[VCS] = I915_EXEC_BSD,
+	[VCS1] = I915_EXEC_BSD | I915_EXEC_BSD_RING1,
+	[VCS2] = I915_EXEC_BSD | I915_EXEC_BSD_RING2,
+	[VECS] = I915_EXEC_VEBOX
+};
+
+static const unsigned int nop_calibration_us = 1000;
+static unsigned long nop_calibration;
+
+static bool quiet;
+static int fd;
+
+#define SWAPVCS	(1<<0)
+#define SEQNO	(1<<1)
+#define BALANCE	(1<<2)
+
+/*
+ * Workload descriptor:
+ *
+ * ctx.engine.duration.dependency.wait,...
+ * <uint>.<str>.<uint>.<int <= 0>.<0|1>,...
+ *
+ * Engine ids: RCS, BCS, VCS, VCS1, VCS2, VECS
+ *
+ * "1.VCS1.3000.0.1,1.RCS.1000.-1.0,1.RCS.3700.0.0,1.RCS.1000.-2.0,1.VCS2.2300.-2.0,1.RCS.4700.-1.0,1.VCS2.600.-1.1"
+ */
+
+static const char *ring_str_map[NUM_ENGINES] = {
+	[RCS] = "RCS",
+	[BCS] = "BCS",
+	[VCS] = "VCS",
+	[VCS1] = "VCS1",
+	[VCS2] = "VCS2",
+	[VECS] = "VECS",
+};
+
+static struct workload *parse_workload(char *_desc)
+{
+	struct workload *wrk;
+	unsigned int nr_steps = 0;
+	char *desc = strdup(_desc);
+	char *_token, *token, *tctx = NULL, *tstart = desc;
+	char *field, *fctx = NULL, *fstart;
+	struct w_step step, *steps = NULL;
+	unsigned int valid;
+	int tmp;
+
+	while ((_token = strtok_r(tstart, ",", &tctx)) != NULL) {
+		tstart = NULL;
+		token = strdup(_token);
+		fstart = token;
+		valid = 0;
+		memset(&step, 0, sizeof(step));
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			fstart = NULL;
+
+			if (!strcasecmp(field, "d")) {
+				if ((field = strtok_r(fstart, ".", &fctx)) !=
+				    NULL) {
+					tmp = atoi(field);
+					if (tmp <= 0) {
+						if (!quiet)
+							fprintf(stderr,
+								"Invalid delay at step %u!\n",
+								nr_steps);
+						return NULL;
+					}
+
+					step.type = DELAY;
+					step.wait = tmp;
+					goto add_step;
+				}
+			} else if (!strcasecmp(field, "p")) {
+				if ((field = strtok_r(fstart, ".", &fctx)) !=
+				    NULL) {
+					tmp = atoi(field);
+					if (tmp <= 0) {
+						if (!quiet)
+							fprintf(stderr,
+								"Invalid period at step %u!\n",
+								nr_steps);
+						return NULL;
+					}
+
+					step.type = PERIOD;
+					step.wait = tmp;
+					goto add_step;
+				}
+			} else if (!strcasecmp(field, "s")) {
+				if ((field = strtok_r(fstart, ".", &fctx)) !=
+				    NULL) {
+					tmp = atoi(field);
+					if (tmp >= 0) {
+						if (!quiet)
+							fprintf(stderr,
+								"Invalid sync target at step %u!\n",
+								nr_steps);
+						return NULL;
+					}
+
+					step.type = SYNC;
+					step.wait = tmp;
+					goto add_step;
+				}
+			}
+
+			tmp = atoi(field);
+			if (tmp < 0) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid ctx id at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.context = tmp;
+
+			valid++;
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			unsigned int i, old_valid = valid;
+
+			fstart = NULL;
+
+			for (i = 0; i < ARRAY_SIZE(ring_str_map); i++) {
+				if (!strcasecmp(field, ring_str_map[i])) {
+					step.engine = i;
+					valid++;
+					break;
+				}
+			}
+
+			if (old_valid == valid) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid engine id at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			char *sep = NULL;
+			long int tmpl;
+
+			fstart = NULL;
+
+			tmpl = strtol(field, &sep, 10);
+			if (tmpl == LONG_MIN || tmpl == LONG_MAX) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid duration at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.duration.min = tmpl;
+
+			if (sep && *sep == '-') {
+				tmpl = strtol(sep + 1, NULL, 10);
+				if (tmpl == LONG_MIN || tmpl == LONG_MAX) {
+					if (!quiet)
+						fprintf(stderr,
+							"Invalid duration range at step %u!\n",
+							nr_steps);
+					return NULL;
+				}
+				step.duration.max = tmpl;
+			} else {
+				step.duration.max = step.duration.min;
+			}
+
+			valid++;
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			fstart = NULL;
+
+			tmp = atoi(field);
+			if (tmp > 0) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid forward dependency at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.dependency = tmp;
+
+			valid++;
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			fstart = NULL;
+
+			tmp = atoi(field);
+			if (tmp != 0 && tmp != 1) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid wait boolean at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.wait = tmp;
+
+			valid++;
+		}
+
+		if (valid != 5) {
+			if (!quiet)
+				fprintf(stderr, "Invalid record at step %u!\n",
+					nr_steps);
+			return NULL;
+		}
+
+		step.type = BATCH;
+
+add_step:
+		step.idx = nr_steps++;
+		steps = realloc(steps, sizeof(step) * nr_steps);
+		igt_assert(steps);
+
+		memcpy(&steps[nr_steps - 1], &step, sizeof(step));
+
+		free(token);
+	}
+
+	wrk = malloc(sizeof(*wrk));
+	igt_assert(wrk);
+
+	wrk->nr_steps = nr_steps;
+	wrk->steps = steps;
+
+	free(desc);
+
+	return wrk;
+}
+
+static struct workload *
+clone_workload(struct workload *_wrk)
+{
+	struct workload *wrk;
+
+	wrk = malloc(sizeof(*wrk));
+	igt_assert(wrk);
+	memset(wrk, 0, sizeof(*wrk));
+
+	wrk->nr_steps = _wrk->nr_steps;
+	wrk->steps = calloc(wrk->nr_steps, sizeof(struct w_step));
+	igt_assert(wrk->steps);
+
+	memcpy(wrk->steps, _wrk->steps, sizeof(struct w_step) * wrk->nr_steps);
+
+	return wrk;
+}
+
+#define rounddown(x, y) (x - (x%y))
+#ifndef PAGE_SIZE
+#define PAGE_SIZE (4096)
+#endif
+
+static unsigned int get_duration(struct duration *dur)
+{
+	if (dur->min == dur->max)
+		return dur->min;
+	else
+		return dur->min + rand() % (dur->max + 1 - dur->min);
+}
+
+static unsigned long get_bb_sz(unsigned int duration)
+{
+	return ALIGN(duration * nop_calibration * sizeof(uint32_t) /
+		     nop_calibration_us, sizeof(uint32_t));
+}
+
+static void
+terminate_bb(struct w_step *w, struct w_step_eb *b, enum intel_engine_id engine,
+	     unsigned int flags)
+{
+	const uint32_t bbe = 0xa << 23;
+	unsigned long bb_sz = get_bb_sz(w->duration.max);
+	unsigned long mmap_start, cmd_offset, mmap_len;
+	uint32_t *ptr, *cs;
+
+	mmap_len = 1;
+	if (flags & SEQNO)
+		mmap_len += 4;
+	mmap_len *= sizeof(uint32_t);
+	cmd_offset = bb_sz - mmap_len;
+	mmap_start = rounddown(cmd_offset, PAGE_SIZE);
+	mmap_len += cmd_offset - mmap_start;
+
+	gem_set_domain(fd, b->bb_handle,
+		       I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
+
+	ptr = gem_mmap__wc(fd, b->bb_handle, mmap_start, mmap_len, PROT_WRITE);
+	cs = (uint32_t *)((char *)ptr + cmd_offset - mmap_start);
+
+	if (flags & SEQNO) {
+		b->reloc.offset = bb_sz - 4 * sizeof(uint32_t);
+		b->reloc.delta = (engine - VCS1) * sizeof(uint32_t);
+
+		*cs++ = MI_STORE_DWORD_IMM;
+		*cs++ = 0;
+		*cs++ = 0;
+		b->mapped_seqno = cs;
+		*cs++ = 0;
+	}
+
+	*cs = bbe;
+
+	b->mapped_batch = ptr;
+	b->mapped_len = mmap_len;
+}
+
+static void
+alloc_step_batch(struct workload *wrk, struct w_step *w, struct w_step_eb *b,
+		 enum intel_engine_id engine, unsigned int flags)
+{
+	unsigned int bb_i, j = 0;
+
+	b->obj[j].handle = gem_create(fd, 4096);
+	b->obj[j].flags = EXEC_OBJECT_WRITE;
+	j++;
+
+	if (flags & SEQNO) {
+		b->obj[j].handle = wrk->status_page_handle;
+		j++;
+	}
+
+	bb_i = j++;
+	b->bb_sz = get_bb_sz(w->duration.max);
+	b->bb_handle = b->obj[bb_i].handle = gem_create(fd, b->bb_sz);
+	terminate_bb(w, b, engine, flags);
+
+	igt_assert(w->dependency <= 0);
+	if (w->dependency) {
+		int dep_idx = w->idx + w->dependency;
+
+		igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
+		igt_assert(wrk->steps[dep_idx].type == BATCH);
+
+		b->obj[j].handle = b->obj[bb_i].handle;
+		bb_i = j;
+		b->obj[j - 1].handle = wrk->steps[dep_idx].b[0].obj[0].handle;
+		j++;
+
+		if (wrk->steps[dep_idx].b[1].obj[0].handle) {
+			b->obj[j].handle = b->obj[bb_i].handle;
+			bb_i = j;
+			b->obj[j - 1].handle =
+					wrk->steps[dep_idx].b[1].obj[0].handle;
+			j++;
+		}
+	}
+
+	if (flags & SEQNO) {
+		b->reloc.presumed_offset = -1;
+		b->reloc.target_handle = 1;
+		b->obj[bb_i].relocs_ptr = to_user_pointer(&b->reloc);
+		b->obj[bb_i].relocation_count = 1;
+	}
+
+	b->eb.buffers_ptr = to_user_pointer(b->obj);
+	b->eb.buffer_count = j;
+	b->eb.rsvd1 = wrk->ctx_id[w->context];
+
+	if (flags & SWAPVCS && engine == VCS1)
+		engine = VCS2;
+	else if (flags & SWAPVCS && engine == VCS2)
+		engine = VCS1;
+	b->eb.flags = eb_engine_map[engine];
+	b->eb.flags |= I915_EXEC_HANDLE_LUT;
+	if (!(flags & SEQNO))
+		b->eb.flags |= I915_EXEC_NO_RELOC;
+#ifdef DEBUG
+	printf("%u: %u:%x|%x|%x|%x %10lu flags=%llx bb=%x[%u] ctx[%u]=%u\n",
+		w->idx, b->eb.buffer_count, b->obj[0].handle,
+		b->obj[1].handle, b->obj[2].handle, b->obj[3].handle,
+		b->bb_sz, b->eb.flags, b->bb_handle, bb_i,
+		w->context, wrk->ctx_id[w->context]);
+#endif
+}
+
+static void
+prepare_workload(struct workload *wrk, unsigned int flags)
+{
+	int max_ctx = -1;
+	struct w_step *w;
+	int i;
+
+	if (flags & SEQNO) {
+		const unsigned int status_sz = sizeof(uint32_t);
+		uint32_t handle = gem_create(fd, status_sz);
+
+		gem_set_caching(fd, handle, I915_CACHING_CACHED);
+		wrk->status_page_handle = handle;
+		wrk->status_page = gem_mmap__cpu(fd, handle, 0, status_sz,
+						 PROT_READ);
+	}
+
+	for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
+		if ((int)w->context > max_ctx) {
+			int delta = w->context + 1 - wrk->nr_ctxs;
+
+			wrk->nr_ctxs += delta;
+			wrk->ctx_id = realloc(wrk->ctx_id,
+					      wrk->nr_ctxs * sizeof(uint32_t));
+			memset(&wrk->ctx_id[wrk->nr_ctxs - delta], 0,
+			       delta * sizeof(uint32_t));
+
+			max_ctx = w->context;
+		}
+
+		if (!wrk->ctx_id[w->context]) {
+			struct drm_i915_gem_context_create arg = {};
+
+			drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_CREATE, &arg);
+			igt_assert(arg.ctx_id);
+
+			wrk->ctx_id[w->context] = arg.ctx_id;
+		}
+	}
+
+	for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
+		unsigned int _flags = flags;
+		enum intel_engine_id engine = w->engine;
+
+		if (w->type != BATCH)
+			continue;
+
+		if (engine != VCS && engine != VCS1 && engine != VCS2)
+			_flags &= ~SEQNO;
+
+		if (engine == VCS)
+			_flags &= ~SWAPVCS;
+
+		if (engine == VCS && flags & BALANCE) {
+			alloc_step_batch(wrk, w, &w->b[0], VCS1, _flags);
+			alloc_step_batch(wrk, w, &w->b[1], VCS2, _flags);
+		} else {
+			alloc_step_batch(wrk, w, &w->b[0], engine, _flags);
+		}
+	}
+}
+
+static double elapsed(const struct timespec *start, const struct timespec *end)
+{
+	return (end->tv_sec - start->tv_sec) +
+	       (end->tv_nsec - start->tv_nsec) / 1e9;
+}
+
+static int elapsed_us(const struct timespec *start, const struct timespec *end)
+{
+	return elapsed(start, end) * 1e6;
+}
+
+static enum intel_engine_id get_vcs_engine(unsigned int n)
+{
+	const enum intel_engine_id vcs_engines[2] = { VCS1, VCS2 };
+
+	igt_assert(n < ARRAY_SIZE(vcs_engines));
+
+	return vcs_engines[n];
+}
+
+
+static enum intel_engine_id
+rr_balance(struct workload *wrk, struct w_step *w)
+{
+	unsigned int engine;
+
+	engine = get_vcs_engine(wrk->vcs_rr);
+	wrk->vcs_rr ^= 1;
+
+	return engine;
+}
+
+static enum intel_engine_id
+qd_balance(struct workload *wrk, struct w_step *w)
+{
+	enum intel_engine_id engine = w->engine;
+	long qd[NUM_ENGINES];
+	unsigned int n;
+
+	igt_assert(engine == VCS);
+
+	qd[VCS1] = wrk->seqno[VCS1] - wrk->status_page[0];
+	wrk->qd_sum[VCS1] += qd[VCS1];
+
+	qd[VCS2] = wrk->seqno[VCS2] - wrk->status_page[1];
+	wrk->qd_sum[VCS2] += qd[VCS2];
+
+	if (qd[VCS1] < qd[VCS2])
+		n = 0;
+	else if (qd[VCS2] < qd[VCS1])
+		n = 1;
+	else
+		n = wrk->vcs_rr;
+
+	engine = get_vcs_engine(n);
+	wrk->vcs_rr = n ^ 1;
+
+#ifdef DEBUG
+	printf("qd_balance: 1:%ld 2:%ld rr:%u = %u\t(%lu - %u) (%lu - %u)\n",
+	       qd[VCS1], qd[VCS2], wrk->vcs_rr, engine,
+	       wrk->seqno[VCS1], wrk->status_page[0],
+	       wrk->seqno[VCS2], wrk->status_page[1]);
+#endif
+	return engine;
+}
+
+static void
+update_bb_seqno(struct w_step_eb *b, enum intel_engine_id engine,
+		uint32_t seqno)
+{
+	*b->mapped_seqno = seqno;
+	b->reloc.delta = (engine - VCS1) * sizeof(uint32_t);
+}
+
+static void
+run_workload(unsigned int id, struct workload *wrk, unsigned int repeat,
+	     enum intel_engine_id (*balance)(struct workload *wrk,
+					     struct w_step *w),
+	     unsigned int flags)
+{
+	struct timespec t_start, t_end;
+	struct w_step *w;
+	double t;
+	int i, j;
+
+	clock_gettime(CLOCK_MONOTONIC, &t_start);
+
+	srand(t_start.tv_nsec);
+
+	for (j = 0; j < repeat; j++) {
+		for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
+			enum intel_engine_id engine = w->engine;
+			struct w_step_eb *b = &w->b[0];
+			int do_sleep = 0;
+
+			if (i == 0)
+				clock_gettime(CLOCK_MONOTONIC,
+					      &wrk->repeat_start);
+
+			if (w->type == DELAY) {
+				do_sleep = w->wait;
+			} else if (w->type == PERIOD) {
+				struct timespec now;
+
+				clock_gettime(CLOCK_MONOTONIC, &now);
+				do_sleep = w->wait -
+					   elapsed_us(&wrk->repeat_start, &now);
+				if (do_sleep < 0) {
+					if (!quiet) {
+						printf("%u: Dropped period @ %u/%u (%dus late)!\n",
+						       id, j, i, do_sleep);
+						continue;
+					}
+				}
+			} else if (w->type == SYNC) {
+				unsigned int s_idx = i + w->wait;
+
+				igt_assert(i > 0 && i < wrk->nr_steps);
+				igt_assert(wrk->steps[s_idx].type == BATCH);
+				gem_sync(fd, wrk->steps[s_idx].b[0].obj[0].handle);
+				if (wrk->steps[s_idx].b[1].obj[0].handle)
+					gem_sync(fd, wrk->steps[s_idx].b[1].obj[0].handle);
+				continue;
+			}
+
+			if (do_sleep) {
+				usleep(do_sleep);
+				continue;
+			}
+
+			wrk->nr_bb[engine]++;
+
+			if (engine == VCS && balance) {
+				engine = balance(wrk, w);
+				wrk->nr_bb[engine]++;
+				b = &w->b[engine - VCS1];
+
+				if (flags & SEQNO)
+					update_bb_seqno(b, engine,
+							++wrk->seqno[engine]);
+			}
+
+			if (w->duration.min != w->duration.max) {
+				unsigned int d = get_duration(&w->duration);
+				unsigned long offset;
+
+				offset = ALIGN(b->bb_sz - get_bb_sz(d),
+					       2 * sizeof(uint32_t));
+				b->eb.batch_start_offset = offset;
+			}
+
+			gem_execbuf(fd, &b->eb);
+
+			if (w->wait)
+				gem_sync(fd, b->obj[0].handle);
+		}
+	}
+
+	gem_sync(fd, wrk->steps[wrk->nr_steps - 1].b[0].obj[0].handle);
+	if (wrk->steps[wrk->nr_steps - 1].b[1].obj[0].handle)
+		gem_sync(fd, wrk->steps[wrk->nr_steps - 1].b[1].obj[0].handle);
+
+	clock_gettime(CLOCK_MONOTONIC, &t_end);
+
+	t = elapsed(&t_start, &t_end);
+	if (!quiet && !balance)
+		printf("%u: %.3fs elapsed (%.3f workloads/s)\n", id, t, repeat / t);
+	if (!quiet && balance == rr_balance)
+		printf("%u: %.3fs elapsed (%.3f workloads/s). %lu (%lu + %lu) total VCS batches.\n",
+		       id, t, repeat / t,
+		       wrk->nr_bb[VCS], wrk->nr_bb[VCS1], wrk->nr_bb[VCS2]);
+	if (!quiet && balance == qd_balance)
+		printf("%u: %.3fs elapsed (%.3f workloads/s). %lu (%lu + %lu) total VCS batches. Average queue depths %.3f, %.3f.\n",
+		       id, t, repeat / t,
+		       wrk->nr_bb[VCS], wrk->nr_bb[VCS1], wrk->nr_bb[VCS2],
+		       (double)wrk->qd_sum[VCS1] / wrk->nr_bb[VCS],
+		       (double)wrk->qd_sum[VCS2] / wrk->nr_bb[VCS]);
+}
+
+static void fini_workload(struct workload *wrk)
+{
+	free(wrk->steps);
+	free(wrk);
+}
+
+static unsigned long calibrate_nop(unsigned int tolerance_pct)
+{
+	const uint32_t bbe = 0xa << 23;
+	unsigned int loops = 17;
+	unsigned int usecs = nop_calibration_us;
+	struct drm_i915_gem_exec_object2 obj = {};
+	struct drm_i915_gem_execbuffer2 eb =
+		{ .buffer_count = 1, .buffers_ptr = (uintptr_t)&obj};
+	long size, last_size;
+	struct timespec t_0, t_end;
+
+	clock_gettime(CLOCK_MONOTONIC, &t_0);
+
+	size = 256 * 1024;
+	do {
+		struct timespec t_start;
+
+		obj.handle = gem_create(fd, size);
+		gem_write(fd, obj.handle, size - sizeof(bbe), &bbe,
+			  sizeof(bbe));
+		gem_execbuf(fd, &eb);
+		gem_sync(fd, obj.handle);
+
+		clock_gettime(CLOCK_MONOTONIC, &t_start);
+		for (int loop = 0; loop < loops; loop++)
+			gem_execbuf(fd, &eb);
+		gem_sync(fd, obj.handle);
+		clock_gettime(CLOCK_MONOTONIC, &t_end);
+
+		gem_close(fd, obj.handle);
+
+		last_size = size;
+		size = loops * size / elapsed(&t_start, &t_end) / 1e6 * usecs;
+		size = ALIGN(size, sizeof(uint32_t));
+	} while (elapsed(&t_0, &t_end) < 5 ||
+		 abs(size - last_size) > (size * tolerance_pct / 100));
+
+	return size / sizeof(uint32_t);
+}
+
+static void print_help(void)
+{
+	puts(
+"Usage: gem_wsim [OPTIONS]\n"
+"\n"
+"Runs a simulated workload on the GPU.\n"
+"When ran without arguments performs a GPU calibration result of which needs\n"
+"to be provided when running the simulation in subsequent invocations.\n"
+"\n"
+"Options:\n"
+"	-h		This text.\n"
+"	-q		Be quiet - do not output anything to stdout.\n"
+"	-n <n>		Nop calibration value.\n"
+"	-t <n>		Nop calibration tolerance percentage.\n"
+"			Use when there is a difficuly obtaining calibration\n"
+"			with the default settings.\n"
+"	-w <desc|path>	Filename or a workload descriptor.\n"
+"			Can be given multiple times.\n"
+"	-r <n>		How many times to emit the workload.\n"
+"	-c <n>		Fork n clients emitting the workload simultaneously.\n"
+"	-x		Swap VCS1 and VCS2 engines in every other client.\n"
+"	-b <n>		Load balancing to use. (0: rr, 1: qd)\n"
+	);
+}
+
+static char *load_workload_descriptor(char *filename)
+{
+	struct stat sbuf;
+	char *buf;
+	int infd, ret, i;
+	ssize_t len;
+
+	ret = stat(filename, &sbuf);
+	if (ret || !S_ISREG(sbuf.st_mode))
+		return filename;
+
+	igt_assert(sbuf.st_size < 1024 * 1024); /* Just so. */
+	buf = malloc(sbuf.st_size);
+	igt_assert(buf);
+
+	infd = open(filename, O_RDONLY);
+	igt_assert(infd >= 0);
+	len = read(infd, buf, sbuf.st_size);
+	igt_assert(len == sbuf.st_size);
+	close(infd);
+
+	for (i = 0; i < len; i++) {
+		if (buf[i] == '\n')
+			buf[i] = ',';
+	}
+
+	len--;
+	while (buf[len] == ',')
+		buf[len--] = 0;
+
+	return buf;
+}
+
+static char **
+add_workload_arg(char **w_args, unsigned int nr_args, char *w_arg)
+{
+	w_args = realloc(w_args, sizeof(char *) * nr_args);
+	igt_assert(w_args);
+	w_args[nr_args - 1] = w_arg;
+
+	return w_args;
+}
+
+int main(int argc, char **argv)
+{
+	unsigned int repeat = 1;
+	unsigned int clients = 1;
+	unsigned int flags = 0;
+	struct timespec t_start, t_end;
+	struct workload **w, **wrk = NULL;
+	unsigned int nr_w_args = 0;
+	char **w_args = NULL;
+	unsigned int tolerance_pct = 1;
+	enum intel_engine_id (*balance)(struct workload *, struct w_step *) = NULL;
+	double t;
+	int i, c;
+
+	fd = drm_open_driver(DRIVER_INTEL);
+
+	while ((c = getopt(argc, argv, "c:n:r:qxw:t:b:h")) != -1) {
+		switch (c) {
+		case 'w':
+			w_args = add_workload_arg(w_args, ++nr_w_args, optarg);
+			break;
+		case 'c':
+			clients = strtol(optarg, NULL, 0);
+			break;
+		case 't':
+			tolerance_pct = strtol(optarg, NULL, 0);
+			break;
+		case 'n':
+			nop_calibration = strtol(optarg, NULL, 0);
+			break;
+		case 'r':
+			repeat = strtol(optarg, NULL, 0);
+			break;
+		case 'q':
+			quiet = true;
+			break;
+		case 'x':
+			flags |= SWAPVCS;
+			break;
+		case 'b':
+			switch (strtol(optarg, NULL, 0)) {
+			case 0:
+				balance = rr_balance;
+				flags |= BALANCE;
+				break;
+			case 1:
+				igt_assert(intel_gen(intel_get_drm_devid(fd)) >=
+					   8);
+				balance = qd_balance;
+				flags |= SEQNO | BALANCE;
+				break;
+			default:
+				if (!quiet)
+					fprintf(stderr,
+						"Unknown balancing mode '%s'!\n",
+						optarg);
+				return 1;
+			}
+			break;
+		case 'h':
+			print_help();
+			return 0;
+		default:
+			return 1;
+		}
+	}
+
+	if (!nop_calibration) {
+		if (!quiet)
+			printf("Calibrating nop delay with %u%% tolerance...\n",
+				tolerance_pct);
+		nop_calibration = calibrate_nop(tolerance_pct);
+		if (!quiet)
+			printf("Nop calibration for %uus delay is %lu.\n",
+			       nop_calibration_us, nop_calibration);
+
+		return 0;
+	}
+
+	if (!nr_w_args) {
+		if (!quiet)
+			fprintf(stderr, "No workload descriptor(s)!\n");
+		return 1;
+	}
+
+	if (nr_w_args > 1 && clients > 1) {
+		if (!quiet)
+			fprintf(stderr,
+				"Cloned clients cannot be combined with multiple workloads!\n");
+		return 1;
+	}
+
+	wrk = calloc(nr_w_args, sizeof(*wrk));
+	igt_assert(wrk);
+
+	for (i = 0; i < nr_w_args; i++) {
+		w_args[i] = load_workload_descriptor(w_args[i]);
+		if (!w_args[i]) {
+			if (!quiet)
+				fprintf(stderr,
+					"Failed to load workload descriptor %u!\n",
+					i);
+			return 1;
+		}
+
+		wrk[i] = parse_workload(w_args[i]);
+		if (!wrk[i]) {
+			if (!quiet)
+				fprintf(stderr,
+					"Failed to parse workload %u!\n", i);
+			return 1;
+		}
+	}
+
+	if (!quiet) {
+		printf("Using %lu nop calibration for %uus delay.\n",
+		       nop_calibration, nop_calibration_us);
+		if (nr_w_args > 1)
+			clients = nr_w_args;
+		printf("%u client%s.\n", clients, clients > 1 ? "s" : "");
+		if (flags & SWAPVCS)
+			printf("Swapping VCS rings between clients.\n");
+	}
+
+	w = calloc(clients, sizeof(struct workload *));
+	igt_assert(w);
+
+	for (i = 0; i < clients; i++) {
+		unsigned int flags_ = flags;
+
+		w[i] = clone_workload(wrk[nr_w_args > 1 ? i : 0]);
+
+		if (flags & SWAPVCS && i & 1)
+			flags_ &= ~SWAPVCS;
+
+		prepare_workload(w[i], flags_);
+	}
+
+	clock_gettime(CLOCK_MONOTONIC, &t_start);
+
+	igt_fork(child, clients)
+		run_workload(child, w[child], repeat, balance, flags);
+
+	igt_waitchildren();
+
+	clock_gettime(CLOCK_MONOTONIC, &t_end);
+
+	t = elapsed(&t_start, &t_end);
+	if (!quiet)
+		printf("%.3fs elapsed (%.3f workloads/s)\n",
+		       t, clients * repeat / t);
+
+	for (i = 0; i < clients; i++)
+		fini_workload(w[i]);
+	free(w);
+	for (i = 0; i < nr_w_args; i++)
+		fini_workload(wrk[i]);
+	free(w_args);
+
+	return 0;
+}
diff --git a/benchmarks/wsim/README b/benchmarks/wsim/README
new file mode 100644
index 000000000000..b55e620c61c2
--- /dev/null
+++ b/benchmarks/wsim/README
@@ -0,0 +1,54 @@
+Workload descriptor format
+==========================
+
+ctx.engine.duration_us.dependency.wait,...
+<uint>.<str>.<uint>[-<uint>].<int <= 0>.<0|1>,...
+d|p|s.<uiny>,...
+
+For duration a range can be given from which a random value will be picked
+before every submit. Since this and seqno management requires CPU access to
+objects, care needs to be taken in order to ensure the submit queue is deep
+enough these operations do not affect the execution speed unless that is
+desired.
+
+Additional workload steps are also supported:
+
+ 'd' - Adds a delay (in microseconds).
+ 'p' - Adds a delay relative to the start of previous loop so that the each loop
+       starts execution with a given period.
+ 's' - Synchronises the pipeline to a batch relative to the step.
+
+Engine ids: RCS, BCS, VCS, VCS1, VCS2, VECS
+
+Example (leading spaces must not be present in the actual file):
+----------------------------------------------------------------
+
+  1.VCS1.3000.0.1
+  1.RCS.500-1000.-1.0
+  1.RCS.3700.0.0
+  1.RCS.1000.-2.0
+  1.VCS2.2300.-2.0
+  1.RCS.4700.-1.0
+  1.VCS2.600.-1.1
+  p.16000
+
+The above workload described in human language works like this:
+
+  1.   A batch is sent to the VCS1 engine which will be executing for 3ms on the
+       GPU and userspace will wait until it is finished before proceeding.
+  2-4. Now three batches are sent to RCS with durations of 0.5-1.5ms (random
+       duration range), 3.7ms and 1ms respectively. The first batch has a data
+       dependency on the preceding VCS1 batch, and the last of the group depends
+       on the first from the group.
+  5.   Now a 2.3ms batch is sent to VCS2, with a data dependency on the 3.7ms
+       RCS batch.
+  6.   This is followed by a 4.7ms RCS batch with a data dependency on the 2.3ms
+       VCS2 batch.
+  7.   Then a 0.6ms VCS2 batch is sent depending on the previous RCS one. In the
+       same step the tool is told to wait for the batch completes before
+       proceeding.
+  8.   Finally the tool is told to wait long enough to ensure the next iteration
+       starts 16ms after the previous one has started.
+
+When workload descriptors are provided on the command line, commas must be used
+instead of new lines.
diff --git a/benchmarks/wsim/media_17i7.wsim b/benchmarks/wsim/media_17i7.wsim
new file mode 100644
index 000000000000..5f533d8e168b
--- /dev/null
+++ b/benchmarks/wsim/media_17i7.wsim
@@ -0,0 +1,7 @@
+1.VCS1.3000.0.1
+1.RCS.1000.-1.0
+1.RCS.3700.0.0
+1.RCS.1000.-2.0
+1.VCS2.2300.-2.0
+1.RCS.4700.-1.0
+1.VCS2.600.-1.1
diff --git a/benchmarks/wsim/media_load_balance_17i7.wsim b/benchmarks/wsim/media_load_balance_17i7.wsim
new file mode 100644
index 000000000000..25a692032eae
--- /dev/null
+++ b/benchmarks/wsim/media_load_balance_17i7.wsim
@@ -0,0 +1,7 @@
+1.VCS.3000.0.1
+1.RCS.1000.-1.0
+1.RCS.3700.0.0
+1.RCS.1000.-2.0
+1.VCS.2300.-2.0
+1.RCS.4700.-1.0
+1.VCS.600.-1.1
diff --git a/benchmarks/wsim/vcs1.wsim b/benchmarks/wsim/vcs1.wsim
new file mode 100644
index 000000000000..e1986aadd65c
--- /dev/null
+++ b/benchmarks/wsim/vcs1.wsim
@@ -0,0 +1,25 @@
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
diff --git a/benchmarks/wsim/vcs_balanced.wsim b/benchmarks/wsim/vcs_balanced.wsim
new file mode 100644
index 000000000000..9a4b3d785db1
--- /dev/null
+++ b/benchmarks/wsim/vcs_balanced.wsim
@@ -0,0 +1,25 @@
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
-- 
2.9.3

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-20 12:29   ` [PATCH i-g-t v4] " Tvrtko Ursulin
@ 2017-04-20 14:23     ` Chris Wilson
  2017-04-20 14:33       ` Chris Wilson
  2017-04-20 14:34       ` Tvrtko Ursulin
  2017-04-20 14:52     ` Chris Wilson
                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 26+ messages in thread
From: Chris Wilson @ 2017-04-20 14:23 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote:
> +static void
> +alloc_step_batch(struct workload *wrk, struct w_step *w, struct w_step_eb *b,
> +		 enum intel_engine_id engine, unsigned int flags)
> +{
> +	unsigned int bb_i, j = 0;
> +
> +	b->obj[j].handle = gem_create(fd, 4096);
> +	b->obj[j].flags = EXEC_OBJECT_WRITE;
> +	j++;
> +
> +	if (flags & SEQNO) {
> +		b->obj[j].handle = wrk->status_page_handle;
> +		j++;
> +	}
> +
> +	bb_i = j++;
> +	b->bb_sz = get_bb_sz(w->duration.max);
> +	b->bb_handle = b->obj[bb_i].handle = gem_create(fd, b->bb_sz);
> +	terminate_bb(w, b, engine, flags);
> +
> +	igt_assert(w->dependency <= 0);
> +	if (w->dependency) {
> +		int dep_idx = w->idx + w->dependency;
> +
> +		igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
> +		igt_assert(wrk->steps[dep_idx].type == BATCH);
> +
> +		b->obj[j].handle = b->obj[bb_i].handle;
> +		bb_i = j;
> +		b->obj[j - 1].handle = wrk->steps[dep_idx].b[0].obj[0].handle;
> +		j++;
> +
> +		if (wrk->steps[dep_idx].b[1].obj[0].handle) {
> +			b->obj[j].handle = b->obj[bb_i].handle;
> +			bb_i = j;
> +			b->obj[j - 1].handle =
> +					wrk->steps[dep_idx].b[1].obj[0].handle;
> +			j++;
> +		}
> +	}
> +
> +	if (flags & SEQNO) {
> +		b->reloc.presumed_offset = -1;

So as I understand it, you are caching the execbuf/obj/reloc for the
workload and then may reissue later with different seqno on different
rings? In which case we have a problem as the kernel will write back the
updated offsets to b->reloc.presumed_offset and b->obj[].offset and in
future passes they will match and the seqno write will go into the wrong
slot (if it swaps rings).

You either want to reset presumed_offset=-1 each time, or better for all
concerned write the correct address alongside the seqno (which also
enables NORELOC).

Delta incoming.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-20 14:23     ` Chris Wilson
@ 2017-04-20 14:33       ` Chris Wilson
  2017-04-20 14:45         ` Tvrtko Ursulin
  2017-04-20 14:34       ` Tvrtko Ursulin
  1 sibling, 1 reply; 26+ messages in thread
From: Chris Wilson @ 2017-04-20 14:33 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-gfx, Tvrtko Ursulin, Rogozhkin, Dmitry V

[-- Attachment #1: Type: text/plain, Size: 545 bytes --]

On Thu, Apr 20, 2017 at 03:23:27PM +0100, Chris Wilson wrote:
> You either want to reset presumed_offset=-1 each time, or better for all
> concerned write the correct address alongside the seqno (which also
> enables NORELOC).
> 
> Delta incoming.

See attached.

Next concern is that I have full rings which implies that we are not
waiting on each batch before resubmitting with a new seqno?

If I throw a assert(!busy(batch_bo)) before the *b->mapped_seqno am I
going to be upset?
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

[-- Attachment #2: 0001-seqno-reloc.patch --]
[-- Type: text/x-diff, Size: 2677 bytes --]

>From 5bf424c2719e81f926b74f4136610cbdfd26a4d8 Mon Sep 17 00:00:00 2001
From: Chris Wilson <chris@chris-wilson.co.uk>
Date: Thu, 20 Apr 2017 15:30:07 +0100
Subject: [PATCH] seqno-reloc

---
 benchmarks/gem_wsim.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/benchmarks/gem_wsim.c b/benchmarks/gem_wsim.c
index adf2d6de..e616335b 100644
--- a/benchmarks/gem_wsim.c
+++ b/benchmarks/gem_wsim.c
@@ -45,6 +45,8 @@
 #include "drmtest.h"
 #include "intel_io.h"
 
+#define LOCAL_I915_GEM_DOMAIN_WC 0x80
+
 enum intel_engine_id {
 	RCS,
 	BCS,
@@ -86,7 +88,9 @@ struct w_step
 		struct drm_i915_gem_relocation_entry reloc;
 		unsigned long bb_sz;
 		uint32_t bb_handle;
-		uint32_t *mapped_batch, *mapped_seqno;
+		uint32_t *mapped_batch;
+		uint64_t *mapped_address;
+		uint32_t *mapped_seqno;
 		unsigned int mapped_len;
 	} b[2]; /* One for each VCS when load balancing */
 };
@@ -405,7 +409,8 @@ terminate_bb(struct w_step *w, struct w_step_eb *b, enum intel_engine_id engine,
 	mmap_len += cmd_offset - mmap_start;
 
 	gem_set_domain(fd, b->bb_handle,
-		       I915_GEM_DOMAIN_CPU, I915_GEM_DOMAIN_CPU);
+		       LOCAL_I915_GEM_DOMAIN_WC,
+		       LOCAL_I915_GEM_DOMAIN_WC);
 
 	ptr = gem_mmap__wc(fd, b->bb_handle, mmap_start, mmap_len, PROT_WRITE);
 	cs = (uint32_t *)((char *)ptr + cmd_offset - mmap_start);
@@ -415,6 +420,7 @@ terminate_bb(struct w_step *w, struct w_step_eb *b, enum intel_engine_id engine,
 		b->reloc.delta = (engine - VCS1) * sizeof(uint32_t);
 
 		*cs++ = MI_STORE_DWORD_IMM;
+		b->mapped_address = (uint64_t *)cs;
 		*cs++ = 0;
 		*cs++ = 0;
 		b->mapped_seqno = cs;
@@ -469,7 +475,6 @@ alloc_step_batch(struct workload *wrk, struct w_step *w, struct w_step_eb *b,
 	}
 
 	if (flags & SEQNO) {
-		b->reloc.presumed_offset = -1;
 		b->reloc.target_handle = 1;
 		b->obj[bb_i].relocs_ptr = to_user_pointer(&b->reloc);
 		b->obj[bb_i].relocation_count = 1;
@@ -485,8 +490,7 @@ alloc_step_batch(struct workload *wrk, struct w_step *w, struct w_step_eb *b,
 		engine = VCS1;
 	b->eb.flags = eb_engine_map[engine];
 	b->eb.flags |= I915_EXEC_HANDLE_LUT;
-	if (!(flags & SEQNO))
-		b->eb.flags |= I915_EXEC_NO_RELOC;
+	b->eb.flags |= I915_EXEC_NO_RELOC;
 #ifdef DEBUG
 	printf("%u: %u:%x|%x|%x|%x %10lu flags=%llx bb=%x[%u] ctx[%u]=%u\n",
 		w->idx, b->eb.buffer_count, b->obj[0].handle,
@@ -628,8 +632,9 @@ static void
 update_bb_seqno(struct w_step_eb *b, enum intel_engine_id engine,
 		uint32_t seqno)
 {
-	*b->mapped_seqno = seqno;
 	b->reloc.delta = (engine - VCS1) * sizeof(uint32_t);
+	*b->mapped_address = b->reloc.presumed_offset + b->reloc.delta;
+	*b->mapped_seqno = seqno;
 }
 
 static void
-- 
2.11.0


[-- Attachment #3: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-20 14:23     ` Chris Wilson
  2017-04-20 14:33       ` Chris Wilson
@ 2017-04-20 14:34       ` Tvrtko Ursulin
  2017-04-20 15:11         ` Chris Wilson
  1 sibling, 1 reply; 26+ messages in thread
From: Tvrtko Ursulin @ 2017-04-20 14:34 UTC (permalink / raw)
  To: Chris Wilson, Tvrtko Ursulin, Intel-gfx, Tvrtko Ursulin,
	Rogozhkin, Dmitry V


On 20/04/2017 15:23, Chris Wilson wrote:
> On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote:
>> +static void
>> +alloc_step_batch(struct workload *wrk, struct w_step *w, struct w_step_eb *b,
>> +		 enum intel_engine_id engine, unsigned int flags)
>> +{
>> +	unsigned int bb_i, j = 0;
>> +
>> +	b->obj[j].handle = gem_create(fd, 4096);
>> +	b->obj[j].flags = EXEC_OBJECT_WRITE;
>> +	j++;
>> +
>> +	if (flags & SEQNO) {
>> +		b->obj[j].handle = wrk->status_page_handle;
>> +		j++;
>> +	}
>> +
>> +	bb_i = j++;
>> +	b->bb_sz = get_bb_sz(w->duration.max);
>> +	b->bb_handle = b->obj[bb_i].handle = gem_create(fd, b->bb_sz);
>> +	terminate_bb(w, b, engine, flags);
>> +
>> +	igt_assert(w->dependency <= 0);
>> +	if (w->dependency) {
>> +		int dep_idx = w->idx + w->dependency;
>> +
>> +		igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
>> +		igt_assert(wrk->steps[dep_idx].type == BATCH);
>> +
>> +		b->obj[j].handle = b->obj[bb_i].handle;
>> +		bb_i = j;
>> +		b->obj[j - 1].handle = wrk->steps[dep_idx].b[0].obj[0].handle;
>> +		j++;
>> +
>> +		if (wrk->steps[dep_idx].b[1].obj[0].handle) {
>> +			b->obj[j].handle = b->obj[bb_i].handle;
>> +			bb_i = j;
>> +			b->obj[j - 1].handle =
>> +					wrk->steps[dep_idx].b[1].obj[0].handle;
>> +			j++;
>> +		}
>> +	}
>> +
>> +	if (flags & SEQNO) {
>> +		b->reloc.presumed_offset = -1;
>
> So as I understand it, you are caching the execbuf/obj/reloc for the
> workload and then may reissue later with different seqno on different
> rings? In which case we have a problem as the kernel will write back the >
 >
> updated offsets to b->reloc.presumed_offset and b->obj[].offset and in
> future passes they will match and the seqno write will go into the wrong
> slot (if it swaps rings).
>
> You either want to reset presumed_offset=-1 each time, or better for all
> concerned write the correct address alongside the seqno (which also
> enables NORELOC).
>
> Delta incoming.

Only the seqno changes, but each engine has its own eb/obj/reloc. So I 
think there is no problem. Or is there still?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-20 14:33       ` Chris Wilson
@ 2017-04-20 14:45         ` Tvrtko Ursulin
  0 siblings, 0 replies; 26+ messages in thread
From: Tvrtko Ursulin @ 2017-04-20 14:45 UTC (permalink / raw)
  To: Chris Wilson, Tvrtko Ursulin, Intel-gfx, Tvrtko Ursulin,
	Rogozhkin, Dmitry V


On 20/04/2017 15:33, Chris Wilson wrote:
> On Thu, Apr 20, 2017 at 03:23:27PM +0100, Chris Wilson wrote:
>> You either want to reset presumed_offset=-1 each time, or better for all
>> concerned write the correct address alongside the seqno (which also
>> enables NORELOC).
>>
>> Delta incoming.
>
> See attached.
>
> Next concern is that I have full rings which implies that we are not
> waiting on each batch before resubmitting with a new seqno?
>
> If I throw a assert(!busy(batch_bo)) before the *b->mapped_seqno am I
> going to be upset?

Yes you would. :) I had a sync (as a move to cpu domain) before seqno 
update in the last version but it disappeared as I was fixing the whole 
area of seqno tracking. So the balancing results in the patch are bogus 
since the seqno can jump to latest ahead of the time...

Regards,

Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-20 12:29   ` [PATCH i-g-t v4] " Tvrtko Ursulin
  2017-04-20 14:23     ` Chris Wilson
@ 2017-04-20 14:52     ` Chris Wilson
  2017-04-20 15:06       ` Tvrtko Ursulin
  2017-04-20 16:20     ` Chris Wilson
  2017-04-21 15:21     ` [PATCH i-g-t v5] " Tvrtko Ursulin
  3 siblings, 1 reply; 26+ messages in thread
From: Chris Wilson @ 2017-04-20 14:52 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote:
> +			wrk->nr_bb[engine]++;
> +
> +			if (engine == VCS && balance) {
> +				engine = balance(wrk, w);
> +				wrk->nr_bb[engine]++;
> +				b = &w->b[engine - VCS1];
> +
> +				if (flags & SEQNO)
> +					update_bb_seqno(b, engine,
> +							++wrk->seqno[engine]);
> +			}
> +
> +			if (w->duration.min != w->duration.max) {
> +				unsigned int d = get_duration(&w->duration);
> +				unsigned long offset;
> +
> +				offset = ALIGN(b->bb_sz - get_bb_sz(d),
> +					       2 * sizeof(uint32_t));
> +				b->eb.batch_start_offset = offset;
> +			}
> +
> +			gem_execbuf(fd, &b->eb);

Likely double counting wrk->nr_bb. I suggest placing it next to the
gem_execbuf().
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-20 14:52     ` Chris Wilson
@ 2017-04-20 15:06       ` Tvrtko Ursulin
  0 siblings, 0 replies; 26+ messages in thread
From: Tvrtko Ursulin @ 2017-04-20 15:06 UTC (permalink / raw)
  To: Chris Wilson, Tvrtko Ursulin, Intel-gfx, Tvrtko Ursulin,
	Rogozhkin, Dmitry V


On 20/04/2017 15:52, Chris Wilson wrote:
> On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote:
>> +			wrk->nr_bb[engine]++;
>> +
>> +			if (engine == VCS && balance) {
>> +				engine = balance(wrk, w);
>> +				wrk->nr_bb[engine]++;
>> +				b = &w->b[engine - VCS1];
>> +
>> +				if (flags & SEQNO)
>> +					update_bb_seqno(b, engine,
>> +							++wrk->seqno[engine]);
>> +			}
>> +
>> +			if (w->duration.min != w->duration.max) {
>> +				unsigned int d = get_duration(&w->duration);
>> +				unsigned long offset;
>> +
>> +				offset = ALIGN(b->bb_sz - get_bb_sz(d),
>> +					       2 * sizeof(uint32_t));
>> +				b->eb.batch_start_offset = offset;
>> +			}
>> +
>> +			gem_execbuf(fd, &b->eb);
>
> Likely double counting wrk->nr_bb. I suggest placing it next to the
> gem_execbuf().

Just convenience in balancing mode so that nr(VCS) = nr(VCS1) + 
nr(VCS2). Also from a different angle, if the sum does not hold, that 
means workload had auto-balancing and explicit VCS1/2 batches. It's only 
used to print out the stats at the end of the run.

Regards,

Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-20 14:34       ` Tvrtko Ursulin
@ 2017-04-20 15:11         ` Chris Wilson
  0 siblings, 0 replies; 26+ messages in thread
From: Chris Wilson @ 2017-04-20 15:11 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Thu, Apr 20, 2017 at 03:34:56PM +0100, Tvrtko Ursulin wrote:
> 
> On 20/04/2017 15:23, Chris Wilson wrote:
> >On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote:
> >>+static void
> >>+alloc_step_batch(struct workload *wrk, struct w_step *w, struct w_step_eb *b,
> >>+		 enum intel_engine_id engine, unsigned int flags)
> >>+{
> >>+	unsigned int bb_i, j = 0;
> >>+
> >>+	b->obj[j].handle = gem_create(fd, 4096);
> >>+	b->obj[j].flags = EXEC_OBJECT_WRITE;
> >>+	j++;
> >>+
> >>+	if (flags & SEQNO) {
> >>+		b->obj[j].handle = wrk->status_page_handle;
> >>+		j++;
> >>+	}
> >>+
> >>+	bb_i = j++;
> >>+	b->bb_sz = get_bb_sz(w->duration.max);
> >>+	b->bb_handle = b->obj[bb_i].handle = gem_create(fd, b->bb_sz);
> >>+	terminate_bb(w, b, engine, flags);
> >>+
> >>+	igt_assert(w->dependency <= 0);
> >>+	if (w->dependency) {
> >>+		int dep_idx = w->idx + w->dependency;
> >>+
> >>+		igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
> >>+		igt_assert(wrk->steps[dep_idx].type == BATCH);
> >>+
> >>+		b->obj[j].handle = b->obj[bb_i].handle;
> >>+		bb_i = j;
> >>+		b->obj[j - 1].handle = wrk->steps[dep_idx].b[0].obj[0].handle;
> >>+		j++;
> >>+
> >>+		if (wrk->steps[dep_idx].b[1].obj[0].handle) {
> >>+			b->obj[j].handle = b->obj[bb_i].handle;
> >>+			bb_i = j;
> >>+			b->obj[j - 1].handle =
> >>+					wrk->steps[dep_idx].b[1].obj[0].handle;
> >>+			j++;
> >>+		}
> >>+	}
> >>+
> >>+	if (flags & SEQNO) {
> >>+		b->reloc.presumed_offset = -1;
> >
> >So as I understand it, you are caching the execbuf/obj/reloc for the
> >workload and then may reissue later with different seqno on different
> >rings? In which case we have a problem as the kernel will write back the >
> >
> >updated offsets to b->reloc.presumed_offset and b->obj[].offset and in
> >future passes they will match and the seqno write will go into the wrong
> >slot (if it swaps rings).
> >
> >You either want to reset presumed_offset=-1 each time, or better for all
> >concerned write the correct address alongside the seqno (which also
> >enables NORELOC).
> >
> >Delta incoming.
> 
> Only the seqno changes, but each engine has its own eb/obj/reloc. So
> I think there is no problem. Or is there still?

bb_handle is per engine as well. Ugh. No, that seems like you are
self-consistent, you just need to remove the -1 and your code is NORELOC
correct.

I wouldn't go so far as having entirely separate batches for each engine
though :)
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t v4] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-20 12:29   ` [PATCH i-g-t v4] " Tvrtko Ursulin
  2017-04-20 14:23     ` Chris Wilson
  2017-04-20 14:52     ` Chris Wilson
@ 2017-04-20 16:20     ` Chris Wilson
  2017-04-21 15:21     ` [PATCH i-g-t v5] " Tvrtko Ursulin
  3 siblings, 0 replies; 26+ messages in thread
From: Chris Wilson @ 2017-04-20 16:20 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Thu, Apr 20, 2017 at 01:29:11PM +0100, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> +static void
> +run_workload(unsigned int id, struct workload *wrk, unsigned int repeat,
> +	     enum intel_engine_id (*balance)(struct workload *wrk,
> +					     struct w_step *w),
> +	     unsigned int flags)
> +{
> +	struct timespec t_start, t_end;
> +	struct w_step *w;
> +	double t;
> +	int i, j;
> +
> +	clock_gettime(CLOCK_MONOTONIC, &t_start);
> +
> +	srand(t_start.tv_nsec);

Let's supply the seed with the workload specification. And use a
portable prng so we can be sure we can reproduce results from one system
to the next.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH i-g-t v5] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-20 12:29   ` [PATCH i-g-t v4] " Tvrtko Ursulin
                       ` (2 preceding siblings ...)
  2017-04-20 16:20     ` Chris Wilson
@ 2017-04-21 15:21     ` Tvrtko Ursulin
  2017-04-25 11:13       ` [PATCH i-g-t v6] " Tvrtko Ursulin
  3 siblings, 1 reply; 26+ messages in thread
From: Tvrtko Ursulin @ 2017-04-21 15:21 UTC (permalink / raw)
  To: Intel-gfx

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Tool which emits batch buffers to engines with configurable
sequences, durations, contexts, dependencies and userspace waits.

Unfinished but shows promise so sending out for early feedback.

v2:
 * Load workload descriptors from files. (also -w)
 * Help text.
 * Calibration control if needed. (-t)
 * NORELOC | LUT to eb flags.
 * Added sample workload to wsim/workload1.

v3:
 * Multiple parallel different workloads (-w -w ...).
 * Multi-context workloads.
 * Variable (random) batch length.
 * Load balancing (round robin and queue depth estimation).
 * Workloads delays and explicit sync steps.
 * Workload frequency (period) control.

v4:
 * Fixed queue-depth estimation by creating separate batches
   per engine when qd load balancing is on.
 * Dropped separate -s cmd line option. It can turn itself on
   automatically when needed.
 * Keep a single status page and lie about the write hazard
   as suggested by Chris.
 * Use batch_start_offset for controlling the batch duration.
   (Chris)
 * Set status page object cache level. (Chris)
 * Moved workload description to a README.
 * Tidied example workloads.
 * Some other cleanups and refactorings.

v5:
 * Master and background workloads (-W / -w).
 * Single batch per step is enough even when balancing. (Chris)
 * Use hars_petruska_f54_1_random IGT functions and see to zero
   at start. (Chris)
 * Use WC cache domain when WC mapping. (Chris)
 * Keep seqnos 64-bytes apart in the status page. (Chris)
 * Add workload throttling and queue-depth throttling commands.
   (Chris)

TODO list:

 * Fence support.
 * Better error handling.
 * Less 1980's workload parsing.
 * Proper workloads.
 * Threads?
 * ... ?

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com>
---
 benchmarks/Makefile.sources                  |    1 +
 benchmarks/gem_wsim.c                        | 1189 ++++++++++++++++++++++++++
 benchmarks/wsim/README                       |   56 ++
 benchmarks/wsim/media_17i7.wsim              |    7 +
 benchmarks/wsim/media_load_balance_17i7.wsim |    7 +
 benchmarks/wsim/vcs1.wsim                    |   26 +
 benchmarks/wsim/vcs_balanced.wsim            |   26 +
 lib/igt_core.c                               |   26 +
 lib/igt_core.h                               |    1 +
 9 files changed, 1339 insertions(+)
 create mode 100644 benchmarks/gem_wsim.c
 create mode 100644 benchmarks/wsim/README
 create mode 100644 benchmarks/wsim/media_17i7.wsim
 create mode 100644 benchmarks/wsim/media_load_balance_17i7.wsim
 create mode 100644 benchmarks/wsim/vcs1.wsim
 create mode 100644 benchmarks/wsim/vcs_balanced.wsim

diff --git a/benchmarks/Makefile.sources b/benchmarks/Makefile.sources
index 3af54ebe36f2..3a941150abb3 100644
--- a/benchmarks/Makefile.sources
+++ b/benchmarks/Makefile.sources
@@ -14,6 +14,7 @@ benchmarks_prog_list =			\
 	gem_prw				\
 	gem_set_domain			\
 	gem_syslatency			\
+	gem_wsim			\
 	kms_vblank			\
 	prime_lookup			\
 	vgem_mmap			\
diff --git a/benchmarks/gem_wsim.c b/benchmarks/gem_wsim.c
new file mode 100644
index 000000000000..3d6670fdb815
--- /dev/null
+++ b/benchmarks/gem_wsim.c
@@ -0,0 +1,1189 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <errno.h>
+#include <poll.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/ioctl.h>
+#include <sys/time.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <assert.h>
+#include <limits.h>
+
+
+#include "intel_chipset.h"
+#include "drm.h"
+#include "ioctl_wrappers.h"
+#include "drmtest.h"
+#include "intel_io.h"
+#include "igt_rand.h"
+
+enum intel_engine_id {
+	RCS,
+	BCS,
+	VCS,
+	VCS1,
+	VCS2,
+	VECS,
+	NUM_ENGINES
+};
+
+struct duration {
+	unsigned int min, max;
+};
+
+enum w_type
+{
+	BATCH,
+	SYNC,
+	DELAY,
+	PERIOD,
+	THROTTLE,
+	QD_THROTTLE
+};
+
+struct w_step
+{
+	/* Workload step metadata */
+	enum w_type type;
+	unsigned int context;
+	unsigned int engine;
+	struct duration duration;
+	int dependency;
+	int wait;
+
+	/* Implementation details */
+	unsigned int idx;
+
+	struct drm_i915_gem_execbuffer2 eb;
+	struct drm_i915_gem_exec_object2 obj[4];
+	struct drm_i915_gem_relocation_entry reloc;
+	unsigned long bb_sz;
+	uint32_t bb_handle;
+	uint32_t *mapped_batch, *mapped_seqno;
+	unsigned int mapped_len;
+};
+
+struct workload
+{
+	unsigned int nr_steps;
+	struct w_step *steps;
+
+	struct timespec repeat_start;
+
+	int pipe[2];
+
+	unsigned int nr_ctxs;
+	uint32_t *ctx_id;
+
+	unsigned long seqno[NUM_ENGINES];
+	uint32_t status_page_handle;
+	uint32_t *status_page;
+	unsigned int vcs_rr;
+
+	unsigned long qd_sum[NUM_ENGINES];
+	unsigned long nr_bb[NUM_ENGINES];
+};
+
+static const unsigned int eb_engine_map[NUM_ENGINES] = {
+	[RCS] = I915_EXEC_RENDER,
+	[BCS] = I915_EXEC_BLT,
+	[VCS] = I915_EXEC_BSD,
+	[VCS1] = I915_EXEC_BSD | I915_EXEC_BSD_RING1,
+	[VCS2] = I915_EXEC_BSD | I915_EXEC_BSD_RING2,
+	[VECS] = I915_EXEC_VEBOX
+};
+
+static const unsigned int nop_calibration_us = 1000;
+static unsigned long nop_calibration;
+
+static bool quiet;
+static int fd;
+
+#define SWAPVCS	(1<<0)
+#define SEQNO	(1<<1)
+#define BALANCE	(1<<2)
+
+#define VCS_SEQNO_IDX(vcs_instance) ((vcs_instance) * 16)
+
+/*
+ * Workload descriptor:
+ *
+ * ctx.engine.duration.dependency.wait,...
+ * <uint>.<str>.<uint>.<int <= 0>.<0|1>,...
+ *
+ * Engine ids: RCS, BCS, VCS, VCS1, VCS2, VECS
+ *
+ * "1.VCS1.3000.0.1,1.RCS.1000.-1.0,1.RCS.3700.0.0,1.RCS.1000.-2.0,1.VCS2.2300.-2.0,1.RCS.4700.-1.0,1.VCS2.600.-1.1"
+ */
+
+static const char *ring_str_map[NUM_ENGINES] = {
+	[RCS] = "RCS",
+	[BCS] = "BCS",
+	[VCS] = "VCS",
+	[VCS1] = "VCS1",
+	[VCS2] = "VCS2",
+	[VECS] = "VECS",
+};
+
+static struct workload *parse_workload(char *_desc)
+{
+	struct workload *wrk;
+	unsigned int nr_steps = 0;
+	char *desc = strdup(_desc);
+	char *_token, *token, *tctx = NULL, *tstart = desc;
+	char *field, *fctx = NULL, *fstart;
+	struct w_step step, *steps = NULL;
+	unsigned int valid;
+	int tmp;
+
+	while ((_token = strtok_r(tstart, ",", &tctx)) != NULL) {
+		tstart = NULL;
+		token = strdup(_token);
+		fstart = token;
+		valid = 0;
+		memset(&step, 0, sizeof(step));
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			fstart = NULL;
+
+			if (!strcasecmp(field, "d")) {
+				if ((field = strtok_r(fstart, ".", &fctx)) !=
+				    NULL) {
+					tmp = atoi(field);
+					if (tmp <= 0) {
+						if (!quiet)
+							fprintf(stderr,
+								"Invalid delay at step %u!\n",
+								nr_steps);
+						return NULL;
+					}
+
+					step.type = DELAY;
+					step.wait = tmp;
+					goto add_step;
+				}
+			} else if (!strcasecmp(field, "p")) {
+				if ((field = strtok_r(fstart, ".", &fctx)) !=
+				    NULL) {
+					tmp = atoi(field);
+					if (tmp <= 0) {
+						if (!quiet)
+							fprintf(stderr,
+								"Invalid period at step %u!\n",
+								nr_steps);
+						return NULL;
+					}
+
+					step.type = PERIOD;
+					step.wait = tmp;
+					goto add_step;
+				}
+			} else if (!strcasecmp(field, "s")) {
+				if ((field = strtok_r(fstart, ".", &fctx)) !=
+				    NULL) {
+					tmp = atoi(field);
+					if (tmp >= 0) {
+						if (!quiet)
+							fprintf(stderr,
+								"Invalid sync target at step %u!\n",
+								nr_steps);
+						return NULL;
+					}
+
+					step.type = SYNC;
+					step.wait = tmp;
+					goto add_step;
+				}
+			} else if (!strcasecmp(field, "t")) {
+				if ((field = strtok_r(fstart, ".", &fctx)) !=
+				    NULL) {
+					tmp = atoi(field);
+					if (tmp < 0) {
+						if (!quiet)
+							fprintf(stderr,
+								"Invalid throttle at step %u!\n",
+								nr_steps);
+						return NULL;
+					}
+
+					step.type = THROTTLE;
+					step.wait = tmp;
+					goto add_step;
+				}
+			} else if (!strcasecmp(field, "q")) {
+				if ((field = strtok_r(fstart, ".", &fctx)) !=
+				    NULL) {
+					tmp = atoi(field);
+					if (tmp < 0) {
+						if (!quiet)
+							fprintf(stderr,
+								"Invalid qd throttle at step %u!\n",
+								nr_steps);
+						return NULL;
+					}
+
+					step.type = QD_THROTTLE;
+					step.wait = tmp;
+					goto add_step;
+				}
+			}
+
+			tmp = atoi(field);
+			if (tmp < 0) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid ctx id at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.context = tmp;
+
+			valid++;
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			unsigned int i, old_valid = valid;
+
+			fstart = NULL;
+
+			for (i = 0; i < ARRAY_SIZE(ring_str_map); i++) {
+				if (!strcasecmp(field, ring_str_map[i])) {
+					step.engine = i;
+					valid++;
+					break;
+				}
+			}
+
+			if (old_valid == valid) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid engine id at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			char *sep = NULL;
+			long int tmpl;
+
+			fstart = NULL;
+
+			tmpl = strtol(field, &sep, 10);
+			if (tmpl == LONG_MIN || tmpl == LONG_MAX) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid duration at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.duration.min = tmpl;
+
+			if (sep && *sep == '-') {
+				tmpl = strtol(sep + 1, NULL, 10);
+				if (tmpl == LONG_MIN || tmpl == LONG_MAX) {
+					if (!quiet)
+						fprintf(stderr,
+							"Invalid duration range at step %u!\n",
+							nr_steps);
+					return NULL;
+				}
+				step.duration.max = tmpl;
+			} else {
+				step.duration.max = step.duration.min;
+			}
+
+			valid++;
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			fstart = NULL;
+
+			tmp = atoi(field);
+			if (tmp > 0) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid forward dependency at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.dependency = tmp;
+
+			valid++;
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			fstart = NULL;
+
+			tmp = atoi(field);
+			if (tmp != 0 && tmp != 1) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid wait boolean at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.wait = tmp;
+
+			valid++;
+		}
+
+		if (valid != 5) {
+			if (!quiet)
+				fprintf(stderr, "Invalid record at step %u!\n",
+					nr_steps);
+			return NULL;
+		}
+
+		step.type = BATCH;
+
+add_step:
+		step.idx = nr_steps++;
+		steps = realloc(steps, sizeof(step) * nr_steps);
+		igt_assert(steps);
+
+		memcpy(&steps[nr_steps - 1], &step, sizeof(step));
+
+		free(token);
+	}
+
+	wrk = malloc(sizeof(*wrk));
+	igt_assert(wrk);
+
+	wrk->nr_steps = nr_steps;
+	wrk->steps = steps;
+
+	free(desc);
+
+	return wrk;
+}
+
+static struct workload *
+clone_workload(struct workload *_wrk)
+{
+	struct workload *wrk;
+
+	wrk = malloc(sizeof(*wrk));
+	igt_assert(wrk);
+	memset(wrk, 0, sizeof(*wrk));
+
+	wrk->nr_steps = _wrk->nr_steps;
+	wrk->steps = calloc(wrk->nr_steps, sizeof(struct w_step));
+	igt_assert(wrk->steps);
+
+	memcpy(wrk->steps, _wrk->steps, sizeof(struct w_step) * wrk->nr_steps);
+
+	return wrk;
+}
+
+#define rounddown(x, y) (x - (x%y))
+#ifndef PAGE_SIZE
+#define PAGE_SIZE (4096)
+#endif
+
+static unsigned int get_duration(struct duration *dur)
+{
+	if (dur->min == dur->max)
+		return dur->min;
+	else
+		return dur->min + hars_petruska_f54_1_random_unsafe() %
+		       (dur->max + 1 - dur->min);
+}
+
+static unsigned long get_bb_sz(unsigned int duration)
+{
+	return ALIGN(duration * nop_calibration * sizeof(uint32_t) /
+		     nop_calibration_us, sizeof(uint32_t));
+}
+
+static void
+terminate_bb(struct w_step *w, unsigned int flags)
+{
+	const uint32_t bbe = 0xa << 23;
+	unsigned long mmap_start, cmd_offset, mmap_len;
+	uint32_t *ptr, *cs;
+
+	mmap_len = 1;
+	if (flags & SEQNO)
+		mmap_len += 4;
+	mmap_len *= sizeof(uint32_t);
+	cmd_offset = w->bb_sz - mmap_len;
+	mmap_start = rounddown(cmd_offset, PAGE_SIZE);
+	mmap_len += cmd_offset - mmap_start;
+
+	gem_set_domain(fd, w->bb_handle,
+		       I915_GEM_DOMAIN_WC, I915_GEM_DOMAIN_WC);
+
+	ptr = gem_mmap__wc(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
+	cs = (uint32_t *)((char *)ptr + cmd_offset - mmap_start);
+
+	if (flags & SEQNO) {
+		w->reloc.offset = w->bb_sz - 4 * sizeof(uint32_t);
+
+		*cs++ = MI_STORE_DWORD_IMM;
+		*cs++ = 0;
+		*cs++ = 0;
+		w->mapped_seqno = cs;
+		*cs++ = 0;
+	}
+
+	*cs = bbe;
+
+	w->mapped_batch = ptr;
+	w->mapped_len = mmap_len;
+}
+
+static void
+eb_update_flags(struct w_step *w, enum intel_engine_id engine,
+		unsigned int flags)
+{
+	w->eb.flags = eb_engine_map[engine];
+	w->eb.flags |= I915_EXEC_HANDLE_LUT;
+	if (!(flags & SEQNO))
+		w->eb.flags |= I915_EXEC_NO_RELOC;
+}
+
+static void
+alloc_step_batch(struct workload *wrk, struct w_step *w, unsigned int flags)
+{
+	enum intel_engine_id engine = w->engine;
+	unsigned int bb_i, j = 0;
+
+	w->obj[j].handle = gem_create(fd, 4096);
+	w->obj[j].flags = EXEC_OBJECT_WRITE;
+	j++;
+
+	if (flags & SEQNO) {
+		w->obj[j].handle = wrk->status_page_handle;
+		j++;
+	}
+
+	bb_i = j++;
+	w->bb_sz = get_bb_sz(w->duration.max);
+	w->bb_handle = w->obj[bb_i].handle = gem_create(fd, w->bb_sz);
+	terminate_bb(w, flags);
+
+	igt_assert(w->dependency <= 0);
+	if (w->dependency) {
+		int dep_idx = w->idx + w->dependency;
+
+		igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
+		igt_assert(wrk->steps[dep_idx].type == BATCH);
+
+		w->obj[j].handle = w->obj[bb_i].handle;
+		bb_i = j;
+		w->obj[j - 1].handle = wrk->steps[dep_idx].obj[0].handle;
+		j++;
+	}
+
+	if (flags & SEQNO) {
+		w->reloc.presumed_offset = -1;
+		w->reloc.target_handle = 1;
+		w->obj[bb_i].relocs_ptr = to_user_pointer(&w->reloc);
+		w->obj[bb_i].relocation_count = 1;
+	}
+
+	w->eb.buffers_ptr = to_user_pointer(w->obj);
+	w->eb.buffer_count = j;
+	w->eb.rsvd1 = wrk->ctx_id[w->context];
+
+	if (flags & SWAPVCS && engine == VCS1)
+		engine = VCS2;
+	else if (flags & SWAPVCS && engine == VCS2)
+		engine = VCS1;
+	eb_update_flags(w, engine, flags);
+#ifdef DEBUG
+	printf("%u: %u:%x|%x|%x|%x %10lu flags=%llx bb=%x[%u] ctx[%u]=%u\n",
+		w->idx, w->eb.buffer_count, w->obj[0].handle,
+		w->obj[1].handle, w->obj[2].handle, w->obj[3].handle,
+		w->bb_sz, w->eb.flags, w->bb_handle, bb_i,
+		w->context, wrk->ctx_id[w->context]);
+#endif
+}
+
+static void
+prepare_workload(struct workload *wrk, unsigned int flags)
+{
+	int max_ctx = -1;
+	struct w_step *w;
+	int i;
+
+	if (flags & SEQNO) {
+		const unsigned int status_sz = sizeof(uint32_t);
+		uint32_t handle = gem_create(fd, status_sz);
+
+		gem_set_caching(fd, handle, I915_CACHING_CACHED);
+		wrk->status_page_handle = handle;
+		wrk->status_page = gem_mmap__cpu(fd, handle, 0, status_sz,
+						 PROT_READ);
+	}
+
+	for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
+		if ((int)w->context > max_ctx) {
+			int delta = w->context + 1 - wrk->nr_ctxs;
+
+			wrk->nr_ctxs += delta;
+			wrk->ctx_id = realloc(wrk->ctx_id,
+					      wrk->nr_ctxs * sizeof(uint32_t));
+			memset(&wrk->ctx_id[wrk->nr_ctxs - delta], 0,
+			       delta * sizeof(uint32_t));
+
+			max_ctx = w->context;
+		}
+
+		if (!wrk->ctx_id[w->context]) {
+			struct drm_i915_gem_context_create arg = {};
+
+			drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_CREATE, &arg);
+			igt_assert(arg.ctx_id);
+
+			wrk->ctx_id[w->context] = arg.ctx_id;
+		}
+	}
+
+	for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
+		unsigned int _flags = flags;
+		enum intel_engine_id engine = w->engine;
+
+		if (w->type != BATCH)
+			continue;
+
+		if (engine != VCS && engine != VCS1 && engine != VCS2)
+			_flags &= ~SEQNO;
+
+		if (engine == VCS)
+			_flags &= ~SWAPVCS;
+
+		alloc_step_batch(wrk, w, _flags);
+	}
+}
+
+static double elapsed(const struct timespec *start, const struct timespec *end)
+{
+	return (end->tv_sec - start->tv_sec) +
+	       (end->tv_nsec - start->tv_nsec) / 1e9;
+}
+
+static int elapsed_us(const struct timespec *start, const struct timespec *end)
+{
+	return elapsed(start, end) * 1e6;
+}
+
+static enum intel_engine_id get_vcs_engine(unsigned int n)
+{
+	const enum intel_engine_id vcs_engines[2] = { VCS1, VCS2 };
+
+	igt_assert(n < ARRAY_SIZE(vcs_engines));
+
+	return vcs_engines[n];
+}
+
+struct workload_balancer {
+	unsigned int (*get_qd)(const struct workload_balancer *balancer,
+			       struct workload *wrk,
+			       enum intel_engine_id engine);
+	enum intel_engine_id (*balance)(const struct workload_balancer *balancer,
+					struct workload *wrk, struct w_step *w);
+};
+
+static enum intel_engine_id
+rr_balance(const struct workload_balancer *balancer,
+	   struct workload *wrk, struct w_step *w)
+{
+	unsigned int engine;
+
+	engine = get_vcs_engine(wrk->vcs_rr);
+	wrk->vcs_rr ^= 1;
+
+	return engine;
+}
+
+static const struct workload_balancer rr_balancer = {
+	.balance = rr_balance,
+};
+
+static unsigned int
+get_qd_depth(const struct workload_balancer *balancer,
+	     struct workload *wrk, enum intel_engine_id engine)
+{
+	return wrk->seqno[engine] -
+	       wrk->status_page[VCS_SEQNO_IDX(engine - VCS1)];
+}
+
+static enum intel_engine_id
+qd_balance(const struct workload_balancer *balancer,
+	   struct workload *wrk, struct w_step *w)
+{
+	enum intel_engine_id engine;
+	long qd[NUM_ENGINES];
+	unsigned int n;
+
+	igt_assert(w->engine == VCS);
+
+	qd[VCS1] = balancer->get_qd(balancer, wrk, VCS1);
+	wrk->qd_sum[VCS1] += qd[VCS1];
+
+	qd[VCS2] = balancer->get_qd(balancer, wrk, VCS2);
+	wrk->qd_sum[VCS2] += qd[VCS2];
+
+	if (qd[VCS1] < qd[VCS2])
+		n = 0;
+	else if (qd[VCS2] < qd[VCS1])
+		n = 1;
+	else
+		n = wrk->vcs_rr;
+
+	engine = get_vcs_engine(n);
+	wrk->vcs_rr = n ^ 1;
+
+#ifdef DEBUG
+	printf("qd_balance: 1:%ld 2:%ld rr:%u = %u\t(%lu - %u) (%lu - %u)\n",
+	       qd[VCS1], qd[VCS2], wrk->vcs_rr, engine,
+	       wrk->seqno[VCS1], wrk->status_page[VCS_SEQNO_IDX(0)],
+	       wrk->seqno[VCS2], wrk->status_page[VCS_SEQNO_IDX(1)]);
+#endif
+	return engine;
+}
+
+static const struct workload_balancer qd_balancer = {
+	.get_qd = get_qd_depth,
+	.balance = qd_balance,
+};
+
+static void
+update_bb_seqno(struct w_step *w, enum intel_engine_id engine, uint32_t seqno)
+{
+	igt_assert(engine == VCS1 || engine == VCS2);
+
+	gem_set_domain(fd, w->bb_handle,
+		       I915_GEM_DOMAIN_WC, I915_GEM_DOMAIN_WC);
+
+	*w->mapped_seqno = seqno;
+	w->reloc.presumed_offset = -1;
+	w->reloc.delta = VCS_SEQNO_IDX(engine - VCS1) * sizeof(uint32_t);
+}
+
+static void w_sync_to(struct workload *wrk, struct w_step *w, int target)
+{
+	if (target < 0)
+		target = wrk->nr_steps + target;
+
+	igt_assert(target < wrk->nr_steps);
+
+	while (wrk->steps[target].type != BATCH) {
+		if (--target < 0)
+			target = wrk->nr_steps + target;
+	}
+
+	igt_assert(target < wrk->nr_steps);
+	igt_assert(wrk->steps[target].type == BATCH);
+
+	gem_sync(fd, wrk->steps[target].obj[0].handle);
+}
+
+static void
+run_workload(unsigned int id, struct workload *wrk,
+	     bool background, int pipe_fd,
+	     const struct workload_balancer *balancer,
+	     unsigned int repeat,
+	     unsigned int flags)
+{
+	struct timespec t_start, t_end;
+	struct w_step *w;
+	bool run = true;
+	int throttle = -1;
+	int qd_throttle = -1;
+	double t;
+	int i, j;
+
+	clock_gettime(CLOCK_MONOTONIC, &t_start);
+
+	hars_petruska_f54_1_random_seed(0);
+
+	for (j = 0; run && (background || j < repeat); j++) {
+		for (i = 0, w = wrk->steps; run && (i < wrk->nr_steps);
+		     i++, w++) {
+			enum intel_engine_id engine = w->engine;
+			int do_sleep = 0;
+
+			if (i == 0)
+				clock_gettime(CLOCK_MONOTONIC,
+					      &wrk->repeat_start);
+
+			if (w->type == DELAY) {
+				do_sleep = w->wait;
+			} else if (w->type == PERIOD) {
+				struct timespec now;
+
+				clock_gettime(CLOCK_MONOTONIC, &now);
+				do_sleep = w->wait -
+					   elapsed_us(&wrk->repeat_start, &now);
+				if (do_sleep < 0) {
+					if (!quiet) {
+						printf("%u: Dropped period @ %u/%u (%dus late)!\n",
+						       id, j, i, do_sleep);
+						continue;
+					}
+				}
+			} else if (w->type == SYNC) {
+				unsigned int s_idx = i + w->wait;
+
+				igt_assert(i > 0 && i < wrk->nr_steps);
+				igt_assert(wrk->steps[s_idx].type == BATCH);
+				gem_sync(fd, wrk->steps[s_idx].obj[0].handle);
+				continue;
+			} else if (w->type == THROTTLE) {
+				throttle = w->wait;
+				continue;
+			} else if (w->type == QD_THROTTLE) {
+				qd_throttle = w->wait;
+				continue;
+			}
+
+			if (do_sleep) {
+				usleep(do_sleep);
+				continue;
+			}
+
+			wrk->nr_bb[engine]++;
+
+			if (engine == VCS && balancer) {
+				engine = balancer->balance(balancer, wrk, w);
+				wrk->nr_bb[engine]++;
+
+				eb_update_flags(w, engine, flags);
+
+				if (flags & SEQNO)
+					update_bb_seqno(w, engine,
+							++wrk->seqno[engine]);
+			}
+
+			if (w->duration.min != w->duration.max) {
+				unsigned int d = get_duration(&w->duration);
+				unsigned long offset;
+
+				offset = ALIGN(w->bb_sz - get_bb_sz(d),
+					       2 * sizeof(uint32_t));
+				w->eb.batch_start_offset = offset;
+			}
+
+			/* If workload want qd throttling when qd is not
+			 * available approximate with normal throttling. */
+			if (qd_throttle > 0 && throttle < 0 &&
+			    !(balancer && balancer->get_qd))
+				throttle = qd_throttle;
+
+			if (throttle > 0)
+				w_sync_to(wrk, w, i - throttle);
+
+			if (qd_throttle > 0 && balancer && balancer->get_qd) {
+				unsigned int target;
+
+				for (target = wrk->nr_steps - 1; target > 0;
+				     target--) {
+					if (balancer->get_qd(balancer, wrk,
+							     engine) <
+					    qd_throttle)
+						break;
+					w_sync_to(wrk, w, i - target);
+				}
+			}
+
+			gem_execbuf(fd, &w->eb);
+
+			if (pipe_fd >= 0) {
+				struct pollfd fds;
+
+				fds.fd = pipe_fd;
+				fds.events = POLLHUP;
+				if (poll(&fds, 1, 0)) {
+					run = false;
+					break;
+				}
+			}
+
+			if (w->wait)
+				gem_sync(fd, w->obj[0].handle);
+		}
+	}
+
+	if (run)
+		gem_sync(fd, wrk->steps[wrk->nr_steps - 1].obj[0].handle);
+
+	clock_gettime(CLOCK_MONOTONIC, &t_end);
+
+	t = elapsed(&t_start, &t_end);
+	if (!quiet && !balancer)
+		printf("%c%u: %.3fs elapsed (%.3f workloads/s)\n",
+		       background ? ' ' : '*', id, t, repeat / t);
+	else if (!quiet && !balancer->get_qd)
+		printf("%c%u: %.3fs elapsed (%.3f workloads/s). %lu (%lu + %lu) total VCS batches.\n",
+		       background ? ' ' : '*', id, t, repeat / t,
+		       wrk->nr_bb[VCS], wrk->nr_bb[VCS1], wrk->nr_bb[VCS2]);
+	else if (!quiet && balancer)
+		printf("%c%u: %.3fs elapsed (%.3f workloads/s). %lu (%lu + %lu) total VCS batches. Average queue depths %.3f, %.3f.\n",
+		       background ? ' ' : '*', id, t, repeat / t,
+		       wrk->nr_bb[VCS], wrk->nr_bb[VCS1], wrk->nr_bb[VCS2],
+		       (double)wrk->qd_sum[VCS1] / wrk->nr_bb[VCS],
+		       (double)wrk->qd_sum[VCS2] / wrk->nr_bb[VCS]);
+}
+
+static void fini_workload(struct workload *wrk)
+{
+	free(wrk->steps);
+	free(wrk);
+}
+
+static unsigned long calibrate_nop(unsigned int tolerance_pct)
+{
+	const uint32_t bbe = 0xa << 23;
+	unsigned int loops = 17;
+	unsigned int usecs = nop_calibration_us;
+	struct drm_i915_gem_exec_object2 obj = {};
+	struct drm_i915_gem_execbuffer2 eb =
+		{ .buffer_count = 1, .buffers_ptr = (uintptr_t)&obj};
+	long size, last_size;
+	struct timespec t_0, t_end;
+
+	clock_gettime(CLOCK_MONOTONIC, &t_0);
+
+	size = 256 * 1024;
+	do {
+		struct timespec t_start;
+
+		obj.handle = gem_create(fd, size);
+		gem_write(fd, obj.handle, size - sizeof(bbe), &bbe,
+			  sizeof(bbe));
+		gem_execbuf(fd, &eb);
+		gem_sync(fd, obj.handle);
+
+		clock_gettime(CLOCK_MONOTONIC, &t_start);
+		for (int loop = 0; loop < loops; loop++)
+			gem_execbuf(fd, &eb);
+		gem_sync(fd, obj.handle);
+		clock_gettime(CLOCK_MONOTONIC, &t_end);
+
+		gem_close(fd, obj.handle);
+
+		last_size = size;
+		size = loops * size / elapsed(&t_start, &t_end) / 1e6 * usecs;
+		size = ALIGN(size, sizeof(uint32_t));
+	} while (elapsed(&t_0, &t_end) < 5 ||
+		 abs(size - last_size) > (size * tolerance_pct / 100));
+
+	return size / sizeof(uint32_t);
+}
+
+static void print_help(void)
+{
+	puts(
+"Usage: gem_wsim [OPTIONS]\n"
+"\n"
+"Runs a simulated workload on the GPU.\n"
+"When ran without arguments performs a GPU calibration result of which needs\n"
+"to be provided when running the simulation in subsequent invocations.\n"
+"\n"
+"Options:\n"
+"	-h		This text.\n"
+"	-q		Be quiet - do not output anything to stdout.\n"
+"	-n <n>		Nop calibration value.\n"
+"	-t <n>		Nop calibration tolerance percentage.\n"
+"			Use when there is a difficulty obtaining calibration\n"
+"			with the default settings.\n"
+"	-w <desc|path>	Filename or a workload descriptor.\n"
+"			Can be given multiple times.\n"
+"	-W <desc|path>	Filename or a master workload descriptor.\n"
+"			Only one master workload can be optinally specified\n"
+"			in which case all other workloads become background\n"
+"			ones and run as long as the master.\n"
+"	-r <n>		How many times to emit the workload.\n"
+"	-c <n>		Fork N clients emitting the workload simultaneously.\n"
+"	-x		Swap VCS1 and VCS2 engines in every other client.\n"
+"	-b <n>		Load balancing to use. (0: rr, 1: qd)\n"
+	);
+}
+
+static char *load_workload_descriptor(char *filename)
+{
+	struct stat sbuf;
+	char *buf;
+	int infd, ret, i;
+	ssize_t len;
+
+	ret = stat(filename, &sbuf);
+	if (ret || !S_ISREG(sbuf.st_mode))
+		return filename;
+
+	igt_assert(sbuf.st_size < 1024 * 1024); /* Just so. */
+	buf = malloc(sbuf.st_size);
+	igt_assert(buf);
+
+	infd = open(filename, O_RDONLY);
+	igt_assert(infd >= 0);
+	len = read(infd, buf, sbuf.st_size);
+	igt_assert(len == sbuf.st_size);
+	close(infd);
+
+	for (i = 0; i < len; i++) {
+		if (buf[i] == '\n')
+			buf[i] = ',';
+	}
+
+	len--;
+	while (buf[len] == ',')
+		buf[len--] = 0;
+
+	return buf;
+}
+
+static char **
+add_workload_arg(char **w_args, unsigned int nr_args, char *w_arg)
+{
+	w_args = realloc(w_args, sizeof(char *) * nr_args);
+	igt_assert(w_args);
+	w_args[nr_args - 1] = w_arg;
+
+	return w_args;
+}
+
+int main(int argc, char **argv)
+{
+	unsigned int repeat = 1;
+	unsigned int clients = 1;
+	unsigned int flags = 0;
+	struct timespec t_start, t_end;
+	struct workload **w, **wrk = NULL;
+	unsigned int nr_w_args = 0;
+	int master_workload = -1;
+	char **w_args = NULL;
+	unsigned int tolerance_pct = 1;
+	const struct workload_balancer *balancer = NULL;
+	double t;
+	int i, c;
+
+	fd = drm_open_driver(DRIVER_INTEL);
+
+	while ((c = getopt(argc, argv, "qc:n:r:xw:W:t:b:h")) != -1) {
+		switch (c) {
+		case 'W':
+			if (master_workload >= 0) {
+				if (!quiet)
+					fprintf(stderr,
+						"Only one master workload can be given!\n");
+				return 1;
+			}
+			master_workload = nr_w_args;
+			/* Fall through */
+		case 'w':
+			w_args = add_workload_arg(w_args, ++nr_w_args, optarg);
+			break;
+		case 'c':
+			clients = strtol(optarg, NULL, 0);
+			break;
+		case 't':
+			tolerance_pct = strtol(optarg, NULL, 0);
+			break;
+		case 'n':
+			nop_calibration = strtol(optarg, NULL, 0);
+			break;
+		case 'r':
+			repeat = strtol(optarg, NULL, 0);
+			break;
+		case 'q':
+			quiet = true;
+			break;
+		case 'x':
+			flags |= SWAPVCS;
+			break;
+		case 'b':
+			switch (strtol(optarg, NULL, 0)) {
+			case 0:
+				balancer = &rr_balancer;
+				flags |= BALANCE;
+				break;
+			case 1:
+				igt_assert(intel_gen(intel_get_drm_devid(fd)) >=
+					   8);
+				balancer = &qd_balancer;
+				flags |= SEQNO | BALANCE;
+				break;
+			default:
+				if (!quiet)
+					fprintf(stderr,
+						"Unknown balancing mode '%s'!\n",
+						optarg);
+				return 1;
+			}
+			break;
+		case 'h':
+			print_help();
+			return 0;
+		default:
+			return 1;
+		}
+	}
+
+	if (!nop_calibration) {
+		if (!quiet)
+			printf("Calibrating nop delay with %u%% tolerance...\n",
+				tolerance_pct);
+		nop_calibration = calibrate_nop(tolerance_pct);
+		if (!quiet)
+			printf("Nop calibration for %uus delay is %lu.\n",
+			       nop_calibration_us, nop_calibration);
+
+		return 0;
+	}
+
+	if (!nr_w_args) {
+		if (!quiet)
+			fprintf(stderr, "No workload descriptor(s)!\n");
+		return 1;
+	}
+
+	if (nr_w_args > 1 && clients > 1) {
+		if (!quiet)
+			fprintf(stderr,
+				"Cloned clients cannot be combined with multiple workloads!\n");
+		return 1;
+	}
+
+	wrk = calloc(nr_w_args, sizeof(*wrk));
+	igt_assert(wrk);
+
+	for (i = 0; i < nr_w_args; i++) {
+		w_args[i] = load_workload_descriptor(w_args[i]);
+		if (!w_args[i]) {
+			if (!quiet)
+				fprintf(stderr,
+					"Failed to load workload descriptor %u!\n",
+					i);
+			return 1;
+		}
+
+		wrk[i] = parse_workload(w_args[i]);
+		if (!wrk[i]) {
+			if (!quiet)
+				fprintf(stderr,
+					"Failed to parse workload %u!\n", i);
+			return 1;
+		}
+	}
+
+	if (!quiet) {
+		printf("Using %lu nop calibration for %uus delay.\n",
+		       nop_calibration, nop_calibration_us);
+		if (nr_w_args > 1)
+			clients = nr_w_args;
+		printf("%u client%s.\n", clients, clients > 1 ? "s" : "");
+		if (flags & SWAPVCS)
+			printf("Swapping VCS rings between clients.\n");
+	}
+
+	if (master_workload >= 0 && clients == 1)
+		master_workload = -1;
+
+	w = calloc(clients, sizeof(struct workload *));
+	igt_assert(w);
+
+	for (i = 0; i < clients; i++) {
+		unsigned int flags_ = flags;
+
+		w[i] = clone_workload(wrk[nr_w_args > 1 ? i : 0]);
+
+		if (master_workload >= 0) {
+			int ret = pipe(w[i]->pipe);
+
+			igt_assert(ret == 0);
+		}
+
+		if (flags & SWAPVCS && i & 1)
+			flags_ &= ~SWAPVCS;
+
+		prepare_workload(w[i], flags_);
+	}
+
+	clock_gettime(CLOCK_MONOTONIC, &t_start);
+
+	igt_fork(child, clients) {
+		int pipe_fd = -1;
+		bool background = false;
+
+		if (master_workload >= 0) {
+			close(w[child]->pipe[0]);
+			if (child != master_workload) {
+				pipe_fd = w[child]->pipe[1];
+				background = true;
+			} else {
+				close(w[child]->pipe[1]);
+			}
+		}
+
+		run_workload(child, w[child], background, pipe_fd, balancer,
+			     repeat, flags);
+	}
+
+	if (master_workload >= 0) {
+		int status = -1;
+		pid_t pid;
+
+		for (i = 0; i < clients; i++)
+			close(w[i]->pipe[1]);
+
+		pid = wait(&status);
+		if (pid >= 0)
+			igt_child_done(pid);
+
+		for (i = 0; i < clients; i++)
+			close(w[i]->pipe[0]);
+	}
+
+	igt_waitchildren();
+
+	clock_gettime(CLOCK_MONOTONIC, &t_end);
+
+	t = elapsed(&t_start, &t_end);
+	if (!quiet)
+		printf("%.3fs elapsed (%.3f workloads/s)\n",
+		       t, clients * repeat / t);
+
+	for (i = 0; i < clients; i++)
+		fini_workload(w[i]);
+	free(w);
+	for (i = 0; i < nr_w_args; i++)
+		fini_workload(wrk[i]);
+	free(w_args);
+
+	return 0;
+}
diff --git a/benchmarks/wsim/README b/benchmarks/wsim/README
new file mode 100644
index 000000000000..7aa0694aa834
--- /dev/null
+++ b/benchmarks/wsim/README
@@ -0,0 +1,56 @@
+Workload descriptor format
+==========================
+
+ctx.engine.duration_us.dependency.wait,...
+<uint>.<str>.<uint>[-<uint>].<int <= 0>.<0|1>,...
+d|p|s.<uiny>,...
+
+For duration a range can be given from which a random value will be picked
+before every submit. Since this and seqno management requires CPU access to
+objects, care needs to be taken in order to ensure the submit queue is deep
+enough these operations do not affect the execution speed unless that is
+desired.
+
+Additional workload steps are also supported:
+
+ 'd' - Adds a delay (in microseconds).
+ 'p' - Adds a delay relative to the start of previous loop so that the each loop
+       starts execution with a given period.
+ 's' - Synchronises the pipeline to a batch relative to the step.
+ 't' - Throttle every n batches
+ 'q' - Throttle to n max queue depth
+
+Engine ids: RCS, BCS, VCS, VCS1, VCS2, VECS
+
+Example (leading spaces must not be present in the actual file):
+----------------------------------------------------------------
+
+  1.VCS1.3000.0.1
+  1.RCS.500-1000.-1.0
+  1.RCS.3700.0.0
+  1.RCS.1000.-2.0
+  1.VCS2.2300.-2.0
+  1.RCS.4700.-1.0
+  1.VCS2.600.-1.1
+  p.16000
+
+The above workload described in human language works like this:
+
+  1.   A batch is sent to the VCS1 engine which will be executing for 3ms on the
+       GPU and userspace will wait until it is finished before proceeding.
+  2-4. Now three batches are sent to RCS with durations of 0.5-1.5ms (random
+       duration range), 3.7ms and 1ms respectively. The first batch has a data
+       dependency on the preceding VCS1 batch, and the last of the group depends
+       on the first from the group.
+  5.   Now a 2.3ms batch is sent to VCS2, with a data dependency on the 3.7ms
+       RCS batch.
+  6.   This is followed by a 4.7ms RCS batch with a data dependency on the 2.3ms
+       VCS2 batch.
+  7.   Then a 0.6ms VCS2 batch is sent depending on the previous RCS one. In the
+       same step the tool is told to wait for the batch completes before
+       proceeding.
+  8.   Finally the tool is told to wait long enough to ensure the next iteration
+       starts 16ms after the previous one has started.
+
+When workload descriptors are provided on the command line, commas must be used
+instead of new lines.
diff --git a/benchmarks/wsim/media_17i7.wsim b/benchmarks/wsim/media_17i7.wsim
new file mode 100644
index 000000000000..5f533d8e168b
--- /dev/null
+++ b/benchmarks/wsim/media_17i7.wsim
@@ -0,0 +1,7 @@
+1.VCS1.3000.0.1
+1.RCS.1000.-1.0
+1.RCS.3700.0.0
+1.RCS.1000.-2.0
+1.VCS2.2300.-2.0
+1.RCS.4700.-1.0
+1.VCS2.600.-1.1
diff --git a/benchmarks/wsim/media_load_balance_17i7.wsim b/benchmarks/wsim/media_load_balance_17i7.wsim
new file mode 100644
index 000000000000..25a692032eae
--- /dev/null
+++ b/benchmarks/wsim/media_load_balance_17i7.wsim
@@ -0,0 +1,7 @@
+1.VCS.3000.0.1
+1.RCS.1000.-1.0
+1.RCS.3700.0.0
+1.RCS.1000.-2.0
+1.VCS.2300.-2.0
+1.RCS.4700.-1.0
+1.VCS.600.-1.1
diff --git a/benchmarks/wsim/vcs1.wsim b/benchmarks/wsim/vcs1.wsim
new file mode 100644
index 000000000000..9d3e682b5ce8
--- /dev/null
+++ b/benchmarks/wsim/vcs1.wsim
@@ -0,0 +1,26 @@
+t.5
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
diff --git a/benchmarks/wsim/vcs_balanced.wsim b/benchmarks/wsim/vcs_balanced.wsim
new file mode 100644
index 000000000000..e8958b8f7f43
--- /dev/null
+++ b/benchmarks/wsim/vcs_balanced.wsim
@@ -0,0 +1,26 @@
+q.5
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
diff --git a/lib/igt_core.c b/lib/igt_core.c
index 403b9423fa9f..9c3b37fe3d63 100644
--- a/lib/igt_core.c
+++ b/lib/igt_core.c
@@ -1558,6 +1558,32 @@ bool __igt_fork(void)
 }
 
 /**
+ * igt_child_done:
+ *
+ * Lets the IGT core know that one of the children has exited.
+ */
+void igt_child_done(pid_t pid)
+{
+	int i = 0;
+	int found = -1;
+
+	igt_assert(num_test_children > 1);
+
+	for (i = 0; i < num_test_children; i++) {
+		if (pid == test_children[i]) {
+			found = i;
+			break;
+		}
+	}
+
+	igt_assert(found >= 0);
+
+	num_test_children--;
+	for (i = found; i < num_test_children; i++)
+		test_children[i] = test_children[i + 1];
+}
+
+/**
  * igt_waitchildren:
  *
  * Wait for all children forked with igt_fork.
diff --git a/lib/igt_core.h b/lib/igt_core.h
index 51b98d82ef7f..4a125af1d6a5 100644
--- a/lib/igt_core.h
+++ b/lib/igt_core.h
@@ -688,6 +688,7 @@ bool __igt_fork(void);
 #define igt_fork(child, num_children) \
 	for (int child = 0; child < (num_children); child++) \
 		for (; __igt_fork(); exit(0))
+void igt_child_done(pid_t pid);
 void igt_waitchildren(void);
 void igt_waitchildren_timeout(int seconds, const char *reason);
 
-- 
2.9.3

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH i-g-t v4] igt/scripts: trace.pl to parse the i915 tracepoints
  2017-03-31 14:58 [PATCH i-g-t 0/2] Workload simulation and tracing Tvrtko Ursulin
  2017-03-31 14:58 ` [PATCH i-g-t 1/2] benchmarks/gem_wsim: Command submission workload simulator Tvrtko Ursulin
  2017-03-31 14:58 ` [PATCH i-g-t 2/2] igt/scripts: trace.pl to parse the i915 tracepoints Tvrtko Ursulin
@ 2017-04-24 14:42 ` Tvrtko Ursulin
  2 siblings, 0 replies; 26+ messages in thread
From: Tvrtko Ursulin @ 2017-04-24 14:42 UTC (permalink / raw)
  To: Intel-gfx; +Cc: Harri Syrja

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Given a log file created via perf with some interesting trace
events enabled, this tool can generate the timeline graph of
requests getting queued, their dependencies resolved, sent to
the GPU for executing and finally completed.

This can be useful when analyzing certain classes of performance
issues. More help is available in the tool itself.

The tool will also calculate some overall per engine statistics,
like total time engine was idle and similar.

v2:
 * Address missing git add.
 * Make html output optional (--html switch) and by default
   just output aggregated per engine stats to stdout.

v3:
 * Added --trace option which invokes perf with the correct
   options automatically.
 * Added --avg-delay-stats which prints averages for things
   like waiting on ready, waiting on GPU and context save
   duration.
 * Fix warnings when no waits on an engine.
 * Correct help text.

v4:
 * Add --squash-ctx-id to substract engine id from ctx id
   when parsing to make it easier to identify which context
   is which with new i915 ctx id allocation scheme.
 * Reconstruct request_out events where they are missing.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Harri Syrja <harri.syrja@intel.com>
Cc: Krzysztof E Olinski <krzysztof.e.olinski@intel.com>
---
 scripts/Makefile.am |   2 +-
 scripts/trace.pl    | 990 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 991 insertions(+), 1 deletion(-)
 create mode 100755 scripts/trace.pl

diff --git a/scripts/Makefile.am b/scripts/Makefile.am
index 85d4a5cf4e5c..641715294936 100644
--- a/scripts/Makefile.am
+++ b/scripts/Makefile.am
@@ -1,2 +1,2 @@
-dist_noinst_SCRIPTS = intel-gfx-trybot who.sh run-tests.sh
+dist_noinst_SCRIPTS = intel-gfx-trybot who.sh run-tests.sh trace.pl
 noinst_PYTHON = throttle.py
diff --git a/scripts/trace.pl b/scripts/trace.pl
new file mode 100755
index 000000000000..1f524aaa0f89
--- /dev/null
+++ b/scripts/trace.pl
@@ -0,0 +1,990 @@
+#! /usr/bin/perl
+#
+# Copyright © 2017 Intel Corporation
+#
+# Permission is hereby granted, free of charge, to any person obtaining a
+# copy of this software and associated documentation files (the "Software"),
+# to deal in the Software without restriction, including without limitation
+# the rights to use, copy, modify, merge, publish, distribute, sublicense,
+# and/or sell copies of the Software, and to permit persons to whom the
+# Software is furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice (including the next
+# paragraph) shall be included in all copies or substantial portions of the
+# Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+# IN THE SOFTWARE.
+#
+
+use strict;
+use warnings;
+use 5.010;
+
+my $gid = 0;
+my (%db, %queue, %submit, %notify, %rings, %ctxdb, %ringmap, %reqwait);
+my @freqs;
+
+my $max_items = 3000;
+my $width_us = 32000;
+my $correct_durations = 0;
+my %ignore_ring;
+my %skip_box;
+my $html = 0;
+my $trace = 0;
+my $avg_delay_stats = 0;
+my $squash_context_id = 0;
+
+my @args;
+
+sub arg_help
+{
+	return unless scalar(@_);
+
+	if ($_[0] eq '--help' or $_[0] eq '-h') {
+		shift @_;
+print <<ENDHELP;
+Notes:
+
+   The tool parse the output generated by the 'perf script' command after the
+   correct set of i915 tracepoints have been collected via perf record.
+
+   To collect the data:
+
+	./trace.pl --trace [command-to-be-profiled]
+
+   The above will invoke perf record, or alternatively it can be done directly:
+
+	perf record -a -c 1 -e i915:intel_gpu_freq_change, \
+			       i915:i915_gem_request_add, \
+			       i915:i915_gem_request_submit, \
+			       i915:i915_gem_request_in, \
+			       i915:i915_gem_request_out, \
+			       i915:intel_engine_notify, \
+			       i915:i915_gem_request_wait_begin, \
+			       i915:i915_gem_request_wait_end \
+			       [command-to-be-profiled]
+
+   Then create the log file with:
+
+	perf script >trace.log
+
+   This file in turn should be piped into this tool which will generate some
+   statistics out of it, or if --html was given HTML output.
+
+   HTML can be viewed from a directory containing the 'vis' JavaScript module.
+   On Ubuntu this can be installed like this:
+
+	apt-get install npm
+	npm install vis
+
+Usage:
+   trace.pl <options> <input-file >output-file
+
+      --help / -h			This help text
+      --max-items=num / -m num		Maximum number of boxes to put on the
+					timeline. More boxes means more work for
+					the JavaScript engine in the browser.
+      --zoom-width-ms=ms / -z ms	Width of the initial timeline zoom
+      --split-requests / -s		Try to split out request which were
+					submitted together due coalescing in the
+					driver. May not be 100% accurate and may
+					influence the per-engine statistics so
+					use with care.
+      --ignore-ring=id / -i id		Ignore ring with the numerical id when
+					parsing the log (enum intel_engine_id).
+					Can be given multiple times.
+      --skip-box=name / -x name		Do not put a certain type of a box on
+					the timeline. One of: queue, ready,
+					execute and ctxsave.
+					Can be given multiple times.
+      --html				Generate HTML output.
+      --trace cmd			Trace the following command.
+      --avg-delay-stats			Print average delay stats.
+      --squash-ctx-id			Squash context id by substracting engine
+					id from ctx id.
+ENDHELP
+
+		exit 0;
+	}
+
+	return @_;
+}
+
+sub arg_html
+{
+	return unless scalar(@_);
+
+	if ($_[0] eq '--html') {
+		shift @_;
+		$html = 1;
+	}
+
+	return @_;
+}
+
+sub arg_avg_delay_stats
+{
+	return unless scalar(@_);
+
+	if ($_[0] eq '--avg-delay-stats') {
+		shift @_;
+		$avg_delay_stats = 1;
+	}
+
+	return @_;
+}
+
+sub arg_squash_ctx_id
+{
+	return unless scalar(@_);
+
+	if ($_[0] eq '--squash-ctx-id') {
+		shift @_;
+		$squash_context_id = 1;
+	}
+
+	return @_;
+}
+
+sub arg_trace
+{
+	my @events = ( 'i915:intel_gpu_freq_change',
+		       'i915:i915_gem_request_add',
+		       'i915:i915_gem_request_submit',
+		       'i915:i915_gem_request_in',
+		       'i915:i915_gem_request_out',
+		       'i915:intel_engine_notify',
+		       'i915:i915_gem_request_wait_begin',
+		       'i915:i915_gem_request_wait_end' );
+
+	return unless scalar(@_);
+
+	if ($_[0] eq '--trace') {
+		shift @_;
+
+		unshift @_, join(',', @events);
+		unshift @_, ('perf', 'record', '-a', '-c', '1', '-e');
+
+		exec @_;
+	}
+
+	return @_;
+}
+
+sub arg_max_items
+{
+	my $val;
+
+	return unless scalar(@_);
+
+	if ($_[0] eq '--max-items' or $_[0] eq '-m') {
+		shift @_;
+		$val = shift @_;
+	} elsif ($_[0] =~ /--max-items=(\d+)/) {
+		shift @_;
+		$val = $1;
+	}
+
+	$max_items = int($val) if defined $val;
+
+	return @_;
+}
+
+sub arg_zoom_width
+{
+	my $val;
+
+	return unless scalar(@_);
+
+	if ($_[0] eq '--zoom-width-ms' or $_[0] eq '-z') {
+		shift @_;
+		$val = shift @_;
+	} elsif ($_[0] =~ /--zoom-width-ms=(\d+)/) {
+		shift @_;
+		$val = $1;
+	}
+
+	$width_us = int($val) * 1000 if defined $val;
+
+	return @_;
+}
+
+sub arg_split_requests
+{
+	return unless scalar(@_);
+
+	if ($_[0] eq '--split-requests' or $_[0] eq '-s') {
+		shift @_;
+		$correct_durations = 1;
+	}
+
+	return @_;
+}
+
+sub arg_ignore_ring
+{
+	my $val;
+
+	return unless scalar(@_);
+
+	if ($_[0] eq '--ignore-ring' or $_[0] eq '-i') {
+		shift @_;
+		$val = shift @_;
+	} elsif ($_[0] =~ /--ignore-ring=(\d+)/) {
+		shift @_;
+		$val = $1;
+	}
+
+	$ignore_ring{$val} = 1 if defined $val;
+
+	return @_;
+}
+
+sub arg_skip_box
+{
+	my $val;
+
+	return unless scalar(@_);
+
+	if ($_[0] eq '--skip-box' or $_[0] eq '-x') {
+		shift @_;
+		$val = shift @_;
+	} elsif ($_[0] =~ /--skip-box=(\d+)/) {
+		shift @_;
+		$val = $1;
+	}
+
+	$skip_box{$val} = 1 if defined $val;
+
+	return @_;
+}
+
+@args = @ARGV;
+while (@args) {
+	my $left = scalar(@args);
+
+	@args = arg_help(@args);
+	@args = arg_html(@args);
+	@args = arg_avg_delay_stats(@args);
+	@args = arg_squash_ctx_id(@args);
+	@args = arg_trace(@args);
+	@args = arg_max_items(@args);
+	@args = arg_zoom_width(@args);
+	@args = arg_split_requests(@args);
+	@args = arg_ignore_ring(@args);
+	@args = arg_skip_box(@args);
+
+	last if $left == scalar(@args);
+}
+
+die if scalar(@args);
+
+@ARGV = @args;
+
+sub parse_req
+{
+	my ($line, $tp) = @_;
+	state %cache;
+
+	$cache{$tp} = qr/(\d+)\.(\d+):.*$tp.*ring=(\d+), ctx=(\d+), seqno=(\d+), global(?:_seqno)?=(\d+)/ unless exists $cache{$tp};
+
+	if ($line =~ $cache{$tp}) {
+		return ($1, $2, $3, $4, $5, $6);
+	} else {
+		return undef;
+	}
+}
+
+sub parse_req_hw
+{
+	my ($line, $tp) = @_;
+	state %cache;
+
+	$cache{$tp} = qr/(\d+)\.(\d+):.*$tp.*ring=(\d+), ctx=(\d+), seqno=(\d+), global(?:_seqno)?=(\d+), port=(\d+)/ unless exists $cache{$tp};
+
+	if ($line =~ $cache{$tp}) {
+		return ($1, $2, $3, $4, $5, $6, $7);
+	} else {
+		return undef;
+	}
+}
+
+sub parse_req_wait_begin
+{
+	my ($line, $tp) = @_;
+
+	if ($line =~ /(\d+)\.(\d+):.*i915_gem_request_wait_begin.*ring=(\d+), ctx=(\d+), seqno=(\d+)/) {
+		return ($1, $2, $3, $4, $5);
+	} else {
+		return undef;
+	}
+}
+
+sub parse_notify
+{
+	my ($line) = @_;
+
+	if ($line =~ /(\d+)\.(\d+):.*intel_engine_notify.*ring=(\d+), seqno=(\d+)/) {
+		return ($1, $2, $3, $4);
+	} else {
+		return undef;
+	}
+}
+
+sub parse_freq
+{
+	my ($line) = @_;
+
+	if ($line =~ /(\d+)\.(\d+):.*intel_gpu_freq_change.*new_freq=(\d+)/) {
+		return ($1, $2, $3);
+	} else {
+		return undef;
+	}
+}
+
+sub us
+{
+	my ($s, $us) = @_;
+
+	return $s * 1000000 + $us;
+}
+
+sub db_key
+{
+	my ($ring, $ctx, $seqno) = @_;
+
+	return $ring . '/' . $ctx . '/' . $seqno;
+}
+
+sub global_key
+{
+	my ($ring, $seqno) = @_;
+
+	return $ring . '/' . $seqno;
+}
+
+sub sanitize_ctx
+{
+	my ($ctx, $ring) = @_;
+
+	$ctx = $ctx - $ring if $squash_context_id;
+
+	if (exists $ctxdb{$ctx}) {
+		return $ctx . '.' . $ctxdb{$ctx};
+	} else {
+		return $ctx;
+	}
+}
+
+sub ts
+{
+	my ($us) = @_;
+	my ($h, $m, $s);
+
+	$s = int($us / 1000000);
+	$us = $us % 1000000;
+
+	$m = int($s / 60);
+	$s = $s % 60;
+
+	$h = int($m / 60);
+	$m = $m % 60;
+
+	return sprintf('2017-01-01 %02u:%02u:%02u.%06u', int($h), int($m), int($s), int($us));
+}
+
+# Main input loop - parse lines and build the internal representation of the
+# trace using a hash of requests and some auxilliary data structures.
+my $prev_freq = 0;
+my $prev_freq_ts = 0;
+my $oldkernelwa = 0;
+my ($no_queue, $no_in);
+while (<>) {
+	my ($s, $us, $ring, $ctx, $seqno, $global_seqno, $port);
+	my $freq;
+	my $key;
+
+	chomp;
+
+	($s, $us, $ring, $ctx, $seqno) = parse_req_wait_begin($_);
+	if (defined $s) {
+		my %rw;
+
+		next if exists $ignore_ring{$ring};
+
+		$ctx = sanitize_ctx($ctx, $ring);
+		$key = db_key($ring, $ctx, $seqno);
+
+		next if exists $reqwait{$key};
+
+		$rw{'key'} = $key;
+		$rw{'ring'} = $ring;
+		$rw{'seqno'} = $seqno;
+		$rw{'ctx'} = $ctx;
+		$rw{'start'} = us($s, $us);
+		$reqwait{$key} = \%rw;
+		next;
+	}
+
+	($s, $us, $ring, $ctx, $seqno, $global_seqno) = parse_req($_, 'i915:i915_gem_request_wait_end');
+	if (defined $s) {
+		next if exists $ignore_ring{$ring};
+
+		$ctx = sanitize_ctx($ctx, $ring);
+		$key = db_key($ring, $ctx, $seqno);
+
+		next unless exists $reqwait{$key};
+
+		$reqwait{$key}->{'end'} = us($s, $us);
+		next;
+	}
+
+	($s, $us, $ring, $ctx, $seqno, $global_seqno) = parse_req($_, 'i915:i915_gem_request_add');
+	if (defined $s) {
+		my $orig_ctx = $ctx;
+
+		next if exists $ignore_ring{$ring};
+
+		$ctx = sanitize_ctx($ctx, $ring);
+		$key = db_key($ring, $ctx, $seqno);
+
+		if (exists $queue{$key}) {
+			$ctxdb{$orig_ctx}++;
+			$ctx = sanitize_ctx($orig_ctx, $ring);
+			$key = db_key($ring, $ctx, $seqno);
+		}
+
+		$queue{$key} = us($s, $us);
+		next;
+	}
+
+	($s, $us, $ring, $ctx, $seqno, $global_seqno) = parse_req($_, 'i915:i915_gem_request_submit');
+	if (defined $s) {
+		next if exists $ignore_ring{$ring};
+
+		$ctx = sanitize_ctx($ctx, $ring);
+		$key = db_key($ring, $ctx, $seqno);
+
+		die if exists $submit{$key};
+		die unless exists $queue{$key};
+
+		$submit{$key} = us($s, $us);
+		next;
+	}
+
+	($s, $us, $ring, $ctx, $seqno, $global_seqno, $port) = parse_req_hw($_, 'i915:i915_gem_request_in');
+	if (defined $s) {
+		my %req;
+
+		next if exists $ignore_ring{$ring};
+
+		$ctx = sanitize_ctx($ctx, $ring);
+		$key = db_key($ring, $ctx, $seqno);
+
+		die if exists $db{$key};
+		if (not exists $queue{$key} and $oldkernelwa) {
+			$no_queue++;
+			next;
+		}
+		die unless exists $queue{$key};
+		die unless exists $submit{$key};
+
+		$req{'start'} = us($s, $us);
+		$req{'ring'} = $ring;
+		$req{'seqno'} = $seqno;
+		$req{'ctx'} = $ctx;
+		$req{'name'} = $ctx . '/' . $seqno;
+		$req{'global'} = $global_seqno;
+		$req{'port'} = $port;
+		$req{'queue'} = $queue{$key};
+		$req{'submit-delay'} = $submit{$key} - $queue{$key};
+		$req{'execute-delay'} = $req{'start'} - $submit{$key};
+		$rings{$ring} = $gid++ unless exists $rings{$ring};
+		$ringmap{$rings{$ring}} = $ring;
+		$db{$key} = \%req;
+		next;
+	}
+
+	($s, $us, $ring, $ctx, $seqno, $global_seqno, $port) = parse_req($_, 'i915:i915_gem_request_out');
+	if (defined $s) {
+		my $gkey = global_key($ring, $global_seqno);
+
+		next if exists $ignore_ring{$ring};
+
+		$ctx = sanitize_ctx($ctx, $ring);
+		$key = db_key($ring, $ctx, $seqno);
+
+		if (not exists $db{$key} and $oldkernelwa) {
+			$no_in++;
+			next;
+		}
+		die unless exists $db{$key};
+		die unless exists $db{$key}->{'start'};
+		die if exists $db{$key}->{'end'};
+
+		$db{$key}->{'end'} = us($s, $us);
+		if (exists $notify{$gkey}) {
+			$db{$key}->{'notify'} = $notify{$gkey};
+		} else {
+			# No notify so far. Maybe it will arrive later which
+			# will be handled in the sanitation loop below.
+			$db{$key}->{'notify'} = $db{$key}->{'end'};
+			$db{$key}->{'no-notify'} = 1;
+		}
+		$db{$key}->{'duration'} = $db{$key}->{'notify'} - $db{$key}->{'start'};
+		$db{$key}->{'context-complete-delay'} = $db{$key}->{'end'} - $db{$key}->{'notify'};
+		next;
+	}
+
+	($s, $us, $ring, $seqno) = parse_notify($_);
+	if (defined $s) {
+		next if exists $ignore_ring{$ring};
+		$notify{global_key($ring, $seqno)} = us($s, $us);
+		next;
+	}
+
+	($s, $us, $freq) = parse_freq($_);
+	if (defined $s) {
+		my $cur = us($s, $us);
+
+		push @freqs, [$prev_freq_ts, $cur, $prev_freq] if $prev_freq;
+		$prev_freq_ts = $cur;
+		$prev_freq = $freq;
+		next;
+	}
+}
+
+# Sanitation pass to fixup up out of order notify and context complete, and to
+# fine the largest seqno to be used for timeline sorting purposes.
+my $max_seqno = 0;
+foreach my $key (keys %db) {
+	my $gkey = global_key($db{$key}->{'ring'}, $db{$key}->{'global'});
+
+	die unless exists $db{$key}->{'start'};
+
+	$max_seqno = $db{$key}->{'seqno'} if $db{$key}->{'seqno'} > $max_seqno;
+
+	unless (exists $db{$key}->{'end'}) {
+		# Context complete not received.
+		if (exists $notify{$gkey}) {
+			# No context complete due req merging - use notify.
+			$db{$key}->{'notify'} = $notify{$gkey};
+			$db{$key}->{'end'} = $db{$key}->{'notify'};
+			$db{$key}->{'no-end'} = 1;
+		} else {
+			# No notify and no context complete - mark it.
+			$db{$key}->{'no-end'} = 1;
+			$db{$key}->{'end'} = $db{$key}->{'start'} + 999;
+			$db{$key}->{'notify'} = $db{$key}->{'end'};
+			$db{$key}->{'incomplete'} = 1;
+		}
+
+		$db{$key}->{'duration'} = $db{$key}->{'notify'} - $db{$key}->{'start'};
+		$db{$key}->{'context-complete-delay'} = $db{$key}->{'end'} - $db{$key}->{'notify'};
+	} else {
+		# Notify arrived after context complete.
+		if (exists $db{$key}->{'no-notify'} and exists $notify{$gkey}) {
+			delete $db{$key}->{'no-notify'};
+			$db{$key}->{'notify'} = $notify{$gkey};
+			$db{$key}->{'duration'} = $db{$key}->{'notify'} - $db{$key}->{'start'};
+			$db{$key}->{'context-complete-delay'} = $db{$key}->{'end'} - $db{$key}->{'notify'};
+		}
+	}
+}
+
+# Fix up incompletes
+foreach my $key (keys %db) {
+	next unless exists $db{$key}->{'incomplete'};
+
+	# End the incomplete batch at the time next one starts
+	my $ring = $db{$key}->{'ring'};
+	my $ctx = $db{$key}->{'ctx'};
+	my $seqno = $db{$key}->{'seqno'};
+	my $next_key;
+	my $i = 1;
+
+	do {
+		$next_key = db_key($ring, $ctx, $seqno + $i);
+		$i++;
+	} until ((exists $db{$next_key} and not exists $db{$next_key}->{'incomplete'})
+		 or $i > scalar(keys(%db)));  # ugly stop hack
+
+	if (exists $db{$next_key}) {
+		$db{$key}->{'notify'} = $db{$next_key}->{'end'};
+		$db{$key}->{'end'} = $db{$key}->{'notify'};
+		$db{$key}->{'duration'} = $db{$key}->{'notify'} - $db{$key}->{'start'};
+		$db{$key}->{'context-complete-delay'} = $db{$key}->{'end'} - $db{$key}->{'notify'};
+	}
+}
+
+# GPU time accounting
+my (%running, %runnable, %queued, %batch_avg, %batch_total_avg, %batch_count);
+my (%submit_avg, %execute_avg, %ctxsave_avg);
+my $last_ts = 0;
+my $first_ts;
+
+foreach my $key (sort {$db{$a}->{'start'} <=> $db{$b}->{'start'}} keys %db) {
+	my $ring = $db{$key}->{'ring'};
+	my $end = $db{$key}->{'end'};
+
+	$first_ts = $db{$key}->{'queue'} if not defined $first_ts or $db{$key}->{'queue'} < $first_ts;
+	$last_ts = $end if $end > $last_ts;
+
+	# Adjust batch start with legacy execlists.
+	# Port == 2 mean batch was merged udring queuing and hasn't actually
+	# been submitted to the gpu until the batch with port < 2 is found.
+	if ($correct_durations and $oldkernelwa and $db{$key}->{'port'} == 2) {
+		my $ctx = $db{$key}->{'ctx'};
+		my $seqno = $db{$key}->{'seqno'};
+		my $next_key;
+		my $i = 1;
+
+		do {
+			$next_key = db_key($ring, $ctx, $seqno + $i);
+			$i++;
+		} until ((exists $db{$next_key} and $db{$next_key}->{'port'} < 2) or $i > scalar(keys(%db)));  # ugly stop hack
+
+		if (exists $db{$next_key}) {
+			$db{$key}->{'start'} = $db{$next_key}->{'start'};
+			$db{$key}->{'end'} = $db{$next_key}->{'end'};
+			die if $db{$key}->{'start'} > $db{$key}->{'end'};
+		}
+	}
+
+	$running{$ring} += $end - $db{$key}->{'start'} unless exists $db{$key}->{'no-end'};
+	$runnable{$ring} += $db{$key}->{'execute-delay'};
+	$queued{$ring} += $db{$key}->{'start'} - $db{$key}->{'execute-delay'} - $db{$key}->{'queue'};
+
+	$batch_count{$ring}++;
+
+	# correct duration of merged batches
+	if ($correct_durations and exists $db{$key}->{'no-end'}) {
+		my $start = $db{$key}->{'start'};
+		my $ctx = $db{$key}->{'ctx'};
+		my $seqno = $db{$key}->{'seqno'};
+		my $next_key;
+		my $i = 1;
+
+		do {
+			$next_key = db_key($ring, $ctx, $seqno + $i);
+			$i++;
+		} until (exists $db{$next_key} or $i > scalar(keys(%db)));  # ugly stop hack
+
+		# 20us tolerance
+		if (exists $db{$next_key} and $db{$next_key}->{'start'} < $start + 20) {
+			$db{$next_key}->{'start'} = $start + $db{$key}->{'duration'};
+			$db{$next_key}->{'start'} = $db{$next_key}->{'end'} if $db{$next_key}->{'start'} > $db{$next_key}->{'end'};
+			$db{$next_key}->{'duration'} = $db{$next_key}->{'notify'} - $db{$next_key}->{'start'};
+			$end = $db{$key}->{'notify'};
+			die if $db{$next_key}->{'start'} > $db{$next_key}->{'end'};
+		}
+		die if $db{$key}->{'start'} > $db{$key}->{'end'};
+	}
+	$batch_avg{$ring} += $db{$key}->{'duration'};
+	$batch_total_avg{$ring} += $end - $db{$key}->{'start'};
+
+	$submit_avg{$ring} += $db{$key}->{'submit-delay'};
+	$execute_avg{$ring} += $db{$key}->{'execute-delay'};
+	$ctxsave_avg{$ring} += $db{$key}->{'end'} - $db{$key}->{'notify'};
+}
+
+foreach my $ring (keys %batch_avg) {
+	$batch_avg{$ring} /= $batch_count{$ring};
+	$batch_total_avg{$ring} /= $batch_count{$ring};
+	$submit_avg{$ring} /= $batch_count{$ring};
+	$execute_avg{$ring} /= $batch_count{$ring};
+	$ctxsave_avg{$ring} /= $batch_count{$ring};
+}
+
+# Calculate engine idle time
+my %flat_busy;
+foreach my $gid (sort keys %rings) {
+	my $ring = $ringmap{$rings{$gid}};
+	my (@s_, @e_);
+
+	# Extract all GPU busy intervals and sort them.
+	foreach my $key (sort {$db{$a}->{'start'} <=> $db{$b}->{'start'}} keys %db) {
+		next unless $db{$key}->{'ring'} == $ring;
+		push @s_, $db{$key}->{'start'};
+		push @e_, $db{$key}->{'end'};
+		die if $db{$key}->{'start'} > $db{$key}->{'end'};
+	}
+
+	die unless $#s_ == $#e_;
+
+	# Flatten the intervals.
+	for my $i (1..$#s_) {
+		last if $i >= @s_; # End of array.
+		die if $e_[$i] < $s_[$i];
+		if ($s_[$i] <= $e_[$i - 1]) {
+			# Current entry overlaps with the previous one. We need
+			# to merge end of the previous interval from the list
+			# with the start of the current one.
+			splice @e_, $i - 1, 1;
+			splice @s_, $i, 1;
+			# Continue with the same element when list got squashed.
+			redo;
+		}
+	}
+
+	# Add up all busy times.
+	my $total = 0;
+	for my $i (0..$#s_) {
+		die if $e_[$i] < $s_[$i];
+
+		$total = $total + ($e_[$i] - $s_[$i]);
+	}
+
+	$flat_busy{$ring} = $total;
+}
+
+my %reqw;
+foreach my $key (keys %reqwait) {
+	$reqw{$reqwait{$key}->{'ring'}} += $reqwait{$key}->{'end'} - $reqwait{$key}->{'start'};
+}
+
+print <<ENDHTML if $html;
+<!DOCTYPE HTML>
+<html>
+<head>
+  <title>i915 GT timeline</title>
+
+  <style type="text/css">
+    body, html {
+      font-family: sans-serif;
+    }
+  </style>
+
+  <script src="node_modules/vis/dist/vis.js"></script>
+  <link href="node_modules/vis//dist/vis.css" rel="stylesheet" type="text/css" />
+</head>
+<body>
+
+<button onclick="toggleStackSubgroups()">Toggle stacking</button>
+
+<p>
+pink = requests executing on the GPU<br>
+grey = runnable requests waiting for a slot on GPU<br>
+blue = requests waiting on fences and dependencies before they are runnable<br>
+</p>
+<p>
+Boxes are in format 'ctx-id/seqno'.
+</p>
+<p>
+Use Ctrl+scroll-action to zoom-in/out and scroll-action or dragging to move around the timeline.
+</p>
+
+<div id="visualization"></div>
+
+<script type="text/javascript">
+  var container = document.getElementById('visualization');
+
+  var groups = new vis.DataSet([
+ENDHTML
+
+#   var groups = new vis.DataSet([
+# 	{id: 1, content: 'g0'},
+# 	{id: 2, content: 'g1'}
+#   ]);
+
+sub html_stats
+{
+	my ($stats, $group, $id) = @_;
+	my $name;
+
+	$name = 'Ring' . $group;
+	$name .= '<br><small><br>';
+	$name .= sprintf('%2.2f', $stats->{'idle'}) . '% idle<br><br>';
+	$name .= sprintf('%2.2f', $stats->{'busy'}) . '% busy<br>';
+	$name .= sprintf('%2.2f', $stats->{'runnable'}) . '% runnable<br>';
+	$name .= sprintf('%2.2f', $stats->{'queued'}) . '% queued<br><br>';
+	$name .= sprintf('%2.2f', $stats->{'wait'}) . '% wait<br><br>';
+	$name .= $stats->{'count'} . ' batches<br>';
+	$name .= sprintf('%2.2f', $stats->{'avg'}) . 'us avg batch<br>';
+	$name .= sprintf('%2.2f', $stats->{'total-avg'}) . 'us avg engine batch<br>';
+	$name .= '</small>';
+
+	print "\t{id: $id, content: '$name'},\n";
+}
+
+sub stdio_stats
+{
+	my ($stats, $group, $id) = @_;
+	my $str;
+
+	$str = 'Ring' . $group . ': ';
+	$str .= $stats->{'count'} . ' batches, ';
+	$str .= sprintf('%2.2f (%2.2f) avg batch us, ', $stats->{'avg'}, $stats->{'total-avg'});
+	$str .= sprintf('%2.2f', $stats->{'idle'}) . '% idle, ';
+	$str .= sprintf('%2.2f', $stats->{'busy'}) . '% busy, ';
+	$str .= sprintf('%2.2f', $stats->{'runnable'}) . '% runnable, ';
+	$str .= sprintf('%2.2f', $stats->{'queued'}) . '% queued, ';
+	$str .= sprintf('%2.2f', $stats->{'wait'}) . '% wait';
+	if ($avg_delay_stats) {
+		$str .= ', submit/execute/save-avg=(';
+		$str .= sprintf('%2.2f/%2.2f/%2.2f', $stats->{'submit'}, $stats->{'execute'}, $stats->{'save'});
+	}
+	$str .= ')';
+
+	say $str;
+}
+
+print "\t{id: 0, content: 'Freq'},\n" if $html;
+foreach my $group (sort keys %rings) {
+	my $name;
+	my $ring = $ringmap{$rings{$group}};
+	my $id = 1 + $rings{$group};
+	my $elapsed = $last_ts - $first_ts;
+	my %stats;
+
+	$stats{'idle'} = (1.0 - $flat_busy{$ring} / $elapsed) * 100.0;
+	$stats{'busy'} = $running{$ring} / $elapsed * 100.0;
+	$stats{'runnable'} = $runnable{$ring} / $elapsed * 100.0;
+	$stats{'queued'} = $queued{$ring} / $elapsed * 100.0;
+	$reqw{$ring} = 0 unless exists $reqw{$ring};
+	$stats{'wait'} = $reqw{$ring} / $elapsed * 100.0;
+	$stats{'count'} = $batch_count{$ring};
+	$stats{'avg'} = $batch_avg{$ring};
+	$stats{'total-avg'} = $batch_total_avg{$ring};
+	$stats{'submit'} = $submit_avg{$ring};
+	$stats{'execute'} = $execute_avg{$ring};
+	$stats{'save'} = $ctxsave_avg{$ring};
+
+	if ($html) {
+		html_stats(\%stats, $group, $id);
+	} else {
+		stdio_stats(\%stats, $group, $id);
+	}
+}
+
+exit 0 unless $html;
+
+print <<ENDHTML;
+  ]);
+
+  var items = new vis.DataSet([
+ENDHTML
+
+my $i = 0;
+foreach my $key (sort {$db{$a}->{'queue'} <=> $db{$b}->{'queue'}} keys %db) {
+	my ($name, $ctx, $seqno) = ($db{$key}->{'name'}, $db{$key}->{'ctx'}, $db{$key}->{'seqno'});
+	my ($queue, $start, $notify, $end) = ($db{$key}->{'queue'}, $db{$key}->{'start'}, $db{$key}->{'notify'}, $db{$key}->{'end'});
+	my $submit = $queue + $db{$key}->{'submit-delay'};
+	my ($content, $style);
+	my $group = 1 + $rings{$db{$key}->{'ring'}};
+	my $type = ' type: \'range\',';
+	my $startend;
+	my $skey;
+
+	# submit to execute
+	unless (exists $skip_box{'queue'}) {
+		$skey = 2 * $max_seqno * $ctx + 2 * $seqno;
+		$style = 'color: black; background-color: lightblue;';
+		$content = "$name<br>$db{$key}->{'submit-delay'}us <small>($db{$key}->{'execute-delay'}us)</small>";
+		$startend = 'start: \'' . ts($queue) . '\', end: \'' . ts($submit) . '\'';
+		print "\t{id: $i, key: $skey, $type group: $group, subgroup: 1, subgroupOrder: 1, content: '$content', $startend, style: \'$style\'},\n";
+		$i++;
+	}
+
+	# execute to start
+	unless (exists $skip_box{'ready'}) {
+		$skey = 2 * $max_seqno * $ctx + 2 * $seqno + 1;
+		$style = 'color: black; background-color: lightgrey;';
+		$content = "<small>$name<br>$db{$key}->{'execute-delay'}us</small>";
+		$startend = 'start: \'' . ts($submit) . '\', end: \'' . ts($start) . '\'';
+		print "\t{id: $i, key: $skey, $type group: $group, subgroup: 1, subgroupOrder: 2, content: '$content', $startend, style: \'$style\'},\n";
+		$i++;
+	}
+
+	# start to user interrupt
+	unless (exists $skip_box{'execute'}) {
+		$skey = -2 * $max_seqno * $ctx - 2 * $seqno - 1;
+		if (exists $db{$key}->{'incomplete'}) {
+			$style = 'color: white; background-color: red;';
+		} else {
+			$style = 'color: black; background-color: pink;';
+		}
+		$content = "$name <small>$db{$key}->{'port'}</small>";
+		$content .= ' <small><i>???</i></small> ' if exists $db{$key}->{'incomplete'};
+		$content .= ' <small><i>++</i></small> ' if exists $db{$key}->{'no-end'};
+		$content .= ' <small><i>+</i></small> ' if exists $db{$key}->{'no-notify'};
+		$content .= "<br>$db{$key}->{'duration'}us <small>($db{$key}->{'context-complete-delay'}us)</small>";
+		$startend = 'start: \'' . ts($start) . '\', end: \'' . ts($notify) . '\'';
+		print "\t{id: $i, key: $skey, $type group: $group, subgroup: 2, subgroupOrder: 3, content: '$content', $startend, style: \'$style\'},\n";
+		$i++;
+	}
+
+	# user interrupt to context complete
+	unless (exists $skip_box{'ctxsave'}) {
+		$skey = -2 * $max_seqno * $ctx - 2 * $seqno;
+		$style = 'color: black; background-color: orange;';
+		my $ctxsave = $db{$key}->{'end'} - $db{$key}->{'notify'};
+		$content = "<small>$name<br>${ctxsave}us</small>";
+		$content .= ' <small><i>???</i></small> ' if exists $db{$key}->{'incomplete'};
+		$content .= ' <small><i>++</i></small> ' if exists $db{$key}->{'no-end'};
+		$content .= ' <small><i>+</i></small> ' if exists $db{$key}->{'no-notify'};
+		$startend = 'start: \'' . ts($notify) . '\', end: \'' . ts($end) . '\'';
+		print "\t{id: $i, key: $skey, $type group: $group, subgroup: 2, subgroupOrder: 4, content: '$content', $startend, style: \'$style\'},\n";
+		$i++;
+	}
+
+	$last_ts = $end;
+
+	last if $i > $max_items;
+}
+
+foreach my $item (@freqs) {
+	my ($start, $end, $freq) = @$item;
+	my $startend;
+
+	next if $start > $last_ts;
+
+	$start = $first_ts if $start < $first_ts;
+	$end = $last_ts if $end > $last_ts;
+	$startend = 'start: \'' . ts($start) . '\', end: \'' . ts($end) . '\'';
+	print "\t{id: $i, type: 'range', group: 0, content: '$freq', $startend},\n";
+	$i++;
+}
+
+my $end_ts = ts($first_ts + $width_us);
+$first_ts = ts($first_ts);
+
+print <<ENDHTML;
+  ]);
+
+  function customOrder (a, b) {
+  // order by id
+    return a.subgroupOrder - b.subgroupOrder;
+  }
+
+  // Configuration for the Timeline
+  var options = { groupOrder: 'content',
+		  horizontalScroll: true,
+		  stack: true,
+		  stackSubgroups: false,
+		  zoomKey: 'ctrlKey',
+		  orientation: 'top',
+		  order: customOrder,
+		  start: '$first_ts',
+		  end: '$end_ts'};
+
+  // Create a Timeline
+  var timeline = new vis.Timeline(container, items, groups, options);
+
+    function toggleStackSubgroups() {
+        options.stackSubgroups = !options.stackSubgroups;
+        timeline.setOptions(options);
+    }
+ENDHTML
+
+print <<ENDHTML;
+</script>
+</body>
+</html>
+ENDHTML
-- 
2.9.3

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH i-g-t v6] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-21 15:21     ` [PATCH i-g-t v5] " Tvrtko Ursulin
@ 2017-04-25 11:13       ` Tvrtko Ursulin
  2017-04-25 11:35         ` Chris Wilson
  0 siblings, 1 reply; 26+ messages in thread
From: Tvrtko Ursulin @ 2017-04-25 11:13 UTC (permalink / raw)
  To: Intel-gfx

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Tool which emits batch buffers to engines with configurable
sequences, durations, contexts, dependencies and userspace waits.

Unfinished but shows promise so sending out for early feedback.

v2:
 * Load workload descriptors from files. (also -w)
 * Help text.
 * Calibration control if needed. (-t)
 * NORELOC | LUT to eb flags.
 * Added sample workload to wsim/workload1.

v3:
 * Multiple parallel different workloads (-w -w ...).
 * Multi-context workloads.
 * Variable (random) batch length.
 * Load balancing (round robin and queue depth estimation).
 * Workloads delays and explicit sync steps.
 * Workload frequency (period) control.

v4:
 * Fixed queue-depth estimation by creating separate batches
   per engine when qd load balancing is on.
 * Dropped separate -s cmd line option. It can turn itself on
   automatically when needed.
 * Keep a single status page and lie about the write hazard
   as suggested by Chris.
 * Use batch_start_offset for controlling the batch duration.
   (Chris)
 * Set status page object cache level. (Chris)
 * Moved workload description to a README.
 * Tidied example workloads.
 * Some other cleanups and refactorings.

v5:
 * Master and background workloads (-W / -w).
 * Single batch per step is enough even when balancing. (Chris)
 * Use hars_petruska_f54_1_random IGT functions and see to zero
   at start. (Chris)
 * Use WC cache domain when WC mapping. (Chris)
 * Keep seqnos 64-bytes apart in the status page. (Chris)
 * Add workload throttling and queue-depth throttling commands.
   (Chris)

v6:
 * Added two more workloads.
 * Merged RT balancer from Chris.

TODO list:

 * Fence support.
 * Better error handling.
 * Less 1980's workload parsing.
 * More workloads.
 * Threads?
 * ... ?

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com>
---
 benchmarks/Makefile.sources                  |    1 +
 benchmarks/gem_wsim.c                        | 1307 ++++++++++++++++++++++++++
 benchmarks/wsim/README                       |   56 ++
 benchmarks/wsim/media_17i7.wsim              |    7 +
 benchmarks/wsim/media_19.wsim                |   10 +
 benchmarks/wsim/media_load_balance_17i7.wsim |    7 +
 benchmarks/wsim/media_load_balance_19.wsim   |   10 +
 benchmarks/wsim/vcs1.wsim                    |   26 +
 benchmarks/wsim/vcs_balanced.wsim            |   26 +
 lib/igt_core.c                               |   26 +
 lib/igt_core.h                               |    1 +
 11 files changed, 1477 insertions(+)
 create mode 100644 benchmarks/gem_wsim.c
 create mode 100644 benchmarks/wsim/README
 create mode 100644 benchmarks/wsim/media_17i7.wsim
 create mode 100644 benchmarks/wsim/media_19.wsim
 create mode 100644 benchmarks/wsim/media_load_balance_17i7.wsim
 create mode 100644 benchmarks/wsim/media_load_balance_19.wsim
 create mode 100644 benchmarks/wsim/vcs1.wsim
 create mode 100644 benchmarks/wsim/vcs_balanced.wsim

diff --git a/benchmarks/Makefile.sources b/benchmarks/Makefile.sources
index 3af54ebe36f2..3a941150abb3 100644
--- a/benchmarks/Makefile.sources
+++ b/benchmarks/Makefile.sources
@@ -14,6 +14,7 @@ benchmarks_prog_list =			\
 	gem_prw				\
 	gem_set_domain			\
 	gem_syslatency			\
+	gem_wsim			\
 	kms_vblank			\
 	prime_lookup			\
 	vgem_mmap			\
diff --git a/benchmarks/gem_wsim.c b/benchmarks/gem_wsim.c
new file mode 100644
index 000000000000..1f491a53ca83
--- /dev/null
+++ b/benchmarks/gem_wsim.c
@@ -0,0 +1,1307 @@
+/*
+ * Copyright © 2017 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include <unistd.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <errno.h>
+#include <poll.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <sys/ioctl.h>
+#include <sys/time.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <assert.h>
+#include <limits.h>
+
+
+#include "intel_chipset.h"
+#include "drm.h"
+#include "ioctl_wrappers.h"
+#include "drmtest.h"
+#include "intel_io.h"
+#include "igt_rand.h"
+
+enum intel_engine_id {
+	RCS,
+	BCS,
+	VCS,
+	VCS1,
+	VCS2,
+	VECS,
+	NUM_ENGINES
+};
+
+struct duration {
+	unsigned int min, max;
+};
+
+enum w_type
+{
+	BATCH,
+	SYNC,
+	DELAY,
+	PERIOD,
+	THROTTLE,
+	QD_THROTTLE
+};
+
+struct w_step
+{
+	/* Workload step metadata */
+	enum w_type type;
+	unsigned int context;
+	unsigned int engine;
+	struct duration duration;
+	int dependency;
+	int wait;
+
+	/* Implementation details */
+	unsigned int idx;
+
+	struct drm_i915_gem_execbuffer2 eb;
+	struct drm_i915_gem_exec_object2 obj[4];
+	struct drm_i915_gem_relocation_entry reloc[3];
+	unsigned long bb_sz;
+	uint32_t bb_handle;
+	uint32_t *mapped_batch, *mapped_seqno;
+	unsigned int mapped_len;
+	uint32_t *rt0_value;
+};
+
+struct workload
+{
+	unsigned int nr_steps;
+	struct w_step *steps;
+
+	struct timespec repeat_start;
+
+	int pipe[2];
+
+	unsigned int nr_ctxs;
+	uint32_t *ctx_id;
+
+	uint32_t seqno[NUM_ENGINES];
+	uint32_t status_page_handle;
+	uint32_t *status_page;
+	unsigned int vcs_rr;
+
+	unsigned long qd_sum[NUM_ENGINES];
+	unsigned long nr_bb[NUM_ENGINES];
+};
+
+static const unsigned int eb_engine_map[NUM_ENGINES] = {
+	[RCS] = I915_EXEC_RENDER,
+	[BCS] = I915_EXEC_BLT,
+	[VCS] = I915_EXEC_BSD,
+	[VCS1] = I915_EXEC_BSD | I915_EXEC_BSD_RING1,
+	[VCS2] = I915_EXEC_BSD | I915_EXEC_BSD_RING2,
+	[VECS] = I915_EXEC_VEBOX
+};
+
+static const unsigned int nop_calibration_us = 1000;
+static unsigned long nop_calibration;
+
+static bool quiet;
+static int fd;
+
+#define SWAPVCS	(1<<0)
+#define SEQNO	(1<<1)
+#define BALANCE	(1<<2)
+#define RT	(1<<3)
+
+#define VCS_SEQNO_IDX(engine) (((engine) - VCS1) * 16)
+#define VCS_SEQNO_OFFSET(engine) (VCS_SEQNO_IDX(engine) * sizeof(uint32_t))
+
+#define RCS_TIMESTAMP (0x2000 + 0x358)
+#define REG(x) (volatile uint32_t *)((volatile char *)igt_global_mmio + x)
+
+/*
+ * Workload descriptor:
+ *
+ * ctx.engine.duration.dependency.wait,...
+ * <uint>.<str>.<uint>.<int <= 0>.<0|1>,...
+ *
+ * Engine ids: RCS, BCS, VCS, VCS1, VCS2, VECS
+ *
+ * "1.VCS1.3000.0.1,1.RCS.1000.-1.0,1.RCS.3700.0.0,1.RCS.1000.-2.0,1.VCS2.2300.-2.0,1.RCS.4700.-1.0,1.VCS2.600.-1.1"
+ */
+
+static const char *ring_str_map[NUM_ENGINES] = {
+	[RCS] = "RCS",
+	[BCS] = "BCS",
+	[VCS] = "VCS",
+	[VCS1] = "VCS1",
+	[VCS2] = "VCS2",
+	[VECS] = "VECS",
+};
+
+static struct workload *parse_workload(char *_desc)
+{
+	struct workload *wrk;
+	unsigned int nr_steps = 0;
+	char *desc = strdup(_desc);
+	char *_token, *token, *tctx = NULL, *tstart = desc;
+	char *field, *fctx = NULL, *fstart;
+	struct w_step step, *steps = NULL;
+	unsigned int valid;
+	int tmp;
+
+	while ((_token = strtok_r(tstart, ",", &tctx)) != NULL) {
+		tstart = NULL;
+		token = strdup(_token);
+		fstart = token;
+		valid = 0;
+		memset(&step, 0, sizeof(step));
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			fstart = NULL;
+
+			if (!strcasecmp(field, "d")) {
+				if ((field = strtok_r(fstart, ".", &fctx)) !=
+				    NULL) {
+					tmp = atoi(field);
+					if (tmp <= 0) {
+						if (!quiet)
+							fprintf(stderr,
+								"Invalid delay at step %u!\n",
+								nr_steps);
+						return NULL;
+					}
+
+					step.type = DELAY;
+					step.wait = tmp;
+					goto add_step;
+				}
+			} else if (!strcasecmp(field, "p")) {
+				if ((field = strtok_r(fstart, ".", &fctx)) !=
+				    NULL) {
+					tmp = atoi(field);
+					if (tmp <= 0) {
+						if (!quiet)
+							fprintf(stderr,
+								"Invalid period at step %u!\n",
+								nr_steps);
+						return NULL;
+					}
+
+					step.type = PERIOD;
+					step.wait = tmp;
+					goto add_step;
+				}
+			} else if (!strcasecmp(field, "s")) {
+				if ((field = strtok_r(fstart, ".", &fctx)) !=
+				    NULL) {
+					tmp = atoi(field);
+					if (tmp >= 0) {
+						if (!quiet)
+							fprintf(stderr,
+								"Invalid sync target at step %u!\n",
+								nr_steps);
+						return NULL;
+					}
+
+					step.type = SYNC;
+					step.wait = tmp;
+					goto add_step;
+				}
+			} else if (!strcasecmp(field, "t")) {
+				if ((field = strtok_r(fstart, ".", &fctx)) !=
+				    NULL) {
+					tmp = atoi(field);
+					if (tmp < 0) {
+						if (!quiet)
+							fprintf(stderr,
+								"Invalid throttle at step %u!\n",
+								nr_steps);
+						return NULL;
+					}
+
+					step.type = THROTTLE;
+					step.wait = tmp;
+					goto add_step;
+				}
+			} else if (!strcasecmp(field, "q")) {
+				if ((field = strtok_r(fstart, ".", &fctx)) !=
+				    NULL) {
+					tmp = atoi(field);
+					if (tmp < 0) {
+						if (!quiet)
+							fprintf(stderr,
+								"Invalid qd throttle at step %u!\n",
+								nr_steps);
+						return NULL;
+					}
+
+					step.type = QD_THROTTLE;
+					step.wait = tmp;
+					goto add_step;
+				}
+			}
+
+			tmp = atoi(field);
+			if (tmp < 0) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid ctx id at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.context = tmp;
+
+			valid++;
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			unsigned int i, old_valid = valid;
+
+			fstart = NULL;
+
+			for (i = 0; i < ARRAY_SIZE(ring_str_map); i++) {
+				if (!strcasecmp(field, ring_str_map[i])) {
+					step.engine = i;
+					valid++;
+					break;
+				}
+			}
+
+			if (old_valid == valid) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid engine id at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			char *sep = NULL;
+			long int tmpl;
+
+			fstart = NULL;
+
+			tmpl = strtol(field, &sep, 10);
+			if (tmpl == LONG_MIN || tmpl == LONG_MAX) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid duration at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.duration.min = tmpl;
+
+			if (sep && *sep == '-') {
+				tmpl = strtol(sep + 1, NULL, 10);
+				if (tmpl == LONG_MIN || tmpl == LONG_MAX) {
+					if (!quiet)
+						fprintf(stderr,
+							"Invalid duration range at step %u!\n",
+							nr_steps);
+					return NULL;
+				}
+				step.duration.max = tmpl;
+			} else {
+				step.duration.max = step.duration.min;
+			}
+
+			valid++;
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			fstart = NULL;
+
+			tmp = atoi(field);
+			if (tmp > 0) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid forward dependency at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.dependency = tmp;
+
+			valid++;
+		}
+
+		if ((field = strtok_r(fstart, ".", &fctx)) != NULL) {
+			fstart = NULL;
+
+			tmp = atoi(field);
+			if (tmp != 0 && tmp != 1) {
+				if (!quiet)
+					fprintf(stderr,
+						"Invalid wait boolean at step %u!\n",
+						nr_steps);
+				return NULL;
+			}
+			step.wait = tmp;
+
+			valid++;
+		}
+
+		if (valid != 5) {
+			if (!quiet)
+				fprintf(stderr, "Invalid record at step %u!\n",
+					nr_steps);
+			return NULL;
+		}
+
+		step.type = BATCH;
+
+add_step:
+		step.idx = nr_steps++;
+		steps = realloc(steps, sizeof(step) * nr_steps);
+		igt_assert(steps);
+
+		memcpy(&steps[nr_steps - 1], &step, sizeof(step));
+
+		free(token);
+	}
+
+	wrk = malloc(sizeof(*wrk));
+	igt_assert(wrk);
+
+	wrk->nr_steps = nr_steps;
+	wrk->steps = steps;
+
+	free(desc);
+
+	return wrk;
+}
+
+static struct workload *
+clone_workload(struct workload *_wrk)
+{
+	struct workload *wrk;
+
+	wrk = malloc(sizeof(*wrk));
+	igt_assert(wrk);
+	memset(wrk, 0, sizeof(*wrk));
+
+	wrk->nr_steps = _wrk->nr_steps;
+	wrk->steps = calloc(wrk->nr_steps, sizeof(struct w_step));
+	igt_assert(wrk->steps);
+
+	memcpy(wrk->steps, _wrk->steps, sizeof(struct w_step) * wrk->nr_steps);
+
+	return wrk;
+}
+
+#define rounddown(x, y) (x - (x%y))
+#ifndef PAGE_SIZE
+#define PAGE_SIZE (4096)
+#endif
+
+static unsigned int get_duration(struct duration *dur)
+{
+	if (dur->min == dur->max)
+		return dur->min;
+	else
+		return dur->min + hars_petruska_f54_1_random_unsafe() %
+		       (dur->max + 1 - dur->min);
+}
+
+static unsigned long get_bb_sz(unsigned int duration)
+{
+	return ALIGN(duration * nop_calibration * sizeof(uint32_t) /
+		     nop_calibration_us, sizeof(uint32_t));
+}
+
+static void
+terminate_bb(struct w_step *w, unsigned int flags)
+{
+	const uint32_t bbe = 0xa << 23;
+	unsigned long mmap_start, mmap_len;
+	unsigned long batch_start = w->bb_sz;
+	uint32_t *ptr, *cs;
+
+	igt_assert(((flags & RT) && (flags & SEQNO)) || !(flags & RT));
+
+	batch_start -= sizeof(uint32_t); /* bbend */
+	if (flags & SEQNO)
+		batch_start -= 4 * sizeof(uint32_t);
+	if (flags & RT)
+		batch_start -= 8 * sizeof(uint32_t);
+
+	mmap_start = rounddown(batch_start, PAGE_SIZE);
+	mmap_len = w->bb_sz - mmap_start;
+
+	gem_set_domain(fd, w->bb_handle,
+		       I915_GEM_DOMAIN_WC, I915_GEM_DOMAIN_WC);
+
+	ptr = gem_mmap__wc(fd, w->bb_handle, mmap_start, mmap_len, PROT_WRITE);
+	cs = (uint32_t *)((char *)ptr + batch_start - mmap_start);
+
+	if (flags & SEQNO) {
+		w->reloc[0].offset = batch_start + sizeof(uint32_t);
+		batch_start += 4 * sizeof(uint32_t);
+
+		*cs++ = MI_STORE_DWORD_IMM;
+		*cs++ = 0;
+		*cs++ = 0;
+		w->mapped_seqno = cs;
+		*cs++ = 0;
+	}
+
+	if (flags & RT) {
+		w->reloc[1].offset = batch_start + sizeof(uint32_t);
+		batch_start += 4 * sizeof(uint32_t);
+
+		*cs++ = MI_STORE_DWORD_IMM;
+		*cs++ = 0;
+		*cs++ = 0;
+		w->rt0_value = cs;
+		*cs++ = 0;
+
+		w->reloc[2].offset = batch_start + 2 * sizeof(uint32_t);
+		batch_start += 4 * sizeof(uint32_t);
+
+		*cs++ = 0x24 << 23 | 2; /* MI_STORE_REG_MEM */
+		*cs++ = RCS_TIMESTAMP;
+		*cs++ = 0;
+		*cs++ = 0;
+	}
+
+	*cs = bbe;
+
+	w->mapped_batch = ptr;
+	w->mapped_len = mmap_len;
+}
+
+static void
+eb_update_flags(struct w_step *w, enum intel_engine_id engine,
+		unsigned int flags)
+{
+	w->eb.flags = eb_engine_map[engine];
+	w->eb.flags |= I915_EXEC_HANDLE_LUT;
+	if (!(flags & SEQNO))
+		w->eb.flags |= I915_EXEC_NO_RELOC;
+}
+
+static void
+alloc_step_batch(struct workload *wrk, struct w_step *w, unsigned int flags)
+{
+	enum intel_engine_id engine = w->engine;
+	unsigned int bb_i, j = 0;
+
+	w->obj[j].handle = gem_create(fd, 4096);
+	w->obj[j].flags = EXEC_OBJECT_WRITE;
+	j++;
+
+	if (flags & SEQNO) {
+		w->obj[j].handle = wrk->status_page_handle;
+		j++;
+	}
+
+	bb_i = j++;
+	w->bb_sz = get_bb_sz(w->duration.max);
+	w->bb_handle = w->obj[bb_i].handle = gem_create(fd, w->bb_sz);
+	terminate_bb(w, flags);
+
+	igt_assert(w->dependency <= 0);
+	if (w->dependency) {
+		int dep_idx = w->idx + w->dependency;
+
+		igt_assert(dep_idx >= 0 && dep_idx < wrk->nr_steps);
+		igt_assert(wrk->steps[dep_idx].type == BATCH);
+
+		w->obj[j].handle = w->obj[bb_i].handle;
+		bb_i = j;
+		w->obj[j - 1].handle = wrk->steps[dep_idx].obj[0].handle;
+		j++;
+	}
+
+	if (flags & SEQNO) {
+		w->obj[bb_i].relocs_ptr = to_user_pointer(&w->reloc);
+		if (flags & RT)
+			w->obj[bb_i].relocation_count = 3;
+		else
+			w->obj[bb_i].relocation_count = 1;
+		for (int i = 0; i < w->obj[bb_i].relocation_count; i++) {
+			w->reloc[i].presumed_offset = -1;
+			w->reloc[i].target_handle = 1;
+		}
+	}
+
+	w->eb.buffers_ptr = to_user_pointer(w->obj);
+	w->eb.buffer_count = j;
+	w->eb.rsvd1 = wrk->ctx_id[w->context];
+
+	if (flags & SWAPVCS && engine == VCS1)
+		engine = VCS2;
+	else if (flags & SWAPVCS && engine == VCS2)
+		engine = VCS1;
+	eb_update_flags(w, engine, flags);
+#ifdef DEBUG
+	printf("%u: %u:%x|%x|%x|%x %10lu flags=%llx bb=%x[%u] ctx[%u]=%u\n",
+		w->idx, w->eb.buffer_count, w->obj[0].handle,
+		w->obj[1].handle, w->obj[2].handle, w->obj[3].handle,
+		w->bb_sz, w->eb.flags, w->bb_handle, bb_i,
+		w->context, wrk->ctx_id[w->context]);
+#endif
+}
+
+static void
+prepare_workload(struct workload *wrk, unsigned int flags)
+{
+	int max_ctx = -1;
+	struct w_step *w;
+	int i;
+
+	if (flags & SEQNO) {
+		const unsigned int status_sz = sizeof(uint32_t);
+		uint32_t handle = gem_create(fd, status_sz);
+
+		gem_set_caching(fd, handle, I915_CACHING_CACHED);
+		wrk->status_page_handle = handle;
+		wrk->status_page = gem_mmap__cpu(fd, handle, 0, status_sz,
+						 PROT_READ);
+	}
+
+	for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
+		if ((int)w->context > max_ctx) {
+			int delta = w->context + 1 - wrk->nr_ctxs;
+
+			wrk->nr_ctxs += delta;
+			wrk->ctx_id = realloc(wrk->ctx_id,
+					      wrk->nr_ctxs * sizeof(uint32_t));
+			memset(&wrk->ctx_id[wrk->nr_ctxs - delta], 0,
+			       delta * sizeof(uint32_t));
+
+			max_ctx = w->context;
+		}
+
+		if (!wrk->ctx_id[w->context]) {
+			struct drm_i915_gem_context_create arg = {};
+
+			drmIoctl(fd, DRM_IOCTL_I915_GEM_CONTEXT_CREATE, &arg);
+			igt_assert(arg.ctx_id);
+
+			wrk->ctx_id[w->context] = arg.ctx_id;
+		}
+	}
+
+	for (i = 0, w = wrk->steps; i < wrk->nr_steps; i++, w++) {
+		unsigned int _flags = flags;
+		enum intel_engine_id engine = w->engine;
+
+		if (w->type != BATCH)
+			continue;
+
+		if (engine != VCS && engine != VCS1 && engine != VCS2)
+			_flags &= ~(SEQNO | RT);
+
+		if (engine == VCS)
+			_flags &= ~SWAPVCS;
+
+		alloc_step_batch(wrk, w, _flags);
+	}
+}
+
+static double elapsed(const struct timespec *start, const struct timespec *end)
+{
+	return (end->tv_sec - start->tv_sec) +
+	       (end->tv_nsec - start->tv_nsec) / 1e9;
+}
+
+static int elapsed_us(const struct timespec *start, const struct timespec *end)
+{
+	return elapsed(start, end) * 1e6;
+}
+
+static enum intel_engine_id get_vcs_engine(unsigned int n)
+{
+	const enum intel_engine_id vcs_engines[2] = { VCS1, VCS2 };
+
+	igt_assert(n < ARRAY_SIZE(vcs_engines));
+
+	return vcs_engines[n];
+}
+
+struct workload_balancer {
+	unsigned int (*get_qd)(const struct workload_balancer *balancer,
+			       struct workload *wrk,
+			       enum intel_engine_id engine);
+	enum intel_engine_id (*balance)(const struct workload_balancer *balancer,
+					struct workload *wrk, struct w_step *w);
+};
+
+static enum intel_engine_id
+rr_balance(const struct workload_balancer *balancer,
+	   struct workload *wrk, struct w_step *w)
+{
+	unsigned int engine;
+
+	engine = get_vcs_engine(wrk->vcs_rr);
+	wrk->vcs_rr ^= 1;
+
+	return engine;
+}
+
+static const struct workload_balancer rr_balancer = {
+	.balance = rr_balance,
+};
+
+static unsigned int
+get_qd_depth(const struct workload_balancer *balancer,
+	     struct workload *wrk, enum intel_engine_id engine)
+{
+	return wrk->seqno[engine] -
+	       wrk->status_page[VCS_SEQNO_IDX(engine)];
+}
+
+static enum intel_engine_id
+qd_balance(const struct workload_balancer *balancer,
+	   struct workload *wrk, struct w_step *w)
+{
+	enum intel_engine_id engine;
+	long qd[NUM_ENGINES];
+	unsigned int n;
+
+	igt_assert(w->engine == VCS);
+
+	qd[VCS1] = balancer->get_qd(balancer, wrk, VCS1);
+	wrk->qd_sum[VCS1] += qd[VCS1];
+
+	qd[VCS2] = balancer->get_qd(balancer, wrk, VCS2);
+	wrk->qd_sum[VCS2] += qd[VCS2];
+
+	if (qd[VCS1] < qd[VCS2])
+		n = 0;
+	else if (qd[VCS2] < qd[VCS1])
+		n = 1;
+	else
+		n = wrk->vcs_rr;
+
+	engine = get_vcs_engine(n);
+	wrk->vcs_rr = n ^ 1;
+
+#ifdef DEBUG
+	printf("qd_balance: 1:%ld 2:%ld rr:%u = %u\t(%lu - %u) (%lu - %u)\n",
+	       qd[VCS1], qd[VCS2], wrk->vcs_rr, engine,
+	       wrk->seqno[VCS1], wrk->status_page[VCS_SEQNO_IDX(VCS1)],
+	       wrk->seqno[VCS2], wrk->status_page[VCS_SEQNO_IDX(VCS2)]);
+#endif
+	return engine;
+}
+
+static const struct workload_balancer qd_balancer = {
+	.get_qd = get_qd_depth,
+	.balance = qd_balance,
+};
+
+static enum intel_engine_id
+rt_balance(const struct workload_balancer *balancer,
+	   struct workload *wrk, struct w_step *w)
+{
+	enum intel_engine_id engine;
+	long qd[NUM_ENGINES];
+	unsigned int n;
+
+	igt_assert(w->engine == VCS);
+
+	/* Estimate the "speed" of the most recent batch
+	 *    (finish time - submit time)
+	 * and use that as an approximate for the total remaining time for
+	 * all batches on that engine. We try to keep the total remaining
+	 * balanced between the engines.
+	 */
+	qd[VCS1] = balancer->get_qd(balancer, wrk, VCS1);
+	wrk->qd_sum[VCS1] += qd[VCS1];
+	qd[VCS1] *= wrk->status_page[2] - wrk->status_page[1];
+#ifdef DEBUG
+	printf("qd[0] = %d (%d - %d) x %d (%d - %d) = %ld\n",
+	       wrk->seqno[VCS1] - wrk->status_page[0],
+	       wrk->seqno[VCS1], wrk->status_page[0],
+	       wrk->status_page[2] - wrk->status_page[1],
+	       wrk->status_page[2], wrk->status_page[1],
+	       qd[VCS1]);
+#endif
+
+	qd[VCS2] = balancer->get_qd(balancer, wrk, VCS2);
+	wrk->qd_sum[VCS2] += qd[VCS2];
+	qd[VCS2] *= wrk->status_page[2 + 16] - wrk->status_page[1 + 16];
+#ifdef DEBUG
+	printf("qd[1] = %d (%d - %d) x %d (%d - %d) = %ld\n",
+	       wrk->seqno[VCS2] - wrk->status_page[16],
+	       wrk->seqno[VCS2], wrk->status_page[16],
+	       wrk->status_page[18] - wrk->status_page[17],
+	       wrk->status_page[18], wrk->status_page[17],
+	       qd[VCS2]);
+#endif
+
+	if (qd[VCS1] < qd[VCS2])
+		n = 0;
+	else if (qd[VCS2] < qd[VCS1])
+		n = 1;
+	else
+		n = wrk->vcs_rr;
+
+	engine = get_vcs_engine(n);
+	wrk->vcs_rr = n ^ 1;
+
+	return engine;
+}
+
+static const struct workload_balancer rt_balancer = {
+	.get_qd = get_qd_depth,
+	.balance = rt_balance,
+};
+
+static void
+update_bb_seqno(struct w_step *w, enum intel_engine_id engine, uint32_t seqno)
+{
+	igt_assert(engine == VCS1 || engine == VCS2);
+
+	gem_set_domain(fd, w->bb_handle,
+		       I915_GEM_DOMAIN_WC, I915_GEM_DOMAIN_WC);
+
+	*w->mapped_seqno = seqno;
+
+	w->reloc[0].presumed_offset = -1;
+	w->reloc[0].delta = VCS_SEQNO_OFFSET(engine);
+}
+
+static void
+update_bb_rt(struct w_step *w, enum intel_engine_id engine)
+{
+	igt_assert(engine == VCS1 || engine == VCS2);
+
+	gem_set_domain(fd, w->bb_handle,
+		       I915_GEM_DOMAIN_WC, I915_GEM_DOMAIN_WC);
+
+	*w->rt0_value = *REG(RCS_TIMESTAMP);
+
+	w->reloc[1].presumed_offset = -1;
+	w->reloc[1].delta = VCS_SEQNO_OFFSET(engine) + sizeof(uint32_t);
+
+	w->reloc[2].presumed_offset = -1;
+	w->reloc[2].delta = VCS_SEQNO_OFFSET(engine) + 2 * sizeof(uint32_t);
+}
+
+static void w_sync_to(struct workload *wrk, struct w_step *w, int target)
+{
+	if (target < 0)
+		target = wrk->nr_steps + target;
+
+	igt_assert(target < wrk->nr_steps);
+
+	while (wrk->steps[target].type != BATCH) {
+		if (--target < 0)
+			target = wrk->nr_steps + target;
+	}
+
+	igt_assert(target < wrk->nr_steps);
+	igt_assert(wrk->steps[target].type == BATCH);
+
+	gem_sync(fd, wrk->steps[target].obj[0].handle);
+}
+
+static void
+run_workload(unsigned int id, struct workload *wrk,
+	     bool background, int pipe_fd,
+	     const struct workload_balancer *balancer,
+	     unsigned int repeat,
+	     unsigned int flags)
+{
+	struct timespec t_start, t_end;
+	struct w_step *w;
+	bool run = true;
+	int throttle = -1;
+	int qd_throttle = -1;
+	double t;
+	int i, j;
+
+	clock_gettime(CLOCK_MONOTONIC, &t_start);
+
+	hars_petruska_f54_1_random_seed(0);
+
+	for (j = 0; run && (background || j < repeat); j++) {
+		clock_gettime(CLOCK_MONOTONIC, &wrk->repeat_start);
+
+		for (i = 0, w = wrk->steps; run && (i < wrk->nr_steps);
+		     i++, w++) {
+			enum intel_engine_id engine = w->engine;
+			int do_sleep = 0;
+
+			if (w->type == DELAY) {
+				do_sleep = w->wait;
+			} else if (w->type == PERIOD) {
+				struct timespec now;
+
+				clock_gettime(CLOCK_MONOTONIC, &now);
+				do_sleep = w->wait -
+					   elapsed_us(&wrk->repeat_start, &now);
+				if (do_sleep < 0) {
+					if (!quiet) {
+						printf("%u: Dropped period @ %u/%u (%dus late)!\n",
+						       id, j, i, do_sleep);
+						continue;
+					}
+				}
+			} else if (w->type == SYNC) {
+				unsigned int s_idx = i + w->wait;
+
+				igt_assert(i > 0 && i < wrk->nr_steps);
+				igt_assert(wrk->steps[s_idx].type == BATCH);
+				gem_sync(fd, wrk->steps[s_idx].obj[0].handle);
+				continue;
+			} else if (w->type == THROTTLE) {
+				throttle = w->wait;
+				continue;
+			} else if (w->type == QD_THROTTLE) {
+				qd_throttle = w->wait;
+				continue;
+			}
+
+			if (do_sleep) {
+				usleep(do_sleep);
+				continue;
+			}
+
+			wrk->nr_bb[engine]++;
+
+			if (engine == VCS && balancer) {
+				engine = balancer->balance(balancer, wrk, w);
+				wrk->nr_bb[engine]++;
+
+				eb_update_flags(w, engine, flags);
+
+				if (flags & SEQNO)
+					update_bb_seqno(w, engine,
+							++wrk->seqno[engine]);
+				if (flags & RT)
+					update_bb_rt(w, engine);
+			}
+
+			if (w->duration.min != w->duration.max) {
+				unsigned int d = get_duration(&w->duration);
+				unsigned long offset;
+
+				offset = ALIGN(w->bb_sz - get_bb_sz(d),
+					       2 * sizeof(uint32_t));
+				w->eb.batch_start_offset = offset;
+			}
+
+			/* If workload want qd throttling when qd is not
+			 * available approximate with normal throttling. */
+			if (qd_throttle > 0 && throttle < 0 &&
+			    !(balancer && balancer->get_qd))
+				throttle = qd_throttle;
+
+			if (throttle > 0)
+				w_sync_to(wrk, w, i - throttle);
+
+			if (qd_throttle > 0 && balancer && balancer->get_qd) {
+				unsigned int target;
+
+				for (target = wrk->nr_steps - 1; target > 0;
+				     target--) {
+					if (balancer->get_qd(balancer, wrk,
+							     engine) <
+					    qd_throttle)
+						break;
+					w_sync_to(wrk, w, i - target);
+				}
+			}
+
+			gem_execbuf(fd, &w->eb);
+
+			if (pipe_fd >= 0) {
+				struct pollfd fds;
+
+				fds.fd = pipe_fd;
+				fds.events = POLLHUP;
+				if (poll(&fds, 1, 0)) {
+					run = false;
+					break;
+				}
+			}
+
+			if (w->wait)
+				gem_sync(fd, w->obj[0].handle);
+		}
+	}
+
+	if (run)
+		gem_sync(fd, wrk->steps[wrk->nr_steps - 1].obj[0].handle);
+
+	clock_gettime(CLOCK_MONOTONIC, &t_end);
+
+	t = elapsed(&t_start, &t_end);
+	if (!quiet && !balancer)
+		printf("%c%u: %.3fs elapsed (%.3f workloads/s)\n",
+		       background ? ' ' : '*', id, t, repeat / t);
+	else if (!quiet && !balancer->get_qd)
+		printf("%c%u: %.3fs elapsed (%.3f workloads/s). %lu (%lu + %lu) total VCS batches.\n",
+		       background ? ' ' : '*', id, t, repeat / t,
+		       wrk->nr_bb[VCS], wrk->nr_bb[VCS1], wrk->nr_bb[VCS2]);
+	else if (!quiet && balancer)
+		printf("%c%u: %.3fs elapsed (%.3f workloads/s). %lu (%lu + %lu) total VCS batches. Average queue depths %.3f, %.3f.\n",
+		       background ? ' ' : '*', id, t, repeat / t,
+		       wrk->nr_bb[VCS], wrk->nr_bb[VCS1], wrk->nr_bb[VCS2],
+		       (double)wrk->qd_sum[VCS1] / wrk->nr_bb[VCS],
+		       (double)wrk->qd_sum[VCS2] / wrk->nr_bb[VCS]);
+}
+
+static void fini_workload(struct workload *wrk)
+{
+	free(wrk->steps);
+	free(wrk);
+}
+
+static unsigned long calibrate_nop(unsigned int tolerance_pct)
+{
+	const uint32_t bbe = 0xa << 23;
+	unsigned int loops = 17;
+	unsigned int usecs = nop_calibration_us;
+	struct drm_i915_gem_exec_object2 obj = {};
+	struct drm_i915_gem_execbuffer2 eb =
+		{ .buffer_count = 1, .buffers_ptr = (uintptr_t)&obj};
+	long size, last_size;
+	struct timespec t_0, t_end;
+
+	clock_gettime(CLOCK_MONOTONIC, &t_0);
+
+	size = 256 * 1024;
+	do {
+		struct timespec t_start;
+
+		obj.handle = gem_create(fd, size);
+		gem_write(fd, obj.handle, size - sizeof(bbe), &bbe,
+			  sizeof(bbe));
+		gem_execbuf(fd, &eb);
+		gem_sync(fd, obj.handle);
+
+		clock_gettime(CLOCK_MONOTONIC, &t_start);
+		for (int loop = 0; loop < loops; loop++)
+			gem_execbuf(fd, &eb);
+		gem_sync(fd, obj.handle);
+		clock_gettime(CLOCK_MONOTONIC, &t_end);
+
+		gem_close(fd, obj.handle);
+
+		last_size = size;
+		size = loops * size / elapsed(&t_start, &t_end) / 1e6 * usecs;
+		size = ALIGN(size, sizeof(uint32_t));
+	} while (elapsed(&t_0, &t_end) < 5 ||
+		 abs(size - last_size) > (size * tolerance_pct / 100));
+
+	return size / sizeof(uint32_t);
+}
+
+static void print_help(void)
+{
+	puts(
+"Usage: gem_wsim [OPTIONS]\n"
+"\n"
+"Runs a simulated workload on the GPU.\n"
+"When ran without arguments performs a GPU calibration result of which needs\n"
+"to be provided when running the simulation in subsequent invocations.\n"
+"\n"
+"Options:\n"
+"	-h		This text.\n"
+"	-q		Be quiet - do not output anything to stdout.\n"
+"	-n <n>		Nop calibration value.\n"
+"	-t <n>		Nop calibration tolerance percentage.\n"
+"			Use when there is a difficulty obtaining calibration\n"
+"			with the default settings.\n"
+"	-w <desc|path>	Filename or a workload descriptor.\n"
+"			Can be given multiple times.\n"
+"	-W <desc|path>	Filename or a master workload descriptor.\n"
+"			Only one master workload can be optinally specified\n"
+"			in which case all other workloads become background\n"
+"			ones and run as long as the master.\n"
+"	-r <n>		How many times to emit the workload.\n"
+"	-c <n>		Fork N clients emitting the workload simultaneously.\n"
+"	-x		Swap VCS1 and VCS2 engines in every other client.\n"
+"	-b <n>		Load balancing to use. (0: rr, 1: qd)\n"
+	);
+}
+
+static char *load_workload_descriptor(char *filename)
+{
+	struct stat sbuf;
+	char *buf;
+	int infd, ret, i;
+	ssize_t len;
+
+	ret = stat(filename, &sbuf);
+	if (ret || !S_ISREG(sbuf.st_mode))
+		return filename;
+
+	igt_assert(sbuf.st_size < 1024 * 1024); /* Just so. */
+	buf = malloc(sbuf.st_size);
+	igt_assert(buf);
+
+	infd = open(filename, O_RDONLY);
+	igt_assert(infd >= 0);
+	len = read(infd, buf, sbuf.st_size);
+	igt_assert(len == sbuf.st_size);
+	close(infd);
+
+	for (i = 0; i < len; i++) {
+		if (buf[i] == '\n')
+			buf[i] = ',';
+	}
+
+	len--;
+	while (buf[len] == ',')
+		buf[len--] = 0;
+
+	return buf;
+}
+
+static char **
+add_workload_arg(char **w_args, unsigned int nr_args, char *w_arg)
+{
+	w_args = realloc(w_args, sizeof(char *) * nr_args);
+	igt_assert(w_args);
+	w_args[nr_args - 1] = w_arg;
+
+	return w_args;
+}
+
+int main(int argc, char **argv)
+{
+	unsigned int repeat = 1;
+	unsigned int clients = 1;
+	unsigned int flags = 0;
+	struct timespec t_start, t_end;
+	struct workload **w, **wrk = NULL;
+	unsigned int nr_w_args = 0;
+	int master_workload = -1;
+	char **w_args = NULL;
+	unsigned int tolerance_pct = 1;
+	const struct workload_balancer *balancer = NULL;
+	double t;
+	int i, c;
+
+	fd = drm_open_driver(DRIVER_INTEL);
+	intel_register_access_init(intel_get_pci_device(), false, fd);
+
+	while ((c = getopt(argc, argv, "qc:n:r:xw:W:t:b:h")) != -1) {
+		switch (c) {
+		case 'W':
+			if (master_workload >= 0) {
+				if (!quiet)
+					fprintf(stderr,
+						"Only one master workload can be given!\n");
+				return 1;
+			}
+			master_workload = nr_w_args;
+			/* Fall through */
+		case 'w':
+			w_args = add_workload_arg(w_args, ++nr_w_args, optarg);
+			break;
+		case 'c':
+			clients = strtol(optarg, NULL, 0);
+			break;
+		case 't':
+			tolerance_pct = strtol(optarg, NULL, 0);
+			break;
+		case 'n':
+			nop_calibration = strtol(optarg, NULL, 0);
+			break;
+		case 'r':
+			repeat = strtol(optarg, NULL, 0);
+			break;
+		case 'q':
+			quiet = true;
+			break;
+		case 'x':
+			flags |= SWAPVCS;
+			break;
+		case 'b':
+			switch (strtol(optarg, NULL, 0)) {
+			case 0:
+				balancer = &rr_balancer;
+				flags |= BALANCE;
+				break;
+			case 1:
+				igt_assert(intel_gen(intel_get_drm_devid(fd)) >=
+					   8);
+				balancer = &qd_balancer;
+				flags |= SEQNO | BALANCE;
+				break;
+			case 2:
+				igt_assert(intel_gen(intel_get_drm_devid(fd)) >=
+					   8);
+				balancer = &rt_balancer;
+				flags |= SEQNO | BALANCE | RT;
+				break;
+			default:
+				if (!quiet)
+					fprintf(stderr,
+						"Unknown balancing mode '%s'!\n",
+						optarg);
+				return 1;
+			}
+			break;
+		case 'h':
+			print_help();
+			return 0;
+		default:
+			return 1;
+		}
+	}
+
+	if (!nop_calibration) {
+		if (!quiet)
+			printf("Calibrating nop delay with %u%% tolerance...\n",
+				tolerance_pct);
+		nop_calibration = calibrate_nop(tolerance_pct);
+		if (!quiet)
+			printf("Nop calibration for %uus delay is %lu.\n",
+			       nop_calibration_us, nop_calibration);
+
+		return 0;
+	}
+
+	if (!nr_w_args) {
+		if (!quiet)
+			fprintf(stderr, "No workload descriptor(s)!\n");
+		return 1;
+	}
+
+	if (nr_w_args > 1 && clients > 1) {
+		if (!quiet)
+			fprintf(stderr,
+				"Cloned clients cannot be combined with multiple workloads!\n");
+		return 1;
+	}
+
+	wrk = calloc(nr_w_args, sizeof(*wrk));
+	igt_assert(wrk);
+
+	for (i = 0; i < nr_w_args; i++) {
+		w_args[i] = load_workload_descriptor(w_args[i]);
+		if (!w_args[i]) {
+			if (!quiet)
+				fprintf(stderr,
+					"Failed to load workload descriptor %u!\n",
+					i);
+			return 1;
+		}
+
+		wrk[i] = parse_workload(w_args[i]);
+		if (!wrk[i]) {
+			if (!quiet)
+				fprintf(stderr,
+					"Failed to parse workload %u!\n", i);
+			return 1;
+		}
+	}
+
+	if (!quiet) {
+		printf("Using %lu nop calibration for %uus delay.\n",
+		       nop_calibration, nop_calibration_us);
+		if (nr_w_args > 1)
+			clients = nr_w_args;
+		printf("%u client%s.\n", clients, clients > 1 ? "s" : "");
+		if (flags & SWAPVCS)
+			printf("Swapping VCS rings between clients.\n");
+	}
+
+	if (master_workload >= 0 && clients == 1)
+		master_workload = -1;
+
+	w = calloc(clients, sizeof(struct workload *));
+	igt_assert(w);
+
+	for (i = 0; i < clients; i++) {
+		unsigned int flags_ = flags;
+
+		w[i] = clone_workload(wrk[nr_w_args > 1 ? i : 0]);
+
+		if (master_workload >= 0) {
+			int ret = pipe(w[i]->pipe);
+
+			igt_assert(ret == 0);
+		}
+
+		if (flags & SWAPVCS && i & 1)
+			flags_ &= ~SWAPVCS;
+
+		prepare_workload(w[i], flags_);
+	}
+
+	clock_gettime(CLOCK_MONOTONIC, &t_start);
+
+	igt_fork(child, clients) {
+		int pipe_fd = -1;
+		bool background = false;
+
+		if (master_workload >= 0) {
+			close(w[child]->pipe[0]);
+			if (child != master_workload) {
+				pipe_fd = w[child]->pipe[1];
+				background = true;
+			} else {
+				close(w[child]->pipe[1]);
+			}
+		}
+
+		run_workload(child, w[child], background, pipe_fd, balancer,
+			     repeat, flags);
+	}
+
+	if (master_workload >= 0) {
+		int status = -1;
+		pid_t pid;
+
+		for (i = 0; i < clients; i++)
+			close(w[i]->pipe[1]);
+
+		pid = wait(&status);
+		if (pid >= 0)
+			igt_child_done(pid);
+
+		for (i = 0; i < clients; i++)
+			close(w[i]->pipe[0]);
+	}
+
+	igt_waitchildren();
+
+	clock_gettime(CLOCK_MONOTONIC, &t_end);
+
+	t = elapsed(&t_start, &t_end);
+	if (!quiet)
+		printf("%.3fs elapsed (%.3f workloads/s)\n",
+		       t, clients * repeat / t);
+
+	for (i = 0; i < clients; i++)
+		fini_workload(w[i]);
+	free(w);
+	for (i = 0; i < nr_w_args; i++)
+		fini_workload(wrk[i]);
+	free(w_args);
+
+	return 0;
+}
diff --git a/benchmarks/wsim/README b/benchmarks/wsim/README
new file mode 100644
index 000000000000..7aa0694aa834
--- /dev/null
+++ b/benchmarks/wsim/README
@@ -0,0 +1,56 @@
+Workload descriptor format
+==========================
+
+ctx.engine.duration_us.dependency.wait,...
+<uint>.<str>.<uint>[-<uint>].<int <= 0>.<0|1>,...
+d|p|s.<uiny>,...
+
+For duration a range can be given from which a random value will be picked
+before every submit. Since this and seqno management requires CPU access to
+objects, care needs to be taken in order to ensure the submit queue is deep
+enough these operations do not affect the execution speed unless that is
+desired.
+
+Additional workload steps are also supported:
+
+ 'd' - Adds a delay (in microseconds).
+ 'p' - Adds a delay relative to the start of previous loop so that the each loop
+       starts execution with a given period.
+ 's' - Synchronises the pipeline to a batch relative to the step.
+ 't' - Throttle every n batches
+ 'q' - Throttle to n max queue depth
+
+Engine ids: RCS, BCS, VCS, VCS1, VCS2, VECS
+
+Example (leading spaces must not be present in the actual file):
+----------------------------------------------------------------
+
+  1.VCS1.3000.0.1
+  1.RCS.500-1000.-1.0
+  1.RCS.3700.0.0
+  1.RCS.1000.-2.0
+  1.VCS2.2300.-2.0
+  1.RCS.4700.-1.0
+  1.VCS2.600.-1.1
+  p.16000
+
+The above workload described in human language works like this:
+
+  1.   A batch is sent to the VCS1 engine which will be executing for 3ms on the
+       GPU and userspace will wait until it is finished before proceeding.
+  2-4. Now three batches are sent to RCS with durations of 0.5-1.5ms (random
+       duration range), 3.7ms and 1ms respectively. The first batch has a data
+       dependency on the preceding VCS1 batch, and the last of the group depends
+       on the first from the group.
+  5.   Now a 2.3ms batch is sent to VCS2, with a data dependency on the 3.7ms
+       RCS batch.
+  6.   This is followed by a 4.7ms RCS batch with a data dependency on the 2.3ms
+       VCS2 batch.
+  7.   Then a 0.6ms VCS2 batch is sent depending on the previous RCS one. In the
+       same step the tool is told to wait for the batch completes before
+       proceeding.
+  8.   Finally the tool is told to wait long enough to ensure the next iteration
+       starts 16ms after the previous one has started.
+
+When workload descriptors are provided on the command line, commas must be used
+instead of new lines.
diff --git a/benchmarks/wsim/media_17i7.wsim b/benchmarks/wsim/media_17i7.wsim
new file mode 100644
index 000000000000..5f533d8e168b
--- /dev/null
+++ b/benchmarks/wsim/media_17i7.wsim
@@ -0,0 +1,7 @@
+1.VCS1.3000.0.1
+1.RCS.1000.-1.0
+1.RCS.3700.0.0
+1.RCS.1000.-2.0
+1.VCS2.2300.-2.0
+1.RCS.4700.-1.0
+1.VCS2.600.-1.1
diff --git a/benchmarks/wsim/media_19.wsim b/benchmarks/wsim/media_19.wsim
new file mode 100644
index 000000000000..f210d7940800
--- /dev/null
+++ b/benchmarks/wsim/media_19.wsim
@@ -0,0 +1,10 @@
+0.VECS.1400-1500.0.0
+0.RCS.1000-1500.-1.0
+s.-2
+2.VCS2.50-350.0.1
+1.VCS1.1300-1400.0.1
+0.VECS.1400-1500.0.0
+0.RCS.100-300.-1.1
+2.RCS.1300-1500.0.0
+2.VCS2.100-300.-1.1
+1.VCS1.900-1400.0.1
diff --git a/benchmarks/wsim/media_load_balance_17i7.wsim b/benchmarks/wsim/media_load_balance_17i7.wsim
new file mode 100644
index 000000000000..25a692032eae
--- /dev/null
+++ b/benchmarks/wsim/media_load_balance_17i7.wsim
@@ -0,0 +1,7 @@
+1.VCS.3000.0.1
+1.RCS.1000.-1.0
+1.RCS.3700.0.0
+1.RCS.1000.-2.0
+1.VCS.2300.-2.0
+1.RCS.4700.-1.0
+1.VCS.600.-1.1
diff --git a/benchmarks/wsim/media_load_balance_19.wsim b/benchmarks/wsim/media_load_balance_19.wsim
new file mode 100644
index 000000000000..03890776fda3
--- /dev/null
+++ b/benchmarks/wsim/media_load_balance_19.wsim
@@ -0,0 +1,10 @@
+0.VECS.1400-1500.0.0
+0.RCS.1000-1500.-1.0
+s.-2
+1.VCS.50-350.0.1
+1.VCS.1300-1400.0.1
+0.VECS.1400-1500.0.0
+0.RCS.100-300.-1.1
+1.RCS.1300-1500.0.0
+1.VCS.100-300.-1.1
+1.VCS.900-1400.0.1
diff --git a/benchmarks/wsim/vcs1.wsim b/benchmarks/wsim/vcs1.wsim
new file mode 100644
index 000000000000..9d3e682b5ce8
--- /dev/null
+++ b/benchmarks/wsim/vcs1.wsim
@@ -0,0 +1,26 @@
+t.5
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
+0.VCS1.500-2000.0.0
diff --git a/benchmarks/wsim/vcs_balanced.wsim b/benchmarks/wsim/vcs_balanced.wsim
new file mode 100644
index 000000000000..e8958b8f7f43
--- /dev/null
+++ b/benchmarks/wsim/vcs_balanced.wsim
@@ -0,0 +1,26 @@
+q.5
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
+0.VCS.500-2000.0.0
diff --git a/lib/igt_core.c b/lib/igt_core.c
index 403b9423fa9f..9c3b37fe3d63 100644
--- a/lib/igt_core.c
+++ b/lib/igt_core.c
@@ -1558,6 +1558,32 @@ bool __igt_fork(void)
 }
 
 /**
+ * igt_child_done:
+ *
+ * Lets the IGT core know that one of the children has exited.
+ */
+void igt_child_done(pid_t pid)
+{
+	int i = 0;
+	int found = -1;
+
+	igt_assert(num_test_children > 1);
+
+	for (i = 0; i < num_test_children; i++) {
+		if (pid == test_children[i]) {
+			found = i;
+			break;
+		}
+	}
+
+	igt_assert(found >= 0);
+
+	num_test_children--;
+	for (i = found; i < num_test_children; i++)
+		test_children[i] = test_children[i + 1];
+}
+
+/**
  * igt_waitchildren:
  *
  * Wait for all children forked with igt_fork.
diff --git a/lib/igt_core.h b/lib/igt_core.h
index 51b98d82ef7f..4a125af1d6a5 100644
--- a/lib/igt_core.h
+++ b/lib/igt_core.h
@@ -688,6 +688,7 @@ bool __igt_fork(void);
 #define igt_fork(child, num_children) \
 	for (int child = 0; child < (num_children); child++) \
 		for (; __igt_fork(); exit(0))
+void igt_child_done(pid_t pid);
 void igt_waitchildren(void);
 void igt_waitchildren_timeout(int seconds, const char *reason);
 
-- 
2.9.3

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t v6] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-25 11:13       ` [PATCH i-g-t v6] " Tvrtko Ursulin
@ 2017-04-25 11:35         ` Chris Wilson
  2017-04-25 12:10           ` Tvrtko Ursulin
  0 siblings, 1 reply; 26+ messages in thread
From: Chris Wilson @ 2017-04-25 11:35 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Tue, Apr 25, 2017 at 12:13:04PM +0100, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Tool which emits batch buffers to engines with configurable
> sequences, durations, contexts, dependencies and userspace waits.
> 
> Unfinished but shows promise so sending out for early feedback.
> 
> v2:
>  * Load workload descriptors from files. (also -w)
>  * Help text.
>  * Calibration control if needed. (-t)
>  * NORELOC | LUT to eb flags.
>  * Added sample workload to wsim/workload1.
> 
> v3:
>  * Multiple parallel different workloads (-w -w ...).
>  * Multi-context workloads.
>  * Variable (random) batch length.
>  * Load balancing (round robin and queue depth estimation).
>  * Workloads delays and explicit sync steps.
>  * Workload frequency (period) control.
> 
> v4:
>  * Fixed queue-depth estimation by creating separate batches
>    per engine when qd load balancing is on.
>  * Dropped separate -s cmd line option. It can turn itself on
>    automatically when needed.
>  * Keep a single status page and lie about the write hazard
>    as suggested by Chris.
>  * Use batch_start_offset for controlling the batch duration.
>    (Chris)
>  * Set status page object cache level. (Chris)
>  * Moved workload description to a README.
>  * Tidied example workloads.
>  * Some other cleanups and refactorings.
> 
> v5:
>  * Master and background workloads (-W / -w).
>  * Single batch per step is enough even when balancing. (Chris)
>  * Use hars_petruska_f54_1_random IGT functions and see to zero
>    at start. (Chris)
>  * Use WC cache domain when WC mapping. (Chris)
>  * Keep seqnos 64-bytes apart in the status page. (Chris)
>  * Add workload throttling and queue-depth throttling commands.
>    (Chris)
> 
> v6:
>  * Added two more workloads.
>  * Merged RT balancer from Chris.
> 
> TODO list:

* No reloc!
* bb caching/reuse
 
>  * Fence support.
>  * Better error handling.
>  * Less 1980's workload parsing.
>  * More workloads.
>  * Threads?
>  * ... ?
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com>
> ---

> +static enum intel_engine_id
> +rt_balance(const struct workload_balancer *balancer,
> +	   struct workload *wrk, struct w_step *w)
> +{
> +	enum intel_engine_id engine;
> +	long qd[NUM_ENGINES];
> +	unsigned int n;
> +
> +	igt_assert(w->engine == VCS);
> +
> +	/* Estimate the "speed" of the most recent batch
> +	 *    (finish time - submit time)
> +	 * and use that as an approximate for the total remaining time for
> +	 * all batches on that engine. We try to keep the total remaining
> +	 * balanced between the engines.
> +	 */

Next steps for this would be to move from an instantaneous speed, to an
average. I'm thinking something like a exponential decay moving average
just to make the estimation more robust.

> +			if (qd_throttle > 0 && balancer && balancer->get_qd) {
> +				unsigned int target;
> +
> +				for (target = wrk->nr_steps - 1; target > 0;
> +				     target--) {

I think this should skip other engines.

if (target->engine != engine)
	continue;

> +					if (balancer->get_qd(balancer, wrk,
> +							     engine) <
> +					    qd_throttle)
> +						break;
> +					w_sync_to(wrk, w, i - target);
> +				}

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t v6] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-25 11:35         ` Chris Wilson
@ 2017-04-25 12:10           ` Tvrtko Ursulin
  2017-04-25 12:25             ` Chris Wilson
  0 siblings, 1 reply; 26+ messages in thread
From: Tvrtko Ursulin @ 2017-04-25 12:10 UTC (permalink / raw)
  To: Chris Wilson, Tvrtko Ursulin, Intel-gfx, Tvrtko Ursulin,
	Rogozhkin, Dmitry V


On 25/04/2017 12:35, Chris Wilson wrote:
> On Tue, Apr 25, 2017 at 12:13:04PM +0100, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> Tool which emits batch buffers to engines with configurable
>> sequences, durations, contexts, dependencies and userspace waits.
>>
>> Unfinished but shows promise so sending out for early feedback.
>>
>> v2:
>>  * Load workload descriptors from files. (also -w)
>>  * Help text.
>>  * Calibration control if needed. (-t)
>>  * NORELOC | LUT to eb flags.
>>  * Added sample workload to wsim/workload1.
>>
>> v3:
>>  * Multiple parallel different workloads (-w -w ...).
>>  * Multi-context workloads.
>>  * Variable (random) batch length.
>>  * Load balancing (round robin and queue depth estimation).
>>  * Workloads delays and explicit sync steps.
>>  * Workload frequency (period) control.
>>
>> v4:
>>  * Fixed queue-depth estimation by creating separate batches
>>    per engine when qd load balancing is on.
>>  * Dropped separate -s cmd line option. It can turn itself on
>>    automatically when needed.
>>  * Keep a single status page and lie about the write hazard
>>    as suggested by Chris.
>>  * Use batch_start_offset for controlling the batch duration.
>>    (Chris)
>>  * Set status page object cache level. (Chris)
>>  * Moved workload description to a README.
>>  * Tidied example workloads.
>>  * Some other cleanups and refactorings.
>>
>> v5:
>>  * Master and background workloads (-W / -w).
>>  * Single batch per step is enough even when balancing. (Chris)
>>  * Use hars_petruska_f54_1_random IGT functions and see to zero
>>    at start. (Chris)
>>  * Use WC cache domain when WC mapping. (Chris)
>>  * Keep seqnos 64-bytes apart in the status page. (Chris)
>>  * Add workload throttling and queue-depth throttling commands.
>>    (Chris)
>>
>> v6:
>>  * Added two more workloads.
>>  * Merged RT balancer from Chris.
>>
>> TODO list:
>
> * No reloc!
> * bb caching/reuse

Yeah I know, but have to progress the overall case as well and I am 
thinking it is getting close to good enough now. So now is the time to 
think of interesting workloads, and workload combinations.

>>  * Fence support.
>>  * Better error handling.
>>  * Less 1980's workload parsing.
>>  * More workloads.
>>  * Threads?
>>  * ... ?
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com>
>> ---
>
>> +static enum intel_engine_id
>> +rt_balance(const struct workload_balancer *balancer,
>> +	   struct workload *wrk, struct w_step *w)
>> +{
>> +	enum intel_engine_id engine;
>> +	long qd[NUM_ENGINES];
>> +	unsigned int n;
>> +
>> +	igt_assert(w->engine == VCS);
>> +
>> +	/* Estimate the "speed" of the most recent batch
>> +	 *    (finish time - submit time)
>> +	 * and use that as an approximate for the total remaining time for
>> +	 * all batches on that engine. We try to keep the total remaining
>> +	 * balanced between the engines.
>> +	 */
>
> Next steps for this would be to move from an instantaneous speed, to an
> average. I'm thinking something like a exponential decay moving average
> just to make the estimation more robust.

Do you think it would be OK to merge these two tools at this point and 
continue improving them in place?

Your balancer already looks a solid step up from the queue-depth one. I 
checked today myself and, what looks like a worst case of a VCS1 hog and 
a balancing workloads running together, it gets the VCS2 utilisation to 
impressive 85%.

As mentioned before those stats can now be collected easily with:

   trace.pl --trace gem_wsim ...; perf script | trace.pl

I need to start pining the relevant people for help with creating 
relevant workloads and am also entertaining the idea of trying balancing 
via exporting the stats from i915 directly. Just to see if true vs 
estimated numbers would make a difference here.

>> +			if (qd_throttle > 0 && balancer && balancer->get_qd) {
>> +				unsigned int target;
>> +
>> +				for (target = wrk->nr_steps - 1; target > 0;
>> +				     target--) {
>
> I think this should skip other engines.
>
> if (target->engine != engine)
> 	continue;

If you say so. I don't have an opinion on it. Would it be useful to 
perhaps have both options - to throttle globally and per-engine? I could 
easily add two different workload commands for that.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t v6] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-25 12:10           ` Tvrtko Ursulin
@ 2017-04-25 12:25             ` Chris Wilson
  2017-04-25 12:31               ` Chris Wilson
  0 siblings, 1 reply; 26+ messages in thread
From: Chris Wilson @ 2017-04-25 12:25 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: Intel-gfx

On Tue, Apr 25, 2017 at 01:10:34PM +0100, Tvrtko Ursulin wrote:
> 
> On 25/04/2017 12:35, Chris Wilson wrote:
> >On Tue, Apr 25, 2017 at 12:13:04PM +0100, Tvrtko Ursulin wrote:
> >>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
[snip]
> >>+static enum intel_engine_id
> >>+rt_balance(const struct workload_balancer *balancer,
> >>+	   struct workload *wrk, struct w_step *w)
> >>+{
> >>+	enum intel_engine_id engine;
> >>+	long qd[NUM_ENGINES];
> >>+	unsigned int n;
> >>+
> >>+	igt_assert(w->engine == VCS);
> >>+
> >>+	/* Estimate the "speed" of the most recent batch
> >>+	 *    (finish time - submit time)
> >>+	 * and use that as an approximate for the total remaining time for
> >>+	 * all batches on that engine. We try to keep the total remaining
> >>+	 * balanced between the engines.
> >>+	 */
> >
> >Next steps for this would be to move from an instantaneous speed, to an
> >average. I'm thinking something like a exponential decay moving average
> >just to make the estimation more robust.
> 
> Do you think it would be OK to merge these two tools at this point
> and continue improving them in place?

Yes. Although there's no excuse no to make this NO_RELOC from the start,
especially if we want to demonstrate how it should be done! Hopefully
attached the delta.

> Your balancer already looks a solid step up from the queue-depth
> one. I checked today myself and, what looks like a worst case of a
> VCS1 hog and a balancing workloads running together, it gets the
> VCS2 utilisation to impressive 85%.

Yup. Just thinking of the danger in using an instantaneous value as our
estimate and how to improve.

> As mentioned before those stats can now be collected easily with:
> 
>   trace.pl --trace gem_wsim ...; perf script | trace.pl
> 
> I need to start pining the relevant people for help with creating
> relevant workloads and am also entertaining the idea of trying
> balancing via exporting the stats from i915 directly. Just to see if
> true vs estimated numbers would make a difference here.

Aye.
 
> >>+			if (qd_throttle > 0 && balancer && balancer->get_qd) {
> >>+				unsigned int target;
> >>+
> >>+				for (target = wrk->nr_steps - 1; target > 0;
> >>+				     target--) {
> >
> >I think this should skip other engines.
> >
> >if (target->engine != engine)
> >	continue;
> 
> If you say so. I don't have an opinion on it. Would it be useful to
> perhaps have both options - to throttle globally and per-engine? I
> could easily add two different workload commands for that.

I'm leaning towards making it a balancer->callback().
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH i-g-t v6] benchmarks/gem_wsim: Command submission workload simulator
  2017-04-25 12:25             ` Chris Wilson
@ 2017-04-25 12:31               ` Chris Wilson
  0 siblings, 0 replies; 26+ messages in thread
From: Chris Wilson @ 2017-04-25 12:31 UTC (permalink / raw)
  To: Tvrtko Ursulin, Tvrtko Ursulin, Intel-gfx, Tvrtko Ursulin,
	Rogozhkin, Dmitry V

[-- Attachment #1: Type: text/plain, Size: 1505 bytes --]

On Tue, Apr 25, 2017 at 01:25:48PM +0100, Chris Wilson wrote:
> On Tue, Apr 25, 2017 at 01:10:34PM +0100, Tvrtko Ursulin wrote:
> > 
> > On 25/04/2017 12:35, Chris Wilson wrote:
> > >On Tue, Apr 25, 2017 at 12:13:04PM +0100, Tvrtko Ursulin wrote:
> > >>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> [snip]
> > >>+static enum intel_engine_id
> > >>+rt_balance(const struct workload_balancer *balancer,
> > >>+	   struct workload *wrk, struct w_step *w)
> > >>+{
> > >>+	enum intel_engine_id engine;
> > >>+	long qd[NUM_ENGINES];
> > >>+	unsigned int n;
> > >>+
> > >>+	igt_assert(w->engine == VCS);
> > >>+
> > >>+	/* Estimate the "speed" of the most recent batch
> > >>+	 *    (finish time - submit time)
> > >>+	 * and use that as an approximate for the total remaining time for
> > >>+	 * all batches on that engine. We try to keep the total remaining
> > >>+	 * balanced between the engines.
> > >>+	 */
> > >
> > >Next steps for this would be to move from an instantaneous speed, to an
> > >average. I'm thinking something like a exponential decay moving average
> > >just to make the estimation more robust.
> > 
> > Do you think it would be OK to merge these two tools at this point
> > and continue improving them in place?
> 
> Yes. Although there's no excuse no to make this NO_RELOC from the start,
> especially if we want to demonstrate how it should be done! Hopefully
> attached the delta.

Which I forgot. Let's try again...
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

[-- Attachment #2: 0001-no-reloc.patch --]
[-- Type: text/x-diff, Size: 3686 bytes --]

>From 985f873f1c9cdaec396c5410738910da04e8f95b Mon Sep 17 00:00:00 2001
From: Chris Wilson <chris@chris-wilson.co.uk>
Date: Tue, 25 Apr 2017 13:22:39 +0100
Subject: [PATCH] no-reloc

---
 benchmarks/gem_wsim.c | 45 +++++++++++++++++++++++++++++----------------
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/benchmarks/gem_wsim.c b/benchmarks/gem_wsim.c
index 1f491a5..f13477a 100644
--- a/benchmarks/gem_wsim.c
+++ b/benchmarks/gem_wsim.c
@@ -90,9 +90,13 @@ struct w_step
 	struct drm_i915_gem_relocation_entry reloc[3];
 	unsigned long bb_sz;
 	uint32_t bb_handle;
-	uint32_t *mapped_batch, *mapped_seqno;
-	unsigned int mapped_len;
+	uint32_t *mapped_batch;
+	uint32_t *seqno_value;
+	uint32_t *seqno_address;
 	uint32_t *rt0_value;
+	uint32_t *rt0_address;
+	uint32_t *rt1_address;
+	unsigned int mapped_len;
 };
 
 struct workload
@@ -463,9 +467,10 @@ terminate_bb(struct w_step *w, unsigned int flags)
 		batch_start += 4 * sizeof(uint32_t);
 
 		*cs++ = MI_STORE_DWORD_IMM;
+		w->seqno_address = cs;
 		*cs++ = 0;
 		*cs++ = 0;
-		w->mapped_seqno = cs;
+		w->seqno_value = cs;
 		*cs++ = 0;
 	}
 
@@ -474,6 +479,7 @@ terminate_bb(struct w_step *w, unsigned int flags)
 		batch_start += 4 * sizeof(uint32_t);
 
 		*cs++ = MI_STORE_DWORD_IMM;
+		w->rt0_address = cs;
 		*cs++ = 0;
 		*cs++ = 0;
 		w->rt0_value = cs;
@@ -484,6 +490,7 @@ terminate_bb(struct w_step *w, unsigned int flags)
 
 		*cs++ = 0x24 << 23 | 2; /* MI_STORE_REG_MEM */
 		*cs++ = RCS_TIMESTAMP;
+		w->rt1_address = cs;
 		*cs++ = 0;
 		*cs++ = 0;
 	}
@@ -500,8 +507,7 @@ eb_update_flags(struct w_step *w, enum intel_engine_id engine,
 {
 	w->eb.flags = eb_engine_map[engine];
 	w->eb.flags |= I915_EXEC_HANDLE_LUT;
-	if (!(flags & SEQNO))
-		w->eb.flags |= I915_EXEC_NO_RELOC;
+	w->eb.flags |= I915_EXEC_NO_RELOC;
 }
 
 static void
@@ -543,10 +549,8 @@ alloc_step_batch(struct workload *wrk, struct w_step *w, unsigned int flags)
 			w->obj[bb_i].relocation_count = 3;
 		else
 			w->obj[bb_i].relocation_count = 1;
-		for (int i = 0; i < w->obj[bb_i].relocation_count; i++) {
-			w->reloc[i].presumed_offset = -1;
+		for (int i = 0; i < w->obj[bb_i].relocation_count; i++)
 			w->reloc[i].target_handle = 1;
-		}
 	}
 
 	w->eb.buffers_ptr = to_user_pointer(w->obj);
@@ -782,10 +786,14 @@ update_bb_seqno(struct w_step *w, enum intel_engine_id engine, uint32_t seqno)
 	gem_set_domain(fd, w->bb_handle,
 		       I915_GEM_DOMAIN_WC, I915_GEM_DOMAIN_WC);
 
-	*w->mapped_seqno = seqno;
-
-	w->reloc[0].presumed_offset = -1;
 	w->reloc[0].delta = VCS_SEQNO_OFFSET(engine);
+
+	*w->seqno_value = seqno;
+	*w->seqno_address = w->reloc[0].presumed_offset + w->reloc[0].delta;
+
+	/* If not using NO_RELOC, force the relocations */
+	if ((w->eb.flags & I915_EXEC_NO_RELOC))
+		w->reloc[0].presumed_offset = -1;
 }
 
 static void
@@ -796,13 +804,18 @@ update_bb_rt(struct w_step *w, enum intel_engine_id engine)
 	gem_set_domain(fd, w->bb_handle,
 		       I915_GEM_DOMAIN_WC, I915_GEM_DOMAIN_WC);
 
-	*w->rt0_value = *REG(RCS_TIMESTAMP);
-
-	w->reloc[1].presumed_offset = -1;
 	w->reloc[1].delta = VCS_SEQNO_OFFSET(engine) + sizeof(uint32_t);
-
-	w->reloc[2].presumed_offset = -1;
 	w->reloc[2].delta = VCS_SEQNO_OFFSET(engine) + 2 * sizeof(uint32_t);
+
+	*w->rt0_value = *REG(RCS_TIMESTAMP);
+	*w->rt0_address = w->reloc[1].presumed_offset + w->reloc[1].delta;
+	*w->rt1_address = w->reloc[1].presumed_offset + w->reloc[1].delta;
+
+	/* If not using NO_RELOC, force the relocations */
+	if ((w->eb.flags & I915_EXEC_NO_RELOC)) {
+		w->reloc[1].presumed_offset = -1;
+		w->reloc[2].presumed_offset = -1;
+	}
 }
 
 static void w_sync_to(struct workload *wrk, struct w_step *w, int target)
-- 
1.9.1


[-- Attachment #3: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-04-25 12:31 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-31 14:58 [PATCH i-g-t 0/2] Workload simulation and tracing Tvrtko Ursulin
2017-03-31 14:58 ` [PATCH i-g-t 1/2] benchmarks/gem_wsim: Command submission workload simulator Tvrtko Ursulin
2017-03-31 15:19   ` Chris Wilson
2017-04-05 16:14   ` [PATCH i-g-t v3] " Tvrtko Ursulin
2017-04-05 16:48     ` Chris Wilson
2017-04-06  8:18       ` Tvrtko Ursulin
2017-04-06  8:55         ` Chris Wilson
2017-04-07  8:53           ` Tvrtko Ursulin
2017-04-07  9:51             ` Chris Wilson
2017-04-20 12:29   ` [PATCH i-g-t v4] " Tvrtko Ursulin
2017-04-20 14:23     ` Chris Wilson
2017-04-20 14:33       ` Chris Wilson
2017-04-20 14:45         ` Tvrtko Ursulin
2017-04-20 14:34       ` Tvrtko Ursulin
2017-04-20 15:11         ` Chris Wilson
2017-04-20 14:52     ` Chris Wilson
2017-04-20 15:06       ` Tvrtko Ursulin
2017-04-20 16:20     ` Chris Wilson
2017-04-21 15:21     ` [PATCH i-g-t v5] " Tvrtko Ursulin
2017-04-25 11:13       ` [PATCH i-g-t v6] " Tvrtko Ursulin
2017-04-25 11:35         ` Chris Wilson
2017-04-25 12:10           ` Tvrtko Ursulin
2017-04-25 12:25             ` Chris Wilson
2017-04-25 12:31               ` Chris Wilson
2017-03-31 14:58 ` [PATCH i-g-t 2/2] igt/scripts: trace.pl to parse the i915 tracepoints Tvrtko Ursulin
2017-04-24 14:42 ` [PATCH i-g-t v4] " Tvrtko Ursulin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.