All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Updates to gem_exec_nop parallel execution test
@ 2016-08-03 15:36 Dave Gordon
  2016-08-03 15:36 ` [PATCH 1/2] igt/gem_exec_nop: add burst submission to " Dave Gordon
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Dave Gordon @ 2016-08-03 15:36 UTC (permalink / raw)
  To: intel-gfx

The parallel execution test in gem_exec_nop is unrealistic, and
in fact chooses a pessimal distribution of work to multiple engines.
The first patch in this sequence adapts it to send multiple batches
(currently 64) to each engine, in a round-robin fashion, thus keeping
the engines fed adequately with workstreams.

The second patch simply changes the output of this same test to make
it more obvious what measurements are made and what results can be
calculated from them.

Dave Gordon (2):
  igt/gem_exec_nop: add burst submission to parallel execution test
  igt/gem_exec_nop: clarify & extend output from parallel execution test

 tests/gem_exec_nop.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] igt/gem_exec_nop: add burst submission to parallel execution test
  2016-08-03 15:36 [PATCH 0/2] Updates to gem_exec_nop parallel execution test Dave Gordon
@ 2016-08-03 15:36 ` Dave Gordon
  2016-08-03 15:45   ` Chris Wilson
  2016-08-22 14:28   ` John Harrison
  2016-08-03 15:36 ` [PATCH 2/2] igt/gem_exec_nop: clarify & extend output from " Dave Gordon
  2016-08-03 16:07 ` ✗ Ro.CI.BAT: failure for Updates to gem_exec_nop " Patchwork
  2 siblings, 2 replies; 14+ messages in thread
From: Dave Gordon @ 2016-08-03 15:36 UTC (permalink / raw)
  To: intel-gfx

The parallel execution test in gem_exec_nop chooses a pessimal
distribution of work to multiple engines; specifically, it
round-robins one batch to each engine in turn. As the workloads
are trivial (NOPs), this results in each engine becoming idle
between batches. Hence parallel submission is seen to take LONGER
than the same number of batches executed sequentially.

If on the other hand we send enough work to each engine to keep
it busy until the next time we add to its queue, (i.e. round-robin
some larger number of batches to each engine in turn) then we can
get true parallel execution and should find that it is FASTER than
sequential execuion.

By experiment, burst sizes of between 8 and 256 are sufficient to
keep multiple engines loaded, with the optimum (for this trivial
workload) being around 64. This is expected to be lower (possibly
as low as one) for more realistic (heavier) workloads.

Signed-off-by: Dave Gordon <david.s.gordon@intel.com>
---
 tests/gem_exec_nop.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/tests/gem_exec_nop.c b/tests/gem_exec_nop.c
index 9b89260..c2bd472 100644
--- a/tests/gem_exec_nop.c
+++ b/tests/gem_exec_nop.c
@@ -166,14 +166,17 @@ static void all(int fd, uint32_t handle, int timeout)
 	gem_sync(fd, handle);
 	intel_detect_and_clear_missed_interrupts(fd);
 
+#define	BURST	64
+
 	count = 0;
 	clock_gettime(CLOCK_MONOTONIC, &start);
 	do {
-		for (int loop = 0; loop < 1024; loop++) {
+		for (int loop = 0; loop < 1024/BURST; loop++) {
 			for (int n = 0; n < nengine; n++) {
 				execbuf.flags &= ~ENGINE_FLAGS;
 				execbuf.flags |= engines[n];
-				gem_execbuf(fd, &execbuf);
+				for (int b = 0; b < BURST; ++b)
+					gem_execbuf(fd, &execbuf);
 			}
 		}
 		count += nengine * 1024;
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] igt/gem_exec_nop: clarify & extend output from parallel execution test
  2016-08-03 15:36 [PATCH 0/2] Updates to gem_exec_nop parallel execution test Dave Gordon
  2016-08-03 15:36 ` [PATCH 1/2] igt/gem_exec_nop: add burst submission to " Dave Gordon
@ 2016-08-03 15:36 ` Dave Gordon
  2016-08-22 14:39   ` John Harrison
  2016-08-03 16:07 ` ✗ Ro.CI.BAT: failure for Updates to gem_exec_nop " Patchwork
  2 siblings, 1 reply; 14+ messages in thread
From: Dave Gordon @ 2016-08-03 15:36 UTC (permalink / raw)
  To: intel-gfx

To make sense of the output of the parallel execution test (preferably
without reading the source!), we need to see the various measurements
that it makes, specifically: time/batch on each engine separately, total
time across all engines sequentially, and the time/batch when the work
is distributed over all engines in parallel.

Since we know the per-batch time on the slowest engine (which will
determine the minimum possible execution time of any equal-split
parallel test), we can also calculate a new figure representing the
degree to which work on the faster engines is overlapped with that on
the slowest engine, and therefore does not contribute to the total time.
Here we choose to present it as a percentage, with parallel-time==serial
time giving 0% overlap, up to parallel-time==slowest-engine-
time/n_engines being 100%. Note that negative values are possible;
values greater than 100% may also be possible, although less likely.

Signed-off-by: Dave Gordon <david.s.gordon@intel.com>
---
 tests/gem_exec_nop.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/tests/gem_exec_nop.c b/tests/gem_exec_nop.c
index c2bd472..05aa383 100644
--- a/tests/gem_exec_nop.c
+++ b/tests/gem_exec_nop.c
@@ -137,7 +137,9 @@ static void all(int fd, uint32_t handle, int timeout)
 		if (ignore_engine(fd, engine))
 			continue;
 
-		time = nop_on_ring(fd, handle, engine, 1, &count) / count;
+		time = nop_on_ring(fd, handle, engine, 2, &count) / count;
+		igt_info("%s: %'lu cycles: %.3fus/batch\n",
+			 e__->name, count, time*1e6);
 		if (time > max) {
 			name = e__->name;
 			max = time;
@@ -148,8 +150,9 @@ static void all(int fd, uint32_t handle, int timeout)
 		engines[nengine++] = engine;
 	}
 	igt_require(nengine);
-	igt_info("Maximum execution latency on %s, %.3fus, total %.3fus per cycle\n",
-		 name, max*1e6, sum*1e6);
+	igt_info("Slowest engine was %s, %.3fus/batch\n", name, max*1e6);
+	igt_info("Total for all %d engines is %.3fus per cycle, average %.3fus/batch\n",
+		 nengine, sum*1e6, sum*1e6/nengine);
 
 	memset(&obj, 0, sizeof(obj));
 	obj.handle = handle;
@@ -187,8 +190,10 @@ static void all(int fd, uint32_t handle, int timeout)
 	igt_assert_eq(intel_detect_and_clear_missed_interrupts(fd), 0);
 
 	time = elapsed(&start, &now) / count;
-	igt_info("All (%d engines): %'lu cycles, average %.3fus per cycle\n",
-		 nengine, count, 1e6*time);
+	igt_info("All %d engines (parallel/%d): %'lu cycles, "
+		 "average %.3fus/batch, overlap %.1f%\n",
+		 nengine, BURST, count,
+		 1e6*time, 100*(sum-time)/(sum-(max/nengine)));
 
 	/* The rate limiting step is how fast the slowest engine can
 	 * its queue of requests, if we wait upon a full ring all dispatch
-- 
1.9.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] igt/gem_exec_nop: add burst submission to parallel execution test
  2016-08-03 15:36 ` [PATCH 1/2] igt/gem_exec_nop: add burst submission to " Dave Gordon
@ 2016-08-03 15:45   ` Chris Wilson
  2016-08-03 16:05     ` Dave Gordon
  2016-08-22 14:28   ` John Harrison
  1 sibling, 1 reply; 14+ messages in thread
From: Chris Wilson @ 2016-08-03 15:45 UTC (permalink / raw)
  To: Dave Gordon; +Cc: intel-gfx

On Wed, Aug 03, 2016 at 04:36:46PM +0100, Dave Gordon wrote:
> The parallel execution test in gem_exec_nop chooses a pessimal
> distribution of work to multiple engines; specifically, it
> round-robins one batch to each engine in turn. As the workloads
> are trivial (NOPs), this results in each engine becoming idle
> between batches. Hence parallel submission is seen to take LONGER
> than the same number of batches executed sequentially.
> 
> If on the other hand we send enough work to each engine to keep
> it busy until the next time we add to its queue, (i.e. round-robin
> some larger number of batches to each engine in turn) then we can
> get true parallel execution and should find that it is FASTER than
> sequential execuion.
> 
> By experiment, burst sizes of between 8 and 256 are sufficient to
> keep multiple engines loaded, with the optimum (for this trivial
> workload) being around 64. This is expected to be lower (possibly
> as low as one) for more realistic (heavier) workloads.

Quite funny. The driver submission overhead of A...A vs ABAB... engines
is nearly identical, at least as far as the analysis presented here.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] igt/gem_exec_nop: add burst submission to parallel execution test
  2016-08-03 15:45   ` Chris Wilson
@ 2016-08-03 16:05     ` Dave Gordon
  2016-08-18 12:01       ` John Harrison
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Gordon @ 2016-08-03 16:05 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx

On 03/08/16 16:45, Chris Wilson wrote:
> On Wed, Aug 03, 2016 at 04:36:46PM +0100, Dave Gordon wrote:
>> The parallel execution test in gem_exec_nop chooses a pessimal
>> distribution of work to multiple engines; specifically, it
>> round-robins one batch to each engine in turn. As the workloads
>> are trivial (NOPs), this results in each engine becoming idle
>> between batches. Hence parallel submission is seen to take LONGER
>> than the same number of batches executed sequentially.
>>
>> If on the other hand we send enough work to each engine to keep
>> it busy until the next time we add to its queue, (i.e. round-robin
>> some larger number of batches to each engine in turn) then we can
>> get true parallel execution and should find that it is FASTER than
>> sequential execuion.
>>
>> By experiment, burst sizes of between 8 and 256 are sufficient to
>> keep multiple engines loaded, with the optimum (for this trivial
>> workload) being around 64. This is expected to be lower (possibly
>> as low as one) for more realistic (heavier) workloads.
>
> Quite funny. The driver submission overhead of A...A vs ABAB... engines
> is nearly identical, at least as far as the analysis presented here.
> -Chris

Correct; but because the workloads are so trivial, if we hand out jobs 
one at a time to each engine, the first will have finished the one batch 
it's been given before we get round to giving at a second one (even in 
execlist mode). If there are N engines, submitting a single batch takes 
S seconds, and the workload takes W seconds to execute, then if W < N*S 
the engine will be idle between batches. For example, if N is 4, W is 
2us, and S is 1us, then the engine will be idle some 50% of the time.

This wouldn't be an issue for more realistic workloads, where W >> S.
It only looks problematic because of the trivial nature of the work.

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* ✗ Ro.CI.BAT: failure for Updates to gem_exec_nop parallel execution test
  2016-08-03 15:36 [PATCH 0/2] Updates to gem_exec_nop parallel execution test Dave Gordon
  2016-08-03 15:36 ` [PATCH 1/2] igt/gem_exec_nop: add burst submission to " Dave Gordon
  2016-08-03 15:36 ` [PATCH 2/2] igt/gem_exec_nop: clarify & extend output from " Dave Gordon
@ 2016-08-03 16:07 ` Patchwork
  2 siblings, 0 replies; 14+ messages in thread
From: Patchwork @ 2016-08-03 16:07 UTC (permalink / raw)
  To: Dave Gordon; +Cc: intel-gfx

== Series Details ==

Series: Updates to gem_exec_nop parallel execution test
URL   : https://patchwork.freedesktop.org/series/10603/
State : failure

== Summary ==

Applying: igt/gem_exec_nop: add burst submission to parallel execution test
fatal: sha1 information is lacking or useless (tests/gem_exec_nop.c).
error: could not build fake ancestor
Patch failed at 0001 igt/gem_exec_nop: add burst submission to parallel execution test
The copy of the patch that failed is found in: .git/rebase-apply/patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] igt/gem_exec_nop: add burst submission to parallel execution test
  2016-08-03 16:05     ` Dave Gordon
@ 2016-08-18 12:01       ` John Harrison
  2016-08-18 15:27         ` Dave Gordon
  0 siblings, 1 reply; 14+ messages in thread
From: John Harrison @ 2016-08-18 12:01 UTC (permalink / raw)
  To: intel-gfx

On 03/08/2016 17:05, Dave Gordon wrote:
> On 03/08/16 16:45, Chris Wilson wrote:
>> On Wed, Aug 03, 2016 at 04:36:46PM +0100, Dave Gordon wrote:
>>> The parallel execution test in gem_exec_nop chooses a pessimal
>>> distribution of work to multiple engines; specifically, it
>>> round-robins one batch to each engine in turn. As the workloads
>>> are trivial (NOPs), this results in each engine becoming idle
>>> between batches. Hence parallel submission is seen to take LONGER
>>> than the same number of batches executed sequentially.
>>>
>>> If on the other hand we send enough work to each engine to keep
>>> it busy until the next time we add to its queue, (i.e. round-robin
>>> some larger number of batches to each engine in turn) then we can
>>> get true parallel execution and should find that it is FASTER than
>>> sequential execuion.
>>>
>>> By experiment, burst sizes of between 8 and 256 are sufficient to
>>> keep multiple engines loaded, with the optimum (for this trivial
>>> workload) being around 64. This is expected to be lower (possibly
>>> as low as one) for more realistic (heavier) workloads.
>>
>> Quite funny. The driver submission overhead of A...A vs ABAB... engines
>> is nearly identical, at least as far as the analysis presented here.
>> -Chris
>
> Correct; but because the workloads are so trivial, if we hand out jobs 
> one at a time to each engine, the first will have finished the one 
> batch it's been given before we get round to giving at a second one 
> (even in execlist mode). If there are N engines, submitting a single 
> batch takes S seconds, and the workload takes W seconds to execute, 
> then if W < N*S the engine will be idle between batches. For example, 
> if N is 4, W is 2us, and S is 1us, then the engine will be idle some 
> 50% of the time.
>
> This wouldn't be an issue for more realistic workloads, where W >> S.
> It only looks problematic because of the trivial nature of the work.

Can you post the numbers that you get?

I seem to get massive variability on my BDW. The render ring always 
gives me around 2.9us/batch but the other rings sometimes give me region 
of 1.2us and sometimes 7-8us.


>
> .Dave.
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] igt/gem_exec_nop: add burst submission to parallel execution test
  2016-08-18 12:01       ` John Harrison
@ 2016-08-18 15:27         ` Dave Gordon
  2016-08-18 15:36           ` Dave Gordon
  2016-08-18 15:59           ` Dave Gordon
  0 siblings, 2 replies; 14+ messages in thread
From: Dave Gordon @ 2016-08-18 15:27 UTC (permalink / raw)
  To: John Harrison, intel-gfx

On 18/08/16 13:01, John Harrison wrote:
> On 03/08/2016 17:05, Dave Gordon wrote:
>> On 03/08/16 16:45, Chris Wilson wrote:
>>> On Wed, Aug 03, 2016 at 04:36:46PM +0100, Dave Gordon wrote:
>>>> The parallel execution test in gem_exec_nop chooses a pessimal
>>>> distribution of work to multiple engines; specifically, it
>>>> round-robins one batch to each engine in turn. As the workloads
>>>> are trivial (NOPs), this results in each engine becoming idle
>>>> between batches. Hence parallel submission is seen to take LONGER
>>>> than the same number of batches executed sequentially.
>>>>
>>>> If on the other hand we send enough work to each engine to keep
>>>> it busy until the next time we add to its queue, (i.e. round-robin
>>>> some larger number of batches to each engine in turn) then we can
>>>> get true parallel execution and should find that it is FASTER than
>>>> sequential execuion.
>>>>
>>>> By experiment, burst sizes of between 8 and 256 are sufficient to
>>>> keep multiple engines loaded, with the optimum (for this trivial
>>>> workload) being around 64. This is expected to be lower (possibly
>>>> as low as one) for more realistic (heavier) workloads.
>>>
>>> Quite funny. The driver submission overhead of A...A vs ABAB... engines
>>> is nearly identical, at least as far as the analysis presented here.
>>> -Chris
>>
>> Correct; but because the workloads are so trivial, if we hand out jobs
>> one at a time to each engine, the first will have finished the one
>> batch it's been given before we get round to giving at a second one
>> (even in execlist mode). If there are N engines, submitting a single
>> batch takes S seconds, and the workload takes W seconds to execute,
>> then if W < N*S the engine will be idle between batches. For example,
>> if N is 4, W is 2us, and S is 1us, then the engine will be idle some
>> 50% of the time.
>>
>> This wouldn't be an issue for more realistic workloads, where W >> S.
>> It only looks problematic because of the trivial nature of the work.
>
> Can you post the numbers that you get?
>
> I seem to get massive variability on my BDW. The render ring always
> gives me around 2.9us/batch but the other rings sometimes give me region
> of 1.2us and sometimes 7-8us.

skylake# ./intel-gpu-tools/tests/gem_exec_nop --run-subtest basic
IGT-Version: 1.15-gd09ad86 (x86_64) (Linux: 
4.8.0-rc1-dsg-10839-g5e5a29c-z-tvrtko-fwname x86_64)
Using GuC submission
render: 594,944 cycles: 3.366us/batch
bsd: 737,280 cycles: 2.715us/batch
blt: 833,536 cycles: 2.400us/batch
vebox: 710,656 cycles: 2.818us/batch
Slowest engine was render, 3.366us/batch
Total for all 4 engines is 11.300us per cycle, average 2.825us/batch
All 4 engines (parallel/64): 5,324,800 cycles, average 1.878us/batch, 
overlap 90.1%
Subtest basic: SUCCESS (18.013s)

These are the results of running the modified test on SKL with GuC 
submission.

If the GPU could execute a trivial batch in less time than it takes the 
CPU to submit one, then CPU/driver/GuC performance would become the 
determining factor -- every batch would be completed before the next one 
was submitted to the GPU even when they're going to the same engine.

If the GPU takes longer to execute a batch than N times the time taken 
for the driver to submit it (where N is the number of engines), then the 
GPU performance would become the limiting factor; the CPU would be able 
to hand out one batch to each engine, and by the time it returned to the 
first, that engine would still not be idle.

But in crossover territory, where the batch takes longer to execute than 
the time to submit it, but less than N times as long, the round-robin 
burst size (number of batches sent to each engine before moving to the 
next) can make a big difference, primarily because the submission 
mechanism gets the opportunity to use dual submission and/or lite 
restore, effectively reducing the number of separate writes to the ELSP 
and hence the s/w overhead per batch.

Note that SKL GuC firmware 6.1 didn't support dual submission or lite 
restore, whereas the next version (8.11) does. Therefore, with that 
firmware we don't see the same slowdown when going to 1-at-a-time 
round-robin. I have a different (new) test that shows this more clearly.

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] igt/gem_exec_nop: add burst submission to parallel execution test
  2016-08-18 15:27         ` Dave Gordon
@ 2016-08-18 15:36           ` Dave Gordon
  2016-08-18 15:54             ` Dave Gordon
  2016-08-18 15:59           ` Dave Gordon
  1 sibling, 1 reply; 14+ messages in thread
From: Dave Gordon @ 2016-08-18 15:36 UTC (permalink / raw)
  To: John Harrison, intel-gfx

On 18/08/16 16:27, Dave Gordon wrote:

[snip]

> Note that SKL GuC firmware 6.1 didn't support dual submission or lite
> restore, whereas the next version (8.11) does. Therefore, with that
> firmware we don't see the same slowdown when going to 1-at-a-time
> round-robin. I have a different (new) test that shows this more clearly.

This is with GuC version 6.1:

skylake# ./intel-gpu-tools/tests/gem_exec_paranop | fgrep -v SUCCESS

Time to exec 8-byte batch:	  3.428µs (ring=render)
Time to exec 8-byte batch:	  2.444µs (ring=bsd)
Time to exec 8-byte batch:	  2.394µs (ring=blt)
Time to exec 8-byte batch:	  2.615µs (ring=vebox)
Time to exec 8-byte batch:	  2.625µs (ring=all, sequential)
Time to exec 8-byte batch:	 12.701µs (ring=all, parallel/1) ***
Time to exec 8-byte batch:	  7.259µs (ring=all, parallel/2)
Time to exec 8-byte batch:	  4.336µs (ring=all, parallel/4)
Time to exec 8-byte batch:	  2.937µs (ring=all, parallel/8)
Time to exec 8-byte batch:	  2.661µs (ring=all, parallel/16)
Time to exec 8-byte batch:	  2.245µs (ring=all, parallel/32)
Time to exec 8-byte batch:	  1.626µs (ring=all, parallel/64)
Time to exec 8-byte batch:	  2.170µs (ring=all, parallel/128)
Time to exec 8-byte batch:	  1.804µs (ring=all, parallel/256)
Time to exec 8-byte batch:	  2.602µs (ring=all, parallel/512)
Time to exec 8-byte batch:	  2.602µs (ring=all, parallel/1024)
Time to exec 8-byte batch:	  2.607µs (ring=all, parallel/2048)

Time to exec 4Kbyte batch:	 14.835µs (ring=render)
Time to exec 4Kbyte batch:	 11.787µs (ring=bsd)
Time to exec 4Kbyte batch:	 11.533µs (ring=blt)
Time to exec 4Kbyte batch:	 11.991µs (ring=vebox)
Time to exec 4Kbyte batch:	 12.444µs (ring=all, sequential)
Time to exec 4Kbyte batch:	 16.211µs (ring=all, parallel/1)
Time to exec 4Kbyte batch:	 13.943µs (ring=all, parallel/2)
Time to exec 4Kbyte batch:	 13.878µs (ring=all, parallel/4)
Time to exec 4Kbyte batch:	 13.841µs (ring=all, parallel/8)
Time to exec 4Kbyte batch:	 14.188µs (ring=all, parallel/16)
Time to exec 4Kbyte batch:	 13.747µs (ring=all, parallel/32)
Time to exec 4Kbyte batch:	 13.734µs (ring=all, parallel/64)
Time to exec 4Kbyte batch:	 13.727µs (ring=all, parallel/128)
Time to exec 4Kbyte batch:	 13.947µs (ring=all, parallel/256)
Time to exec 4Kbyte batch:	 12.230µs (ring=all, parallel/512)
Time to exec 4Kbyte batch:	 12.147µs (ring=all, parallel/1024)
Time to exec 4Kbyte batch:	 12.617µs (ring=all, parallel/2048)

What this shows is that the submission overhead is ~3us which is 
comparable with the execution time of a trivial (8-byte) batch, but 
insignificant compared with the time to execute the 4Kbyte batch. The 
burst size therefore makes very little difference to the larger batches.

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] igt/gem_exec_nop: add burst submission to parallel execution test
  2016-08-18 15:36           ` Dave Gordon
@ 2016-08-18 15:54             ` Dave Gordon
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Gordon @ 2016-08-18 15:54 UTC (permalink / raw)
  To: John Harrison, intel-gfx

On 18/08/16 16:36, Dave Gordon wrote:
> On 18/08/16 16:27, Dave Gordon wrote:
>
> [snip]
>
>> Note that SKL GuC firmware 6.1 didn't support dual submission or lite
>> restore, whereas the next version (8.11) does. Therefore, with that
>> firmware we don't see the same slowdown when going to 1-at-a-time
>> round-robin. I have a different (new) test that shows this more clearly.
>
> This is with GuC version 6.1:
>
> skylake# ./intel-gpu-tools/tests/gem_exec_paranop | fgrep -v SUCCESS
>
> Time to exec 8-byte batch:      3.428µs (ring=render)
> Time to exec 8-byte batch:      2.444µs (ring=bsd)
> Time to exec 8-byte batch:      2.394µs (ring=blt)
> Time to exec 8-byte batch:      2.615µs (ring=vebox)
> Time to exec 8-byte batch:      2.625µs (ring=all, sequential)
> Time to exec 8-byte batch:     12.701µs (ring=all, parallel/1) ***
> Time to exec 8-byte batch:      7.259µs (ring=all, parallel/2)
> Time to exec 8-byte batch:      4.336µs (ring=all, parallel/4)
> Time to exec 8-byte batch:      2.937µs (ring=all, parallel/8)
> Time to exec 8-byte batch:      2.661µs (ring=all, parallel/16)
> Time to exec 8-byte batch:      2.245µs (ring=all, parallel/32)
> Time to exec 8-byte batch:      1.626µs (ring=all, parallel/64)
> Time to exec 8-byte batch:      2.170µs (ring=all, parallel/128)
> Time to exec 8-byte batch:      1.804µs (ring=all, parallel/256)
> Time to exec 8-byte batch:      2.602µs (ring=all, parallel/512)
> Time to exec 8-byte batch:      2.602µs (ring=all, parallel/1024)
> Time to exec 8-byte batch:      2.607µs (ring=all, parallel/2048)

And for comparison, here are the figures with v8.11:

# ./intel-gpu-tools/tests/gem_exec_paranop | fgrep -v SUCCESS

Time to exec 8-byte batch:	  3.458µs (ring=render)
Time to exec 8-byte batch:	  2.154µs (ring=bsd)
Time to exec 8-byte batch:	  2.156µs (ring=blt)
Time to exec 8-byte batch:	  2.156µs (ring=vebox)
Time to exec 8-byte batch:	  2.388µs (ring=all, sequential)
Time to exec 8-byte batch:	  5.897µs (ring=all, parallel/1)
Time to exec 8-byte batch:	  4.669µs (ring=all, parallel/2)
Time to exec 8-byte batch:	  4.278µs (ring=all, parallel/4)
Time to exec 8-byte batch:	  2.410µs (ring=all, parallel/8)
Time to exec 8-byte batch:	  2.165µs (ring=all, parallel/16)
Time to exec 8-byte batch:	  2.158µs (ring=all, parallel/32)
Time to exec 8-byte batch:	  1.594µs (ring=all, parallel/64)
Time to exec 8-byte batch:	  1.583µs (ring=all, parallel/128)
Time to exec 8-byte batch:	  2.473µs (ring=all, parallel/256)
Time to exec 8-byte batch:	  2.264µs (ring=all, parallel/512)
Time to exec 8-byte batch:	  2.357µs (ring=all, parallel/1024)
Time to exec 8-byte batch:	  2.382µs (ring=all, parallel/2048)

All generally slightly faster, but parallel/1 is approximately twice as 
fast, while parallel/64 is virtually unchanged, as are all the timings 
for large batches.

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] igt/gem_exec_nop: add burst submission to parallel execution test
  2016-08-18 15:27         ` Dave Gordon
  2016-08-18 15:36           ` Dave Gordon
@ 2016-08-18 15:59           ` Dave Gordon
  1 sibling, 0 replies; 14+ messages in thread
From: Dave Gordon @ 2016-08-18 15:59 UTC (permalink / raw)
  To: John Harrison, intel-gfx

On 18/08/16 16:27, Dave Gordon wrote:
> On 18/08/16 13:01, John Harrison wrote:

[snip]

>> Can you post the numbers that you get?
>>
>> I seem to get massive variability on my BDW. The render ring always
>> gives me around 2.9us/batch but the other rings sometimes give me region
>> of 1.2us and sometimes 7-8us.
>
> skylake# ./intel-gpu-tools/tests/gem_exec_nop --run-subtest basic
> IGT-Version: 1.15-gd09ad86 (x86_64) (Linux:
> 4.8.0-rc1-dsg-10839-g5e5a29c-z-tvrtko-fwname x86_64)
> Using GuC submission
> render: 594,944 cycles: 3.366us/batch
> bsd: 737,280 cycles: 2.715us/batch
> blt: 833,536 cycles: 2.400us/batch
> vebox: 710,656 cycles: 2.818us/batch
> Slowest engine was render, 3.366us/batch
> Total for all 4 engines is 11.300us per cycle, average 2.825us/batch
> All 4 engines (parallel/64): 5,324,800 cycles, average 1.878us/batch,
> overlap 90.1%
> Subtest basic: SUCCESS (18.013s)

That was GuC f/w 6.1, here's the results from 8.11:

skylake# sudo ./intel-gpu-tools/tests/gem_exec_nop --run-subtest basic
IGT-Version: 1.15-gd09ad86 (x86_64) (Linux: 
4.8.0-rc2-dsg-11313-g7430e5f-dsg-work-101 x86_64)
Using GuC submission
render: 585,728 cycles: 3.418us/batch
bsd: 930,816 cycles: 2.151us/batch
blt: 930,816 cycles: 2.150us/batch
vebox: 930,816 cycles: 2.150us/batch
Slowest engine was render, 3.418us/batch
Total for all 4 engines is 9.869us per cycle, average 2.467us/batch
All 4 engines (parallel/64): 5,668,864 cycles, average 1.765us/batch, 
overlap 89.9%
Subtest basic: SUCCESS (18.016s)

... showing minor improvements generally, especially the non-render engines.

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] igt/gem_exec_nop: add burst submission to parallel execution test
  2016-08-03 15:36 ` [PATCH 1/2] igt/gem_exec_nop: add burst submission to " Dave Gordon
  2016-08-03 15:45   ` Chris Wilson
@ 2016-08-22 14:28   ` John Harrison
  1 sibling, 0 replies; 14+ messages in thread
From: John Harrison @ 2016-08-22 14:28 UTC (permalink / raw)
  To: intel-gfx

On 03/08/2016 16:36, Dave Gordon wrote:
> The parallel execution test in gem_exec_nop chooses a pessimal
> distribution of work to multiple engines; specifically, it
> round-robins one batch to each engine in turn. As the workloads
> are trivial (NOPs), this results in each engine becoming idle
> between batches. Hence parallel submission is seen to take LONGER
> than the same number of batches executed sequentially.
>
> If on the other hand we send enough work to each engine to keep
> it busy until the next time we add to its queue, (i.e. round-robin
> some larger number of batches to each engine in turn) then we can
> get true parallel execution and should find that it is FASTER than
> sequential execuion.
>
> By experiment, burst sizes of between 8 and 256 are sufficient to
> keep multiple engines loaded, with the optimum (for this trivial
> workload) being around 64. This is expected to be lower (possibly
> as low as one) for more realistic (heavier) workloads.
>
> Signed-off-by: Dave Gordon <david.s.gordon@intel.com>
> ---
>   tests/gem_exec_nop.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/tests/gem_exec_nop.c b/tests/gem_exec_nop.c
> index 9b89260..c2bd472 100644
> --- a/tests/gem_exec_nop.c
> +++ b/tests/gem_exec_nop.c
> @@ -166,14 +166,17 @@ static void all(int fd, uint32_t handle, int timeout)
>   	gem_sync(fd, handle);
>   	intel_detect_and_clear_missed_interrupts(fd);
>   
> +#define	BURST	64
> +
>   	count = 0;
>   	clock_gettime(CLOCK_MONOTONIC, &start);
>   	do {
> -		for (int loop = 0; loop < 1024; loop++) {
> +		for (int loop = 0; loop < 1024/BURST; loop++) {
>   			for (int n = 0; n < nengine; n++) {
>   				execbuf.flags &= ~ENGINE_FLAGS;
>   				execbuf.flags |= engines[n];
> -				gem_execbuf(fd, &execbuf);
> +				for (int b = 0; b < BURST; ++b)
> +					gem_execbuf(fd, &execbuf);
>   			}
>   		}
>   		count += nengine * 1024;

Would be nice to have the burst size configurable but either way...

Reviewed-by: John Harrison <john.c.harrison@intel.com>

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] igt/gem_exec_nop: clarify & extend output from parallel execution test
  2016-08-03 15:36 ` [PATCH 2/2] igt/gem_exec_nop: clarify & extend output from " Dave Gordon
@ 2016-08-22 14:39   ` John Harrison
  2016-08-22 14:42     ` John Harrison
  0 siblings, 1 reply; 14+ messages in thread
From: John Harrison @ 2016-08-22 14:39 UTC (permalink / raw)
  To: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 3057 bytes --]

On 03/08/2016 16:36, Dave Gordon wrote:
> To make sense of the output of the parallel execution test (preferably
> without reading the source!), we need to see the various measurements
> that it makes, specifically: time/batch on each engine separately, total
> time across all engines sequentially, and the time/batch when the work
> is distributed over all engines in parallel.
>
> Since we know the per-batch time on the slowest engine (which will
> determine the minimum possible execution time of any equal-split
> parallel test), we can also calculate a new figure representing the
> degree to which work on the faster engines is overlapped with that on
> the slowest engine, and therefore does not contribute to the total time.
> Here we choose to present it as a percentage, with parallel-time==serial
> time giving 0% overlap, up to parallel-time==slowest-engine-
> time/n_engines being 100%. Note that negative values are possible;
> values greater than 100% may also be possible, although less likely.
>
> Signed-off-by: Dave Gordon <david.s.gordon@intel.com>
> ---
>   tests/gem_exec_nop.c | 15 ++++++++++-----
>   1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/tests/gem_exec_nop.c b/tests/gem_exec_nop.c
> index c2bd472..05aa383 100644
> --- a/tests/gem_exec_nop.c
> +++ b/tests/gem_exec_nop.c
> @@ -137,7 +137,9 @@ static void all(int fd, uint32_t handle, int timeout)
>   		if (ignore_engine(fd, engine))
>   			continue;
>   
> -		time = nop_on_ring(fd, handle, engine, 1, &count) / count;
> +		time = nop_on_ring(fd, handle, engine, 2, &count) / count;
> +		igt_info("%s: %'lu cycles: %.3fus/batch\n",
> +			 e__->name, count, time*1e6);
>   		if (time > max) {
>   			name = e__->name;
>   			max = time;
> @@ -148,8 +150,9 @@ static void all(int fd, uint32_t handle, int timeout)
>   		engines[nengine++] = engine;
>   	}
>   	igt_require(nengine);
> -	igt_info("Maximum execution latency on %s, %.3fus, total %.3fus per cycle\n",
> -		 name, max*1e6, sum*1e6);
> +	igt_info("Slowest engine was %s, %.3fus/batch\n", name, max*1e6);
> +	igt_info("Total for all %d engines is %.3fus per cycle, average %.3fus/batch\n",
> +		 nengine, sum*1e6, sum*1e6/nengine);
>   
>   	memset(&obj, 0, sizeof(obj));
>   	obj.handle = handle;
> @@ -187,8 +190,10 @@ static void all(int fd, uint32_t handle, int timeout)
>   	igt_assert_eq(intel_detect_and_clear_missed_interrupts(fd), 0);
>   
>   	time = elapsed(&start, &now) / count;
> -	igt_info("All (%d engines): %'lu cycles, average %.3fus per cycle\n",
> -		 nengine, count, 1e6*time);
> +	igt_info("All %d engines (parallel/%d): %'lu cycles, "
> +		 "average %.3fus/batch, overlap %.1f%\n",
> +		 nengine, BURST, count,
> +		 1e6*time, 100*(sum-time)/(sum-(max/nengine)));
>   
>   	/* The rate limiting step is how fast the slowest engine can
>   	 * its queue of requests, if we wait upon a full ring all dispatch

I'm not entirely convinced about the overlap calculation. The other info 
is definitely useful though.

Reviewed-by: John Harrison <john.c.harrison@intel.com>


[-- Attachment #1.2: Type: text/html, Size: 3482 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] igt/gem_exec_nop: clarify & extend output from parallel execution test
  2016-08-22 14:39   ` John Harrison
@ 2016-08-22 14:42     ` John Harrison
  0 siblings, 0 replies; 14+ messages in thread
From: John Harrison @ 2016-08-22 14:42 UTC (permalink / raw)
  To: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 3303 bytes --]

On 22/08/2016 15:39, John Harrison wrote:
> On 03/08/2016 16:36, Dave Gordon wrote:
>> To make sense of the output of the parallel execution test (preferably
>> without reading the source!), we need to see the various measurements
>> that it makes, specifically: time/batch on each engine separately, total
>> time across all engines sequentially, and the time/batch when the work
>> is distributed over all engines in parallel.
>>
>> Since we know the per-batch time on the slowest engine (which will
>> determine the minimum possible execution time of any equal-split
>> parallel test), we can also calculate a new figure representing the
>> degree to which work on the faster engines is overlapped with that on
>> the slowest engine, and therefore does not contribute to the total time.
>> Here we choose to present it as a percentage, with parallel-time==serial
>> time giving 0% overlap, up to parallel-time==slowest-engine-
>> time/n_engines being 100%. Note that negative values are possible;
>> values greater than 100% may also be possible, although less likely.
>>
>> Signed-off-by: Dave Gordon<david.s.gordon@intel.com>
>> ---
>>   tests/gem_exec_nop.c | 15 ++++++++++-----
>>   1 file changed, 10 insertions(+), 5 deletions(-)
>>
>> diff --git a/tests/gem_exec_nop.c b/tests/gem_exec_nop.c
>> index c2bd472..05aa383 100644
>> --- a/tests/gem_exec_nop.c
>> +++ b/tests/gem_exec_nop.c
>> @@ -137,7 +137,9 @@ static void all(int fd, uint32_t handle, int timeout)
>>   		if (ignore_engine(fd, engine))
>>   			continue;
>>   
>> -		time = nop_on_ring(fd, handle, engine, 1, &count) / count;
>> +		time = nop_on_ring(fd, handle, engine, 2, &count) / count;
>> +		igt_info("%s: %'lu cycles: %.3fus/batch\n",
>> +			 e__->name, count, time*1e6);
>>   		if (time > max) {
>>   			name = e__->name;
>>   			max = time;
>> @@ -148,8 +150,9 @@ static void all(int fd, uint32_t handle, int timeout)
>>   		engines[nengine++] = engine;
>>   	}
>>   	igt_require(nengine);
>> -	igt_info("Maximum execution latency on %s, %.3fus, total %.3fus per cycle\n",
>> -		 name, max*1e6, sum*1e6);
>> +	igt_info("Slowest engine was %s, %.3fus/batch\n", name, max*1e6);
>> +	igt_info("Total for all %d engines is %.3fus per cycle, average %.3fus/batch\n",
>> +		 nengine, sum*1e6, sum*1e6/nengine);
>>   
>>   	memset(&obj, 0, sizeof(obj));
>>   	obj.handle = handle;
>> @@ -187,8 +190,10 @@ static void all(int fd, uint32_t handle, int timeout)
>>   	igt_assert_eq(intel_detect_and_clear_missed_interrupts(fd), 0);
>>   
>>   	time = elapsed(&start, &now) / count;
>> -	igt_info("All (%d engines): %'lu cycles, average %.3fus per cycle\n",
>> -		 nengine, count, 1e6*time);
>> +	igt_info("All %d engines (parallel/%d): %'lu cycles, "
>> +		 "average %.3fus/batch, overlap %.1f%\n",
PS: As mentioned in person, the above format string is should end with 
'%%\n' not '%\n'. But with that fixed, review-by as below.

>> +		 nengine, BURST, count,
>> +		 1e6*time, 100*(sum-time)/(sum-(max/nengine)));
>>   
>>   	/* The rate limiting step is how fast the slowest engine can
>>   	 * its queue of requests, if we wait upon a full ring all dispatch
>
> I'm not entirely convinced about the overlap calculation. The other 
> info is definitely useful though.
>
> Reviewed-by: John Harrison <john.c.harrison@intel.com>
>


[-- Attachment #1.2: Type: text/html, Size: 4245 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-08-22 14:42 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-03 15:36 [PATCH 0/2] Updates to gem_exec_nop parallel execution test Dave Gordon
2016-08-03 15:36 ` [PATCH 1/2] igt/gem_exec_nop: add burst submission to " Dave Gordon
2016-08-03 15:45   ` Chris Wilson
2016-08-03 16:05     ` Dave Gordon
2016-08-18 12:01       ` John Harrison
2016-08-18 15:27         ` Dave Gordon
2016-08-18 15:36           ` Dave Gordon
2016-08-18 15:54             ` Dave Gordon
2016-08-18 15:59           ` Dave Gordon
2016-08-22 14:28   ` John Harrison
2016-08-03 15:36 ` [PATCH 2/2] igt/gem_exec_nop: clarify & extend output from " Dave Gordon
2016-08-22 14:39   ` John Harrison
2016-08-22 14:42     ` John Harrison
2016-08-03 16:07 ` ✗ Ro.CI.BAT: failure for Updates to gem_exec_nop " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.