* [Intel-gfx] [PATCH i-g-t v2] tests/perf_pmu: Avoid RT thread for accuracy test
@ 2018-04-03 12:38 ` Tvrtko Ursulin
0 siblings, 0 replies; 43+ messages in thread
From: Tvrtko Ursulin @ 2018-04-03 12:38 UTC (permalink / raw)
To: igt-dev; +Cc: Intel-gfx
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Realtime scheduling interferes with execlists submission (tasklet) so try
to simplify the PWM loop in a few ways:
* Drop RT.
* Longer batches for smaller systematic error.
* More truthful test duration calculation.
* Less clock queries.
* No self-adjust - instead just report the achieved cycle and let the
parent check against it.
* Report absolute cycle error.
v2:
* Bring back self-adjust. (Chris Wilson)
(But slightly fixed version with no overflow.)
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
tests/perf_pmu.c | 97 +++++++++++++++++++++++++-------------------------------
1 file changed, 43 insertions(+), 54 deletions(-)
diff --git a/tests/perf_pmu.c b/tests/perf_pmu.c
index f27b7ec7d2c2..0cfacd4a8fbe 100644
--- a/tests/perf_pmu.c
+++ b/tests/perf_pmu.c
@@ -1504,12 +1504,6 @@ test_enable_race(int gem_fd, const struct intel_execution_engine2 *e)
gem_quiescent_gpu(gem_fd);
}
-static double __error(double val, double ref)
-{
- igt_assert(ref > 1e-5 /* smallval */);
- return (100.0 * val / ref) - 100.0;
-}
-
static void __rearm_spin_batch(igt_spin_t *spin)
{
const uint32_t mi_arb_chk = 0x5 << 23;
@@ -1532,13 +1526,12 @@ static void
accuracy(int gem_fd, const struct intel_execution_engine2 *e,
unsigned long target_busy_pct)
{
- const unsigned int min_test_loops = 7;
- const unsigned long min_test_us = 1e6;
- unsigned long busy_us = 2500;
+ unsigned long busy_us = 10000 - 100 * (1 + abs(50 - target_busy_pct));
unsigned long idle_us = 100 * (busy_us - target_busy_pct *
busy_us / 100) / target_busy_pct;
- unsigned long pwm_calibration_us;
- unsigned long test_us;
+ const unsigned long min_test_us = 1e6;
+ const unsigned long pwm_calibration_us = min_test_us;
+ const unsigned long test_us = min_test_us;
double busy_r, expected;
uint64_t val[2];
uint64_t ts[2];
@@ -1553,13 +1546,6 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
idle_us *= 2;
}
- pwm_calibration_us = min_test_loops * (busy_us + idle_us);
- while (pwm_calibration_us < min_test_us)
- pwm_calibration_us += busy_us + idle_us;
- test_us = min_test_loops * (idle_us + busy_us);
- while (test_us < min_test_us)
- test_us += busy_us + idle_us;
-
igt_info("calibration=%lums, test=%lums; ratio=%.2f%% (%luus/%luus)\n",
pwm_calibration_us / 1000, test_us / 1000,
(double)busy_us / (busy_us + idle_us) * 100.0,
@@ -1572,20 +1558,11 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
/* Emit PWM pattern on the engine from a child. */
igt_fork(child, 1) {
- struct sched_param rt = { .sched_priority = 99 };
const unsigned long timeout[] = {
pwm_calibration_us * 1000, test_us * 1000
};
- uint64_t total_busy_ns = 0, total_idle_ns = 0;
+ uint64_t total_busy_ns = 0, total_ns = 0;
igt_spin_t *spin;
- int ret;
-
- /* We need the best sleep accuracy we can get. */
- ret = sched_setscheduler(0,
- SCHED_FIFO | SCHED_RESET_ON_FORK,
- &rt);
- if (ret)
- igt_warn("Failed to set scheduling policy!\n");
/* Allocate our spin batch and idle it. */
spin = igt_spin_batch_new(gem_fd, 0, e2ring(gem_fd, e), 0);
@@ -1594,39 +1571,51 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
/* 1st pass is calibration, second pass is the test. */
for (int pass = 0; pass < ARRAY_SIZE(timeout); pass++) {
- uint64_t busy_ns = -total_busy_ns;
- uint64_t idle_ns = -total_idle_ns;
- struct timespec test_start = { };
+ unsigned int target_idle_us = idle_us;
+ uint64_t busy_ns = 0, idle_ns = 0;
+ struct timespec start = { };
+ unsigned long pass_ns = 0;
+
+ igt_nsec_elapsed(&start);
- igt_nsec_elapsed(&test_start);
do {
- unsigned int target_idle_us, t_busy;
+ unsigned long loop_ns, loop_busy;
+ struct timespec _ts = { };
+ double err;
+
+ /* PWM idle sleep. */
+ _ts.tv_nsec = target_idle_us * 1000;
+ nanosleep(&_ts, NULL);
/* Restart the spinbatch. */
__rearm_spin_batch(spin);
__submit_spin_batch(gem_fd, spin, e, 0);
- /*
- * Note that the submission may be delayed to a
- * tasklet (ksoftirqd) which cannot run until we
- * sleep as we hog the cpu (we are RT).
- */
-
- t_busy = measured_usleep(busy_us);
+ /* PWM busy sleep. */
+ loop_busy = igt_nsec_elapsed(&start);
+ _ts.tv_nsec = busy_us * 1000;
+ nanosleep(&_ts, NULL);
igt_spin_batch_end(spin);
- gem_sync(gem_fd, spin->handle);
-
- total_busy_ns += t_busy;
-
- target_idle_us =
- (100 * total_busy_ns / target_busy_pct - (total_busy_ns + total_idle_ns)) / 1000;
- total_idle_ns += measured_usleep(target_idle_us);
- } while (igt_nsec_elapsed(&test_start) < timeout[pass]);
-
- busy_ns += total_busy_ns;
- idle_ns += total_idle_ns;
- expected = (double)busy_ns / (busy_ns + idle_ns);
+ /* Time accounting. */
+ loop_ns = igt_nsec_elapsed(&start);
+ loop_busy = loop_ns - loop_busy;
+ loop_ns -= pass_ns;
+
+ busy_ns += loop_busy;
+ total_busy_ns += loop_busy;
+ idle_ns += loop_ns - loop_busy;
+ pass_ns += loop_ns;
+ total_ns += loop_ns;
+
+ /* Re-calibrate. */
+ err = (double)total_busy_ns / total_ns -
+ (double)target_busy_pct / 100.0;
+ target_idle_us = (double)target_idle_us *
+ (1.0 + err);
+ } while (pass_ns < timeout[pass]);
+
+ expected = (double)busy_ns / pass_ns;
igt_info("%u: busy %"PRIu64"us, idle %"PRIu64"us: %.2f%% (target: %lu%%)\n",
pass, busy_ns / 1000, idle_ns / 1000,
100 * expected, target_busy_pct);
@@ -1655,8 +1644,8 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
busy_r = (double)(val[1] - val[0]) / (ts[1] - ts[0]);
- igt_info("error=%.2f%% (%.2f%% vs %.2f%%)\n",
- __error(busy_r, expected), 100 * busy_r, 100 * expected);
+ igt_info("error=%.2f (%.2f%% vs %.2f%%)\n",
+ (busy_r - expected) * 100, 100 * busy_r, 100 * expected);
assert_within(100.0 * busy_r, 100.0 * expected, 2);
}
--
2.14.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH i-g-t v2] tests/perf_pmu: Avoid RT thread for accuracy test
2018-04-03 12:38 ` [Intel-gfx] " Tvrtko Ursulin
@ 2018-04-03 13:10 ` Chris Wilson
-1 siblings, 0 replies; 43+ messages in thread
From: Chris Wilson @ 2018-04-03 13:10 UTC (permalink / raw)
To: Tvrtko Ursulin, igt-dev; +Cc: Intel-gfx
Quoting Tvrtko Ursulin (2018-04-03 13:38:25)
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> Realtime scheduling interferes with execlists submission (tasklet) so try
> to simplify the PWM loop in a few ways:
>
> * Drop RT.
> * Longer batches for smaller systematic error.
> * More truthful test duration calculation.
> * Less clock queries.
> * No self-adjust - instead just report the achieved cycle and let the
> parent check against it.
> * Report absolute cycle error.
>
> v2:
> * Bring back self-adjust. (Chris Wilson)
> (But slightly fixed version with no overflow.)
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
> tests/perf_pmu.c | 97 +++++++++++++++++++++++++-------------------------------
> 1 file changed, 43 insertions(+), 54 deletions(-)
>
> diff --git a/tests/perf_pmu.c b/tests/perf_pmu.c
> index f27b7ec7d2c2..0cfacd4a8fbe 100644
> --- a/tests/perf_pmu.c
> +++ b/tests/perf_pmu.c
> @@ -1504,12 +1504,6 @@ test_enable_race(int gem_fd, const struct intel_execution_engine2 *e)
> gem_quiescent_gpu(gem_fd);
> }
>
> -static double __error(double val, double ref)
> -{
> - igt_assert(ref > 1e-5 /* smallval */);
> - return (100.0 * val / ref) - 100.0;
> -}
> -
> static void __rearm_spin_batch(igt_spin_t *spin)
> {
> const uint32_t mi_arb_chk = 0x5 << 23;
> @@ -1532,13 +1526,12 @@ static void
> accuracy(int gem_fd, const struct intel_execution_engine2 *e,
> unsigned long target_busy_pct)
> {
> - const unsigned int min_test_loops = 7;
> - const unsigned long min_test_us = 1e6;
> - unsigned long busy_us = 2500;
> + unsigned long busy_us = 10000 - 100 * (1 + abs(50 - target_busy_pct));
> unsigned long idle_us = 100 * (busy_us - target_busy_pct *
> busy_us / 100) / target_busy_pct;
> - unsigned long pwm_calibration_us;
> - unsigned long test_us;
> + const unsigned long min_test_us = 1e6;
> + const unsigned long pwm_calibration_us = min_test_us;
> + const unsigned long test_us = min_test_us;
> double busy_r, expected;
> uint64_t val[2];
> uint64_t ts[2];
> @@ -1553,13 +1546,6 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
> idle_us *= 2;
> }
>
> - pwm_calibration_us = min_test_loops * (busy_us + idle_us);
> - while (pwm_calibration_us < min_test_us)
> - pwm_calibration_us += busy_us + idle_us;
> - test_us = min_test_loops * (idle_us + busy_us);
> - while (test_us < min_test_us)
> - test_us += busy_us + idle_us;
> -
> igt_info("calibration=%lums, test=%lums; ratio=%.2f%% (%luus/%luus)\n",
> pwm_calibration_us / 1000, test_us / 1000,
> (double)busy_us / (busy_us + idle_us) * 100.0,
> @@ -1572,20 +1558,11 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
>
> /* Emit PWM pattern on the engine from a child. */
> igt_fork(child, 1) {
> - struct sched_param rt = { .sched_priority = 99 };
> const unsigned long timeout[] = {
> pwm_calibration_us * 1000, test_us * 1000
> };
> - uint64_t total_busy_ns = 0, total_idle_ns = 0;
> + uint64_t total_busy_ns = 0, total_ns = 0;
> igt_spin_t *spin;
> - int ret;
> -
> - /* We need the best sleep accuracy we can get. */
> - ret = sched_setscheduler(0,
> - SCHED_FIFO | SCHED_RESET_ON_FORK,
> - &rt);
> - if (ret)
> - igt_warn("Failed to set scheduling policy!\n");
>
> /* Allocate our spin batch and idle it. */
> spin = igt_spin_batch_new(gem_fd, 0, e2ring(gem_fd, e), 0);
> @@ -1594,39 +1571,51 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
>
> /* 1st pass is calibration, second pass is the test. */
> for (int pass = 0; pass < ARRAY_SIZE(timeout); pass++) {
> - uint64_t busy_ns = -total_busy_ns;
> - uint64_t idle_ns = -total_idle_ns;
> - struct timespec test_start = { };
> + unsigned int target_idle_us = idle_us;
> + uint64_t busy_ns = 0, idle_ns = 0;
> + struct timespec start = { };
> + unsigned long pass_ns = 0;
> +
> + igt_nsec_elapsed(&start);
>
> - igt_nsec_elapsed(&test_start);
> do {
> - unsigned int target_idle_us, t_busy;
> + unsigned long loop_ns, loop_busy;
> + struct timespec _ts = { };
> + double err;
> +
> + /* PWM idle sleep. */
> + _ts.tv_nsec = target_idle_us * 1000;
> + nanosleep(&_ts, NULL);
>
> /* Restart the spinbatch. */
> __rearm_spin_batch(spin);
> __submit_spin_batch(gem_fd, spin, e, 0);
>
> - /*
> - * Note that the submission may be delayed to a
> - * tasklet (ksoftirqd) which cannot run until we
> - * sleep as we hog the cpu (we are RT).
> - */
> -
> - t_busy = measured_usleep(busy_us);
> + /* PWM busy sleep. */
> + loop_busy = igt_nsec_elapsed(&start);
> + _ts.tv_nsec = busy_us * 1000;
> + nanosleep(&_ts, NULL);
> igt_spin_batch_end(spin);
> - gem_sync(gem_fd, spin->handle);
> -
> - total_busy_ns += t_busy;
> -
> - target_idle_us =
> - (100 * total_busy_ns / target_busy_pct - (total_busy_ns + total_idle_ns)) / 1000;
> - total_idle_ns += measured_usleep(target_idle_us);
> - } while (igt_nsec_elapsed(&test_start) < timeout[pass]);
> -
> - busy_ns += total_busy_ns;
> - idle_ns += total_idle_ns;
>
> - expected = (double)busy_ns / (busy_ns + idle_ns);
> + /* Time accounting. */
> + loop_ns = igt_nsec_elapsed(&start);
> + loop_busy = loop_ns - loop_busy;
> + loop_ns -= pass_ns;
> +
> + busy_ns += loop_busy;
> + total_busy_ns += loop_busy;
> + idle_ns += loop_ns - loop_busy;
> + pass_ns += loop_ns;
> + total_ns += loop_ns;
> +
> + /* Re-calibrate. */
> + err = (double)total_busy_ns / total_ns -
> + (double)target_busy_pct / 100.0;
> + target_idle_us = (double)target_idle_us *
> + (1.0 + err);
Previously the question we answered was how long should I sleep for the
busy:idle ratio to hit the target.
expected_total_ns = 100.0 * total_busy_ns / target_busy_pct;
target_idle_us = (expected_total_ns - current_total_ns) / 1000;
unsigned long loop_ns, loop_busy;
struct timespec _ts = { };
double err;
/* PWM idle sleep. */
_ts.tv_nsec = target_idle_us * 1000;
nanosleep(&_ts, NULL);
Assuming no >1s sleeps.
(Ok, so the sleep after recalc is still here.)
/* Restart the spinbatch. */
__rearm_spin_batch(spin);
__submit_spin_batch(gem_fd, spin, e, 0);
/* PWM busy sleep. */
loop_busy = igt_nsec_elapsed(&start);
_ts.tv_nsec = busy_us * 1000;
nanosleep(&_ts, NULL);
igt_spin_batch_end(spin);
/* Time accounting. */
loop_ns = igt_nsec_elapsed(&start);
loop_busy = loop_ns - loop_busy;
loop_ns -= pass_ns;
So pass_ns is time from start of calibration, loop_ns is time for this
loop.
busy_ns += loop_busy;
total_busy_ns += loop_busy;
busy_ns will be calibration pass, total all passes?
idle_ns += loop_ns - loop_busy;
And idle is the residual between the time up to this point, and what has
been busy.
pass_ns += loop_ns;
total_ns += loop_ns;
/* Re-calibrate. */
err = (double)total_busy_ns / total_ns -
(double)target_busy_pct / 100.0;
Hmm, I thought you didn't like the run on calculations, and wanted to
reset between passes? (Have I got total_busy_ns and busy_ns confused?)
target_idle_us = (double)target_idle_us * (1.0 + err);
Ok, I'm tired, but... So, if busy is 10% larger than expected, sleep 10%
longer to try and compensate, would be the gist.
And this is because you always sleep and spin together and so cannot
just sleep to compensate for the earlier inaccuracy. Which means we
never truly try to correct the error in the same pass, but apply a
correction factor for the next.
To me it seems like the closed system with each loop being "spin then
adjusted sleep" will autocorrect and more likely to finish correct (as
we are less reliant on the next loop for the accuracy). It's pretty much
immaterial, as we expect the pmu to match the measurements (and not our
expectations), but I find the one pass does all much simpler to follow.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [igt-dev] [Intel-gfx] [PATCH i-g-t v2] tests/perf_pmu: Avoid RT thread for accuracy test
@ 2018-04-03 13:10 ` Chris Wilson
0 siblings, 0 replies; 43+ messages in thread
From: Chris Wilson @ 2018-04-03 13:10 UTC (permalink / raw)
To: Tvrtko Ursulin, igt-dev; +Cc: Intel-gfx
Quoting Tvrtko Ursulin (2018-04-03 13:38:25)
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> Realtime scheduling interferes with execlists submission (tasklet) so try
> to simplify the PWM loop in a few ways:
>
> * Drop RT.
> * Longer batches for smaller systematic error.
> * More truthful test duration calculation.
> * Less clock queries.
> * No self-adjust - instead just report the achieved cycle and let the
> parent check against it.
> * Report absolute cycle error.
>
> v2:
> * Bring back self-adjust. (Chris Wilson)
> (But slightly fixed version with no overflow.)
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
> tests/perf_pmu.c | 97 +++++++++++++++++++++++++-------------------------------
> 1 file changed, 43 insertions(+), 54 deletions(-)
>
> diff --git a/tests/perf_pmu.c b/tests/perf_pmu.c
> index f27b7ec7d2c2..0cfacd4a8fbe 100644
> --- a/tests/perf_pmu.c
> +++ b/tests/perf_pmu.c
> @@ -1504,12 +1504,6 @@ test_enable_race(int gem_fd, const struct intel_execution_engine2 *e)
> gem_quiescent_gpu(gem_fd);
> }
>
> -static double __error(double val, double ref)
> -{
> - igt_assert(ref > 1e-5 /* smallval */);
> - return (100.0 * val / ref) - 100.0;
> -}
> -
> static void __rearm_spin_batch(igt_spin_t *spin)
> {
> const uint32_t mi_arb_chk = 0x5 << 23;
> @@ -1532,13 +1526,12 @@ static void
> accuracy(int gem_fd, const struct intel_execution_engine2 *e,
> unsigned long target_busy_pct)
> {
> - const unsigned int min_test_loops = 7;
> - const unsigned long min_test_us = 1e6;
> - unsigned long busy_us = 2500;
> + unsigned long busy_us = 10000 - 100 * (1 + abs(50 - target_busy_pct));
> unsigned long idle_us = 100 * (busy_us - target_busy_pct *
> busy_us / 100) / target_busy_pct;
> - unsigned long pwm_calibration_us;
> - unsigned long test_us;
> + const unsigned long min_test_us = 1e6;
> + const unsigned long pwm_calibration_us = min_test_us;
> + const unsigned long test_us = min_test_us;
> double busy_r, expected;
> uint64_t val[2];
> uint64_t ts[2];
> @@ -1553,13 +1546,6 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
> idle_us *= 2;
> }
>
> - pwm_calibration_us = min_test_loops * (busy_us + idle_us);
> - while (pwm_calibration_us < min_test_us)
> - pwm_calibration_us += busy_us + idle_us;
> - test_us = min_test_loops * (idle_us + busy_us);
> - while (test_us < min_test_us)
> - test_us += busy_us + idle_us;
> -
> igt_info("calibration=%lums, test=%lums; ratio=%.2f%% (%luus/%luus)\n",
> pwm_calibration_us / 1000, test_us / 1000,
> (double)busy_us / (busy_us + idle_us) * 100.0,
> @@ -1572,20 +1558,11 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
>
> /* Emit PWM pattern on the engine from a child. */
> igt_fork(child, 1) {
> - struct sched_param rt = { .sched_priority = 99 };
> const unsigned long timeout[] = {
> pwm_calibration_us * 1000, test_us * 1000
> };
> - uint64_t total_busy_ns = 0, total_idle_ns = 0;
> + uint64_t total_busy_ns = 0, total_ns = 0;
> igt_spin_t *spin;
> - int ret;
> -
> - /* We need the best sleep accuracy we can get. */
> - ret = sched_setscheduler(0,
> - SCHED_FIFO | SCHED_RESET_ON_FORK,
> - &rt);
> - if (ret)
> - igt_warn("Failed to set scheduling policy!\n");
>
> /* Allocate our spin batch and idle it. */
> spin = igt_spin_batch_new(gem_fd, 0, e2ring(gem_fd, e), 0);
> @@ -1594,39 +1571,51 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
>
> /* 1st pass is calibration, second pass is the test. */
> for (int pass = 0; pass < ARRAY_SIZE(timeout); pass++) {
> - uint64_t busy_ns = -total_busy_ns;
> - uint64_t idle_ns = -total_idle_ns;
> - struct timespec test_start = { };
> + unsigned int target_idle_us = idle_us;
> + uint64_t busy_ns = 0, idle_ns = 0;
> + struct timespec start = { };
> + unsigned long pass_ns = 0;
> +
> + igt_nsec_elapsed(&start);
>
> - igt_nsec_elapsed(&test_start);
> do {
> - unsigned int target_idle_us, t_busy;
> + unsigned long loop_ns, loop_busy;
> + struct timespec _ts = { };
> + double err;
> +
> + /* PWM idle sleep. */
> + _ts.tv_nsec = target_idle_us * 1000;
> + nanosleep(&_ts, NULL);
>
> /* Restart the spinbatch. */
> __rearm_spin_batch(spin);
> __submit_spin_batch(gem_fd, spin, e, 0);
>
> - /*
> - * Note that the submission may be delayed to a
> - * tasklet (ksoftirqd) which cannot run until we
> - * sleep as we hog the cpu (we are RT).
> - */
> -
> - t_busy = measured_usleep(busy_us);
> + /* PWM busy sleep. */
> + loop_busy = igt_nsec_elapsed(&start);
> + _ts.tv_nsec = busy_us * 1000;
> + nanosleep(&_ts, NULL);
> igt_spin_batch_end(spin);
> - gem_sync(gem_fd, spin->handle);
> -
> - total_busy_ns += t_busy;
> -
> - target_idle_us =
> - (100 * total_busy_ns / target_busy_pct - (total_busy_ns + total_idle_ns)) / 1000;
> - total_idle_ns += measured_usleep(target_idle_us);
> - } while (igt_nsec_elapsed(&test_start) < timeout[pass]);
> -
> - busy_ns += total_busy_ns;
> - idle_ns += total_idle_ns;
>
> - expected = (double)busy_ns / (busy_ns + idle_ns);
> + /* Time accounting. */
> + loop_ns = igt_nsec_elapsed(&start);
> + loop_busy = loop_ns - loop_busy;
> + loop_ns -= pass_ns;
> +
> + busy_ns += loop_busy;
> + total_busy_ns += loop_busy;
> + idle_ns += loop_ns - loop_busy;
> + pass_ns += loop_ns;
> + total_ns += loop_ns;
> +
> + /* Re-calibrate. */
> + err = (double)total_busy_ns / total_ns -
> + (double)target_busy_pct / 100.0;
> + target_idle_us = (double)target_idle_us *
> + (1.0 + err);
Previously the question we answered was how long should I sleep for the
busy:idle ratio to hit the target.
expected_total_ns = 100.0 * total_busy_ns / target_busy_pct;
target_idle_us = (expected_total_ns - current_total_ns) / 1000;
unsigned long loop_ns, loop_busy;
struct timespec _ts = { };
double err;
/* PWM idle sleep. */
_ts.tv_nsec = target_idle_us * 1000;
nanosleep(&_ts, NULL);
Assuming no >1s sleeps.
(Ok, so the sleep after recalc is still here.)
/* Restart the spinbatch. */
__rearm_spin_batch(spin);
__submit_spin_batch(gem_fd, spin, e, 0);
/* PWM busy sleep. */
loop_busy = igt_nsec_elapsed(&start);
_ts.tv_nsec = busy_us * 1000;
nanosleep(&_ts, NULL);
igt_spin_batch_end(spin);
/* Time accounting. */
loop_ns = igt_nsec_elapsed(&start);
loop_busy = loop_ns - loop_busy;
loop_ns -= pass_ns;
So pass_ns is time from start of calibration, loop_ns is time for this
loop.
busy_ns += loop_busy;
total_busy_ns += loop_busy;
busy_ns will be calibration pass, total all passes?
idle_ns += loop_ns - loop_busy;
And idle is the residual between the time up to this point, and what has
been busy.
pass_ns += loop_ns;
total_ns += loop_ns;
/* Re-calibrate. */
err = (double)total_busy_ns / total_ns -
(double)target_busy_pct / 100.0;
Hmm, I thought you didn't like the run on calculations, and wanted to
reset between passes? (Have I got total_busy_ns and busy_ns confused?)
target_idle_us = (double)target_idle_us * (1.0 + err);
Ok, I'm tired, but... So, if busy is 10% larger than expected, sleep 10%
longer to try and compensate, would be the gist.
And this is because you always sleep and spin together and so cannot
just sleep to compensate for the earlier inaccuracy. Which means we
never truly try to correct the error in the same pass, but apply a
correction factor for the next.
To me it seems like the closed system with each loop being "spin then
adjusted sleep" will autocorrect and more likely to finish correct (as
we are less reliant on the next loop for the accuracy). It's pretty much
immaterial, as we expect the pmu to match the measurements (and not our
expectations), but I find the one pass does all much simpler to follow.
-Chris
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH i-g-t v2] tests/perf_pmu: Avoid RT thread for accuracy test
2018-04-03 13:10 ` [igt-dev] [Intel-gfx] " Chris Wilson
@ 2018-04-03 16:09 ` Tvrtko Ursulin
-1 siblings, 0 replies; 43+ messages in thread
From: Tvrtko Ursulin @ 2018-04-03 16:09 UTC (permalink / raw)
To: Chris Wilson, Tvrtko Ursulin, igt-dev; +Cc: Intel-gfx
On 03/04/2018 14:10, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-04-03 13:38:25)
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> Realtime scheduling interferes with execlists submission (tasklet) so try
>> to simplify the PWM loop in a few ways:
>>
>> * Drop RT.
>> * Longer batches for smaller systematic error.
>> * More truthful test duration calculation.
>> * Less clock queries.
>> * No self-adjust - instead just report the achieved cycle and let the
>> parent check against it.
>> * Report absolute cycle error.
>>
>> v2:
>> * Bring back self-adjust. (Chris Wilson)
>> (But slightly fixed version with no overflow.)
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> ---
>> tests/perf_pmu.c | 97 +++++++++++++++++++++++++-------------------------------
>> 1 file changed, 43 insertions(+), 54 deletions(-)
>>
>> diff --git a/tests/perf_pmu.c b/tests/perf_pmu.c
>> index f27b7ec7d2c2..0cfacd4a8fbe 100644
>> --- a/tests/perf_pmu.c
>> +++ b/tests/perf_pmu.c
>> @@ -1504,12 +1504,6 @@ test_enable_race(int gem_fd, const struct intel_execution_engine2 *e)
>> gem_quiescent_gpu(gem_fd);
>> }
>>
>> -static double __error(double val, double ref)
>> -{
>> - igt_assert(ref > 1e-5 /* smallval */);
>> - return (100.0 * val / ref) - 100.0;
>> -}
>> -
>> static void __rearm_spin_batch(igt_spin_t *spin)
>> {
>> const uint32_t mi_arb_chk = 0x5 << 23;
>> @@ -1532,13 +1526,12 @@ static void
>> accuracy(int gem_fd, const struct intel_execution_engine2 *e,
>> unsigned long target_busy_pct)
>> {
>> - const unsigned int min_test_loops = 7;
>> - const unsigned long min_test_us = 1e6;
>> - unsigned long busy_us = 2500;
>> + unsigned long busy_us = 10000 - 100 * (1 + abs(50 - target_busy_pct));
>> unsigned long idle_us = 100 * (busy_us - target_busy_pct *
>> busy_us / 100) / target_busy_pct;
>> - unsigned long pwm_calibration_us;
>> - unsigned long test_us;
>> + const unsigned long min_test_us = 1e6;
>> + const unsigned long pwm_calibration_us = min_test_us;
>> + const unsigned long test_us = min_test_us;
>> double busy_r, expected;
>> uint64_t val[2];
>> uint64_t ts[2];
>> @@ -1553,13 +1546,6 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
>> idle_us *= 2;
>> }
>>
>> - pwm_calibration_us = min_test_loops * (busy_us + idle_us);
>> - while (pwm_calibration_us < min_test_us)
>> - pwm_calibration_us += busy_us + idle_us;
>> - test_us = min_test_loops * (idle_us + busy_us);
>> - while (test_us < min_test_us)
>> - test_us += busy_us + idle_us;
>> -
>> igt_info("calibration=%lums, test=%lums; ratio=%.2f%% (%luus/%luus)\n",
>> pwm_calibration_us / 1000, test_us / 1000,
>> (double)busy_us / (busy_us + idle_us) * 100.0,
>> @@ -1572,20 +1558,11 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
>>
>> /* Emit PWM pattern on the engine from a child. */
>> igt_fork(child, 1) {
>> - struct sched_param rt = { .sched_priority = 99 };
>> const unsigned long timeout[] = {
>> pwm_calibration_us * 1000, test_us * 1000
>> };
>> - uint64_t total_busy_ns = 0, total_idle_ns = 0;
>> + uint64_t total_busy_ns = 0, total_ns = 0;
>> igt_spin_t *spin;
>> - int ret;
>> -
>> - /* We need the best sleep accuracy we can get. */
>> - ret = sched_setscheduler(0,
>> - SCHED_FIFO | SCHED_RESET_ON_FORK,
>> - &rt);
>> - if (ret)
>> - igt_warn("Failed to set scheduling policy!\n");
>>
>> /* Allocate our spin batch and idle it. */
>> spin = igt_spin_batch_new(gem_fd, 0, e2ring(gem_fd, e), 0);
>> @@ -1594,39 +1571,51 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
>>
>> /* 1st pass is calibration, second pass is the test. */
>> for (int pass = 0; pass < ARRAY_SIZE(timeout); pass++) {
>> - uint64_t busy_ns = -total_busy_ns;
>> - uint64_t idle_ns = -total_idle_ns;
>> - struct timespec test_start = { };
>> + unsigned int target_idle_us = idle_us;
>> + uint64_t busy_ns = 0, idle_ns = 0;
>> + struct timespec start = { };
>> + unsigned long pass_ns = 0;
>> +
>> + igt_nsec_elapsed(&start);
>>
>> - igt_nsec_elapsed(&test_start);
>> do {
>> - unsigned int target_idle_us, t_busy;
>> + unsigned long loop_ns, loop_busy;
>> + struct timespec _ts = { };
>> + double err;
>> +
>> + /* PWM idle sleep. */
>> + _ts.tv_nsec = target_idle_us * 1000;
>> + nanosleep(&_ts, NULL);
>>
>> /* Restart the spinbatch. */
>> __rearm_spin_batch(spin);
>> __submit_spin_batch(gem_fd, spin, e, 0);
>>
>> - /*
>> - * Note that the submission may be delayed to a
>> - * tasklet (ksoftirqd) which cannot run until we
>> - * sleep as we hog the cpu (we are RT).
>> - */
>> -
>> - t_busy = measured_usleep(busy_us);
>> + /* PWM busy sleep. */
>> + loop_busy = igt_nsec_elapsed(&start);
>> + _ts.tv_nsec = busy_us * 1000;
>> + nanosleep(&_ts, NULL);
>> igt_spin_batch_end(spin);
>> - gem_sync(gem_fd, spin->handle);
>> -
>> - total_busy_ns += t_busy;
>> -
>> - target_idle_us =
>> - (100 * total_busy_ns / target_busy_pct - (total_busy_ns + total_idle_ns)) / 1000;
>> - total_idle_ns += measured_usleep(target_idle_us);
>> - } while (igt_nsec_elapsed(&test_start) < timeout[pass]);
>> -
>> - busy_ns += total_busy_ns;
>> - idle_ns += total_idle_ns;
>>
>> - expected = (double)busy_ns / (busy_ns + idle_ns);
>> + /* Time accounting. */
>> + loop_ns = igt_nsec_elapsed(&start);
>> + loop_busy = loop_ns - loop_busy;
>> + loop_ns -= pass_ns;
>> +
>> + busy_ns += loop_busy;
>> + total_busy_ns += loop_busy;
>> + idle_ns += loop_ns - loop_busy;
>> + pass_ns += loop_ns;
>> + total_ns += loop_ns;
>> +
>> + /* Re-calibrate. */
>> + err = (double)total_busy_ns / total_ns -
>> + (double)target_busy_pct / 100.0;
>> + target_idle_us = (double)target_idle_us *
>> + (1.0 + err);
>
> Previously the question we answered was how long should I sleep for the
> busy:idle ratio to hit the target.
>
> expected_total_ns = 100.0 * total_busy_ns / target_busy_pct;
> target_idle_us = (expected_total_ns - current_total_ns) / 1000;
Yes, and the overflow (or underflow, depending how you look at it) was
here. Usually in the first loop iteration for me, when expected_total_ns
is smaller than current_total_ns.
But mostly I think this should have a minor effect, unless some systems
can hit it more often.
>
> unsigned long loop_ns, loop_busy;
> struct timespec _ts = { };
> double err;
>
> /* PWM idle sleep. */
> _ts.tv_nsec = target_idle_us * 1000;
> nanosleep(&_ts, NULL);
>
> Assuming no >1s sleeps.
> (Ok, so the sleep after recalc is still here.)
>
> /* Restart the spinbatch. */
> __rearm_spin_batch(spin);
> __submit_spin_batch(gem_fd, spin, e, 0);
>
> /* PWM busy sleep. */
> loop_busy = igt_nsec_elapsed(&start);
> _ts.tv_nsec = busy_us * 1000;
> nanosleep(&_ts, NULL);
> igt_spin_batch_end(spin);
>
> /* Time accounting. */
> loop_ns = igt_nsec_elapsed(&start);
> loop_busy = loop_ns - loop_busy;
> loop_ns -= pass_ns;
>
> So pass_ns is time from start of calibration, loop_ns is time for this
> loop.
>
> busy_ns += loop_busy;
> total_busy_ns += loop_busy;
>
> busy_ns will be calibration pass, total all passes?
busy_ns/idle_ns are the current pass. There is also total_busy/idle_ns
at one level up, which are the totals.
>
> idle_ns += loop_ns - loop_busy;
>
> And idle is the residual between the time up to this point, and what has
> been busy.
Yes, I wanted to simplify and have reduced it to two clock queries per
loop only. It maybe isn't the easiest to follow. :I
> pass_ns += loop_ns;
> total_ns += loop_ns;
>
> /* Re-calibrate. */
> err = (double)total_busy_ns / total_ns -
> (double)target_busy_pct / 100.0;
>
> Hmm, I thought you didn't like the run on calculations, and wanted to
> reset between passes? (Have I got total_busy_ns and busy_ns confused?)
No, I did not like the aggregate in igt_info only. For calibration total
times are better I think.
With an exception that the "expected" ratio, as reported to the parent,
is based on the 2nd pass. That is so the error of the first pass, the
initial and hopefully all calibration that is needed, is not included in
the value we assert against, since it is not the one parent will sample
PMU busyness either.
> target_idle_us = (double)target_idle_us * (1.0 + err);
>
> Ok, I'm tired, but... So, if busy is 10% larger than expected, sleep 10%
> longer to try and compensate, would be the gist.
Correct.
> And this is because you always sleep and spin together and so cannot
> just sleep to compensate for the earlier inaccuracy. Which means we
> never truly try to correct the error in the same pass, but apply a
> correction factor for the next.
Correct.
> To me it seems like the closed system with each loop being "spin then
> adjusted sleep" will autocorrect and more likely to finish correct (as
> we are less reliant on the next loop for the accuracy). It's pretty much
> immaterial, as we expect the pmu to match the measurements (and not our
> expectations), but I find the one pass does all much simpler to follow.
Since we do a good number of loops, and hope the calibration will
converge quickly (which it does for me), I don't see that there is an
issue there.
With this loop it, for me locally, consistently underestimates by ~+0.03
- +0.04% for 2% and 98% tests, and ~+0.20 - +0.60%. In am not so happy
with the fact error seems systematic (seems to be that each loop adds ~
+0.008 - +0.012% of error) - but don't have an idea on how to improve it
further.
More of a problem will be if this still doesn't work that great on the
CI like for instance if there latencies will be more random. IGT_TRACE
equivalent to GEM_TRACE and dump the calibration passes comes to mind. :)
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [Intel-gfx] [PATCH i-g-t v2] tests/perf_pmu: Avoid RT thread for accuracy test
@ 2018-04-03 16:09 ` Tvrtko Ursulin
0 siblings, 0 replies; 43+ messages in thread
From: Tvrtko Ursulin @ 2018-04-03 16:09 UTC (permalink / raw)
To: Chris Wilson, Tvrtko Ursulin, igt-dev; +Cc: Intel-gfx
On 03/04/2018 14:10, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-04-03 13:38:25)
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> Realtime scheduling interferes with execlists submission (tasklet) so try
>> to simplify the PWM loop in a few ways:
>>
>> * Drop RT.
>> * Longer batches for smaller systematic error.
>> * More truthful test duration calculation.
>> * Less clock queries.
>> * No self-adjust - instead just report the achieved cycle and let the
>> parent check against it.
>> * Report absolute cycle error.
>>
>> v2:
>> * Bring back self-adjust. (Chris Wilson)
>> (But slightly fixed version with no overflow.)
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> ---
>> tests/perf_pmu.c | 97 +++++++++++++++++++++++++-------------------------------
>> 1 file changed, 43 insertions(+), 54 deletions(-)
>>
>> diff --git a/tests/perf_pmu.c b/tests/perf_pmu.c
>> index f27b7ec7d2c2..0cfacd4a8fbe 100644
>> --- a/tests/perf_pmu.c
>> +++ b/tests/perf_pmu.c
>> @@ -1504,12 +1504,6 @@ test_enable_race(int gem_fd, const struct intel_execution_engine2 *e)
>> gem_quiescent_gpu(gem_fd);
>> }
>>
>> -static double __error(double val, double ref)
>> -{
>> - igt_assert(ref > 1e-5 /* smallval */);
>> - return (100.0 * val / ref) - 100.0;
>> -}
>> -
>> static void __rearm_spin_batch(igt_spin_t *spin)
>> {
>> const uint32_t mi_arb_chk = 0x5 << 23;
>> @@ -1532,13 +1526,12 @@ static void
>> accuracy(int gem_fd, const struct intel_execution_engine2 *e,
>> unsigned long target_busy_pct)
>> {
>> - const unsigned int min_test_loops = 7;
>> - const unsigned long min_test_us = 1e6;
>> - unsigned long busy_us = 2500;
>> + unsigned long busy_us = 10000 - 100 * (1 + abs(50 - target_busy_pct));
>> unsigned long idle_us = 100 * (busy_us - target_busy_pct *
>> busy_us / 100) / target_busy_pct;
>> - unsigned long pwm_calibration_us;
>> - unsigned long test_us;
>> + const unsigned long min_test_us = 1e6;
>> + const unsigned long pwm_calibration_us = min_test_us;
>> + const unsigned long test_us = min_test_us;
>> double busy_r, expected;
>> uint64_t val[2];
>> uint64_t ts[2];
>> @@ -1553,13 +1546,6 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
>> idle_us *= 2;
>> }
>>
>> - pwm_calibration_us = min_test_loops * (busy_us + idle_us);
>> - while (pwm_calibration_us < min_test_us)
>> - pwm_calibration_us += busy_us + idle_us;
>> - test_us = min_test_loops * (idle_us + busy_us);
>> - while (test_us < min_test_us)
>> - test_us += busy_us + idle_us;
>> -
>> igt_info("calibration=%lums, test=%lums; ratio=%.2f%% (%luus/%luus)\n",
>> pwm_calibration_us / 1000, test_us / 1000,
>> (double)busy_us / (busy_us + idle_us) * 100.0,
>> @@ -1572,20 +1558,11 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
>>
>> /* Emit PWM pattern on the engine from a child. */
>> igt_fork(child, 1) {
>> - struct sched_param rt = { .sched_priority = 99 };
>> const unsigned long timeout[] = {
>> pwm_calibration_us * 1000, test_us * 1000
>> };
>> - uint64_t total_busy_ns = 0, total_idle_ns = 0;
>> + uint64_t total_busy_ns = 0, total_ns = 0;
>> igt_spin_t *spin;
>> - int ret;
>> -
>> - /* We need the best sleep accuracy we can get. */
>> - ret = sched_setscheduler(0,
>> - SCHED_FIFO | SCHED_RESET_ON_FORK,
>> - &rt);
>> - if (ret)
>> - igt_warn("Failed to set scheduling policy!\n");
>>
>> /* Allocate our spin batch and idle it. */
>> spin = igt_spin_batch_new(gem_fd, 0, e2ring(gem_fd, e), 0);
>> @@ -1594,39 +1571,51 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
>>
>> /* 1st pass is calibration, second pass is the test. */
>> for (int pass = 0; pass < ARRAY_SIZE(timeout); pass++) {
>> - uint64_t busy_ns = -total_busy_ns;
>> - uint64_t idle_ns = -total_idle_ns;
>> - struct timespec test_start = { };
>> + unsigned int target_idle_us = idle_us;
>> + uint64_t busy_ns = 0, idle_ns = 0;
>> + struct timespec start = { };
>> + unsigned long pass_ns = 0;
>> +
>> + igt_nsec_elapsed(&start);
>>
>> - igt_nsec_elapsed(&test_start);
>> do {
>> - unsigned int target_idle_us, t_busy;
>> + unsigned long loop_ns, loop_busy;
>> + struct timespec _ts = { };
>> + double err;
>> +
>> + /* PWM idle sleep. */
>> + _ts.tv_nsec = target_idle_us * 1000;
>> + nanosleep(&_ts, NULL);
>>
>> /* Restart the spinbatch. */
>> __rearm_spin_batch(spin);
>> __submit_spin_batch(gem_fd, spin, e, 0);
>>
>> - /*
>> - * Note that the submission may be delayed to a
>> - * tasklet (ksoftirqd) which cannot run until we
>> - * sleep as we hog the cpu (we are RT).
>> - */
>> -
>> - t_busy = measured_usleep(busy_us);
>> + /* PWM busy sleep. */
>> + loop_busy = igt_nsec_elapsed(&start);
>> + _ts.tv_nsec = busy_us * 1000;
>> + nanosleep(&_ts, NULL);
>> igt_spin_batch_end(spin);
>> - gem_sync(gem_fd, spin->handle);
>> -
>> - total_busy_ns += t_busy;
>> -
>> - target_idle_us =
>> - (100 * total_busy_ns / target_busy_pct - (total_busy_ns + total_idle_ns)) / 1000;
>> - total_idle_ns += measured_usleep(target_idle_us);
>> - } while (igt_nsec_elapsed(&test_start) < timeout[pass]);
>> -
>> - busy_ns += total_busy_ns;
>> - idle_ns += total_idle_ns;
>>
>> - expected = (double)busy_ns / (busy_ns + idle_ns);
>> + /* Time accounting. */
>> + loop_ns = igt_nsec_elapsed(&start);
>> + loop_busy = loop_ns - loop_busy;
>> + loop_ns -= pass_ns;
>> +
>> + busy_ns += loop_busy;
>> + total_busy_ns += loop_busy;
>> + idle_ns += loop_ns - loop_busy;
>> + pass_ns += loop_ns;
>> + total_ns += loop_ns;
>> +
>> + /* Re-calibrate. */
>> + err = (double)total_busy_ns / total_ns -
>> + (double)target_busy_pct / 100.0;
>> + target_idle_us = (double)target_idle_us *
>> + (1.0 + err);
>
> Previously the question we answered was how long should I sleep for the
> busy:idle ratio to hit the target.
>
> expected_total_ns = 100.0 * total_busy_ns / target_busy_pct;
> target_idle_us = (expected_total_ns - current_total_ns) / 1000;
Yes, and the overflow (or underflow, depending how you look at it) was
here. Usually in the first loop iteration for me, when expected_total_ns
is smaller than current_total_ns.
But mostly I think this should have a minor effect, unless some systems
can hit it more often.
>
> unsigned long loop_ns, loop_busy;
> struct timespec _ts = { };
> double err;
>
> /* PWM idle sleep. */
> _ts.tv_nsec = target_idle_us * 1000;
> nanosleep(&_ts, NULL);
>
> Assuming no >1s sleeps.
> (Ok, so the sleep after recalc is still here.)
>
> /* Restart the spinbatch. */
> __rearm_spin_batch(spin);
> __submit_spin_batch(gem_fd, spin, e, 0);
>
> /* PWM busy sleep. */
> loop_busy = igt_nsec_elapsed(&start);
> _ts.tv_nsec = busy_us * 1000;
> nanosleep(&_ts, NULL);
> igt_spin_batch_end(spin);
>
> /* Time accounting. */
> loop_ns = igt_nsec_elapsed(&start);
> loop_busy = loop_ns - loop_busy;
> loop_ns -= pass_ns;
>
> So pass_ns is time from start of calibration, loop_ns is time for this
> loop.
>
> busy_ns += loop_busy;
> total_busy_ns += loop_busy;
>
> busy_ns will be calibration pass, total all passes?
busy_ns/idle_ns are the current pass. There is also total_busy/idle_ns
at one level up, which are the totals.
>
> idle_ns += loop_ns - loop_busy;
>
> And idle is the residual between the time up to this point, and what has
> been busy.
Yes, I wanted to simplify and have reduced it to two clock queries per
loop only. It maybe isn't the easiest to follow. :I
> pass_ns += loop_ns;
> total_ns += loop_ns;
>
> /* Re-calibrate. */
> err = (double)total_busy_ns / total_ns -
> (double)target_busy_pct / 100.0;
>
> Hmm, I thought you didn't like the run on calculations, and wanted to
> reset between passes? (Have I got total_busy_ns and busy_ns confused?)
No, I did not like the aggregate in igt_info only. For calibration total
times are better I think.
With an exception that the "expected" ratio, as reported to the parent,
is based on the 2nd pass. That is so the error of the first pass, the
initial and hopefully all calibration that is needed, is not included in
the value we assert against, since it is not the one parent will sample
PMU busyness either.
> target_idle_us = (double)target_idle_us * (1.0 + err);
>
> Ok, I'm tired, but... So, if busy is 10% larger than expected, sleep 10%
> longer to try and compensate, would be the gist.
Correct.
> And this is because you always sleep and spin together and so cannot
> just sleep to compensate for the earlier inaccuracy. Which means we
> never truly try to correct the error in the same pass, but apply a
> correction factor for the next.
Correct.
> To me it seems like the closed system with each loop being "spin then
> adjusted sleep" will autocorrect and more likely to finish correct (as
> we are less reliant on the next loop for the accuracy). It's pretty much
> immaterial, as we expect the pmu to match the measurements (and not our
> expectations), but I find the one pass does all much simpler to follow.
Since we do a good number of loops, and hope the calibration will
converge quickly (which it does for me), I don't see that there is an
issue there.
With this loop it, for me locally, consistently underestimates by ~+0.03
- +0.04% for 2% and 98% tests, and ~+0.20 - +0.60%. In am not so happy
with the fact error seems systematic (seems to be that each loop adds ~
+0.008 - +0.012% of error) - but don't have an idea on how to improve it
further.
More of a problem will be if this still doesn't work that great on the
CI like for instance if there latencies will be more random. IGT_TRACE
equivalent to GEM_TRACE and dump the calibration passes comes to mind. :)
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH i-g-t v2] tests/perf_pmu: Avoid RT thread for accuracy test
2018-04-03 16:09 ` [Intel-gfx] " Tvrtko Ursulin
@ 2018-04-03 16:24 ` Chris Wilson
-1 siblings, 0 replies; 43+ messages in thread
From: Chris Wilson @ 2018-04-03 16:24 UTC (permalink / raw)
To: Tvrtko Ursulin, Tvrtko Ursulin, igt-dev; +Cc: Intel-gfx
Quoting Tvrtko Ursulin (2018-04-03 17:09:09)
>
> On 03/04/2018 14:10, Chris Wilson wrote:
> > To me it seems like the closed system with each loop being "spin then
> > adjusted sleep" will autocorrect and more likely to finish correct (as
> > we are less reliant on the next loop for the accuracy). It's pretty much
> > immaterial, as we expect the pmu to match the measurements (and not our
> > expectations), but I find the one pass does all much simpler to follow.
>
> Since we do a good number of loops, and hope the calibration will
> converge quickly (which it does for me), I don't see that there is an
> issue there.
I'm sitting here drinking coffee trying to decide if it does converge ;)
That's the problem here, I need to actually find a pencil, some paper
and remember some basic maths for series convergence. Not happening with
the amount of coffee I need to drink at the moment.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [igt-dev] [Intel-gfx] [PATCH i-g-t v2] tests/perf_pmu: Avoid RT thread for accuracy test
@ 2018-04-03 16:24 ` Chris Wilson
0 siblings, 0 replies; 43+ messages in thread
From: Chris Wilson @ 2018-04-03 16:24 UTC (permalink / raw)
To: Tvrtko Ursulin, Tvrtko Ursulin, igt-dev; +Cc: Intel-gfx
Quoting Tvrtko Ursulin (2018-04-03 17:09:09)
>
> On 03/04/2018 14:10, Chris Wilson wrote:
> > To me it seems like the closed system with each loop being "spin then
> > adjusted sleep" will autocorrect and more likely to finish correct (as
> > we are less reliant on the next loop for the accuracy). It's pretty much
> > immaterial, as we expect the pmu to match the measurements (and not our
> > expectations), but I find the one pass does all much simpler to follow.
>
> Since we do a good number of loops, and hope the calibration will
> converge quickly (which it does for me), I don't see that there is an
> issue there.
I'm sitting here drinking coffee trying to decide if it does converge ;)
That's the problem here, I need to actually find a pencil, some paper
and remember some basic maths for series convergence. Not happening with
the amount of coffee I need to drink at the moment.
-Chris
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH i-g-t v3] tests/perf_pmu: Avoid RT thread for accuracy test
2018-04-03 12:38 ` [Intel-gfx] " Tvrtko Ursulin
@ 2018-04-03 16:39 ` Tvrtko Ursulin
-1 siblings, 0 replies; 43+ messages in thread
From: Tvrtko Ursulin @ 2018-04-03 16:39 UTC (permalink / raw)
To: igt-dev; +Cc: Intel-gfx
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Realtime scheduling interferes with execlists submission (tasklet) so try
to simplify the PWM loop in a few ways:
* Drop RT.
* Longer batches for smaller systematic error.
* More truthful test duration calculation.
* Less clock queries.
* No self-adjust - instead just report the achieved cycle and let the
parent check against it.
* Report absolute cycle error.
v2:
* Bring back self-adjust. (Chris Wilson)
(But slightly fixed version with no overflow.)
v3:
* Log average and mean calibration for each pass.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
tests/perf_pmu.c | 108 +++++++++++++++++++++++++++----------------------------
1 file changed, 53 insertions(+), 55 deletions(-)
diff --git a/tests/perf_pmu.c b/tests/perf_pmu.c
index 2273ddb9e684..697008c855fd 100644
--- a/tests/perf_pmu.c
+++ b/tests/perf_pmu.c
@@ -1497,12 +1497,6 @@ test_enable_race(int gem_fd, const struct intel_execution_engine2 *e)
gem_quiescent_gpu(gem_fd);
}
-static double __error(double val, double ref)
-{
- igt_assert(ref > 1e-5 /* smallval */);
- return (100.0 * val / ref) - 100.0;
-}
-
static void __rearm_spin_batch(igt_spin_t *spin)
{
const uint32_t mi_arb_chk = 0x5 << 23;
@@ -1525,13 +1519,12 @@ static void
accuracy(int gem_fd, const struct intel_execution_engine2 *e,
unsigned long target_busy_pct)
{
- const unsigned int min_test_loops = 7;
- const unsigned long min_test_us = 1e6;
- unsigned long busy_us = 2500;
+ unsigned long busy_us = 10000 - 100 * (1 + abs(50 - target_busy_pct));
unsigned long idle_us = 100 * (busy_us - target_busy_pct *
busy_us / 100) / target_busy_pct;
- unsigned long pwm_calibration_us;
- unsigned long test_us;
+ const unsigned long min_test_us = 1e6;
+ const unsigned long pwm_calibration_us = min_test_us;
+ const unsigned long test_us = min_test_us;
double busy_r, expected;
uint64_t val[2];
uint64_t ts[2];
@@ -1546,13 +1539,6 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
idle_us *= 2;
}
- pwm_calibration_us = min_test_loops * (busy_us + idle_us);
- while (pwm_calibration_us < min_test_us)
- pwm_calibration_us += busy_us + idle_us;
- test_us = min_test_loops * (idle_us + busy_us);
- while (test_us < min_test_us)
- test_us += busy_us + idle_us;
-
igt_info("calibration=%lums, test=%lums; ratio=%.2f%% (%luus/%luus)\n",
pwm_calibration_us / 1000, test_us / 1000,
(double)busy_us / (busy_us + idle_us) * 100.0,
@@ -1565,20 +1551,11 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
/* Emit PWM pattern on the engine from a child. */
igt_fork(child, 1) {
- struct sched_param rt = { .sched_priority = 99 };
const unsigned long timeout[] = {
pwm_calibration_us * 1000, test_us * 1000
};
- uint64_t total_busy_ns = 0, total_idle_ns = 0;
+ uint64_t total_busy_ns = 0, total_ns = 0;
igt_spin_t *spin;
- int ret;
-
- /* We need the best sleep accuracy we can get. */
- ret = sched_setscheduler(0,
- SCHED_FIFO | SCHED_RESET_ON_FORK,
- &rt);
- if (ret)
- igt_warn("Failed to set scheduling policy!\n");
/* Allocate our spin batch and idle it. */
spin = igt_spin_batch_new(gem_fd, 0, e2ring(gem_fd, e), 0);
@@ -1587,42 +1564,63 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
/* 1st pass is calibration, second pass is the test. */
for (int pass = 0; pass < ARRAY_SIZE(timeout); pass++) {
- uint64_t busy_ns = -total_busy_ns;
- uint64_t idle_ns = -total_idle_ns;
- struct timespec test_start = { };
+ unsigned int target_idle_us = idle_us;
+ uint64_t busy_ns = 0, idle_ns = 0;
+ struct timespec start = { };
+ unsigned long pass_ns = 0;
+ double avg = 0.0, var = 0.0;
+ unsigned int n = 0;
+
+ igt_nsec_elapsed(&start);
- igt_nsec_elapsed(&test_start);
do {
- unsigned int target_idle_us, t_busy;
+ unsigned long loop_ns, loop_busy;
+ struct timespec _ts = { };
+ double err, tmp;
+
+ /* PWM idle sleep. */
+ _ts.tv_nsec = target_idle_us * 1000;
+ nanosleep(&_ts, NULL);
/* Restart the spinbatch. */
__rearm_spin_batch(spin);
__submit_spin_batch(gem_fd, spin, e, 0);
- /*
- * Note that the submission may be delayed to a
- * tasklet (ksoftirqd) which cannot run until we
- * sleep as we hog the cpu (we are RT).
- */
-
- t_busy = measured_usleep(busy_us);
+ /* PWM busy sleep. */
+ loop_busy = igt_nsec_elapsed(&start);
+ _ts.tv_nsec = busy_us * 1000;
+ nanosleep(&_ts, NULL);
igt_spin_batch_end(spin);
- gem_sync(gem_fd, spin->handle);
-
- total_busy_ns += t_busy;
-
- target_idle_us =
- (100 * total_busy_ns / target_busy_pct - (total_busy_ns + total_idle_ns)) / 1000;
- total_idle_ns += measured_usleep(target_idle_us);
- } while (igt_nsec_elapsed(&test_start) < timeout[pass]);
-
- busy_ns += total_busy_ns;
- idle_ns += total_idle_ns;
- expected = (double)busy_ns / (busy_ns + idle_ns);
- igt_info("%u: busy %"PRIu64"us, idle %"PRIu64"us: %.2f%% (target: %lu%%)\n",
+ /* Time accounting. */
+ loop_ns = igt_nsec_elapsed(&start);
+ loop_busy = loop_ns - loop_busy;
+ loop_ns -= pass_ns;
+
+ busy_ns += loop_busy;
+ total_busy_ns += loop_busy;
+ idle_ns += loop_ns - loop_busy;
+ pass_ns += loop_ns;
+ total_ns += loop_ns;
+
+ /* Re-calibrate. */
+ err = (double)total_busy_ns / total_ns -
+ (double)target_busy_pct / 100.0;
+ target_idle_us = (double)target_idle_us *
+ (1.0 + err);
+
+ /* Running average and variance for debug. */
+ err = 100.0 * total_busy_ns / total_ns;
+ tmp = avg;
+ avg += (err - avg) / ++n;
+ var += (err - avg) * (err - tmp);
+// printf("%f * %f = %f\n", err - avg, err - tmp, (err - avg) * (err - tmp));
+ } while (pass_ns < timeout[pass]);
+
+ expected = (double)busy_ns / pass_ns;
+ igt_info("%u: busy %"PRIu64"us, idle %"PRIu64"us -> %.2f%% (target: %lu%%; average=%.2f, variance=%f)\n",
pass, busy_ns / 1000, idle_ns / 1000,
- 100 * expected, target_busy_pct);
+ 100 * expected, target_busy_pct, avg, var);
write(link[1], &expected, sizeof(expected));
}
@@ -1649,7 +1647,7 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
busy_r = (double)(val[1] - val[0]) / (ts[1] - ts[0]);
igt_info("error=%.2f%% (%.2f%% vs %.2f%%)\n",
- __error(busy_r, expected), 100 * busy_r, 100 * expected);
+ (busy_r - expected) * 100, 100 * busy_r, 100 * expected);
assert_within(100.0 * busy_r, 100.0 * expected, 2);
}
--
2.14.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [igt-dev] [PATCH i-g-t v3] tests/perf_pmu: Avoid RT thread for accuracy test
@ 2018-04-03 16:39 ` Tvrtko Ursulin
0 siblings, 0 replies; 43+ messages in thread
From: Tvrtko Ursulin @ 2018-04-03 16:39 UTC (permalink / raw)
To: igt-dev; +Cc: Intel-gfx, Tvrtko Ursulin
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Realtime scheduling interferes with execlists submission (tasklet) so try
to simplify the PWM loop in a few ways:
* Drop RT.
* Longer batches for smaller systematic error.
* More truthful test duration calculation.
* Less clock queries.
* No self-adjust - instead just report the achieved cycle and let the
parent check against it.
* Report absolute cycle error.
v2:
* Bring back self-adjust. (Chris Wilson)
(But slightly fixed version with no overflow.)
v3:
* Log average and mean calibration for each pass.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
tests/perf_pmu.c | 108 +++++++++++++++++++++++++++----------------------------
1 file changed, 53 insertions(+), 55 deletions(-)
diff --git a/tests/perf_pmu.c b/tests/perf_pmu.c
index 2273ddb9e684..697008c855fd 100644
--- a/tests/perf_pmu.c
+++ b/tests/perf_pmu.c
@@ -1497,12 +1497,6 @@ test_enable_race(int gem_fd, const struct intel_execution_engine2 *e)
gem_quiescent_gpu(gem_fd);
}
-static double __error(double val, double ref)
-{
- igt_assert(ref > 1e-5 /* smallval */);
- return (100.0 * val / ref) - 100.0;
-}
-
static void __rearm_spin_batch(igt_spin_t *spin)
{
const uint32_t mi_arb_chk = 0x5 << 23;
@@ -1525,13 +1519,12 @@ static void
accuracy(int gem_fd, const struct intel_execution_engine2 *e,
unsigned long target_busy_pct)
{
- const unsigned int min_test_loops = 7;
- const unsigned long min_test_us = 1e6;
- unsigned long busy_us = 2500;
+ unsigned long busy_us = 10000 - 100 * (1 + abs(50 - target_busy_pct));
unsigned long idle_us = 100 * (busy_us - target_busy_pct *
busy_us / 100) / target_busy_pct;
- unsigned long pwm_calibration_us;
- unsigned long test_us;
+ const unsigned long min_test_us = 1e6;
+ const unsigned long pwm_calibration_us = min_test_us;
+ const unsigned long test_us = min_test_us;
double busy_r, expected;
uint64_t val[2];
uint64_t ts[2];
@@ -1546,13 +1539,6 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
idle_us *= 2;
}
- pwm_calibration_us = min_test_loops * (busy_us + idle_us);
- while (pwm_calibration_us < min_test_us)
- pwm_calibration_us += busy_us + idle_us;
- test_us = min_test_loops * (idle_us + busy_us);
- while (test_us < min_test_us)
- test_us += busy_us + idle_us;
-
igt_info("calibration=%lums, test=%lums; ratio=%.2f%% (%luus/%luus)\n",
pwm_calibration_us / 1000, test_us / 1000,
(double)busy_us / (busy_us + idle_us) * 100.0,
@@ -1565,20 +1551,11 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
/* Emit PWM pattern on the engine from a child. */
igt_fork(child, 1) {
- struct sched_param rt = { .sched_priority = 99 };
const unsigned long timeout[] = {
pwm_calibration_us * 1000, test_us * 1000
};
- uint64_t total_busy_ns = 0, total_idle_ns = 0;
+ uint64_t total_busy_ns = 0, total_ns = 0;
igt_spin_t *spin;
- int ret;
-
- /* We need the best sleep accuracy we can get. */
- ret = sched_setscheduler(0,
- SCHED_FIFO | SCHED_RESET_ON_FORK,
- &rt);
- if (ret)
- igt_warn("Failed to set scheduling policy!\n");
/* Allocate our spin batch and idle it. */
spin = igt_spin_batch_new(gem_fd, 0, e2ring(gem_fd, e), 0);
@@ -1587,42 +1564,63 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
/* 1st pass is calibration, second pass is the test. */
for (int pass = 0; pass < ARRAY_SIZE(timeout); pass++) {
- uint64_t busy_ns = -total_busy_ns;
- uint64_t idle_ns = -total_idle_ns;
- struct timespec test_start = { };
+ unsigned int target_idle_us = idle_us;
+ uint64_t busy_ns = 0, idle_ns = 0;
+ struct timespec start = { };
+ unsigned long pass_ns = 0;
+ double avg = 0.0, var = 0.0;
+ unsigned int n = 0;
+
+ igt_nsec_elapsed(&start);
- igt_nsec_elapsed(&test_start);
do {
- unsigned int target_idle_us, t_busy;
+ unsigned long loop_ns, loop_busy;
+ struct timespec _ts = { };
+ double err, tmp;
+
+ /* PWM idle sleep. */
+ _ts.tv_nsec = target_idle_us * 1000;
+ nanosleep(&_ts, NULL);
/* Restart the spinbatch. */
__rearm_spin_batch(spin);
__submit_spin_batch(gem_fd, spin, e, 0);
- /*
- * Note that the submission may be delayed to a
- * tasklet (ksoftirqd) which cannot run until we
- * sleep as we hog the cpu (we are RT).
- */
-
- t_busy = measured_usleep(busy_us);
+ /* PWM busy sleep. */
+ loop_busy = igt_nsec_elapsed(&start);
+ _ts.tv_nsec = busy_us * 1000;
+ nanosleep(&_ts, NULL);
igt_spin_batch_end(spin);
- gem_sync(gem_fd, spin->handle);
-
- total_busy_ns += t_busy;
-
- target_idle_us =
- (100 * total_busy_ns / target_busy_pct - (total_busy_ns + total_idle_ns)) / 1000;
- total_idle_ns += measured_usleep(target_idle_us);
- } while (igt_nsec_elapsed(&test_start) < timeout[pass]);
-
- busy_ns += total_busy_ns;
- idle_ns += total_idle_ns;
- expected = (double)busy_ns / (busy_ns + idle_ns);
- igt_info("%u: busy %"PRIu64"us, idle %"PRIu64"us: %.2f%% (target: %lu%%)\n",
+ /* Time accounting. */
+ loop_ns = igt_nsec_elapsed(&start);
+ loop_busy = loop_ns - loop_busy;
+ loop_ns -= pass_ns;
+
+ busy_ns += loop_busy;
+ total_busy_ns += loop_busy;
+ idle_ns += loop_ns - loop_busy;
+ pass_ns += loop_ns;
+ total_ns += loop_ns;
+
+ /* Re-calibrate. */
+ err = (double)total_busy_ns / total_ns -
+ (double)target_busy_pct / 100.0;
+ target_idle_us = (double)target_idle_us *
+ (1.0 + err);
+
+ /* Running average and variance for debug. */
+ err = 100.0 * total_busy_ns / total_ns;
+ tmp = avg;
+ avg += (err - avg) / ++n;
+ var += (err - avg) * (err - tmp);
+// printf("%f * %f = %f\n", err - avg, err - tmp, (err - avg) * (err - tmp));
+ } while (pass_ns < timeout[pass]);
+
+ expected = (double)busy_ns / pass_ns;
+ igt_info("%u: busy %"PRIu64"us, idle %"PRIu64"us -> %.2f%% (target: %lu%%; average=%.2f, variance=%f)\n",
pass, busy_ns / 1000, idle_ns / 1000,
- 100 * expected, target_busy_pct);
+ 100 * expected, target_busy_pct, avg, var);
write(link[1], &expected, sizeof(expected));
}
@@ -1649,7 +1647,7 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
busy_r = (double)(val[1] - val[0]) / (ts[1] - ts[0]);
igt_info("error=%.2f%% (%.2f%% vs %.2f%%)\n",
- __error(busy_r, expected), 100 * busy_r, 100 * expected);
+ (busy_r - expected) * 100, 100 * busy_r, 100 * expected);
assert_within(100.0 * busy_r, 100.0 * expected, 2);
}
--
2.14.1
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH i-g-t v4] tests/perf_pmu: Avoid RT thread for accuracy test
2018-04-03 16:39 ` [igt-dev] " Tvrtko Ursulin
@ 2018-04-04 9:51 ` Tvrtko Ursulin
-1 siblings, 0 replies; 43+ messages in thread
From: Tvrtko Ursulin @ 2018-04-04 9:51 UTC (permalink / raw)
To: igt-dev; +Cc: Intel-gfx
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Realtime scheduling interferes with execlists submission (tasklet) so try
to simplify the PWM loop in a few ways:
* Drop RT.
* Longer batches for smaller systematic error.
* More truthful test duration calculation.
* Less clock queries.
* No self-adjust - instead just report the achieved cycle and let the
parent check against it.
* Report absolute cycle error.
v2:
* Bring back self-adjust. (Chris Wilson)
(But slightly fixed version with no overflow.)
v3:
* Log average and mean calibration for each pass.
v4:
* Eliminate development leftovers.
* Fix variance logging.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
tests/perf_pmu.c | 107 +++++++++++++++++++++++++++----------------------------
1 file changed, 52 insertions(+), 55 deletions(-)
diff --git a/tests/perf_pmu.c b/tests/perf_pmu.c
index 2273ddb9e684..590e6526b069 100644
--- a/tests/perf_pmu.c
+++ b/tests/perf_pmu.c
@@ -1497,12 +1497,6 @@ test_enable_race(int gem_fd, const struct intel_execution_engine2 *e)
gem_quiescent_gpu(gem_fd);
}
-static double __error(double val, double ref)
-{
- igt_assert(ref > 1e-5 /* smallval */);
- return (100.0 * val / ref) - 100.0;
-}
-
static void __rearm_spin_batch(igt_spin_t *spin)
{
const uint32_t mi_arb_chk = 0x5 << 23;
@@ -1525,13 +1519,12 @@ static void
accuracy(int gem_fd, const struct intel_execution_engine2 *e,
unsigned long target_busy_pct)
{
- const unsigned int min_test_loops = 7;
- const unsigned long min_test_us = 1e6;
- unsigned long busy_us = 2500;
+ unsigned long busy_us = 10000 - 100 * (1 + abs(50 - target_busy_pct));
unsigned long idle_us = 100 * (busy_us - target_busy_pct *
busy_us / 100) / target_busy_pct;
- unsigned long pwm_calibration_us;
- unsigned long test_us;
+ const unsigned long min_test_us = 1e6;
+ const unsigned long pwm_calibration_us = min_test_us;
+ const unsigned long test_us = min_test_us;
double busy_r, expected;
uint64_t val[2];
uint64_t ts[2];
@@ -1546,13 +1539,6 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
idle_us *= 2;
}
- pwm_calibration_us = min_test_loops * (busy_us + idle_us);
- while (pwm_calibration_us < min_test_us)
- pwm_calibration_us += busy_us + idle_us;
- test_us = min_test_loops * (idle_us + busy_us);
- while (test_us < min_test_us)
- test_us += busy_us + idle_us;
-
igt_info("calibration=%lums, test=%lums; ratio=%.2f%% (%luus/%luus)\n",
pwm_calibration_us / 1000, test_us / 1000,
(double)busy_us / (busy_us + idle_us) * 100.0,
@@ -1565,20 +1551,11 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
/* Emit PWM pattern on the engine from a child. */
igt_fork(child, 1) {
- struct sched_param rt = { .sched_priority = 99 };
const unsigned long timeout[] = {
pwm_calibration_us * 1000, test_us * 1000
};
- uint64_t total_busy_ns = 0, total_idle_ns = 0;
+ uint64_t total_busy_ns = 0, total_ns = 0;
igt_spin_t *spin;
- int ret;
-
- /* We need the best sleep accuracy we can get. */
- ret = sched_setscheduler(0,
- SCHED_FIFO | SCHED_RESET_ON_FORK,
- &rt);
- if (ret)
- igt_warn("Failed to set scheduling policy!\n");
/* Allocate our spin batch and idle it. */
spin = igt_spin_batch_new(gem_fd, 0, e2ring(gem_fd, e), 0);
@@ -1587,42 +1564,62 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
/* 1st pass is calibration, second pass is the test. */
for (int pass = 0; pass < ARRAY_SIZE(timeout); pass++) {
- uint64_t busy_ns = -total_busy_ns;
- uint64_t idle_ns = -total_idle_ns;
- struct timespec test_start = { };
+ unsigned int target_idle_us = idle_us;
+ uint64_t busy_ns = 0, idle_ns = 0;
+ struct timespec start = { };
+ unsigned long pass_ns = 0;
+ double avg = 0.0, var = 0.0;
+ unsigned int n = 0;
+
+ igt_nsec_elapsed(&start);
- igt_nsec_elapsed(&test_start);
do {
- unsigned int target_idle_us, t_busy;
+ unsigned long loop_ns, loop_busy;
+ struct timespec _ts = { };
+ double err, tmp;
+
+ /* PWM idle sleep. */
+ _ts.tv_nsec = target_idle_us * 1000;
+ nanosleep(&_ts, NULL);
/* Restart the spinbatch. */
__rearm_spin_batch(spin);
__submit_spin_batch(gem_fd, spin, e, 0);
- /*
- * Note that the submission may be delayed to a
- * tasklet (ksoftirqd) which cannot run until we
- * sleep as we hog the cpu (we are RT).
- */
-
- t_busy = measured_usleep(busy_us);
+ /* PWM busy sleep. */
+ loop_busy = igt_nsec_elapsed(&start);
+ _ts.tv_nsec = busy_us * 1000;
+ nanosleep(&_ts, NULL);
igt_spin_batch_end(spin);
- gem_sync(gem_fd, spin->handle);
-
- total_busy_ns += t_busy;
-
- target_idle_us =
- (100 * total_busy_ns / target_busy_pct - (total_busy_ns + total_idle_ns)) / 1000;
- total_idle_ns += measured_usleep(target_idle_us);
- } while (igt_nsec_elapsed(&test_start) < timeout[pass]);
-
- busy_ns += total_busy_ns;
- idle_ns += total_idle_ns;
- expected = (double)busy_ns / (busy_ns + idle_ns);
- igt_info("%u: busy %"PRIu64"us, idle %"PRIu64"us: %.2f%% (target: %lu%%)\n",
+ /* Time accounting. */
+ loop_ns = igt_nsec_elapsed(&start);
+ loop_busy = loop_ns - loop_busy;
+ loop_ns -= pass_ns;
+
+ busy_ns += loop_busy;
+ total_busy_ns += loop_busy;
+ idle_ns += loop_ns - loop_busy;
+ pass_ns += loop_ns;
+ total_ns += loop_ns;
+
+ /* Re-calibrate. */
+ err = (double)total_busy_ns / total_ns -
+ (double)target_busy_pct / 100.0;
+ target_idle_us = (double)target_idle_us *
+ (1.0 + err);
+
+ /* Running average and variance for debug. */
+ err = 100.0 * total_busy_ns / total_ns;
+ tmp = avg;
+ avg += (err - avg) / ++n;
+ var += (err - avg) * (err - tmp);
+ } while (pass_ns < timeout[pass]);
+
+ expected = (double)busy_ns / pass_ns;
+ igt_info("%u: busy %"PRIu64"us, idle %"PRIu64"us -> %.2f%% (target: %lu%%; average=%.2f, variance=%f)\n",
pass, busy_ns / 1000, idle_ns / 1000,
- 100 * expected, target_busy_pct);
+ 100 * expected, target_busy_pct, avg, var / n);
write(link[1], &expected, sizeof(expected));
}
@@ -1649,7 +1646,7 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
busy_r = (double)(val[1] - val[0]) / (ts[1] - ts[0]);
igt_info("error=%.2f%% (%.2f%% vs %.2f%%)\n",
- __error(busy_r, expected), 100 * busy_r, 100 * expected);
+ (busy_r - expected) * 100, 100 * busy_r, 100 * expected);
assert_within(100.0 * busy_r, 100.0 * expected, 2);
}
--
2.14.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [Intel-gfx] [PATCH i-g-t v4] tests/perf_pmu: Avoid RT thread for accuracy test
@ 2018-04-04 9:51 ` Tvrtko Ursulin
0 siblings, 0 replies; 43+ messages in thread
From: Tvrtko Ursulin @ 2018-04-04 9:51 UTC (permalink / raw)
To: igt-dev; +Cc: Intel-gfx
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Realtime scheduling interferes with execlists submission (tasklet) so try
to simplify the PWM loop in a few ways:
* Drop RT.
* Longer batches for smaller systematic error.
* More truthful test duration calculation.
* Less clock queries.
* No self-adjust - instead just report the achieved cycle and let the
parent check against it.
* Report absolute cycle error.
v2:
* Bring back self-adjust. (Chris Wilson)
(But slightly fixed version with no overflow.)
v3:
* Log average and mean calibration for each pass.
v4:
* Eliminate development leftovers.
* Fix variance logging.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
tests/perf_pmu.c | 107 +++++++++++++++++++++++++++----------------------------
1 file changed, 52 insertions(+), 55 deletions(-)
diff --git a/tests/perf_pmu.c b/tests/perf_pmu.c
index 2273ddb9e684..590e6526b069 100644
--- a/tests/perf_pmu.c
+++ b/tests/perf_pmu.c
@@ -1497,12 +1497,6 @@ test_enable_race(int gem_fd, const struct intel_execution_engine2 *e)
gem_quiescent_gpu(gem_fd);
}
-static double __error(double val, double ref)
-{
- igt_assert(ref > 1e-5 /* smallval */);
- return (100.0 * val / ref) - 100.0;
-}
-
static void __rearm_spin_batch(igt_spin_t *spin)
{
const uint32_t mi_arb_chk = 0x5 << 23;
@@ -1525,13 +1519,12 @@ static void
accuracy(int gem_fd, const struct intel_execution_engine2 *e,
unsigned long target_busy_pct)
{
- const unsigned int min_test_loops = 7;
- const unsigned long min_test_us = 1e6;
- unsigned long busy_us = 2500;
+ unsigned long busy_us = 10000 - 100 * (1 + abs(50 - target_busy_pct));
unsigned long idle_us = 100 * (busy_us - target_busy_pct *
busy_us / 100) / target_busy_pct;
- unsigned long pwm_calibration_us;
- unsigned long test_us;
+ const unsigned long min_test_us = 1e6;
+ const unsigned long pwm_calibration_us = min_test_us;
+ const unsigned long test_us = min_test_us;
double busy_r, expected;
uint64_t val[2];
uint64_t ts[2];
@@ -1546,13 +1539,6 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
idle_us *= 2;
}
- pwm_calibration_us = min_test_loops * (busy_us + idle_us);
- while (pwm_calibration_us < min_test_us)
- pwm_calibration_us += busy_us + idle_us;
- test_us = min_test_loops * (idle_us + busy_us);
- while (test_us < min_test_us)
- test_us += busy_us + idle_us;
-
igt_info("calibration=%lums, test=%lums; ratio=%.2f%% (%luus/%luus)\n",
pwm_calibration_us / 1000, test_us / 1000,
(double)busy_us / (busy_us + idle_us) * 100.0,
@@ -1565,20 +1551,11 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
/* Emit PWM pattern on the engine from a child. */
igt_fork(child, 1) {
- struct sched_param rt = { .sched_priority = 99 };
const unsigned long timeout[] = {
pwm_calibration_us * 1000, test_us * 1000
};
- uint64_t total_busy_ns = 0, total_idle_ns = 0;
+ uint64_t total_busy_ns = 0, total_ns = 0;
igt_spin_t *spin;
- int ret;
-
- /* We need the best sleep accuracy we can get. */
- ret = sched_setscheduler(0,
- SCHED_FIFO | SCHED_RESET_ON_FORK,
- &rt);
- if (ret)
- igt_warn("Failed to set scheduling policy!\n");
/* Allocate our spin batch and idle it. */
spin = igt_spin_batch_new(gem_fd, 0, e2ring(gem_fd, e), 0);
@@ -1587,42 +1564,62 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
/* 1st pass is calibration, second pass is the test. */
for (int pass = 0; pass < ARRAY_SIZE(timeout); pass++) {
- uint64_t busy_ns = -total_busy_ns;
- uint64_t idle_ns = -total_idle_ns;
- struct timespec test_start = { };
+ unsigned int target_idle_us = idle_us;
+ uint64_t busy_ns = 0, idle_ns = 0;
+ struct timespec start = { };
+ unsigned long pass_ns = 0;
+ double avg = 0.0, var = 0.0;
+ unsigned int n = 0;
+
+ igt_nsec_elapsed(&start);
- igt_nsec_elapsed(&test_start);
do {
- unsigned int target_idle_us, t_busy;
+ unsigned long loop_ns, loop_busy;
+ struct timespec _ts = { };
+ double err, tmp;
+
+ /* PWM idle sleep. */
+ _ts.tv_nsec = target_idle_us * 1000;
+ nanosleep(&_ts, NULL);
/* Restart the spinbatch. */
__rearm_spin_batch(spin);
__submit_spin_batch(gem_fd, spin, e, 0);
- /*
- * Note that the submission may be delayed to a
- * tasklet (ksoftirqd) which cannot run until we
- * sleep as we hog the cpu (we are RT).
- */
-
- t_busy = measured_usleep(busy_us);
+ /* PWM busy sleep. */
+ loop_busy = igt_nsec_elapsed(&start);
+ _ts.tv_nsec = busy_us * 1000;
+ nanosleep(&_ts, NULL);
igt_spin_batch_end(spin);
- gem_sync(gem_fd, spin->handle);
-
- total_busy_ns += t_busy;
-
- target_idle_us =
- (100 * total_busy_ns / target_busy_pct - (total_busy_ns + total_idle_ns)) / 1000;
- total_idle_ns += measured_usleep(target_idle_us);
- } while (igt_nsec_elapsed(&test_start) < timeout[pass]);
-
- busy_ns += total_busy_ns;
- idle_ns += total_idle_ns;
- expected = (double)busy_ns / (busy_ns + idle_ns);
- igt_info("%u: busy %"PRIu64"us, idle %"PRIu64"us: %.2f%% (target: %lu%%)\n",
+ /* Time accounting. */
+ loop_ns = igt_nsec_elapsed(&start);
+ loop_busy = loop_ns - loop_busy;
+ loop_ns -= pass_ns;
+
+ busy_ns += loop_busy;
+ total_busy_ns += loop_busy;
+ idle_ns += loop_ns - loop_busy;
+ pass_ns += loop_ns;
+ total_ns += loop_ns;
+
+ /* Re-calibrate. */
+ err = (double)total_busy_ns / total_ns -
+ (double)target_busy_pct / 100.0;
+ target_idle_us = (double)target_idle_us *
+ (1.0 + err);
+
+ /* Running average and variance for debug. */
+ err = 100.0 * total_busy_ns / total_ns;
+ tmp = avg;
+ avg += (err - avg) / ++n;
+ var += (err - avg) * (err - tmp);
+ } while (pass_ns < timeout[pass]);
+
+ expected = (double)busy_ns / pass_ns;
+ igt_info("%u: busy %"PRIu64"us, idle %"PRIu64"us -> %.2f%% (target: %lu%%; average=%.2f, variance=%f)\n",
pass, busy_ns / 1000, idle_ns / 1000,
- 100 * expected, target_busy_pct);
+ 100 * expected, target_busy_pct, avg, var / n);
write(link[1], &expected, sizeof(expected));
}
@@ -1649,7 +1646,7 @@ accuracy(int gem_fd, const struct intel_execution_engine2 *e,
busy_r = (double)(val[1] - val[0]) / (ts[1] - ts[0]);
igt_info("error=%.2f%% (%.2f%% vs %.2f%%)\n",
- __error(busy_r, expected), 100 * busy_r, 100 * expected);
+ (busy_r - expected) * 100, 100 * busy_r, 100 * expected);
assert_within(100.0 * busy_r, 100.0 * expected, 2);
}
--
2.14.1
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH i-g-t v4] tests/perf_pmu: Avoid RT thread for accuracy test
2018-04-04 9:51 ` [Intel-gfx] " Tvrtko Ursulin
@ 2018-04-11 13:23 ` Chris Wilson
-1 siblings, 0 replies; 43+ messages in thread
From: Chris Wilson @ 2018-04-11 13:23 UTC (permalink / raw)
To: Tvrtko Ursulin, igt-dev; +Cc: Intel-gfx
Quoting Tvrtko Ursulin (2018-04-04 10:51:52)
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> Realtime scheduling interferes with execlists submission (tasklet) so try
> to simplify the PWM loop in a few ways:
>
> * Drop RT.
> * Longer batches for smaller systematic error.
> * More truthful test duration calculation.
> * Less clock queries.
> * No self-adjust - instead just report the achieved cycle and let the
> parent check against it.
> * Report absolute cycle error.
>
> v2:
> * Bring back self-adjust. (Chris Wilson)
> (But slightly fixed version with no overflow.)
>
> v3:
> * Log average and mean calibration for each pass.
>
> v4:
> * Eliminate development leftovers.
> * Fix variance logging.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
From a pragmatic point of view, there's no point waiting for me to be
happy with the convergence if CI is, and the variance will definitely be
interesting (although you could have used igt_mean to compute the
iterative variance), so
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [igt-dev] [Intel-gfx] [PATCH i-g-t v4] tests/perf_pmu: Avoid RT thread for accuracy test
@ 2018-04-11 13:23 ` Chris Wilson
0 siblings, 0 replies; 43+ messages in thread
From: Chris Wilson @ 2018-04-11 13:23 UTC (permalink / raw)
To: Tvrtko Ursulin, igt-dev; +Cc: Intel-gfx
Quoting Tvrtko Ursulin (2018-04-04 10:51:52)
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> Realtime scheduling interferes with execlists submission (tasklet) so try
> to simplify the PWM loop in a few ways:
>
> * Drop RT.
> * Longer batches for smaller systematic error.
> * More truthful test duration calculation.
> * Less clock queries.
> * No self-adjust - instead just report the achieved cycle and let the
> parent check against it.
> * Report absolute cycle error.
>
> v2:
> * Bring back self-adjust. (Chris Wilson)
> (But slightly fixed version with no overflow.)
>
> v3:
> * Log average and mean calibration for each pass.
>
> v4:
> * Eliminate development leftovers.
> * Fix variance logging.
>
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
From a pragmatic point of view, there's no point waiting for me to be
happy with the convergence if CI is, and the variance will definitely be
interesting (although you could have used igt_mean to compute the
iterative variance), so
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH i-g-t v4] tests/perf_pmu: Avoid RT thread for accuracy test
2018-04-11 13:23 ` [igt-dev] [Intel-gfx] " Chris Wilson
@ 2018-04-11 13:52 ` Tvrtko Ursulin
-1 siblings, 0 replies; 43+ messages in thread
From: Tvrtko Ursulin @ 2018-04-11 13:52 UTC (permalink / raw)
To: Chris Wilson, Tvrtko Ursulin, igt-dev; +Cc: Intel-gfx
On 11/04/2018 14:23, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-04-04 10:51:52)
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> Realtime scheduling interferes with execlists submission (tasklet) so try
>> to simplify the PWM loop in a few ways:
>>
>> * Drop RT.
>> * Longer batches for smaller systematic error.
>> * More truthful test duration calculation.
>> * Less clock queries.
>> * No self-adjust - instead just report the achieved cycle and let the
>> parent check against it.
>> * Report absolute cycle error.
>>
>> v2:
>> * Bring back self-adjust. (Chris Wilson)
>> (But slightly fixed version with no overflow.)
>>
>> v3:
>> * Log average and mean calibration for each pass.
>>
>> v4:
>> * Eliminate development leftovers.
>> * Fix variance logging.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> From a pragmatic point of view, there's no point waiting for me to be
> happy with the convergence if CI is, and the variance will definitely be
> interesting (although you could have used igt_mean to compute the
> iterative variance), so
>
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Thanks, I've pushed it and so we'll see.
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [igt-dev] [Intel-gfx] [PATCH i-g-t v4] tests/perf_pmu: Avoid RT thread for accuracy test
@ 2018-04-11 13:52 ` Tvrtko Ursulin
0 siblings, 0 replies; 43+ messages in thread
From: Tvrtko Ursulin @ 2018-04-11 13:52 UTC (permalink / raw)
To: Chris Wilson, Tvrtko Ursulin, igt-dev; +Cc: Intel-gfx
On 11/04/2018 14:23, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-04-04 10:51:52)
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> Realtime scheduling interferes with execlists submission (tasklet) so try
>> to simplify the PWM loop in a few ways:
>>
>> * Drop RT.
>> * Longer batches for smaller systematic error.
>> * More truthful test duration calculation.
>> * Less clock queries.
>> * No self-adjust - instead just report the achieved cycle and let the
>> parent check against it.
>> * Report absolute cycle error.
>>
>> v2:
>> * Bring back self-adjust. (Chris Wilson)
>> (But slightly fixed version with no overflow.)
>>
>> v3:
>> * Log average and mean calibration for each pass.
>>
>> v4:
>> * Eliminate development leftovers.
>> * Fix variance logging.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> From a pragmatic point of view, there's no point waiting for me to be
> happy with the convergence if CI is, and the variance will definitely be
> interesting (although you could have used igt_mean to compute the
> iterative variance), so
>
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Thanks, I've pushed it and so we'll see.
Regards,
Tvrtko
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH i-g-t v4] tests/perf_pmu: Avoid RT thread for accuracy test
2018-04-11 13:52 ` [igt-dev] [Intel-gfx] " Tvrtko Ursulin
@ 2018-04-14 11:35 ` Chris Wilson
-1 siblings, 0 replies; 43+ messages in thread
From: Chris Wilson @ 2018-04-14 11:35 UTC (permalink / raw)
To: Tvrtko Ursulin, Tvrtko Ursulin, igt-dev; +Cc: Intel-gfx
Quoting Tvrtko Ursulin (2018-04-11 14:52:36)
>
> On 11/04/2018 14:23, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-04-04 10:51:52)
> >> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>
> >> Realtime scheduling interferes with execlists submission (tasklet) so try
> >> to simplify the PWM loop in a few ways:
> >>
> >> * Drop RT.
> >> * Longer batches for smaller systematic error.
> >> * More truthful test duration calculation.
> >> * Less clock queries.
> >> * No self-adjust - instead just report the achieved cycle and let the
> >> parent check against it.
> >> * Report absolute cycle error.
> >>
> >> v2:
> >> * Bring back self-adjust. (Chris Wilson)
> >> (But slightly fixed version with no overflow.)
> >>
> >> v3:
> >> * Log average and mean calibration for each pass.
> >>
> >> v4:
> >> * Eliminate development leftovers.
> >> * Fix variance logging.
> >>
> >> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >
> > From a pragmatic point of view, there's no point waiting for me to be
> > happy with the convergence if CI is, and the variance will definitely be
> > interesting (although you could have used igt_mean to compute the
> > iterative variance), so
> >
> > Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
>
> Thanks, I've pushed it and so we'll see.
We should resurrect the RT variant in the near future. It's definitely
an issue in our driver that random userspace can impact execution of
unconnected others. (Handling RT starvation of workers is something we
have to be aware of elsewhere, commonly hits oom if we don't have an
escape clause.) Lots of words just to say, we should add a test for RT
to exercise the bad behaviour. Hmm, doesn't need to be pmu, just we need
an assertion that execution latency is bounded and no RT hog will delay
it.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [igt-dev] [Intel-gfx] [PATCH i-g-t v4] tests/perf_pmu: Avoid RT thread for accuracy test
@ 2018-04-14 11:35 ` Chris Wilson
0 siblings, 0 replies; 43+ messages in thread
From: Chris Wilson @ 2018-04-14 11:35 UTC (permalink / raw)
To: Tvrtko Ursulin, Tvrtko Ursulin, igt-dev; +Cc: Intel-gfx
Quoting Tvrtko Ursulin (2018-04-11 14:52:36)
>
> On 11/04/2018 14:23, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-04-04 10:51:52)
> >> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>
> >> Realtime scheduling interferes with execlists submission (tasklet) so try
> >> to simplify the PWM loop in a few ways:
> >>
> >> * Drop RT.
> >> * Longer batches for smaller systematic error.
> >> * More truthful test duration calculation.
> >> * Less clock queries.
> >> * No self-adjust - instead just report the achieved cycle and let the
> >> parent check against it.
> >> * Report absolute cycle error.
> >>
> >> v2:
> >> * Bring back self-adjust. (Chris Wilson)
> >> (But slightly fixed version with no overflow.)
> >>
> >> v3:
> >> * Log average and mean calibration for each pass.
> >>
> >> v4:
> >> * Eliminate development leftovers.
> >> * Fix variance logging.
> >>
> >> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >
> > From a pragmatic point of view, there's no point waiting for me to be
> > happy with the convergence if CI is, and the variance will definitely be
> > interesting (although you could have used igt_mean to compute the
> > iterative variance), so
> >
> > Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
>
> Thanks, I've pushed it and so we'll see.
We should resurrect the RT variant in the near future. It's definitely
an issue in our driver that random userspace can impact execution of
unconnected others. (Handling RT starvation of workers is something we
have to be aware of elsewhere, commonly hits oom if we don't have an
escape clause.) Lots of words just to say, we should add a test for RT
to exercise the bad behaviour. Hmm, doesn't need to be pmu, just we need
an assertion that execution latency is bounded and no RT hog will delay
it.
-Chris
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH i-g-t v4] tests/perf_pmu: Avoid RT thread for accuracy test
2018-04-14 11:35 ` [igt-dev] [Intel-gfx] " Chris Wilson
@ 2018-04-16 9:55 ` Tvrtko Ursulin
-1 siblings, 0 replies; 43+ messages in thread
From: Tvrtko Ursulin @ 2018-04-16 9:55 UTC (permalink / raw)
To: Chris Wilson, Tvrtko Ursulin, igt-dev; +Cc: Intel-gfx
On 14/04/2018 12:35, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-04-11 14:52:36)
>>
>> On 11/04/2018 14:23, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2018-04-04 10:51:52)
>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>
>>>> Realtime scheduling interferes with execlists submission (tasklet) so try
>>>> to simplify the PWM loop in a few ways:
>>>>
>>>> * Drop RT.
>>>> * Longer batches for smaller systematic error.
>>>> * More truthful test duration calculation.
>>>> * Less clock queries.
>>>> * No self-adjust - instead just report the achieved cycle and let the
>>>> parent check against it.
>>>> * Report absolute cycle error.
>>>>
>>>> v2:
>>>> * Bring back self-adjust. (Chris Wilson)
>>>> (But slightly fixed version with no overflow.)
>>>>
>>>> v3:
>>>> * Log average and mean calibration for each pass.
>>>>
>>>> v4:
>>>> * Eliminate development leftovers.
>>>> * Fix variance logging.
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>
>>> From a pragmatic point of view, there's no point waiting for me to be
>>> happy with the convergence if CI is, and the variance will definitely be
>>> interesting (although you could have used igt_mean to compute the
>>> iterative variance), so
>>>
>>> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
>>
>> Thanks, I've pushed it and so we'll see.
>
> We should resurrect the RT variant in the near future. It's definitely
> an issue in our driver that random userspace can impact execution of
> unconnected others. (Handling RT starvation of workers is something we
> have to be aware of elsewhere, commonly hits oom if we don't have an
> escape clause.) Lots of words just to say, we should add a test for RT
> to exercise the bad behaviour. Hmm, doesn't need to be pmu, just we need
> an assertion that execution latency is bounded and no RT hog will delay
> it.
Agreed, I can add a simple test to gem_exec_latency.
But with regards on how to fix this - re-enabling direct submission
sounds simplest (not only indirect via tasklet) in theory although I do
remember you were raising some issues with this route last time I
mentioned it. It does sound like a conceptually correct thing to do.
As an alternative we could explore conversion effort and resulting
latencies from conversion to threaded irq handler.
You also had a patch to improve tasklet scheduling in some cases now I
remember. We can try that after I write the test as well. Although I
have no idea how hard of a sell that would be.
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [igt-dev] [Intel-gfx] [PATCH i-g-t v4] tests/perf_pmu: Avoid RT thread for accuracy test
@ 2018-04-16 9:55 ` Tvrtko Ursulin
0 siblings, 0 replies; 43+ messages in thread
From: Tvrtko Ursulin @ 2018-04-16 9:55 UTC (permalink / raw)
To: Chris Wilson, Tvrtko Ursulin, igt-dev; +Cc: Intel-gfx
On 14/04/2018 12:35, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-04-11 14:52:36)
>>
>> On 11/04/2018 14:23, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2018-04-04 10:51:52)
>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>
>>>> Realtime scheduling interferes with execlists submission (tasklet) so try
>>>> to simplify the PWM loop in a few ways:
>>>>
>>>> * Drop RT.
>>>> * Longer batches for smaller systematic error.
>>>> * More truthful test duration calculation.
>>>> * Less clock queries.
>>>> * No self-adjust - instead just report the achieved cycle and let the
>>>> parent check against it.
>>>> * Report absolute cycle error.
>>>>
>>>> v2:
>>>> * Bring back self-adjust. (Chris Wilson)
>>>> (But slightly fixed version with no overflow.)
>>>>
>>>> v3:
>>>> * Log average and mean calibration for each pass.
>>>>
>>>> v4:
>>>> * Eliminate development leftovers.
>>>> * Fix variance logging.
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>
>>> From a pragmatic point of view, there's no point waiting for me to be
>>> happy with the convergence if CI is, and the variance will definitely be
>>> interesting (although you could have used igt_mean to compute the
>>> iterative variance), so
>>>
>>> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
>>
>> Thanks, I've pushed it and so we'll see.
>
> We should resurrect the RT variant in the near future. It's definitely
> an issue in our driver that random userspace can impact execution of
> unconnected others. (Handling RT starvation of workers is something we
> have to be aware of elsewhere, commonly hits oom if we don't have an
> escape clause.) Lots of words just to say, we should add a test for RT
> to exercise the bad behaviour. Hmm, doesn't need to be pmu, just we need
> an assertion that execution latency is bounded and no RT hog will delay
> it.
Agreed, I can add a simple test to gem_exec_latency.
But with regards on how to fix this - re-enabling direct submission
sounds simplest (not only indirect via tasklet) in theory although I do
remember you were raising some issues with this route last time I
mentioned it. It does sound like a conceptually correct thing to do.
As an alternative we could explore conversion effort and resulting
latencies from conversion to threaded irq handler.
You also had a patch to improve tasklet scheduling in some cases now I
remember. We can try that after I write the test as well. Although I
have no idea how hard of a sell that would be.
Regards,
Tvrtko
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH i-g-t v4] tests/perf_pmu: Avoid RT thread for accuracy test
2018-04-16 9:55 ` [igt-dev] [Intel-gfx] " Tvrtko Ursulin
@ 2018-04-16 10:08 ` Chris Wilson
-1 siblings, 0 replies; 43+ messages in thread
From: Chris Wilson @ 2018-04-16 10:08 UTC (permalink / raw)
To: Tvrtko Ursulin, Tvrtko Ursulin, igt-dev; +Cc: Intel-gfx
Quoting Tvrtko Ursulin (2018-04-16 10:55:29)
>
> On 14/04/2018 12:35, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-04-11 14:52:36)
> >>
> >> On 11/04/2018 14:23, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2018-04-04 10:51:52)
> >>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>
> >>>> Realtime scheduling interferes with execlists submission (tasklet) so try
> >>>> to simplify the PWM loop in a few ways:
> >>>>
> >>>> * Drop RT.
> >>>> * Longer batches for smaller systematic error.
> >>>> * More truthful test duration calculation.
> >>>> * Less clock queries.
> >>>> * No self-adjust - instead just report the achieved cycle and let the
> >>>> parent check against it.
> >>>> * Report absolute cycle error.
> >>>>
> >>>> v2:
> >>>> * Bring back self-adjust. (Chris Wilson)
> >>>> (But slightly fixed version with no overflow.)
> >>>>
> >>>> v3:
> >>>> * Log average and mean calibration for each pass.
> >>>>
> >>>> v4:
> >>>> * Eliminate development leftovers.
> >>>> * Fix variance logging.
> >>>>
> >>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>
> >>> From a pragmatic point of view, there's no point waiting for me to be
> >>> happy with the convergence if CI is, and the variance will definitely be
> >>> interesting (although you could have used igt_mean to compute the
> >>> iterative variance), so
> >>>
> >>> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>
> >> Thanks, I've pushed it and so we'll see.
> >
> > We should resurrect the RT variant in the near future. It's definitely
> > an issue in our driver that random userspace can impact execution of
> > unconnected others. (Handling RT starvation of workers is something we
> > have to be aware of elsewhere, commonly hits oom if we don't have an
> > escape clause.) Lots of words just to say, we should add a test for RT
> > to exercise the bad behaviour. Hmm, doesn't need to be pmu, just we need
> > an assertion that execution latency is bounded and no RT hog will delay
> > it.
>
> Agreed, I can add a simple test to gem_exec_latency.
>
> But with regards on how to fix this - re-enabling direct submission
> sounds simplest (not only indirect via tasklet) in theory although I do
> remember you were raising some issues with this route last time I
> mentioned it. It does sound like a conceptually correct thing to do.
The problem comes down to that we want direct submission from the irq
handler, which the tasklet solves very nicely for us (most of the time).
Finding an alternative hook other than irq_exit() is the challenge,
irq_work might be acceptable.
> As an alternative we could explore conversion effort and resulting
> latencies from conversion to threaded irq handler.
* shivers
Then we have at least consistently bad latency ;) And the sysadmin can
decide how to prioritise, boo.
> You also had a patch to improve tasklet scheduling in some cases now I
> remember. We can try that after I write the test as well. Although I
> have no idea how hard of a sell that would be.
I think the next plan for upstream tasklets is to try and avoid having
one vector influence the ksoftirqd latency of another. However, that
doesn't solve it for us, where it's likely we've consumed the tasklet
timeslice and so will still be deferred onto ksoftirqd. (It just solves
the case of netdev forcing us to ksoftirqd along with itself.) The hack
I use on top of that to always do at least one immediate execution of
HISOFTIRQ boils down to why just allow that special case, to which there
is no good answer.
Hmm, irq_work, my only concern is if it is run with irqs disabled. We
could live without, but that's an alarmingly big chunk of code.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [igt-dev] [Intel-gfx] [PATCH i-g-t v4] tests/perf_pmu: Avoid RT thread for accuracy test
@ 2018-04-16 10:08 ` Chris Wilson
0 siblings, 0 replies; 43+ messages in thread
From: Chris Wilson @ 2018-04-16 10:08 UTC (permalink / raw)
To: Tvrtko Ursulin, Tvrtko Ursulin, igt-dev; +Cc: Intel-gfx
Quoting Tvrtko Ursulin (2018-04-16 10:55:29)
>
> On 14/04/2018 12:35, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-04-11 14:52:36)
> >>
> >> On 11/04/2018 14:23, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2018-04-04 10:51:52)
> >>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>
> >>>> Realtime scheduling interferes with execlists submission (tasklet) so try
> >>>> to simplify the PWM loop in a few ways:
> >>>>
> >>>> * Drop RT.
> >>>> * Longer batches for smaller systematic error.
> >>>> * More truthful test duration calculation.
> >>>> * Less clock queries.
> >>>> * No self-adjust - instead just report the achieved cycle and let the
> >>>> parent check against it.
> >>>> * Report absolute cycle error.
> >>>>
> >>>> v2:
> >>>> * Bring back self-adjust. (Chris Wilson)
> >>>> (But slightly fixed version with no overflow.)
> >>>>
> >>>> v3:
> >>>> * Log average and mean calibration for each pass.
> >>>>
> >>>> v4:
> >>>> * Eliminate development leftovers.
> >>>> * Fix variance logging.
> >>>>
> >>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>
> >>> From a pragmatic point of view, there's no point waiting for me to be
> >>> happy with the convergence if CI is, and the variance will definitely be
> >>> interesting (although you could have used igt_mean to compute the
> >>> iterative variance), so
> >>>
> >>> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>
> >> Thanks, I've pushed it and so we'll see.
> >
> > We should resurrect the RT variant in the near future. It's definitely
> > an issue in our driver that random userspace can impact execution of
> > unconnected others. (Handling RT starvation of workers is something we
> > have to be aware of elsewhere, commonly hits oom if we don't have an
> > escape clause.) Lots of words just to say, we should add a test for RT
> > to exercise the bad behaviour. Hmm, doesn't need to be pmu, just we need
> > an assertion that execution latency is bounded and no RT hog will delay
> > it.
>
> Agreed, I can add a simple test to gem_exec_latency.
>
> But with regards on how to fix this - re-enabling direct submission
> sounds simplest (not only indirect via tasklet) in theory although I do
> remember you were raising some issues with this route last time I
> mentioned it. It does sound like a conceptually correct thing to do.
The problem comes down to that we want direct submission from the irq
handler, which the tasklet solves very nicely for us (most of the time).
Finding an alternative hook other than irq_exit() is the challenge,
irq_work might be acceptable.
> As an alternative we could explore conversion effort and resulting
> latencies from conversion to threaded irq handler.
* shivers
Then we have at least consistently bad latency ;) And the sysadmin can
decide how to prioritise, boo.
> You also had a patch to improve tasklet scheduling in some cases now I
> remember. We can try that after I write the test as well. Although I
> have no idea how hard of a sell that would be.
I think the next plan for upstream tasklets is to try and avoid having
one vector influence the ksoftirqd latency of another. However, that
doesn't solve it for us, where it's likely we've consumed the tasklet
timeslice and so will still be deferred onto ksoftirqd. (It just solves
the case of netdev forcing us to ksoftirqd along with itself.) The hack
I use on top of that to always do at least one immediate execution of
HISOFTIRQ boils down to why just allow that special case, to which there
is no good answer.
Hmm, irq_work, my only concern is if it is run with irqs disabled. We
could live without, but that's an alarmingly big chunk of code.
-Chris
_______________________________________________
igt-dev mailing list
igt-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/igt-dev
^ permalink raw reply [flat|nested] 43+ messages in thread