All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
@ 2018-03-28  6:38 Francisco Jerez
  2018-03-28  6:38 ` [PATCH 1/9] cpufreq: Implement infrastructure keeping track of aggregated IO active time Francisco Jerez
                   ` (11 more replies)
  0 siblings, 12 replies; 39+ messages in thread
From: Francisco Jerez @ 2018-03-28  6:38 UTC (permalink / raw)
  To: linux-pm, intel-gfx, Srinivas Pandruvada; +Cc: Eero Tamminen, Rafael J. Wysocki

This series attempts to solve an energy efficiency problem of the
current active-mode non-HWP governor of the intel_pstate driver used
for the most part on low-power platforms.  Under heavy IO load the
current controller tends to increase frequencies to the maximum turbo
P-state, partly due to IO wait boosting, partly due to the roughly
flat frequency response curve of the current controller (see
[6]), which causes it to ramp frequencies up and down repeatedly for
any oscillating workload (think of graphics, audio or disk IO when any
of them becomes a bottleneck), severely increasing energy usage
relative to a (throughput-wise equivalent) controller able to provide
the same average frequency without fluctuation.  The core energy
efficiency improvement has been observed to be of the order of 20% via
RAPL, but it's expected to vary substantially between workloads (see
perf-per-watt comparison [2]).

One might expect that this could come at some cost in terms of system
responsiveness, but the governor implemented in PATCH 6 has a variable
response curve controlled by a heuristic that keeps the controller in
a low-latency state unless the system is under heavy IO load for an
extended period of time.  The low-latency behavior is actually
significantly more aggressive than the current governor, allowing it
to achieve better throughput in some scenarios where the load
ping-pongs between the CPU and some IO device (see PATCH 6 for more of
the rationale).  The controller offers relatively lower latency than
the upstream one particularly while C0 residency is low (which by
itself contributes to mitigate the increased energy usage while on
C0).  However under certain conditions the low-latency heuristic may
increase power consumption (see perf-per-watt comparison [2], the
apparent regressions are correlated with an increase in performance in
the same benchmark due to the use of the low-latency heuristic) -- If
this is a problem a different trade-off between latency and energy
usage shouldn't be difficult to achieve, but it will come at a
performance cost in some cases.  I couldn't observe a statistically
significant increase in idle power consumption due to this behavior
(on BXT J3455):

package-0 RAPL (W):    XXXXXX ±0.14% x8 ->     XXXXXX ±0.15% x9         d=-0.04% ±0.14%      p=61.73%

[Absolute benchmark results are unfortunately omitted from this letter
due to company policies, but the percent change and Student's T
p-value are included above and in the referenced benchmark results]

The most obvious impact of this series will likely be the overall
improvement in graphics performance on systems with an IGP integrated
into the processor package (though for the moment this is only enabled
on BXT+), because the TDP budget shared among CPU and GPU can
frequently become a limiting factor in low-power devices.  On heavily
TDP-bound devices this series improves performance of virtually any
non-trivial graphics rendering by a significant amount (of the order
of the energy efficiency improvement for that workload assuming the
optimization didn't cause it to become non-TDP-bound).

See [1]-[5] for detailed numbers including various graphics benchmarks
and a sample of the Phoronix daily-system-tracker.  Some popular
graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve
between 5% and 11% on our systems.  The exact improvement can vary
substantially between systems (compare the benchmark results from the
two different J3455 systems [1] and [3]) due to a number of factors,
including the ratio between CPU and GPU processing power, the behavior
of the userspace graphics driver, the windowing system and resolution,
the BIOS (which has an influence on the package TDP), the thermal
characteristics of the system, etc.

Unigine Valley and Heaven improve by a similar factor on some systems
(see the J3455 results [1]), but on others the improvement is lower
because the benchmark fails to fully utilize the GPU, which causes the
heuristic to remain in low-latency state for longer, which leaves a
reduced TDP budget available to the GPU, which prevents performance
from increasing further.  This can be avoided by using the alternative
heuristic parameters suggested in the commit message of PATCH 8, which
provide a lower IO utilization threshold and hysteresis for the
controller to attempt to save energy.  I'm not proposing those for
upstream (yet) because they would also increase the risk for
latency-sensitive IO-heavy workloads to regress (like SynMark2
OglTerrainFly* and some arguably poorly designed IPC-bound X11
benchmarks).

Discrete graphics aren't likely to experience that much of a visible
improvement from this, even though many non-IGP workloads *could*
benefit by reducing the system's energy usage while the discrete GPU
(or really, any other IO device) becomes a bottleneck, but this is not
attempted in this series, since that would involve making an energy
efficiency/latency trade-off that only the maintainers of the
respective drivers are in a position to make.  The cpufreq interface
introduced in PATCH 1 to achieve this is left as an opt-in for that
reason, only the i915 DRM driver is hooked up since it will get the
most direct pay-off due to the increased energy budget available to
the GPU, but other power-hungry third-party gadgets built into the
same package (*cough* AMD *cough* Mali *cough* PowerVR *cough*) may be
able to benefit from this interface eventually by instrumenting the
driver in a similar way.

The cpufreq interface is not exclusively tied to the intel_pstate
driver, because other governors can make use of the statistic
calculated as result to avoid over-optimizing for latency in scenarios
where a lower frequency would be able to achieve similar throughput
while using less energy.  The interpretation of this statistic relies
on the observation that for as long as the system is CPU-bound, any IO
load occurring as a result of the execution of a program will scale
roughly linearly with the clock frequency the program is run at, so
(assuming that the CPU has enough processing power) a point will be
reached at which the program won't be able to execute faster with
increasing CPU frequency because the throughput limits of some device
will have been attained.  Increasing frequencies past that point only
pessimizes energy usage for no real benefit -- The optimal behavior is
for the CPU to lock to the minimum frequency that is able to keep the
IO devices involved fully utilized (assuming we are past the
maximum-efficiency inflection point of the CPU's power-to-frequency
curve), which is roughly the goal of this series.

PELT could be a useful extension for this model since its largely
heuristic assumptions would become more accurate if the IO and CPU
load could be tracked separately for each scheduling entity, but this
is not attempted in this series because the additional complexity and
computational cost of such an approach is hard to justify at this
stage, particularly since the current governor has similar
limitations.

Various frequency and step-function response graphs are available in
[6]-[9] for comparison (obtained empirically on a BXT J3455 system).
The response curves for the low-latency and low-power states of the
heuristic are shown separately -- As you can see they roughly bracket
the frequency response curve of the current governor.  The step
response of the aggressive heuristic is within a single update period
(even though it's not quite obvious from the graph with the levels of
zoom provided).  I'll attach benchmark results from a slower but
non-TDP-limited machine (which means there will be no TDP budget
increase that could possibly mask a performance regression of other
kind) as soon as they come out.

Thanks to Eero and Valtteri for testing a number of intermediate
revisions of this series (and there were quite a few of them) in more
than half a dozen systems, they helped spot quite a few issues of
earlier versions of this heuristic.

[PATCH 1/9] cpufreq: Implement infrastructure keeping track of aggregated IO active time.
[PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs with core_funcs"
[PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple of long names"
[PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify intel_pstate_adjust_pstate()"
[PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util from pstate_funcs"
[PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass filtering controller for small core.
[PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP controller based on ACPI FADT profile.
[PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP controller parameters via debugfs.
[PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity to cpufreq.

[1] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-comparison-J3455.log
[2] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-per-watt-comparison-J3455.log
[3] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-comparison-J3455-1.log
[4] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-comparison-J4205.log
[5] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-comparison-J5005.log
[6] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequency-response-magnitude-comparison.svg
[7] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequency-response-phase-comparison.svg
[8] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-response-comparison-1.svg
[9] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-response-comparison-2.svg
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/9] cpufreq: Implement infrastructure keeping track of aggregated IO active time.
  2018-03-28  6:38 [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
@ 2018-03-28  6:38 ` Francisco Jerez
  2018-03-28  6:38 ` [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs with core_funcs" Francisco Jerez
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Francisco Jerez @ 2018-03-28  6:38 UTC (permalink / raw)
  To: linux-pm, intel-gfx, Srinivas Pandruvada; +Cc: Eero Tamminen, Rafael J. Wysocki

This provides an IO activity statistic to cpufreq governors
complementary to the IO wait time currently available to them.  An IO
utilization estimated from this statistic which is significantly lower
than 100% can be interpreted as an indication that no IO devices are
utilized to their full throughput yet, and overall system performance
has a good chance of scaling with increasing CPU frequency.  An IO
utilization close to 100% indicates that at all times there was at
least one active IO device, in which case the system is not guaranteed
to be able to accomplish more work per unit of time even if the CPU
frequency could be increased further, providing an opportunity for the
cpufreq governor to save energy.

This patch uses a fairly minimal lockless approach to keep track of IO
activity time that only relies on an atomic counter of the number of
IO jobs in flight and another atomic variable that accumulates the
total amount of time spent with at least one active IO job since
system boot.  IO utilization can be estimated by the cpufreq governor
by periodically sampling increments of IO active time and dividing
them by the increment of the system monotonic clock.

Under some circumstances it may be more accurate to estimate IO
utilization as the maximum IO utilization across all IO devices, which
could be achieved with a somewhat more complex tree-like data
structure (the present approach is roughly equivalent for IO jobs that
are executed simultaneously, but IO jobs that aren't overlapping in
time will show up as the sum of the individual utilizations of each IO
device, which might be biased towards 100% if the non-overlapping jobs
were actually parallelizable), OTOH in cases where the tasks performed
by different IO devices are interdependent the present approach
provides a more accurate estimate (while the alternative approach
would be biased towards 0% and would likely result in less
energy-efficient behavior of the cpufreq governor).

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/cpufreq.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/cpufreq.h   |  20 +++++
 2 files changed, 224 insertions(+)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index de33ebf008ad..892709d0722e 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2444,6 +2444,210 @@ int cpufreq_boost_enabled(void)
 }
 EXPORT_SYMBOL_GPL(cpufreq_boost_enabled);
 
+/*********************************************************************
+ *               IO ACTIVE TIME ACCOUNTING                           *
+ *********************************************************************/
+
+/**
+ * Number of times cpufreq_io_active_begin() has been called so far without a
+ * matching cpufreq_io_active_end(), or IOW, the approximate number of IO jobs
+ * currently in flight.
+ */
+static atomic_t io_active_count;
+
+/**
+ * Total aggregated time that io_active_count has been greater than zero since
+ * system boot.  Negative values (in two's complement) represent a duration
+ * relative to the current time (typically used to implement the section
+ * between matching cpufreq_io_active_begin() and cpufreq_io_active_end()
+ * calls).  Positive values represent absolute durations and are smaller than
+ * IO_ACTIVE_TIME_M in magnitude.  In order to prevent the reduced integer
+ * range from introducing more frequent time wraparounds than in the rest of
+ * the kernel, time is represented with slightly lower precision than a
+ * ktime_t, in units of 4 ns.
+ */
+static atomic64_t io_active_time;
+
+/**
+ * Time of system boot, or one plus the maximum encoding of an absolute time
+ * duration.  Values greater or equal to this constant in magnitude are used to
+ * represent points in time rather than time durations, this guarantees that
+ * the maximum representable time duration can be subtracted from any point in
+ * time and still give a positive number as result, which is important due to
+ * the somewhat special semantics of the sign of io_active_time.
+ */
+#define IO_ACTIVE_TIME_M ((uint64_t)1 << 62)
+
+/**
+ * Return true if @t is a negative io_active_time value encoding a time
+ * duration relative to the current time.
+ */
+static bool io_active_time_is_relative(uint64_t t)
+{
+	return t >> 63;
+}
+
+/**
+ * Convert a duration or point in time into a scalar value in nanoseconds.
+ */
+static uint64_t io_active_time_to_ns(uint64_t t)
+{
+	return (t & (IO_ACTIVE_TIME_M - 1)) << 2;
+}
+
+/**
+ * Convert a scalar time value in nanoseconds into a point in time.
+ */
+static uint64_t io_active_time_from_ns(uint64_t ns)
+{
+	return IO_ACTIVE_TIME_M + (ns >> 2);
+}
+
+/**
+ * Mark the beginning of the processing of an IO job.  Each call of
+ * cpufreq_io_active_begin() must be accompanied by a corresponding call of
+ * cpufreq_io_active_end() after the IO job completes.
+ */
+void cpufreq_io_active_begin(void)
+{
+	/*
+	 * The body of the conditional below is executed only for the first of
+	 * any number of concurrent calls of cpufreq_io_active_begin(), it is
+	 * ordered after any io_active_time updates done by previous
+	 * invocations of cpufreq_io_active_end() (since those updates are
+	 * ordered before the atomic_cmpxchg_release operation that caused the
+	 * counter to drop to zero in the first place), and it is ordered with
+	 * respect to io_active_time updates done by concurrent invocations of
+	 * cpufreq_io_active_end() (since they won't modify io_active_time
+	 * until io_active_count has dropped to one, which implies that the
+	 * present cpufreq_io_active_begin() call and its matching
+	 * cpufreq_io_active_end() have completed).  Therefore code in this
+	 * block is effectively single-threaded.
+	 */
+	if (atomic_fetch_inc_acquire(&io_active_count) == 0) {
+		/*
+		 * Subtract the current time from io_active_time, which is
+		 * guaranteed to give a negative value (i.e. relative to the
+		 * current time) assuming that the precondition of
+		 * io_active_time being non-negative and lower than
+		 * IO_ACTIVE_TIME_M (i.e. an absolute time duration) is met.
+		 */
+		const uint64_t now = io_active_time_from_ns(ktime_get_ns());
+
+		atomic64_sub(now, &io_active_time);
+		/*
+		 * The barrier is provided for the io_active_time update above
+		 * to be correctly ordered with respect to subsequent memory
+		 * operations, in particular the ones leading to the eventual
+		 * execution of a matching cpufreq_io_active_end() call, which
+		 * must have the io_active_time update above visible.
+		 */
+		smp_wmb();
+	}
+}
+EXPORT_SYMBOL(cpufreq_io_active_begin);
+
+/**
+ * Mark the end of the processing of an IO job.  This must be called after the
+ * completion of the corresponding call of cpufreq_io_active_begin(), but there
+ * is no requirement for the two functions to be called from the same thread.
+ */
+void cpufreq_io_active_end(void)
+{
+	const uint64_t now = io_active_time_from_ns(ktime_get_ns());
+	uint64_t old_active, new_active = atomic_read(&io_active_count);
+	uint64_t begin;
+
+	do {
+		old_active = new_active;
+
+		/*
+		 * The body of the conditional below is ordered after a point
+		 * in which the present thread was executing the only active
+		 * begin/end section remaining in the system.  This implies
+		 * that no concurrent begin/end section can possibly have
+		 * started before the present begin/end section, since
+		 * otherwise the last read of io_active_count would have
+		 * returned a value greater than one.  Therefore all
+		 * concurrent begin/end sections are ordered with respect to
+		 * this section's begin, and cannot possibly have observed an
+		 * io_active_count value lower than two at any point of their
+		 * execution.  That makes code in this block (and in the other
+		 * old_active == 1 conditional further down) effectively
+		 * single-threaded until the atomic cmpxchg operation below
+		 * succeeds.
+		 */
+		if (old_active == 1) {
+			/*
+			 * Update io_active_time to reflect the time spent
+			 * between this (potentially last) call of
+			 * cpufreq_io_active_end() and the matching call of
+			 * cpufreq_io_active_begin().  This doesn't use an
+			 * atomic add in order to prevent overflow which could
+			 * lead to a sign inconsistency every ~584 years.
+			 *
+			 * If now is lower than -begin because the system's
+			 * monotonic clock has wrapped around since the
+			 * matching call of cpufreq_io_active_begin(), the sum
+			 * below will give a time duration result off by
+			 * IO_ACTIVE_TIME_M, which is taken care of with a
+			 * little bitwise arithmetic.
+			 */
+			begin = atomic64_read(&io_active_time);
+			WARN_ON(!io_active_time_is_relative(begin));
+			atomic64_set(&io_active_time,
+				     (begin + now) & (IO_ACTIVE_TIME_M - 1));
+		}
+
+		/*
+		 * If old_active is one at this point we have the guarantee
+		 * that there will be no concurrent update of io_active_time
+		 * between our last update above and the atomic cmpxchg
+		 * operation below, so io_active_time is guaranteed to be
+		 * non-negative as a postcondition of this function when
+		 * io_active_count is successfully decremented to zero by the
+		 * following cmpxchg operation.
+		 */
+		new_active = atomic_cmpxchg_release(
+			&io_active_count, old_active, old_active - 1);
+
+		/*
+		 * Roll back to the original io_active_time value if somebody
+		 * called cpufreq_io_active_begin() concurrently after our
+		 * previous read of io_active_count, which means that this
+		 * maybe wasn't the last call of cpufreq_io_active_end() after
+		 * all.
+		 */
+		if (old_active == 1 && old_active != new_active)
+			atomic64_set(&io_active_time, begin);
+
+		/*
+		 * Retry if another thread modified the counter in parallel
+		 * preventing the atomic cmpxchg operation above from
+		 * committing our updated value.
+		 */
+	} while (old_active != new_active);
+}
+EXPORT_SYMBOL(cpufreq_io_active_end);
+
+/**
+ * Return the total accumulated time in nanoseconds that the system has spent
+ * doing IO processing since boot, defined as the total time spent between
+ * matching calls of cpufreq_io_active_begin() and cpufreq_io_active_end().
+ */
+uint64_t cpufreq_io_active_time_ns(void)
+{
+	const uint64_t time = atomic64_read(&io_active_time);
+	const uint64_t now = io_active_time_from_ns(ktime_get_ns());
+
+	if (io_active_time_is_relative(time))
+		return io_active_time_to_ns(time + now);
+	else
+		return io_active_time_to_ns(time);
+}
+EXPORT_SYMBOL(cpufreq_io_active_time_ns);
+
+
 /*********************************************************************
  *               REGISTER / UNREGISTER CPUFREQ DRIVER                *
  *********************************************************************/
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 21e8d248d956..2107a1169cce 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -950,7 +950,27 @@ static inline bool policy_has_boost_freq(struct cpufreq_policy *policy)
 }
 #endif
 
+#ifdef CONFIG_CPU_FREQ
+void cpufreq_io_active_begin(void);
+void cpufreq_io_active_end(void);
+uint64_t cpufreq_io_active_time_ns(void);
+#else
+static inline void cpufreq_io_active_begin(void)
+{
+}
+
+static inline void cpufreq_io_active_end(void)
+{
+}
+
+static inline uint64_t cpufreq_io_active_time_ns(void)
+{
+	return 0;
+}
+#endif
+
 extern void arch_freq_prepare_all(void);
+
 extern unsigned int arch_freq_get_on_cpu(int cpu);
 
 extern void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq,
-- 
2.16.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs with core_funcs"
  2018-03-28  6:38 [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
  2018-03-28  6:38 ` [PATCH 1/9] cpufreq: Implement infrastructure keeping track of aggregated IO active time Francisco Jerez
@ 2018-03-28  6:38 ` Francisco Jerez
  2018-03-28  6:38 ` [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple of long names" Francisco Jerez
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Francisco Jerez @ 2018-03-28  6:38 UTC (permalink / raw)
  To: linux-pm, intel-gfx, Srinivas Pandruvada; +Cc: Eero Tamminen, Rafael J. Wysocki

This reverts commit dbd49b85eec7eb6d7ae61bad8306d5cdd85c142d.  A
future commit will introduce a new update_util implementation for LP
platforms, so the bxt_funcs table comes in handy.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 6d084c61ee25..fe847d086926 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1600,6 +1600,15 @@ static const struct pstate_funcs knl_funcs = {
 	.get_val = core_get_val,
 };
 
+static const struct pstate_funcs bxt_funcs = {
+	.get_max = core_get_max_pstate,
+	.get_max_physical = core_get_max_pstate_physical,
+	.get_min = core_get_min_pstate,
+	.get_turbo = core_get_turbo_pstate,
+	.get_scaling = core_get_scaling,
+	.get_val = core_get_val,
+};
+
 #define ICPU(model, policy) \
 	{ X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF,\
 			(unsigned long)&policy }
@@ -1623,8 +1632,8 @@ static const struct x86_cpu_id intel_pstate_cpu_ids[] = {
 	ICPU(INTEL_FAM6_BROADWELL_XEON_D,	core_funcs),
 	ICPU(INTEL_FAM6_XEON_PHI_KNL,		knl_funcs),
 	ICPU(INTEL_FAM6_XEON_PHI_KNM,		knl_funcs),
-	ICPU(INTEL_FAM6_ATOM_GOLDMONT,		core_funcs),
-	ICPU(INTEL_FAM6_ATOM_GEMINI_LAKE,       core_funcs),
+	ICPU(INTEL_FAM6_ATOM_GOLDMONT,		bxt_funcs),
+	ICPU(INTEL_FAM6_ATOM_GEMINI_LAKE,       bxt_funcs),
 	ICPU(INTEL_FAM6_SKYLAKE_X,		core_funcs),
 	{}
 };
-- 
2.16.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple of long names"
  2018-03-28  6:38 [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
  2018-03-28  6:38 ` [PATCH 1/9] cpufreq: Implement infrastructure keeping track of aggregated IO active time Francisco Jerez
  2018-03-28  6:38 ` [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs with core_funcs" Francisco Jerez
@ 2018-03-28  6:38 ` Francisco Jerez
  2018-03-28  6:38 ` [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify intel_pstate_adjust_pstate()" Francisco Jerez
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Francisco Jerez @ 2018-03-28  6:38 UTC (permalink / raw)
  To: linux-pm, intel-gfx, Srinivas Pandruvada; +Cc: Eero Tamminen, Rafael J. Wysocki

This reverts one half of commit
d77d4888cb8458b098accd4d7555c0f7f6399c4e.  It moves back to the old
name of get_target_pstate_use_cpu_load(), because a future commit will
introduce a new P-state target calculation function.  The shortened
name of INTEL_PSTATE_SAMPLING_INTERVAL is left untouched.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index fe847d086926..e21645f0fd3a 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1442,7 +1442,7 @@ static inline int32_t get_avg_pstate(struct cpudata *cpu)
 			  cpu->sample.core_avg_perf);
 }
 
-static inline int32_t get_target_pstate(struct cpudata *cpu)
+static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu)
 {
 	struct sample *sample = &cpu->sample;
 	int32_t busy_frac, boost;
@@ -1507,7 +1507,7 @@ static void intel_pstate_adjust_pstate(struct cpudata *cpu)
 
 	update_turbo_state();
 
-	target_pstate = get_target_pstate(cpu);
+	target_pstate = get_target_pstate_use_cpu_load(cpu);
 	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
 	trace_cpu_frequency(target_pstate * cpu->pstate.scaling, cpu->cpu);
 	intel_pstate_update_pstate(cpu, target_pstate);
-- 
2.16.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify intel_pstate_adjust_pstate()"
  2018-03-28  6:38 [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
                   ` (2 preceding siblings ...)
  2018-03-28  6:38 ` [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple of long names" Francisco Jerez
@ 2018-03-28  6:38 ` Francisco Jerez
  2018-03-28  6:38 ` [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util from pstate_funcs" Francisco Jerez
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Francisco Jerez @ 2018-03-28  6:38 UTC (permalink / raw)
  To: linux-pm, intel-gfx, Srinivas Pandruvada; +Cc: Eero Tamminen, Rafael J. Wysocki

This reverts commit a891283e56362543d1d276e192266069ef52075b.  The
previous approach of taking an explicit P-state target as argument
makes it more easily reusable by a future commit.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index e21645f0fd3a..ef8f46ff7168 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1499,15 +1499,13 @@ static void intel_pstate_update_pstate(struct cpudata *cpu, int pstate)
 	wrmsrl(MSR_IA32_PERF_CTL, pstate_funcs.get_val(cpu, pstate));
 }
 
-static void intel_pstate_adjust_pstate(struct cpudata *cpu)
+static void intel_pstate_adjust_pstate(struct cpudata *cpu, int target_pstate)
 {
 	int from = cpu->pstate.current_pstate;
 	struct sample *sample;
-	int target_pstate;
 
 	update_turbo_state();
 
-	target_pstate = get_target_pstate_use_cpu_load(cpu);
 	target_pstate = intel_pstate_prepare_request(cpu, target_pstate);
 	trace_cpu_frequency(target_pstate * cpu->pstate.scaling, cpu->cpu);
 	intel_pstate_update_pstate(cpu, target_pstate);
@@ -1557,8 +1555,12 @@ static void intel_pstate_update_util(struct update_util_data *data, u64 time,
 		return;
 
 set_pstate:
-	if (intel_pstate_sample(cpu, time))
-		intel_pstate_adjust_pstate(cpu);
+	if (intel_pstate_sample(cpu, time)) {
+		int target_pstate;
+
+		target_pstate = get_target_pstate_use_cpu_load(cpu);
+		intel_pstate_adjust_pstate(cpu, target_pstate);
+	}
 }
 
 static struct pstate_funcs core_funcs = {
-- 
2.16.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util from pstate_funcs"
  2018-03-28  6:38 [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
                   ` (3 preceding siblings ...)
  2018-03-28  6:38 ` [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify intel_pstate_adjust_pstate()" Francisco Jerez
@ 2018-03-28  6:38 ` Francisco Jerez
  2018-03-28  6:38 ` [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass filtering controller for small core Francisco Jerez
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Francisco Jerez @ 2018-03-28  6:38 UTC (permalink / raw)
  To: linux-pm, intel-gfx, Srinivas Pandruvada; +Cc: Eero Tamminen, Rafael J. Wysocki

This reverts commit c4f3f70cacba2fa19545389a12d09b606d2ad1cf.  A
future commit will introduce a new update_util implementation, so the
pstate_funcs table entry is going to be useful.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index ef8f46ff7168..ef699a3a238f 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -267,6 +267,7 @@ static struct cpudata **all_cpu_data;
  * @get_scaling:	Callback to get frequency scaling factor
  * @get_val:		Callback to convert P state to actual MSR write value
  * @get_vid:		Callback to get VID data for Atom platforms
+ * @update_util:	Active mode utilization update callback.
  *
  * Core and Atom CPU models have different way to get P State limits. This
  * structure is used to store those callbacks.
@@ -280,6 +281,8 @@ struct pstate_funcs {
 	int (*get_aperf_mperf_shift)(void);
 	u64 (*get_val)(struct cpudata*, int pstate);
 	void (*get_vid)(struct cpudata *);
+	void (*update_util)(struct update_util_data *data, u64 time,
+			    unsigned int flags);
 };
 
 static struct pstate_funcs pstate_funcs __read_mostly;
@@ -1570,6 +1573,7 @@ static struct pstate_funcs core_funcs = {
 	.get_turbo = core_get_turbo_pstate,
 	.get_scaling = core_get_scaling,
 	.get_val = core_get_val,
+	.update_util = intel_pstate_update_util,
 };
 
 static const struct pstate_funcs silvermont_funcs = {
@@ -1580,6 +1584,7 @@ static const struct pstate_funcs silvermont_funcs = {
 	.get_val = atom_get_val,
 	.get_scaling = silvermont_get_scaling,
 	.get_vid = atom_get_vid,
+	.update_util = intel_pstate_update_util,
 };
 
 static const struct pstate_funcs airmont_funcs = {
@@ -1590,6 +1595,7 @@ static const struct pstate_funcs airmont_funcs = {
 	.get_val = atom_get_val,
 	.get_scaling = airmont_get_scaling,
 	.get_vid = atom_get_vid,
+	.update_util = intel_pstate_update_util,
 };
 
 static const struct pstate_funcs knl_funcs = {
@@ -1600,6 +1606,7 @@ static const struct pstate_funcs knl_funcs = {
 	.get_aperf_mperf_shift = knl_get_aperf_mperf_shift,
 	.get_scaling = core_get_scaling,
 	.get_val = core_get_val,
+	.update_util = intel_pstate_update_util,
 };
 
 static const struct pstate_funcs bxt_funcs = {
@@ -1609,6 +1616,7 @@ static const struct pstate_funcs bxt_funcs = {
 	.get_turbo = core_get_turbo_pstate,
 	.get_scaling = core_get_scaling,
 	.get_val = core_get_val,
+	.update_util = intel_pstate_update_util,
 };
 
 #define ICPU(model, policy) \
@@ -1705,7 +1713,7 @@ static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
 	/* Prevent intel_pstate_update_util() from using stale data. */
 	cpu->sample.time = 0;
 	cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
-				     intel_pstate_update_util);
+				     pstate_funcs.update_util);
 	cpu->update_util_set = true;
 }
 
@@ -2148,6 +2156,7 @@ static void __init copy_cpu_funcs(struct pstate_funcs *funcs)
 	pstate_funcs.get_scaling = funcs->get_scaling;
 	pstate_funcs.get_val   = funcs->get_val;
 	pstate_funcs.get_vid   = funcs->get_vid;
+	pstate_funcs.update_util = funcs->update_util;
 	pstate_funcs.get_aperf_mperf_shift = funcs->get_aperf_mperf_shift;
 }
 
@@ -2278,7 +2287,9 @@ static int __init intel_pstate_init(void)
 
 	if (x86_match_cpu(hwp_support_ids)) {
 		copy_cpu_funcs(&core_funcs);
-		if (!no_hwp) {
+		if (no_hwp) {
+			pstate_funcs.update_util = intel_pstate_update_util;
+		} else {
 			hwp_active++;
 			intel_pstate.attr = hwp_cpufreq_attrs;
 			goto hwp_cpu_matched;
-- 
2.16.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass filtering controller for small core.
  2018-03-28  6:38 [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
                   ` (4 preceding siblings ...)
  2018-03-28  6:38 ` [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util from pstate_funcs" Francisco Jerez
@ 2018-03-28  6:38 ` Francisco Jerez
  2018-03-28  6:38 ` [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP controller based on ACPI FADT profile Francisco Jerez
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Francisco Jerez @ 2018-03-28  6:38 UTC (permalink / raw)
  To: linux-pm, intel-gfx, Srinivas Pandruvada; +Cc: Eero Tamminen, Rafael J. Wysocki

This introduces a controller for low-power parts that takes advantage
of the IO active time statistic introduced earlier in order to adjust
the trade-off between responsiveness and energy efficiency of the
heuristic dynamically.  This allows it to achieve lower energy
consumption when the system is far enough from the CPU-bound end of
the IO utilization statistic.  In low-latency mode the controller is
actually somewhat more aggressive than the current one due to its use
of the APER/MPERF ratio (particularly if C0 residency is low, which by
itself partly mitigates the lower energy efficiency of the aggressive
heuristic) -- See the comments below for the rationale.

The heuristic is tuned to roughly match the performance numbers of the
current governor (which is rather aggressive) in latency-bound
test-cases, so the energy-saving behavior won't kick in with the
current calibration except when heavily IO-bound for some time.  The
RT and DL scheduling flags could potentially provide a useful
additional variable for the heuristic to decide whether the workload
is latency-sensitive, allowing it to save power in other
(non-IO-bound) cases, but this is not attempted in this series since
there would be an increased risk of performance regressions due to
latency-sensitive tasks not marked RT or DL.

For the moment this is only enabled on BXT in order to reduce the
extent of any unexpected fallout, but it should work on other
low-power platforms if it's hooked up to the right pstate_funcs table
(at your own risk).

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 357 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 353 insertions(+), 4 deletions(-)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index ef699a3a238f..d4b5d0aaa282 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -61,6 +61,11 @@ static inline int32_t mul_fp(int32_t x, int32_t y)
 	return ((int64_t)x * (int64_t)y) >> FRAC_BITS;
 }
 
+static inline int rnd_fp(int32_t x)
+{
+	return (x + (1 << (FRAC_BITS - 1))) >> FRAC_BITS;
+}
+
 static inline int32_t div_fp(s64 x, s64 y)
 {
 	return div64_s64((int64_t)x << FRAC_BITS, y);
@@ -171,6 +176,23 @@ struct vid_data {
 	int32_t ratio;
 };
 
+/**
+ * struct lp_data - LP controller parameters and state.
+ * @sample_interval_ns:  Update interval in ns
+ * @last_io_active_ns:   Cumulative IO active time in ns observed at the
+ *                       last sample.
+ * @setpoint:            Target CPU utilization at which the controller is
+ *                       expected to leave the current P-state untouched, as
+ *                       a fixed-point fraction.
+ * @p_base:              Low-pass filtered P-state as a fixed-point fraction.
+ */
+struct lp_data {
+	s64 sample_interval_ns;
+	uint64_t last_io_active_ns;
+	int32_t setpoint;
+	int32_t p_base;
+};
+
 /**
  * struct global_params - Global parameters, mostly tunable via sysfs.
  * @no_turbo:		Whether or not to use turbo P-states.
@@ -234,6 +256,7 @@ struct cpudata {
 
 	struct pstate_data pstate;
 	struct vid_data vid;
+	struct lp_data lp;
 
 	u64	last_update;
 	u64	last_sample_time;
@@ -258,6 +281,28 @@ struct cpudata {
 
 static struct cpudata **all_cpu_data;
 
+/**
+ * struct lp_params - LP controller static configuration
+ * @sample_interval_ms:      Update interval in ms
+ * @setpoint_pml:            Target CPU utilization at which the controller is
+ *                           expected to leave the current P-state untouched,
+ *                           as an integer per mille.
+ * @p_base_avg_hz:           Exponential averaging frequency of the P-state
+ *                           low-pass filter as an integer in Hz.
+ * @io_active_threshold_pml: IO utilization threshold at which the controller
+ *                           should transition to a higher latency low-pass
+ *                           filtering mode, as an integer per mille.
+ * @io_active_avg_hz:        Exponential averaging frequency of the IO
+ *                           utilization statistic as an integer in Hz.
+ */
+struct lp_params {
+	int sample_interval_ms;
+	int setpoint_pml;
+	int p_base_avg_hz;
+	int io_active_threshold_pml;
+	int io_active_avg_hz;
+};
+
 /**
  * struct pstate_funcs - Per CPU model specific callbacks
  * @get_max:		Callback to get maximum non turbo effective P state
@@ -286,6 +331,13 @@ struct pstate_funcs {
 };
 
 static struct pstate_funcs pstate_funcs __read_mostly;
+static struct lp_params lp_params __read_mostly = {
+	.sample_interval_ms = 10,
+	.setpoint_pml = 700,
+	.p_base_avg_hz = 3,
+	.io_active_threshold_pml = 983,
+	.io_active_avg_hz = 3
+};
 
 static int hwp_active __read_mostly;
 static bool per_cpu_limits __read_mostly;
@@ -1483,6 +1535,285 @@ static inline int32_t get_target_pstate_use_cpu_load(struct cpudata *cpu)
 	return target;
 }
 
+/**
+ * Initialize the struct lp_data of the specified CPU to the defaults
+ * calculated from @lp_params.
+ */
+static void intel_pstate_lp_reset(struct cpudata *cpu)
+{
+	struct lp_data *lp = &cpu->lp;
+
+	lp->sample_interval_ns = lp_params.sample_interval_ms * NSEC_PER_MSEC;
+	lp->setpoint = div_fp(lp_params.setpoint_pml, 1000);
+	lp->p_base = int_tofp(cpu->pstate.current_pstate);
+}
+
+/**
+ * Unit ramp function used as building block for more complex
+ * piecewise linear functions.
+ */
+static int32_t ramp(int32_t x0, int32_t x1, int32_t x)
+{
+	return x <= x0 ? 0 :
+	       x >= x1 ? int_tofp(1) :
+	       div_fp(x - x0, x1 - x0);
+}
+
+/**
+ * Fixed point representation with twice the usual number of
+ * fractional bits.
+ */
+#define DFRAC_BITS 16
+#define DFRAC_ONE (1 << DFRAC_BITS)
+#define DFRAC_MAX_INT (0u - (uint32_t)DFRAC_ONE)
+
+/**
+ * Fast but rather inaccurate piecewise-linear approximation of a
+ * fixed-point product by an inverse exponential:
+ *
+ *  decay(a, p) = a * 2 ^ (-p / DFRAC_ONE) + O(a)
+ *
+ * The error term should be lower in magnitude than 0.044 * a.
+ */
+static int32_t decay(int32_t a, uint32_t p)
+{
+	if (a < 0) {
+		/*
+		 * Avoid implementation-defined behavior in signed
+		 * right shift of negative integer.
+		 */
+		return -decay(-a, p);
+
+	} else if (p < 32 * DFRAC_ONE) {
+		/* Interpolate between 2^-floor(p) and 2^-ceil(p). */
+		const uint32_t floor_p = p >> DFRAC_BITS;
+		const uint32_t ceil_p = (p + DFRAC_ONE - 1) >> DFRAC_BITS;
+		const uint64_t frac_p = p - (floor_p << DFRAC_BITS);
+
+		return ((a >> floor_p) * (DFRAC_ONE - frac_p) +
+			(ceil_p >= 32 ? 0 : a >> ceil_p) * frac_p) >>
+		       DFRAC_BITS;
+	}
+
+	/* Short-circuit to avoid overflow. */
+	return 0;
+}
+
+/**
+ * Calculate the target P-state for the next update period.  Uses a
+ * (variably) low-pass-filtering controller intended to improve energy
+ * efficiency under some conditions controlled heuristically.
+ */
+static int32_t get_target_pstate_lp(struct cpudata *cpu)
+{
+	struct lp_data *lp = &cpu->lp;
+	/*
+	 * Estimate the average IO utilization over the sampling
+	 * interval.
+	 */
+	const uint64_t delta_ns = cpu->sample.time - cpu->last_sample_time;
+	const uint64_t io_active_ns = cpufreq_io_active_time_ns();
+	const uint32_t io_active_pml = div_fp(
+		(io_active_ns - lp->last_io_active_ns) * 1000,
+		delta_ns);
+	/*
+	 * Approximate, but saves two 64-bit integer divisions below
+	 * and should be fully evaluated at compile-time.  Causes the
+	 * exponential averaging to have an effective base of
+	 * 1.90702343749, which has little functional implications as
+	 * long as the io_active_avg_hz and p_base_avg_hz parameters
+	 * are scaled accordingly.
+	 */
+	const uint32_t ns_per_s_shift = order_base_2(NSEC_PER_SEC);
+	/*
+	 * Exponentially average the IO utilization observed during
+	 * the last interval in order to obtain a more long-term
+	 * statistic.  The exponent p is the ratio between the elapsed
+	 * time and an exponential averaging time constant:
+	 *
+	 *   T0 = s / io_active_avg_hz
+	 *
+	 * This time constant should typically be of the order of
+	 * magnitude of the time constant T1 of the low-pass filter in
+	 * order for the IO utilization statistic to contain a
+	 * non-negligible contribution from the behavior of the system
+	 * in the window of time in which the low-pass filter will be
+	 * rearranging the workload.  A longer time constant provides
+	 * a more stable statistic at the cost of making it less
+	 * responsive to changes in the behavior of the system.
+	 */
+	const uint32_t p = min((uint64_t)DFRAC_MAX_INT,
+			       (lp_params.io_active_avg_hz * delta_ns) >>
+			       (ns_per_s_shift - DFRAC_BITS));
+	const uint32_t io_active_avg_pml = io_active_pml +
+		decay(cpu->iowait_boost - io_active_pml, p);
+	/*
+	 * Whether the system is under close to full IO utilization,
+	 * which causes the controller to make a more conservative
+	 * trade-off between latency and energy usage, since the
+	 * system is close to IO-bound so performance isn't guaranteed
+	 * to scale further with increasing CPU frequency.
+	 *
+	 * If the workload *is* latency-bound despite the nearly full
+	 * IO utilization, the more conservative behavior of the
+	 * controller will result in negative feed-back causing the IO
+	 * utilization to drop under io_active_threshold_pml, at which
+	 * point the controller will switch back to the more
+	 * aggressively latency-minimizing mode, which will cause the
+	 * IO utilization to increase.
+	 *
+	 * This means that io_active_threshold_pml acts as a point of
+	 * stable equilibrium of the system for latency-bound IO
+	 * workloads, so it should be rather high in order to avoid
+	 * underutilizing IO devices around equilibrium, which may
+	 * hurt performance for such (rather pathological) workloads.
+	 * Applications that pipeline CPU and IO work shouldn't be
+	 * appreciably latency-bound and will be able to achieve full
+	 * IO utilization regardless of io_active_threshold_pml.  In
+	 * addition they will obtain comparatively lower energy usage
+	 * since the controller should hardly ever have to switch to
+	 * latency-minimizing mode in the steady state.
+	 */
+	const bool relax = rnd_fp(io_active_avg_pml) >=
+			   lp_params.io_active_threshold_pml;
+	/*
+	 * P-state limits in fixed-point as allowed by the policy.
+	 */
+	const int32_t p_min = int_tofp(max(cpu->pstate.min_pstate,
+					   cpu->min_perf_ratio));
+	const int32_t p_max = int_tofp(cpu->max_perf_ratio);
+	const int32_t p_cur = int_tofp(cpu->pstate.current_pstate);
+	/*
+	 * Observed average P-state during (a part of) the sampling
+	 * period.  The conservative path uses the TSC increment as
+	 * denominator which will give the minimum (arguably most
+	 * energy-efficient) P-state able to accomplish the observed
+	 * amount of work during the sampling period.
+	 *
+	 * The downside of that somewhat optimistic estimate is that
+	 * it can give a rather biased result for intermittent
+	 * latency-sensitive workloads, which may have to be completed
+	 * in a short window of time for the system to achieve maximum
+	 * performance, even though the average CPU utilization is
+	 * low.  For that reason the latency-minimizing heuristic uses
+	 * the MPERF increment as denominator instead, which will give
+	 * the P-state able to accomplish the observed amount of work
+	 * during the time that the processor was actually awake (in
+	 * C0 state specifically), which is approximately optimal
+	 * under the rather pessimistic assumption that the CPU work
+	 * cannot be parallelized with any other dependent IO work
+	 * that subsequently keeps the CPU idle (arguably in C1+
+	 * states), so MPERF provides an estimate of the time the CPU
+	 * actually had available to accomplish the observed work.
+	 */
+	const s64 n_obs = cpu->sample.aperf << cpu->aperf_mperf_shift;
+	const s64 n_max = (relax ? cpu->sample.tsc :
+			   cpu->sample.mperf << cpu->aperf_mperf_shift);
+	const int32_t p_obs = min(p_cur, div_fp(
+				     n_obs * cpu->pstate.max_pstate_physical,
+				     n_max));
+	/*
+	 * Average P-state that would have been observed at the target
+	 * CPU utilization.  A lower setpoint fraction gives the
+	 * controller a stronger upward bias and a larger room for the
+	 * system load to fluctuate between update periods, at the
+	 * cost of increasing energy usage.
+	 */
+	const int32_t p_tgt = mul_fp(lp->setpoint, p_cur);
+	/*
+	 * Unfiltered controller response for the observed average
+	 * performance during the last sampling period.	 This is the
+	 * simplest piecewise-linear function of p_obs that satisfies
+	 * the following properties:
+	 *
+	 *   - p_est(0) = p_min
+	 *   - p_est(p_tgt) = p_cur
+	 *   - p_est(p_cur) = p_max
+	 *
+	 * which ensure that the P-state range specified by the policy
+	 * is honored and that the controller has a fixed point at the
+	 * target utilization.
+	 */
+	const int32_t p_est = max(p_min,
+				  mul_fp(p_cur, ramp(0, p_tgt, p_obs)) +
+				  mul_fp(p_max - p_cur,
+					 ramp(p_tgt, p_cur, p_obs)));
+	/*
+	 * Low-pass filter the P-state estimate above by exponential
+	 * averaging.  For an oscillating workload (e.g. submitting
+	 * work repeatedly to a device like a soundcard or GPU) this
+	 * will approximate the minimum P-state that would be able to
+	 * accomplish the observed amount of work during the averaging
+	 * period, which is also the optimally energy-efficient one,
+	 * under the assumptions that:
+	 *
+	 *  - The power curve of the system is convex throughout the
+	 *    range of P-states allowed by the policy.	I.e. energy
+	 *    efficiency is steadily decreasing with frequency past
+	 *    p_min (which is typically close to the
+	 *    maximum-efficiency ratio).  In practice for the lower
+	 *    range of P-states this may only be approximately true
+	 *    due to the interaction between different components of
+	 *    the system.
+	 *
+	 *  - Parallelism constraints of the workload don't prevent it
+	 *    from achieving the same throughput at the lower P-state.
+	 *    This will happen in cases where the application is
+	 *    designed in a way that doesn't allow for dependent CPU
+	 *    and IO jobs to be pipelined, leading to alternating full
+	 *    and zero utilization of the CPU and IO device.  This
+	 *    will give an average IO device utilization lower than
+	 *    100% regardless of the CPU frequency, which is expected
+	 *    to cause the controller to transition to low-latency
+	 *    mode once the IO utilization drops below
+	 *    io_active_threshold_pml, at which point p_base will no
+	 *    longer have an influence on the controller response.
+	 *
+	 *  - The period of the oscillating workload is significantly
+	 *    shorter than the time constant of the exponential
+	 *    average:
+	 *
+	 *	T1 = s / p_base_avg_hz
+	 *
+	 *    Otherwise for more slowly oscillating workloads the
+	 *    controller response will roughly follow the oscillation,
+	 *    leading to decreased energy efficiency.
+	 *
+	 *  - The behavior of the workload doesn't change
+	 *    qualitatively during the next update interval.  This is
+	 *    only true in the steady state, and could possibly lead
+	 *    to a transitory period in which the controller response
+	 *    deviates from the most energy-efficient ratio until the
+	 *    workload reaches a steady state again.  This could be
+	 *    mitigated to a certain extent with some form of
+	 *    per-entity load tracking.
+	 */
+	const uint32_t q = min((uint64_t)DFRAC_MAX_INT,
+			       (lp_params.p_base_avg_hz * delta_ns) >>
+			       (ns_per_s_shift - DFRAC_BITS));
+	lp->p_base = p_est + decay(lp->p_base - p_est, q);
+	/*
+	 * Update busy_scaled with the utilization fraction relative
+	 * to the TSC base frequency.  Used for tracing.  100%
+	 * indicates full utilization at the maximum non-turbo
+	 * frequency.
+	 */
+	cpu->sample.busy_scaled = div_fp(100 * n_obs, cpu->sample.tsc);
+	/*
+	 * Track the IO utilization statistic as iowait_boost for it
+	 * to show up in traces instead of the iowait_boost fraction.
+	 */
+	cpu->iowait_boost = io_active_avg_pml;
+	lp->last_io_active_ns = io_active_ns;
+	/*
+	 * Use the low-pass-filtered controller response for better
+	 * energy efficiency unless we have reasons to believe that
+	 * some of the optimality assumptions discussed above may not
+	 * hold.
+	 */
+	return rnd_fp(relax ? lp->p_base : p_est);
+}
+
 static int intel_pstate_prepare_request(struct cpudata *cpu, int pstate)
 {
 	int max_pstate = intel_pstate_get_base_pstate(cpu);
@@ -1566,6 +1897,22 @@ static void intel_pstate_update_util(struct update_util_data *data, u64 time,
 	}
 }
 
+/**
+ * Implementation of the cpufreq update_util hook based on the LP
+ * controller (see get_target_pstate_lp()).
+ */
+static void intel_pstate_update_util_lp(struct update_util_data *data,
+					u64 time, unsigned int flags)
+{
+	struct cpudata *cpu = container_of(data, struct cpudata, update_util);
+	const u64 delta_ns = time - cpu->sample.time;
+
+	if (smp_processor_id() == cpu->cpu &&
+	    delta_ns >= cpu->lp.sample_interval_ns &&
+	    intel_pstate_sample(cpu, time))
+		intel_pstate_adjust_pstate(cpu, get_target_pstate_lp(cpu));
+}
+
 static struct pstate_funcs core_funcs = {
 	.get_max = core_get_max_pstate,
 	.get_max_physical = core_get_max_pstate_physical,
@@ -1616,7 +1963,7 @@ static const struct pstate_funcs bxt_funcs = {
 	.get_turbo = core_get_turbo_pstate,
 	.get_scaling = core_get_scaling,
 	.get_val = core_get_val,
-	.update_util = intel_pstate_update_util,
+	.update_util = intel_pstate_update_util_lp,
 };
 
 #define ICPU(model, policy) \
@@ -1695,6 +2042,10 @@ static int intel_pstate_init_cpu(unsigned int cpunum)
 
 	intel_pstate_get_cpu_pstates(cpu);
 
+	if (!hwp_active &&
+	    pstate_funcs.update_util == intel_pstate_update_util_lp)
+		intel_pstate_lp_reset(cpu);
+
 	pr_debug("controlling: cpu %d\n", cpunum);
 
 	return 0;
@@ -2287,9 +2638,7 @@ static int __init intel_pstate_init(void)
 
 	if (x86_match_cpu(hwp_support_ids)) {
 		copy_cpu_funcs(&core_funcs);
-		if (no_hwp) {
-			pstate_funcs.update_util = intel_pstate_update_util;
-		} else {
+		if (!no_hwp) {
 			hwp_active++;
 			intel_pstate.attr = hwp_cpufreq_attrs;
 			goto hwp_cpu_matched;
-- 
2.16.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP controller based on ACPI FADT profile.
  2018-03-28  6:38 [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
                   ` (5 preceding siblings ...)
  2018-03-28  6:38 ` [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass filtering controller for small core Francisco Jerez
@ 2018-03-28  6:38 ` Francisco Jerez
  2018-03-28  6:38 ` [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP controller parameters via debugfs Francisco Jerez
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Francisco Jerez @ 2018-03-28  6:38 UTC (permalink / raw)
  To: linux-pm, intel-gfx, Srinivas Pandruvada; +Cc: Eero Tamminen, Rafael J. Wysocki

This is provided at Srinivas' request.  The LP controller is disabled
for the moment on server FADT profiles in order to avoid disturbing
the performance behavior of small-core servers.  In cases where the
default inferred from the BIOS FADT profile is suboptimal, the LP
controller can be forcefully enabled or disabled by passing
"intel_pstate=lp" or "intel_pstate=no_lp" respectively in the kernel
command line.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 Documentation/admin-guide/kernel-parameters.txt |  6 +++++
 Documentation/admin-guide/pm/intel_pstate.rst   |  7 ++++++
 drivers/cpufreq/intel_pstate.c                  | 32 ++++++++++++++++++++++++-
 3 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 1d1d53f85ddd..0ba112696938 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1681,6 +1681,12 @@
 		per_cpu_perf_limits
 			Allow per-logical-CPU P-State performance control limits using
 			cpufreq sysfs interface
+		lp
+			Force use of LP P-state controller.  Overrides selection
+			derived from ACPI FADT profile.  Has no effect if HWP is
+			available.
+		no_lp
+			Prevent use of LP P-state controller (see "lp" parameter).
 
 	intremap=	[X86-64, Intel-IOMMU]
 			on	enable Interrupt Remapping (default)
diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst
index d2b6fda3d67b..a5885fc4c039 100644
--- a/Documentation/admin-guide/pm/intel_pstate.rst
+++ b/Documentation/admin-guide/pm/intel_pstate.rst
@@ -642,6 +642,13 @@ of them have to be prepended with the ``intel_pstate=`` prefix.
 	Use per-logical-CPU P-State limits (see `Coordination of P-state
 	Limits`_ for details).
 
+``lp``
+	Force use of LP P-state controller.  Overrides selection derived
+	from ACPI FADT profile.  Has no effect if HWP is available.
+
+``no_lp``
+	Prevent use of LP P-state controller (see "lp" parameter).
+
 
 Diagnostics and Tuning
 ======================
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index d4b5d0aaa282..d0e212387755 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2485,6 +2485,7 @@ static int intel_pstate_update_status(const char *buf, size_t size)
 
 static int no_load __initdata;
 static int no_hwp __initdata;
+static int no_lp __initdata;
 static int hwp_only __initdata;
 static unsigned int force_load __initdata;
 
@@ -2507,8 +2508,12 @@ static void __init copy_cpu_funcs(struct pstate_funcs *funcs)
 	pstate_funcs.get_scaling = funcs->get_scaling;
 	pstate_funcs.get_val   = funcs->get_val;
 	pstate_funcs.get_vid   = funcs->get_vid;
-	pstate_funcs.update_util = funcs->update_util;
 	pstate_funcs.get_aperf_mperf_shift = funcs->get_aperf_mperf_shift;
+
+	if (no_lp)
+		pstate_funcs.update_util = intel_pstate_update_util;
+	else
+		pstate_funcs.update_util = funcs->update_util;
 }
 
 #ifdef CONFIG_ACPI
@@ -2690,6 +2695,25 @@ static int __init intel_pstate_init(void)
 }
 device_initcall(intel_pstate_init);
 
+#ifdef CONFIG_ACPI
+static bool __init is_server_acpi_profile(void)
+{
+	switch (acpi_gbl_FADT.preferred_profile) {
+	case PM_ENTERPRISE_SERVER:
+	case PM_SOHO_SERVER:
+	case PM_PERFORMANCE_SERVER:
+		return true;
+	default:
+		return false;
+	}
+}
+#else
+static bool __init is_server_acpi_profile(void)
+{
+	return false;
+}
+#endif
+
 static int __init intel_pstate_setup(char *str)
 {
 	if (!str)
@@ -2713,6 +2737,12 @@ static int __init intel_pstate_setup(char *str)
 	if (!strcmp(str, "per_cpu_perf_limits"))
 		per_cpu_limits = true;
 
+	no_lp = is_server_acpi_profile();
+	if (!strcmp(str, "lp"))
+		no_lp = 0;
+	if (!strcmp(str, "no_lp"))
+		no_lp = 1;
+
 #ifdef CONFIG_ACPI
 	if (!strcmp(str, "support_acpi_ppc"))
 		acpi_ppc = true;
-- 
2.16.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP controller parameters via debugfs.
  2018-03-28  6:38 [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
                   ` (6 preceding siblings ...)
  2018-03-28  6:38 ` [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP controller based on ACPI FADT profile Francisco Jerez
@ 2018-03-28  6:38 ` Francisco Jerez
  2018-03-28  6:38 ` [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity to cpufreq Francisco Jerez
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Francisco Jerez @ 2018-03-28  6:38 UTC (permalink / raw)
  To: linux-pm, intel-gfx, Srinivas Pandruvada; +Cc: Eero Tamminen, Rafael J. Wysocki

This is not required for the controller to work but has proven very
useful for debugging and testing of alternative heuristic parameters,
which may offer a better trade-off between energy efficiency and
latency -- The default parameters are rather aggressively
latency-minimizing in order to keep up with the big-core controller
and avoid regressions in latency-sensitive applications, however a
more energy-efficient trade-off may be preferable in practice.  E.g.:

{
   .sample_interval_ms = 10,
   .setpoint_pml = 700,
   .p_base_avg_hz = 3,
   .io_active_threshold_pml = 950,
   .io_active_avg_hz = 10
}

A warning is printed out which should taint the kernel for the
non-standard calibration of the heuristic to be obvious in bug
reports.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/cpufreq/intel_pstate.c | 90 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index d0e212387755..7b1666c34273 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -862,6 +862,92 @@ static void intel_pstate_update_policies(void)
 		cpufreq_update_policy(cpu);
 }
 
+/************************** debugfs begin ************************/
+static void intel_pstate_lp_reset(struct cpudata *cpu);
+
+static int lp_param_set(void *data, u64 val)
+{
+	unsigned int cpu;
+
+	*(u32 *)data = val;
+	for_each_possible_cpu(cpu) {
+		if (all_cpu_data[cpu])
+			intel_pstate_lp_reset(all_cpu_data[cpu]);
+	}
+
+	WARN_ONCE(1, "Unsupported P-state LP parameter update via debugging interface");
+
+	return 0;
+}
+
+static int lp_param_get(void *data, u64 *val)
+{
+	*val = *(u32 *)data;
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(fops_lp_param, lp_param_get, lp_param_set, "%llu\n");
+
+static struct dentry *debugfs_parent;
+
+struct lp_param {
+	char *name;
+	void *value;
+	struct dentry *dentry;
+};
+
+static struct lp_param lp_files[] = {
+	{"lp_sample_interval_ms", &lp_params.sample_interval_ms, },
+	{"lp_setpoint_pml", &lp_params.setpoint_pml, },
+	{"lp_p_base_avg_hz", &lp_params.p_base_avg_hz, },
+	{"lp_io_active_threshold_pml", &lp_params.io_active_threshold_pml, },
+	{"lp_io_active_avg_hz", &lp_params.io_active_avg_hz, },
+	{NULL, NULL, }
+};
+
+static void intel_pstate_update_util_lp(struct update_util_data *data,
+					u64 time, unsigned int flags);
+
+static void intel_pstate_debug_expose_params(void)
+{
+	int i;
+
+	if (hwp_active ||
+	    pstate_funcs.update_util != intel_pstate_update_util_lp)
+		return;
+
+	debugfs_parent = debugfs_create_dir("pstate_snb", NULL);
+	if (IS_ERR_OR_NULL(debugfs_parent))
+		return;
+
+	for (i = 0; lp_files[i].name; i++) {
+		struct dentry *dentry;
+
+		dentry = debugfs_create_file(lp_files[i].name, 0660,
+					     debugfs_parent, lp_files[i].value,
+					     &fops_lp_param);
+		if (!IS_ERR(dentry))
+			lp_files[i].dentry = dentry;
+	}
+}
+
+static void intel_pstate_debug_hide_params(void)
+{
+	int i;
+
+	if (IS_ERR_OR_NULL(debugfs_parent))
+		return;
+
+	for (i = 0; lp_files[i].name; i++) {
+		debugfs_remove(lp_files[i].dentry);
+		lp_files[i].dentry = NULL;
+	}
+
+	debugfs_remove(debugfs_parent);
+	debugfs_parent = NULL;
+}
+
+/************************** debugfs end ************************/
+
 /************************** sysfs begin ************************/
 #define show_one(file_name, object)					\
 	static ssize_t show_##file_name					\
@@ -2423,6 +2509,8 @@ static int intel_pstate_register_driver(struct cpufreq_driver *driver)
 
 	global.min_perf_pct = min_perf_pct_min();
 
+	intel_pstate_debug_expose_params();
+
 	return 0;
 }
 
@@ -2431,6 +2519,8 @@ static int intel_pstate_unregister_driver(void)
 	if (hwp_active)
 		return -EBUSY;
 
+	intel_pstate_debug_hide_params();
+
 	cpufreq_unregister_driver(intel_pstate_driver);
 	intel_pstate_driver_cleanup();
 
-- 
2.16.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity to cpufreq.
  2018-03-28  6:38 [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
                   ` (7 preceding siblings ...)
  2018-03-28  6:38 ` [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP controller parameters via debugfs Francisco Jerez
@ 2018-03-28  6:38 ` Francisco Jerez
  2018-03-28  8:02   ` Chris Wilson
  2018-03-30 18:50 ` [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Francisco Jerez @ 2018-03-28  6:38 UTC (permalink / raw)
  To: linux-pm, intel-gfx, Srinivas Pandruvada; +Cc: Eero Tamminen, Rafael J. Wysocki

This allows cpufreq governors to realize when the system becomes
non-CPU-bound due to GPU rendering activity, which will cause the
intel_pstate LP controller to behave more conservatively: CPU energy
usage will be reduced when there isn't a good chance for system
performance to scale with CPU frequency.  This leaves additional TDP
budget available for the GPU to reach higher frequencies, which is
translated into an improvement in graphics performance to the extent
that the workload remains TDP-limited (Most non-trivial graphics
benchmarks out there improve significantly in TDP-constrained
platforms, see the cover letter for some numbers).  If the workload
isn't (anymore) TDP-limited performance should stay roughly constant,
but energy usage will be divided by a similar factor.

The intel_pstate LP controller is only enabled on BXT+ small-core
platforms at this point, so this shouldn't have any effect on other
systems.

Signed-off-by: Francisco Jerez <currojerez@riseup.net>
---
 drivers/gpu/drm/i915/intel_lrc.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 3a69b367e565..721f915115bd 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -132,6 +132,7 @@
  *
  */
 #include <linux/interrupt.h>
+#include <linux/cpufreq.h>
 
 #include <drm/drmP.h>
 #include <drm/i915_drm.h>
@@ -379,11 +380,13 @@ execlists_context_schedule_in(struct i915_request *rq)
 {
 	execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
 	intel_engine_context_in(rq->engine);
+	cpufreq_io_active_begin();
 }
 
 static inline void
 execlists_context_schedule_out(struct i915_request *rq)
 {
+	cpufreq_io_active_end();
 	intel_engine_context_out(rq->engine);
 	execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_OUT);
 }
@@ -726,6 +729,7 @@ execlists_cancel_port_requests(struct intel_engine_execlists * const execlists)
 		struct i915_request *rq = port_request(port);
 
 		GEM_BUG_ON(!execlists->active);
+		cpufreq_io_active_end();
 		intel_engine_context_out(rq->engine);
 
 		execlists_context_status_change(rq,
-- 
2.16.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity to cpufreq.
  2018-03-28  6:38 ` [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity to cpufreq Francisco Jerez
@ 2018-03-28  8:02   ` Chris Wilson
  2018-03-28 18:55     ` Francisco Jerez
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Wilson @ 2018-03-28  8:02 UTC (permalink / raw)
  To: Francisco Jerez, linux-pm, intel-gfx, Srinivas Pandruvada
  Cc: Eero Tamminen, Rafael J. Wysocki

Quoting Francisco Jerez (2018-03-28 07:38:45)
> This allows cpufreq governors to realize when the system becomes
> non-CPU-bound due to GPU rendering activity, which will cause the
> intel_pstate LP controller to behave more conservatively: CPU energy
> usage will be reduced when there isn't a good chance for system
> performance to scale with CPU frequency.  This leaves additional TDP
> budget available for the GPU to reach higher frequencies, which is
> translated into an improvement in graphics performance to the extent
> that the workload remains TDP-limited (Most non-trivial graphics
> benchmarks out there improve significantly in TDP-constrained
> platforms, see the cover letter for some numbers).  If the workload
> isn't (anymore) TDP-limited performance should stay roughly constant,
> but energy usage will be divided by a similar factor.

And that's what I thought IPS was already meant to be achieving;
intelligent distribution between different units...
 
> The intel_pstate LP controller is only enabled on BXT+ small-core
> platforms at this point, so this shouldn't have any effect on other
> systems.

Although that's probably only a feature for big core :)
 
> Signed-off-by: Francisco Jerez <currojerez@riseup.net>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 3a69b367e565..721f915115bd 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -132,6 +132,7 @@
>   *
>   */
>  #include <linux/interrupt.h>
> +#include <linux/cpufreq.h>
>  
>  #include <drm/drmP.h>
>  #include <drm/i915_drm.h>
> @@ -379,11 +380,13 @@ execlists_context_schedule_in(struct i915_request *rq)
>  {
>         execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
>         intel_engine_context_in(rq->engine);
> +       cpufreq_io_active_begin();

Since you only care about a binary value for GPU activity, we don't need
to do this on each context, just between submitting the first request
and the final completion, i.e. couple this to EXECLISTS_ACTIVE_USER.

Haven't yet gone back to check how heavy io_active_begin/end are, but I
trust you appreciate that this code is particularly latency sensitive.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity to cpufreq.
  2018-03-28  8:02   ` Chris Wilson
@ 2018-03-28 18:55     ` Francisco Jerez
  2018-03-28 19:20       ` Chris Wilson
  0 siblings, 1 reply; 39+ messages in thread
From: Francisco Jerez @ 2018-03-28 18:55 UTC (permalink / raw)
  To: Chris Wilson, linux-pm, intel-gfx, Srinivas Pandruvada
  Cc: Eero Tamminen, Rafael J. Wysocki


[-- Attachment #1.1.1: Type: text/plain, Size: 3948 bytes --]

Hi Chris,

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Francisco Jerez (2018-03-28 07:38:45)
>> This allows cpufreq governors to realize when the system becomes
>> non-CPU-bound due to GPU rendering activity, which will cause the
>> intel_pstate LP controller to behave more conservatively: CPU energy
>> usage will be reduced when there isn't a good chance for system
>> performance to scale with CPU frequency.  This leaves additional TDP
>> budget available for the GPU to reach higher frequencies, which is
>> translated into an improvement in graphics performance to the extent
>> that the workload remains TDP-limited (Most non-trivial graphics
>> benchmarks out there improve significantly in TDP-constrained
>> platforms, see the cover letter for some numbers).  If the workload
>> isn't (anymore) TDP-limited performance should stay roughly constant,
>> but energy usage will be divided by a similar factor.
>
> And that's what I thought IPS was already meant to be achieving;
> intelligent distribution between different units...
>  

I'm not aware of IPS ever being available on any small core systems.

>> The intel_pstate LP controller is only enabled on BXT+ small-core
>> platforms at this point, so this shouldn't have any effect on other
>> systems.
>
> Although that's probably only a feature for big core :)
>

Actually I wouldn't rule out it being useful on big core up front.  I've
been playing around with the idea of hooking up this same interface to
the EPP preference used by HWP, which would allow the HWP controller to
reduce energy usage in GPU-bound conditions, which could potentially
leave additional TDP headroom available for the GPU -- It's not uncommon
for graphics workloads to be TDP-limited even on big core, and while
non-TDP-limited (so neither IPS nor IBC will ever help you) that would
still allow them to optimize CPU energy usage for workloads that are
consistently GPU-bound (which would mean longer battery life while
gaming on a big-core laptop).

>> Signed-off-by: Francisco Jerez <currojerez@riseup.net>
>> ---
>>  drivers/gpu/drm/i915/intel_lrc.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>> 
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>> index 3a69b367e565..721f915115bd 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -132,6 +132,7 @@
>>   *
>>   */
>>  #include <linux/interrupt.h>
>> +#include <linux/cpufreq.h>
>>  
>>  #include <drm/drmP.h>
>>  #include <drm/i915_drm.h>
>> @@ -379,11 +380,13 @@ execlists_context_schedule_in(struct i915_request *rq)
>>  {
>>         execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
>>         intel_engine_context_in(rq->engine);
>> +       cpufreq_io_active_begin();
>
> Since you only care about a binary value for GPU activity, we don't need
> to do this on each context, just between submitting the first request
> and the final completion, i.e. couple this to EXECLISTS_ACTIVE_USER.
>
> Haven't yet gone back to check how heavy io_active_begin/end are, but I
> trust you appreciate that this code is particularly latency sensitive.

The cpufreq_io_active_begin/end() interfaces are probably as lightweight
as they can be.  There's no locking involved.  In cases where there
already is an overlapping request, cpufreq_io_active_begin() and
cpufreq_io_active_end() should return without doing much more than an
atomic increment and an atomic cmpxchg respectively.

That said, it wouldn't hurt to call each of them once per sequence of
overlapping requests.  Do you want me to call them from
execlists_set/clear_active() themselves when bit == EXECLISTS_ACTIVE_USER,
or at each callsite of execlists_set/clear_active()? [possibly protected
with a check that execlists_is_active(EXECLISTS_ACTIVE_USER) wasn't
already the expected value]

> -Chris

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity to cpufreq.
  2018-03-28 18:55     ` Francisco Jerez
@ 2018-03-28 19:20       ` Chris Wilson
  2018-03-28 23:19         ` Chris Wilson
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Wilson @ 2018-03-28 19:20 UTC (permalink / raw)
  To: Francisco Jerez, linux-pm, intel-gfx, Srinivas Pandruvada
  Cc: Eero Tamminen, Rafael J. Wysocki

Quoting Francisco Jerez (2018-03-28 19:55:12)
> Hi Chris,
> 
[snip]
> That said, it wouldn't hurt to call each of them once per sequence of
> overlapping requests.  Do you want me to call them from
> execlists_set/clear_active() themselves when bit == EXECLISTS_ACTIVE_USER,
> or at each callsite of execlists_set/clear_active()? [possibly protected
> with a check that execlists_is_active(EXECLISTS_ACTIVE_USER) wasn't
> already the expected value]

No, I was thinking of adding an execlists_start()/execlists_stop()
[naming suggestions welcome, begin/end?] where we could hook additional
bookkeeping into.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity to cpufreq.
  2018-03-28 19:20       ` Chris Wilson
@ 2018-03-28 23:19         ` Chris Wilson
  2018-03-29  0:32           ` Francisco Jerez
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Wilson @ 2018-03-28 23:19 UTC (permalink / raw)
  To: Francisco Jerez, linux-pm, intel-gfx, Srinivas Pandruvada
  Cc: Eero Tamminen, Rafael J. Wysocki

Quoting Chris Wilson (2018-03-28 20:20:19)
> Quoting Francisco Jerez (2018-03-28 19:55:12)
> > Hi Chris,
> > 
> [snip]
> > That said, it wouldn't hurt to call each of them once per sequence of
> > overlapping requests.  Do you want me to call them from
> > execlists_set/clear_active() themselves when bit == EXECLISTS_ACTIVE_USER,
> > or at each callsite of execlists_set/clear_active()? [possibly protected
> > with a check that execlists_is_active(EXECLISTS_ACTIVE_USER) wasn't
> > already the expected value]
> 
> No, I was thinking of adding an execlists_start()/execlists_stop()
> [naming suggestions welcome, begin/end?] where we could hook additional
> bookkeeping into.

Trying to call execlist_begin() once didn't pan out. It's easier to
reuse for similar bookkeeping used in future patches if execlist_begin()
(or whatever name suits best) at the start of each context.

Something along the lines of:

@@ -374,6 +374,19 @@ execlists_context_status_change(struct i915_request *rq, unsigned long status)
                                   status, rq);
 }
 
+static inline void
+execlists_begin(struct intel_engine_execlists *execlists,
+               struct execlist_port *port)
+{
+       execlists_set_active_once(execlists, EXECLISTS_ACTIVE_USER);
+}
+
+static inline void
+execlists_end(struct intel_engine_execlists *execlists)
+{
+       execlists_clear_active(execlists, EXECLISTS_ACTIVE_USER);
+}
+
 static inline void
 execlists_context_schedule_in(struct i915_request *rq)
 {
@@ -710,7 +723,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
        spin_unlock_irq(&engine->timeline->lock);
 
        if (submit) {
-               execlists_set_active(execlists, EXECLISTS_ACTIVE_USER);
+               execlists_begin(execlists, execlists->port);
                execlists_submit_ports(engine);
        }
 
@@ -741,7 +754,7 @@ execlists_cancel_port_requests(struct intel_engine_execlists * const execlists)
                port++;
        }
 
-       execlists_clear_active(execlists, EXECLISTS_ACTIVE_USER);
+       execlists_end(execlists);
 }
 
 static void clear_gtiir(struct intel_engine_cs *engine)
@@ -872,7 +885,7 @@ static void execlists_submission_tasklet(unsigned long data)
 {
        struct intel_engine_cs * const engine = (struct intel_engine_cs *)data;
        struct intel_engine_execlists * const execlists = &engine->execlists;
-       struct execlist_port * const port = execlists->port;
+       struct execlist_port *port = execlists->port;
        struct drm_i915_private *dev_priv = engine->i915;
        bool fw = false;
 
@@ -1010,9 +1023,19 @@ static void execlists_submission_tasklet(unsigned long data)
 
                        GEM_BUG_ON(count == 0);
                        if (--count == 0) {
+                               /*
+                                * On the final event corresponding to the
+                                * submission of this context, we expect either
+                                * an element-switch event or an completion
+                                * event (and on completion, the active-idle
+                                * marker). No more preemptions, lite-restore
+                                * or otherwise
+                                */
                                GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED);
                                GEM_BUG_ON(port_isset(&port[1]) &&
                                           !(status & GEN8_CTX_STATUS_ELEMENT_SWITCH));
+                               GEM_BUG_ON(!port_isset(&port[1]) &&
+                                          !(status & GEN8_CTX_STATUS_ACTIVE_IDLE));
                                GEM_BUG_ON(!i915_request_completed(rq));
                                execlists_context_schedule_out(rq);
                                trace_i915_request_out(rq);
@@ -1021,17 +1044,14 @@ static void execlists_submission_tasklet(unsigned long data)
                                GEM_TRACE("%s completed ctx=%d\n",
                                          engine->name, port->context_id);
 
-                               execlists_port_complete(execlists, port);
+                               port = execlists_port_complete(execlists, port);
+                               if (port_isset(port))
+                                       execlists_begin(execlists, port);
+                               else
+                                       execlists_end(execlists);
                        } else {
                                port_set(port, port_pack(rq, count));
                        }

-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity to cpufreq.
  2018-03-28 23:19         ` Chris Wilson
@ 2018-03-29  0:32           ` Francisco Jerez
  2018-03-29  1:01             ` Chris Wilson
  0 siblings, 1 reply; 39+ messages in thread
From: Francisco Jerez @ 2018-03-29  0:32 UTC (permalink / raw)
  To: Chris Wilson, linux-pm, intel-gfx, Srinivas Pandruvada
  Cc: Eero Tamminen, Rafael J. Wysocki


[-- Attachment #1.1.1: Type: text/plain, Size: 5772 bytes --]

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Chris Wilson (2018-03-28 20:20:19)
>> Quoting Francisco Jerez (2018-03-28 19:55:12)
>> > Hi Chris,
>> > 
>> [snip]
>> > That said, it wouldn't hurt to call each of them once per sequence of
>> > overlapping requests.  Do you want me to call them from
>> > execlists_set/clear_active() themselves when bit == EXECLISTS_ACTIVE_USER,
>> > or at each callsite of execlists_set/clear_active()? [possibly protected
>> > with a check that execlists_is_active(EXECLISTS_ACTIVE_USER) wasn't
>> > already the expected value]
>> 
>> No, I was thinking of adding an execlists_start()/execlists_stop()
>> [naming suggestions welcome, begin/end?] where we could hook additional
>> bookkeeping into.
>
> Trying to call execlist_begin() once didn't pan out. It's easier to
> reuse for similar bookkeeping used in future patches if execlist_begin()
> (or whatever name suits best) at the start of each context.
>
> Something along the lines of:
>
> @@ -374,6 +374,19 @@ execlists_context_status_change(struct i915_request *rq, unsigned long status)
>                                    status, rq);
>  }
>  
> +static inline void
> +execlists_begin(struct intel_engine_execlists *execlists,

I had started writing something along the same lines in my working tree
called execlists_active_user_begin/end -- Which name do you prefer?

> +               struct execlist_port *port)
> +{

What do you expect the port argument to be useful for?  Is it ever going
to be anything other than execlists->port?

> +       execlists_set_active_once(execlists, EXECLISTS_ACTIVE_USER);
> +}
> +
> +static inline void
> +execlists_end(struct intel_engine_execlists *execlists)
> +{
> +       execlists_clear_active(execlists, EXECLISTS_ACTIVE_USER);
> +}
> +
>  static inline void
>  execlists_context_schedule_in(struct i915_request *rq)
>  {
> @@ -710,7 +723,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
>         spin_unlock_irq(&engine->timeline->lock);
>  
>         if (submit) {
> -               execlists_set_active(execlists, EXECLISTS_ACTIVE_USER);
> +               execlists_begin(execlists, execlists->port);
>                 execlists_submit_ports(engine);
>         }
>  
> @@ -741,7 +754,7 @@ execlists_cancel_port_requests(struct intel_engine_execlists * const execlists)
>                 port++;
>         }
>  
> -       execlists_clear_active(execlists, EXECLISTS_ACTIVE_USER);

This line doesn't seem to exist in my working tree.  I guess it was just
added?

> +       execlists_end(execlists);
>  }
>  
>  static void clear_gtiir(struct intel_engine_cs *engine)
> @@ -872,7 +885,7 @@ static void execlists_submission_tasklet(unsigned long data)
>  {
>         struct intel_engine_cs * const engine = (struct intel_engine_cs *)data;
>         struct intel_engine_execlists * const execlists = &engine->execlists;
> -       struct execlist_port * const port = execlists->port;
> +       struct execlist_port *port = execlists->port;
>         struct drm_i915_private *dev_priv = engine->i915;
>         bool fw = false;
>  
> @@ -1010,9 +1023,19 @@ static void execlists_submission_tasklet(unsigned long data)
>  
>                         GEM_BUG_ON(count == 0);
>                         if (--count == 0) {
> +                               /*
> +                                * On the final event corresponding to the
> +                                * submission of this context, we expect either
> +                                * an element-switch event or an completion
> +                                * event (and on completion, the active-idle
> +                                * marker). No more preemptions, lite-restore
> +                                * or otherwise
> +                                */
>                                 GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED);
>                                 GEM_BUG_ON(port_isset(&port[1]) &&
>                                            !(status & GEN8_CTX_STATUS_ELEMENT_SWITCH));
> +                               GEM_BUG_ON(!port_isset(&port[1]) &&
> +                                          !(status & GEN8_CTX_STATUS_ACTIVE_IDLE));
>                                 GEM_BUG_ON(!i915_request_completed(rq));
>                                 execlists_context_schedule_out(rq);
>                                 trace_i915_request_out(rq);
> @@ -1021,17 +1044,14 @@ static void execlists_submission_tasklet(unsigned long data)
>                                 GEM_TRACE("%s completed ctx=%d\n",
>                                           engine->name, port->context_id);
>  
> -                               execlists_port_complete(execlists, port);
> +                               port = execlists_port_complete(execlists, port);
> +                               if (port_isset(port))
> +                                       execlists_begin(execlists, port);

Isn't this going to call execlists_begin() roughly once per request?
What's the purpose if we expect it to be a no-op except for the first
request submitted after execlists_end()?  Isn't the intention to provide
a hook for bookkeeping that depends on idle to active and active to idle
transitions of the hardware?

> +                               else
> +                                       execlists_end(execlists);
>                         } else {
>                                 port_set(port, port_pack(rq, count));
>                         }
>

Isn't there an "execlists_clear_active(execlists,
EXECLISTS_ACTIVE_USER);" call in your tree a few lines below that is now
redundant?

> -Chris

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity to cpufreq.
  2018-03-29  0:32           ` Francisco Jerez
@ 2018-03-29  1:01             ` Chris Wilson
  2018-03-29  1:20               ` Chris Wilson
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Wilson @ 2018-03-29  1:01 UTC (permalink / raw)
  To: Francisco Jerez, linux-pm, intel-gfx, Srinivas Pandruvada
  Cc: Eero Tamminen, Rafael J. Wysocki

Quoting Francisco Jerez (2018-03-29 01:32:07)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Quoting Chris Wilson (2018-03-28 20:20:19)
> >> Quoting Francisco Jerez (2018-03-28 19:55:12)
> >> > Hi Chris,
> >> > 
> >> [snip]
> >> > That said, it wouldn't hurt to call each of them once per sequence of
> >> > overlapping requests.  Do you want me to call them from
> >> > execlists_set/clear_active() themselves when bit == EXECLISTS_ACTIVE_USER,
> >> > or at each callsite of execlists_set/clear_active()? [possibly protected
> >> > with a check that execlists_is_active(EXECLISTS_ACTIVE_USER) wasn't
> >> > already the expected value]
> >> 
> >> No, I was thinking of adding an execlists_start()/execlists_stop()
> >> [naming suggestions welcome, begin/end?] where we could hook additional
> >> bookkeeping into.
> >
> > Trying to call execlist_begin() once didn't pan out. It's easier to
> > reuse for similar bookkeeping used in future patches if execlist_begin()
> > (or whatever name suits best) at the start of each context.
> >
> > Something along the lines of:
> >
> > @@ -374,6 +374,19 @@ execlists_context_status_change(struct i915_request *rq, unsigned long status)
> >                                    status, rq);
> >  }
> >  
> > +static inline void
> > +execlists_begin(struct intel_engine_execlists *execlists,
> 
> I had started writing something along the same lines in my working tree
> called execlists_active_user_begin/end -- Which name do you prefer?

Hmm, so the reason why the user distinction exists is that we may
momentarily remain active after the last user context is switch out, if
there is a preemption event still pending. Keeping the user tag does
help maintain that distinction.

> 
> > +               struct execlist_port *port)
> > +{
> 
> What do you expect the port argument to be useful for?  Is it ever going
> to be anything other than execlists->port?

There are patches afoot to change that, so since I needed to inspect
port here it seemed to tie in nicely with the proposed changes to
execlists_port_complete.

> > +       execlists_set_active_once(execlists, EXECLISTS_ACTIVE_USER);
> > +}
> > +
> > +static inline void
> > +execlists_end(struct intel_engine_execlists *execlists)
> > +{
> > +       execlists_clear_active(execlists, EXECLISTS_ACTIVE_USER);
> > +}
> > +
> >  static inline void
> >  execlists_context_schedule_in(struct i915_request *rq)
> >  {
> > @@ -710,7 +723,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
> >         spin_unlock_irq(&engine->timeline->lock);
> >  
> >         if (submit) {
> > -               execlists_set_active(execlists, EXECLISTS_ACTIVE_USER);
> > +               execlists_begin(execlists, execlists->port);
> >                 execlists_submit_ports(engine);
> >         }
> >  
> > @@ -741,7 +754,7 @@ execlists_cancel_port_requests(struct intel_engine_execlists * const execlists)
> >                 port++;
> >         }
> >  
> > -       execlists_clear_active(execlists, EXECLISTS_ACTIVE_USER);
> 
> This line doesn't seem to exist in my working tree.  I guess it was just
> added?

A few days ago, yes.

> > +       execlists_end(execlists);
> >  }
> >  
> >  static void clear_gtiir(struct intel_engine_cs *engine)
> > @@ -872,7 +885,7 @@ static void execlists_submission_tasklet(unsigned long data)
> >  {
> >         struct intel_engine_cs * const engine = (struct intel_engine_cs *)data;
> >         struct intel_engine_execlists * const execlists = &engine->execlists;
> > -       struct execlist_port * const port = execlists->port;
> > +       struct execlist_port *port = execlists->port;
> >         struct drm_i915_private *dev_priv = engine->i915;
> >         bool fw = false;
> >  
> > @@ -1010,9 +1023,19 @@ static void execlists_submission_tasklet(unsigned long data)
> >  
> >                         GEM_BUG_ON(count == 0);
> >                         if (--count == 0) {
> > +                               /*
> > +                                * On the final event corresponding to the
> > +                                * submission of this context, we expect either
> > +                                * an element-switch event or an completion
> > +                                * event (and on completion, the active-idle
> > +                                * marker). No more preemptions, lite-restore
> > +                                * or otherwise
> > +                                */
> >                                 GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED);
> >                                 GEM_BUG_ON(port_isset(&port[1]) &&
> >                                            !(status & GEN8_CTX_STATUS_ELEMENT_SWITCH));
> > +                               GEM_BUG_ON(!port_isset(&port[1]) &&
> > +                                          !(status & GEN8_CTX_STATUS_ACTIVE_IDLE));
> >                                 GEM_BUG_ON(!i915_request_completed(rq));
> >                                 execlists_context_schedule_out(rq);
> >                                 trace_i915_request_out(rq);
> > @@ -1021,17 +1044,14 @@ static void execlists_submission_tasklet(unsigned long data)
> >                                 GEM_TRACE("%s completed ctx=%d\n",
> >                                           engine->name, port->context_id);
> >  
> > -                               execlists_port_complete(execlists, port);
> > +                               port = execlists_port_complete(execlists, port);
> > +                               if (port_isset(port))
> > +                                       execlists_begin(execlists, port);
> 
> Isn't this going to call execlists_begin() roughly once per request?
> What's the purpose if we expect it to be a no-op except for the first
> request submitted after execlists_end()?  Isn't the intention to provide
> a hook for bookkeeping that depends on idle to active and active to idle
> transitions of the hardware?

Because on a context switch I need to update the GPU clocks, update
tracking for preemption, and that's just the two I have in my tree that
need to land in the next couple of weeks...

Currently I have,

static inline void
execlists_begin(struct intel_engine_execlists *execlists,
                struct execlist_port *port)
{
        struct intel_engine_cs *engine =
                container_of(execlists, typeof(*engine), execlists);
        struct i915_request *rq = port_request(port);

        execlists_set_active_once(execlists, EXECLISTS_ACTIVE_USER);

        intel_rps_update_engine(engine, rq->ctx);
        execlists->current_priority = rq_prio(rq);
}

static inline void
execlists_end(struct intel_engine_execlists *execlists)
{
        struct intel_engine_cs *engine =
                container_of(execlists, typeof(*engine), execlists);

        execlists->current_priority = INT_MIN;
        intel_rps_update_engine(engine, NULL);

        execlists_clear_active(execlists, EXECLISTS_ACTIVE_USER);
}

> > +                               else
> > +                                       execlists_end(execlists);
> >                         } else {
> >                                 port_set(port, port_pack(rq, count));
> >                         }
> >
> 
> Isn't there an "execlists_clear_active(execlists,
> EXECLISTS_ACTIVE_USER);" call in your tree a few lines below that is now
> redundant?

This is only half the patch to show the gist. :)
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity to cpufreq.
  2018-03-29  1:01             ` Chris Wilson
@ 2018-03-29  1:20               ` Chris Wilson
  0 siblings, 0 replies; 39+ messages in thread
From: Chris Wilson @ 2018-03-29  1:20 UTC (permalink / raw)
  To: Francisco Jerez, linux-pm, intel-gfx, Srinivas Pandruvada
  Cc: Eero Tamminen, Rafael J. Wysocki

Quoting Chris Wilson (2018-03-29 02:01:37)
> Quoting Francisco Jerez (2018-03-29 01:32:07)
> > Chris Wilson <chris@chris-wilson.co.uk> writes:
> > > +                               else
> > > +                                       execlists_end(execlists);
> > >                         } else {
> > >                                 port_set(port, port_pack(rq, count));
> > >                         }
> > >
> > 
> > Isn't there an "execlists_clear_active(execlists,
> > EXECLISTS_ACTIVE_USER);" call in your tree a few lines below that is now
> > redundant?
> 
> This is only half the patch to show the gist. :)

A more concrete sketch,
https://patchwork.freedesktop.org/patch/213594/
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-03-28  6:38 [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
                   ` (8 preceding siblings ...)
  2018-03-28  6:38 ` [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity to cpufreq Francisco Jerez
@ 2018-03-30 18:50 ` Francisco Jerez
  2018-04-10 22:28 ` Francisco Jerez
  2018-04-17 14:03 ` Chris Wilson
  11 siblings, 0 replies; 39+ messages in thread
From: Francisco Jerez @ 2018-03-30 18:50 UTC (permalink / raw)
  To: linux-pm, intel-gfx, Srinivas Pandruvada; +Cc: Eero Tamminen, Rafael J. Wysocki


[-- Attachment #1.1.1: Type: text/plain, Size: 10602 bytes --]

Francisco Jerez <currojerez@riseup.net> writes:

> This series attempts to solve an energy efficiency problem of the
> current active-mode non-HWP governor of the intel_pstate driver used
> for the most part on low-power platforms.  Under heavy IO load the
> current controller tends to increase frequencies to the maximum turbo
> P-state, partly due to IO wait boosting, partly due to the roughly
> flat frequency response curve of the current controller (see
> [6]), which causes it to ramp frequencies up and down repeatedly for
> any oscillating workload (think of graphics, audio or disk IO when any
> of them becomes a bottleneck), severely increasing energy usage
> relative to a (throughput-wise equivalent) controller able to provide
> the same average frequency without fluctuation.  The core energy
> efficiency improvement has been observed to be of the order of 20% via
> RAPL, but it's expected to vary substantially between workloads (see
> perf-per-watt comparison [2]).
>
> One might expect that this could come at some cost in terms of system
> responsiveness, but the governor implemented in PATCH 6 has a variable
> response curve controlled by a heuristic that keeps the controller in
> a low-latency state unless the system is under heavy IO load for an
> extended period of time.  The low-latency behavior is actually
> significantly more aggressive than the current governor, allowing it
> to achieve better throughput in some scenarios where the load
> ping-pongs between the CPU and some IO device (see PATCH 6 for more of
> the rationale).  The controller offers relatively lower latency than
> the upstream one particularly while C0 residency is low (which by
> itself contributes to mitigate the increased energy usage while on
> C0).  However under certain conditions the low-latency heuristic may
> increase power consumption (see perf-per-watt comparison [2], the
> apparent regressions are correlated with an increase in performance in
> the same benchmark due to the use of the low-latency heuristic) -- If
> this is a problem a different trade-off between latency and energy
> usage shouldn't be difficult to achieve, but it will come at a
> performance cost in some cases.  I couldn't observe a statistically
> significant increase in idle power consumption due to this behavior
> (on BXT J3455):
>
> package-0 RAPL (W):    XXXXXX ±0.14% x8 ->     XXXXXX ±0.15% x9         d=-0.04% ±0.14%      p=61.73%
>
> [Absolute benchmark results are unfortunately omitted from this letter
> due to company policies, but the percent change and Student's T
> p-value are included above and in the referenced benchmark results]
>
> The most obvious impact of this series will likely be the overall
> improvement in graphics performance on systems with an IGP integrated
> into the processor package (though for the moment this is only enabled
> on BXT+), because the TDP budget shared among CPU and GPU can
> frequently become a limiting factor in low-power devices.  On heavily
> TDP-bound devices this series improves performance of virtually any
> non-trivial graphics rendering by a significant amount (of the order
> of the energy efficiency improvement for that workload assuming the
> optimization didn't cause it to become non-TDP-bound).
>
> See [1]-[5] for detailed numbers including various graphics benchmarks
> and a sample of the Phoronix daily-system-tracker.  Some popular
> graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve
> between 5% and 11% on our systems.  The exact improvement can vary
> substantially between systems (compare the benchmark results from the
> two different J3455 systems [1] and [3]) due to a number of factors,
> including the ratio between CPU and GPU processing power, the behavior
> of the userspace graphics driver, the windowing system and resolution,
> the BIOS (which has an influence on the package TDP), the thermal
> characteristics of the system, etc.
>
> Unigine Valley and Heaven improve by a similar factor on some systems
> (see the J3455 results [1]), but on others the improvement is lower
> because the benchmark fails to fully utilize the GPU, which causes the
> heuristic to remain in low-latency state for longer, which leaves a
> reduced TDP budget available to the GPU, which prevents performance
> from increasing further.  This can be avoided by using the alternative
> heuristic parameters suggested in the commit message of PATCH 8, which
> provide a lower IO utilization threshold and hysteresis for the
> controller to attempt to save energy.  I'm not proposing those for
> upstream (yet) because they would also increase the risk for
> latency-sensitive IO-heavy workloads to regress (like SynMark2
> OglTerrainFly* and some arguably poorly designed IPC-bound X11
> benchmarks).
>
> Discrete graphics aren't likely to experience that much of a visible
> improvement from this, even though many non-IGP workloads *could*
> benefit by reducing the system's energy usage while the discrete GPU
> (or really, any other IO device) becomes a bottleneck, but this is not
> attempted in this series, since that would involve making an energy
> efficiency/latency trade-off that only the maintainers of the
> respective drivers are in a position to make.  The cpufreq interface
> introduced in PATCH 1 to achieve this is left as an opt-in for that
> reason, only the i915 DRM driver is hooked up since it will get the
> most direct pay-off due to the increased energy budget available to
> the GPU, but other power-hungry third-party gadgets built into the
> same package (*cough* AMD *cough* Mali *cough* PowerVR *cough*) may be
> able to benefit from this interface eventually by instrumenting the
> driver in a similar way.
>
> The cpufreq interface is not exclusively tied to the intel_pstate
> driver, because other governors can make use of the statistic
> calculated as result to avoid over-optimizing for latency in scenarios
> where a lower frequency would be able to achieve similar throughput
> while using less energy.  The interpretation of this statistic relies
> on the observation that for as long as the system is CPU-bound, any IO
> load occurring as a result of the execution of a program will scale
> roughly linearly with the clock frequency the program is run at, so
> (assuming that the CPU has enough processing power) a point will be
> reached at which the program won't be able to execute faster with
> increasing CPU frequency because the throughput limits of some device
> will have been attained.  Increasing frequencies past that point only
> pessimizes energy usage for no real benefit -- The optimal behavior is
> for the CPU to lock to the minimum frequency that is able to keep the
> IO devices involved fully utilized (assuming we are past the
> maximum-efficiency inflection point of the CPU's power-to-frequency
> curve), which is roughly the goal of this series.
>
> PELT could be a useful extension for this model since its largely
> heuristic assumptions would become more accurate if the IO and CPU
> load could be tracked separately for each scheduling entity, but this
> is not attempted in this series because the additional complexity and
> computational cost of such an approach is hard to justify at this
> stage, particularly since the current governor has similar
> limitations.
>
> Various frequency and step-function response graphs are available in
> [6]-[9] for comparison (obtained empirically on a BXT J3455 system).
> The response curves for the low-latency and low-power states of the
> heuristic are shown separately -- As you can see they roughly bracket
> the frequency response curve of the current governor.  The step
> response of the aggressive heuristic is within a single update period
> (even though it's not quite obvious from the graph with the levels of
> zoom provided).  I'll attach benchmark results from a slower but
> non-TDP-limited machine (which means there will be no TDP budget
> increase that could possibly mask a performance regression of other
> kind) as soon as they come out.
>

See [10] for the promised benchmark results on a non-TDP-limited T5700
-- Mainly to confirm no regressions since no performance improvement was
expected as a consequence of the GPU TDP budget increase.  The slight
improvements are mostly latency-sensitive benchmarks that improve as a
result of the low-latency heuristic.

[10] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-comparison-T5700.log

> Thanks to Eero and Valtteri for testing a number of intermediate
> revisions of this series (and there were quite a few of them) in more
> than half a dozen systems, they helped spot quite a few issues of
> earlier versions of this heuristic.
>
> [PATCH 1/9] cpufreq: Implement infrastructure keeping track of aggregated IO active time.
> [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs with core_funcs"
> [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple of long names"
> [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify intel_pstate_adjust_pstate()"
> [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util from pstate_funcs"
> [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass filtering controller for small core.
> [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP controller based on ACPI FADT profile.
> [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP controller parameters via debugfs.
> [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity to cpufreq.
>
> [1] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-comparison-J3455.log
> [2] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-per-watt-comparison-J3455.log
> [3] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-comparison-J3455-1.log
> [4] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-comparison-J4205.log
> [5] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-comparison-J5005.log
> [6] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequency-response-magnitude-comparison.svg
> [7] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequency-response-phase-comparison.svg
> [8] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-response-comparison-1.svg
> [9] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-response-comparison-2.svg

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-03-28  6:38 [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
                   ` (9 preceding siblings ...)
  2018-03-30 18:50 ` [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
@ 2018-04-10 22:28 ` Francisco Jerez
  2018-04-11  3:14   ` Srinivas Pandruvada
  2018-04-17 14:03 ` Chris Wilson
  11 siblings, 1 reply; 39+ messages in thread
From: Francisco Jerez @ 2018-04-10 22:28 UTC (permalink / raw)
  To: linux-pm, intel-gfx; +Cc: Eero Tamminen, Rafael J. Wysocki, Srinivas Pandruvada


[-- Attachment #1.1.1: Type: text/plain, Size: 10879 bytes --]

Francisco Jerez <currojerez@riseup.net> writes:

> This series attempts to solve an energy efficiency problem of the
> current active-mode non-HWP governor of the intel_pstate driver used
> for the most part on low-power platforms.  Under heavy IO load the
> current controller tends to increase frequencies to the maximum turbo
> P-state, partly due to IO wait boosting, partly due to the roughly
> flat frequency response curve of the current controller (see
> [6]), which causes it to ramp frequencies up and down repeatedly for
> any oscillating workload (think of graphics, audio or disk IO when any
> of them becomes a bottleneck), severely increasing energy usage
> relative to a (throughput-wise equivalent) controller able to provide
> the same average frequency without fluctuation.  The core energy
> efficiency improvement has been observed to be of the order of 20% via
> RAPL, but it's expected to vary substantially between workloads (see
> perf-per-watt comparison [2]).
>
> One might expect that this could come at some cost in terms of system
> responsiveness, but the governor implemented in PATCH 6 has a variable
> response curve controlled by a heuristic that keeps the controller in
> a low-latency state unless the system is under heavy IO load for an
> extended period of time.  The low-latency behavior is actually
> significantly more aggressive than the current governor, allowing it
> to achieve better throughput in some scenarios where the load
> ping-pongs between the CPU and some IO device (see PATCH 6 for more of
> the rationale).  The controller offers relatively lower latency than
> the upstream one particularly while C0 residency is low (which by
> itself contributes to mitigate the increased energy usage while on
> C0).  However under certain conditions the low-latency heuristic may
> increase power consumption (see perf-per-watt comparison [2], the
> apparent regressions are correlated with an increase in performance in
> the same benchmark due to the use of the low-latency heuristic) -- If
> this is a problem a different trade-off between latency and energy
> usage shouldn't be difficult to achieve, but it will come at a
> performance cost in some cases.  I couldn't observe a statistically
> significant increase in idle power consumption due to this behavior
> (on BXT J3455):
>
> package-0 RAPL (W):    XXXXXX ±0.14% x8 ->     XXXXXX ±0.15% x9         d=-0.04% ±0.14%      p=61.73%
>

For the case anyone is wondering what's going on, Srinivas pointed me at
a larger idle power usage increase off-list, ultimately caused by the
low-latency heuristic as discussed in the paragraph above.  I have a v2
of PATCH 6 that gives the controller a third response curve roughly
intermediate between the low-latency and low-power states of this
revision, which avoids the energy usage increase while C0 residency is
low (e.g. during idle) expected for v1.  The low-latency behavior of
this revision is still going to be available based on a heuristic (in
particular when a realtime-priority task is scheduled).  We're carrying
out some additional testing, I'll post the code here eventually.

> [Absolute benchmark results are unfortunately omitted from this letter
> due to company policies, but the percent change and Student's T
> p-value are included above and in the referenced benchmark results]
>
> The most obvious impact of this series will likely be the overall
> improvement in graphics performance on systems with an IGP integrated
> into the processor package (though for the moment this is only enabled
> on BXT+), because the TDP budget shared among CPU and GPU can
> frequently become a limiting factor in low-power devices.  On heavily
> TDP-bound devices this series improves performance of virtually any
> non-trivial graphics rendering by a significant amount (of the order
> of the energy efficiency improvement for that workload assuming the
> optimization didn't cause it to become non-TDP-bound).
>
> See [1]-[5] for detailed numbers including various graphics benchmarks
> and a sample of the Phoronix daily-system-tracker.  Some popular
> graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve
> between 5% and 11% on our systems.  The exact improvement can vary
> substantially between systems (compare the benchmark results from the
> two different J3455 systems [1] and [3]) due to a number of factors,
> including the ratio between CPU and GPU processing power, the behavior
> of the userspace graphics driver, the windowing system and resolution,
> the BIOS (which has an influence on the package TDP), the thermal
> characteristics of the system, etc.
>
> Unigine Valley and Heaven improve by a similar factor on some systems
> (see the J3455 results [1]), but on others the improvement is lower
> because the benchmark fails to fully utilize the GPU, which causes the
> heuristic to remain in low-latency state for longer, which leaves a
> reduced TDP budget available to the GPU, which prevents performance
> from increasing further.  This can be avoided by using the alternative
> heuristic parameters suggested in the commit message of PATCH 8, which
> provide a lower IO utilization threshold and hysteresis for the
> controller to attempt to save energy.  I'm not proposing those for
> upstream (yet) because they would also increase the risk for
> latency-sensitive IO-heavy workloads to regress (like SynMark2
> OglTerrainFly* and some arguably poorly designed IPC-bound X11
> benchmarks).
>
> Discrete graphics aren't likely to experience that much of a visible
> improvement from this, even though many non-IGP workloads *could*
> benefit by reducing the system's energy usage while the discrete GPU
> (or really, any other IO device) becomes a bottleneck, but this is not
> attempted in this series, since that would involve making an energy
> efficiency/latency trade-off that only the maintainers of the
> respective drivers are in a position to make.  The cpufreq interface
> introduced in PATCH 1 to achieve this is left as an opt-in for that
> reason, only the i915 DRM driver is hooked up since it will get the
> most direct pay-off due to the increased energy budget available to
> the GPU, but other power-hungry third-party gadgets built into the
> same package (*cough* AMD *cough* Mali *cough* PowerVR *cough*) may be
> able to benefit from this interface eventually by instrumenting the
> driver in a similar way.
>
> The cpufreq interface is not exclusively tied to the intel_pstate
> driver, because other governors can make use of the statistic
> calculated as result to avoid over-optimizing for latency in scenarios
> where a lower frequency would be able to achieve similar throughput
> while using less energy.  The interpretation of this statistic relies
> on the observation that for as long as the system is CPU-bound, any IO
> load occurring as a result of the execution of a program will scale
> roughly linearly with the clock frequency the program is run at, so
> (assuming that the CPU has enough processing power) a point will be
> reached at which the program won't be able to execute faster with
> increasing CPU frequency because the throughput limits of some device
> will have been attained.  Increasing frequencies past that point only
> pessimizes energy usage for no real benefit -- The optimal behavior is
> for the CPU to lock to the minimum frequency that is able to keep the
> IO devices involved fully utilized (assuming we are past the
> maximum-efficiency inflection point of the CPU's power-to-frequency
> curve), which is roughly the goal of this series.
>
> PELT could be a useful extension for this model since its largely
> heuristic assumptions would become more accurate if the IO and CPU
> load could be tracked separately for each scheduling entity, but this
> is not attempted in this series because the additional complexity and
> computational cost of such an approach is hard to justify at this
> stage, particularly since the current governor has similar
> limitations.
>
> Various frequency and step-function response graphs are available in
> [6]-[9] for comparison (obtained empirically on a BXT J3455 system).
> The response curves for the low-latency and low-power states of the
> heuristic are shown separately -- As you can see they roughly bracket
> the frequency response curve of the current governor.  The step
> response of the aggressive heuristic is within a single update period
> (even though it's not quite obvious from the graph with the levels of
> zoom provided).  I'll attach benchmark results from a slower but
> non-TDP-limited machine (which means there will be no TDP budget
> increase that could possibly mask a performance regression of other
> kind) as soon as they come out.
>
> Thanks to Eero and Valtteri for testing a number of intermediate
> revisions of this series (and there were quite a few of them) in more
> than half a dozen systems, they helped spot quite a few issues of
> earlier versions of this heuristic.
>
> [PATCH 1/9] cpufreq: Implement infrastructure keeping track of aggregated IO active time.
> [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs with core_funcs"
> [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple of long names"
> [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify intel_pstate_adjust_pstate()"
> [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util from pstate_funcs"
> [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass filtering controller for small core.
> [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP controller based on ACPI FADT profile.
> [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP controller parameters via debugfs.
> [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity to cpufreq.
>
> [1] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-comparison-J3455.log
> [2] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-per-watt-comparison-J3455.log
> [3] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-comparison-J3455-1.log
> [4] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-comparison-J4205.log
> [5] http://people.freedesktop.org/~currojerez/intel_pstate-lp/benchmark-perf-comparison-J5005.log
> [6] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequency-response-magnitude-comparison.svg
> [7] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequency-response-phase-comparison.svg
> [8] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-response-comparison-1.svg
> [9] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-response-comparison-2.svg

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-10 22:28 ` Francisco Jerez
@ 2018-04-11  3:14   ` Srinivas Pandruvada
  2018-04-11 16:10     ` Francisco Jerez
  0 siblings, 1 reply; 39+ messages in thread
From: Srinivas Pandruvada @ 2018-04-11  3:14 UTC (permalink / raw)
  To: Francisco Jerez, linux-pm, intel-gfx
  Cc: Peter Zijlstra, Eero Tamminen, Rafael J. Wysocki

On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote:
> Francisco Jerez <currojerez@riseup.net> writes:
> 
[...]


> For the case anyone is wondering what's going on, Srinivas pointed me
> at
> a larger idle power usage increase off-list, ultimately caused by the
> low-latency heuristic as discussed in the paragraph above.  I have a
> v2
> of PATCH 6 that gives the controller a third response curve roughly
> intermediate between the low-latency and low-power states of this
> revision, which avoids the energy usage increase while C0 residency
> is
> low (e.g. during idle) expected for v1.  The low-latency behavior of
> this revision is still going to be available based on a heuristic (in
> particular when a realtime-priority task is scheduled).  We're
> carrying
> out some additional testing, I'll post the code here eventually.

Please try sched-util governor also. There is a frequency-invariant
patch, which I can send you (This eventually will be pushed by Peter).
We want to avoid complexity to intel-pstate for non HWP power sensitive
platforms as far as possible.


Thanks,
Srinivas



> 
> > [Absolute benchmark results are unfortunately omitted from this
> > letter
> > due to company policies, but the percent change and Student's T
> > p-value are included above and in the referenced benchmark results]
> > 
> > The most obvious impact of this series will likely be the overall
> > improvement in graphics performance on systems with an IGP
> > integrated
> > into the processor package (though for the moment this is only
> > enabled
> > on BXT+), because the TDP budget shared among CPU and GPU can
> > frequently become a limiting factor in low-power devices.  On
> > heavily
> > TDP-bound devices this series improves performance of virtually any
> > non-trivial graphics rendering by a significant amount (of the
> > order
> > of the energy efficiency improvement for that workload assuming the
> > optimization didn't cause it to become non-TDP-bound).
> > 
> > See [1]-[5] for detailed numbers including various graphics
> > benchmarks
> > and a sample of the Phoronix daily-system-tracker.  Some popular
> > graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve
> > between 5% and 11% on our systems.  The exact improvement can vary
> > substantially between systems (compare the benchmark results from
> > the
> > two different J3455 systems [1] and [3]) due to a number of
> > factors,
> > including the ratio between CPU and GPU processing power, the
> > behavior
> > of the userspace graphics driver, the windowing system and
> > resolution,
> > the BIOS (which has an influence on the package TDP), the thermal
> > characteristics of the system, etc.
> > 
> > Unigine Valley and Heaven improve by a similar factor on some
> > systems
> > (see the J3455 results [1]), but on others the improvement is lower
> > because the benchmark fails to fully utilize the GPU, which causes
> > the
> > heuristic to remain in low-latency state for longer, which leaves a
> > reduced TDP budget available to the GPU, which prevents performance
> > from increasing further.  This can be avoided by using the
> > alternative
> > heuristic parameters suggested in the commit message of PATCH 8,
> > which
> > provide a lower IO utilization threshold and hysteresis for the
> > controller to attempt to save energy.  I'm not proposing those for
> > upstream (yet) because they would also increase the risk for
> > latency-sensitive IO-heavy workloads to regress (like SynMark2
> > OglTerrainFly* and some arguably poorly designed IPC-bound X11
> > benchmarks).
> > 
> > Discrete graphics aren't likely to experience that much of a
> > visible
> > improvement from this, even though many non-IGP workloads *could*
> > benefit by reducing the system's energy usage while the discrete
> > GPU
> > (or really, any other IO device) becomes a bottleneck, but this is
> > not
> > attempted in this series, since that would involve making an energy
> > efficiency/latency trade-off that only the maintainers of the
> > respective drivers are in a position to make.  The cpufreq
> > interface
> > introduced in PATCH 1 to achieve this is left as an opt-in for that
> > reason, only the i915 DRM driver is hooked up since it will get the
> > most direct pay-off due to the increased energy budget available to
> > the GPU, but other power-hungry third-party gadgets built into the
> > same package (*cough* AMD *cough* Mali *cough* PowerVR *cough*) may
> > be
> > able to benefit from this interface eventually by instrumenting the
> > driver in a similar way.
> > 
> > The cpufreq interface is not exclusively tied to the intel_pstate
> > driver, because other governors can make use of the statistic
> > calculated as result to avoid over-optimizing for latency in
> > scenarios
> > where a lower frequency would be able to achieve similar throughput
> > while using less energy.  The interpretation of this statistic
> > relies
> > on the observation that for as long as the system is CPU-bound, any
> > IO
> > load occurring as a result of the execution of a program will scale
> > roughly linearly with the clock frequency the program is run at, so
> > (assuming that the CPU has enough processing power) a point will be
> > reached at which the program won't be able to execute faster with
> > increasing CPU frequency because the throughput limits of some
> > device
> > will have been attained.  Increasing frequencies past that point
> > only
> > pessimizes energy usage for no real benefit -- The optimal behavior
> > is
> > for the CPU to lock to the minimum frequency that is able to keep
> > the
> > IO devices involved fully utilized (assuming we are past the
> > maximum-efficiency inflection point of the CPU's power-to-frequency
> > curve), which is roughly the goal of this series.
> > 
> > PELT could be a useful extension for this model since its largely
> > heuristic assumptions would become more accurate if the IO and CPU
> > load could be tracked separately for each scheduling entity, but
> > this
> > is not attempted in this series because the additional complexity
> > and
> > computational cost of such an approach is hard to justify at this
> > stage, particularly since the current governor has similar
> > limitations.
> > 
> > Various frequency and step-function response graphs are available
> > in
> > [6]-[9] for comparison (obtained empirically on a BXT J3455
> > system).
> > The response curves for the low-latency and low-power states of the
> > heuristic are shown separately -- As you can see they roughly
> > bracket
> > the frequency response curve of the current governor.  The step
> > response of the aggressive heuristic is within a single update
> > period
> > (even though it's not quite obvious from the graph with the levels
> > of
> > zoom provided).  I'll attach benchmark results from a slower but
> > non-TDP-limited machine (which means there will be no TDP budget
> > increase that could possibly mask a performance regression of other
> > kind) as soon as they come out.
> > 
> > Thanks to Eero and Valtteri for testing a number of intermediate
> > revisions of this series (and there were quite a few of them) in
> > more
> > than half a dozen systems, they helped spot quite a few issues of
> > earlier versions of this heuristic.
> > 
> > [PATCH 1/9] cpufreq: Implement infrastructure keeping track of
> > aggregated IO active time.
> > [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs with
> > core_funcs"
> > [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple of long
> > names"
> > [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify
> > intel_pstate_adjust_pstate()"
> > [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util from
> > pstate_funcs"
> > [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass
> > filtering controller for small core.
> > [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP controller
> > based on ACPI FADT profile.
> > [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP controller
> > parameters via debugfs.
> > [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity
> > to cpufreq.
> > 
> > [1] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
> > mark-perf-comparison-J3455.log
> > [2] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
> > mark-perf-per-watt-comparison-J3455.log
> > [3] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
> > mark-perf-comparison-J3455-1.log
> > [4] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
> > mark-perf-comparison-J4205.log
> > [5] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
> > mark-perf-comparison-J5005.log
> > [6] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequ
> > ency-response-magnitude-comparison.svg
> > [7] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequ
> > ency-response-phase-comparison.svg
> > [8] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-
> > response-comparison-1.svg
> > [9] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-
> > response-comparison-2.svg
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-11  3:14   ` Srinivas Pandruvada
@ 2018-04-11 16:10     ` Francisco Jerez
  2018-04-11 16:26       ` Francisco Jerez
  0 siblings, 1 reply; 39+ messages in thread
From: Francisco Jerez @ 2018-04-11 16:10 UTC (permalink / raw)
  To: Srinivas Pandruvada, linux-pm, intel-gfx
  Cc: Peter Zijlstra, Eero Tamminen, Rafael J. Wysocki


[-- Attachment #1.1.1: Type: text/plain, Size: 11538 bytes --]

Hi Srinivas,

Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:

> On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote:
>> Francisco Jerez <currojerez@riseup.net> writes:
>> 
> [...]
>
>
>> For the case anyone is wondering what's going on, Srinivas pointed me
>> at
>> a larger idle power usage increase off-list, ultimately caused by the
>> low-latency heuristic as discussed in the paragraph above.  I have a
>> v2
>> of PATCH 6 that gives the controller a third response curve roughly
>> intermediate between the low-latency and low-power states of this
>> revision, which avoids the energy usage increase while C0 residency
>> is
>> low (e.g. during idle) expected for v1.  The low-latency behavior of
>> this revision is still going to be available based on a heuristic (in
>> particular when a realtime-priority task is scheduled).  We're
>> carrying
>> out some additional testing, I'll post the code here eventually.
>
> Please try sched-util governor also. There is a frequency-invariant
> patch, which I can send you (This eventually will be pushed by Peter).
> We want to avoid complexity to intel-pstate for non HWP power sensitive
> platforms as far as possible.
>

Unfortunately the schedutil governor (whether frequency invariant or
not) has the exact same energy efficiency issues as the present
intel_pstate non-HWP governor.  Its response is severely underdamped
leading to energy-inefficient behavior for any oscillating non-CPU-bound
workload.  To exacerbate that problem the frequency is maxed out on
frequent IO waiting just like the current intel_pstate cpu-load
controller does, even though the frequent IO waits may actually be an
indication that the system is IO-bound (which means that the large
energy usage increase may not be translated in any performance benefit
in practice, not to speak of performance being impacted negatively in
TDP-bound scenarios like GPU rendering).

Regarding run-time complexity, I haven't observed this governor to be
measurably more computationally intensive than the present one.  It's a
bunch more instructions indeed, but still within the same ballpark as
the current governor.  The average increase in CPU utilization on my BXT
with this series is less than 0.03% (sampled via ftrace for v1, I can
repeat the measurement for the v2 I have in the works, though I don't
expect the result to be substantially different).  If this is a problem
for you there are several optimization opportunities that would cut down
the number of CPU cycles get_target_pstate_lp() takes to execute by a
large percent (most of the optimization ideas I can think of right now
though would come at some accuracy/maintainability/debuggability cost,
but may still be worth pursuing), but the computational overhead is low
enough at this point that the impact on any benchmark or real workload
would be orders of magnitude lower than its variance, which makes it
kind of difficult to keep the discussion data-driven [as possibly any
performance optimization discussion should ever be ;)].

>
> Thanks,
> Srinivas
>
>
>
>> 
>> > [Absolute benchmark results are unfortunately omitted from this
>> > letter
>> > due to company policies, but the percent change and Student's T
>> > p-value are included above and in the referenced benchmark results]
>> > 
>> > The most obvious impact of this series will likely be the overall
>> > improvement in graphics performance on systems with an IGP
>> > integrated
>> > into the processor package (though for the moment this is only
>> > enabled
>> > on BXT+), because the TDP budget shared among CPU and GPU can
>> > frequently become a limiting factor in low-power devices.  On
>> > heavily
>> > TDP-bound devices this series improves performance of virtually any
>> > non-trivial graphics rendering by a significant amount (of the
>> > order
>> > of the energy efficiency improvement for that workload assuming the
>> > optimization didn't cause it to become non-TDP-bound).
>> > 
>> > See [1]-[5] for detailed numbers including various graphics
>> > benchmarks
>> > and a sample of the Phoronix daily-system-tracker.  Some popular
>> > graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve
>> > between 5% and 11% on our systems.  The exact improvement can vary
>> > substantially between systems (compare the benchmark results from
>> > the
>> > two different J3455 systems [1] and [3]) due to a number of
>> > factors,
>> > including the ratio between CPU and GPU processing power, the
>> > behavior
>> > of the userspace graphics driver, the windowing system and
>> > resolution,
>> > the BIOS (which has an influence on the package TDP), the thermal
>> > characteristics of the system, etc.
>> > 
>> > Unigine Valley and Heaven improve by a similar factor on some
>> > systems
>> > (see the J3455 results [1]), but on others the improvement is lower
>> > because the benchmark fails to fully utilize the GPU, which causes
>> > the
>> > heuristic to remain in low-latency state for longer, which leaves a
>> > reduced TDP budget available to the GPU, which prevents performance
>> > from increasing further.  This can be avoided by using the
>> > alternative
>> > heuristic parameters suggested in the commit message of PATCH 8,
>> > which
>> > provide a lower IO utilization threshold and hysteresis for the
>> > controller to attempt to save energy.  I'm not proposing those for
>> > upstream (yet) because they would also increase the risk for
>> > latency-sensitive IO-heavy workloads to regress (like SynMark2
>> > OglTerrainFly* and some arguably poorly designed IPC-bound X11
>> > benchmarks).
>> > 
>> > Discrete graphics aren't likely to experience that much of a
>> > visible
>> > improvement from this, even though many non-IGP workloads *could*
>> > benefit by reducing the system's energy usage while the discrete
>> > GPU
>> > (or really, any other IO device) becomes a bottleneck, but this is
>> > not
>> > attempted in this series, since that would involve making an energy
>> > efficiency/latency trade-off that only the maintainers of the
>> > respective drivers are in a position to make.  The cpufreq
>> > interface
>> > introduced in PATCH 1 to achieve this is left as an opt-in for that
>> > reason, only the i915 DRM driver is hooked up since it will get the
>> > most direct pay-off due to the increased energy budget available to
>> > the GPU, but other power-hungry third-party gadgets built into the
>> > same package (*cough* AMD *cough* Mali *cough* PowerVR *cough*) may
>> > be
>> > able to benefit from this interface eventually by instrumenting the
>> > driver in a similar way.
>> > 
>> > The cpufreq interface is not exclusively tied to the intel_pstate
>> > driver, because other governors can make use of the statistic
>> > calculated as result to avoid over-optimizing for latency in
>> > scenarios
>> > where a lower frequency would be able to achieve similar throughput
>> > while using less energy.  The interpretation of this statistic
>> > relies
>> > on the observation that for as long as the system is CPU-bound, any
>> > IO
>> > load occurring as a result of the execution of a program will scale
>> > roughly linearly with the clock frequency the program is run at, so
>> > (assuming that the CPU has enough processing power) a point will be
>> > reached at which the program won't be able to execute faster with
>> > increasing CPU frequency because the throughput limits of some
>> > device
>> > will have been attained.  Increasing frequencies past that point
>> > only
>> > pessimizes energy usage for no real benefit -- The optimal behavior
>> > is
>> > for the CPU to lock to the minimum frequency that is able to keep
>> > the
>> > IO devices involved fully utilized (assuming we are past the
>> > maximum-efficiency inflection point of the CPU's power-to-frequency
>> > curve), which is roughly the goal of this series.
>> > 
>> > PELT could be a useful extension for this model since its largely
>> > heuristic assumptions would become more accurate if the IO and CPU
>> > load could be tracked separately for each scheduling entity, but
>> > this
>> > is not attempted in this series because the additional complexity
>> > and
>> > computational cost of such an approach is hard to justify at this
>> > stage, particularly since the current governor has similar
>> > limitations.
>> > 
>> > Various frequency and step-function response graphs are available
>> > in
>> > [6]-[9] for comparison (obtained empirically on a BXT J3455
>> > system).
>> > The response curves for the low-latency and low-power states of the
>> > heuristic are shown separately -- As you can see they roughly
>> > bracket
>> > the frequency response curve of the current governor.  The step
>> > response of the aggressive heuristic is within a single update
>> > period
>> > (even though it's not quite obvious from the graph with the levels
>> > of
>> > zoom provided).  I'll attach benchmark results from a slower but
>> > non-TDP-limited machine (which means there will be no TDP budget
>> > increase that could possibly mask a performance regression of other
>> > kind) as soon as they come out.
>> > 
>> > Thanks to Eero and Valtteri for testing a number of intermediate
>> > revisions of this series (and there were quite a few of them) in
>> > more
>> > than half a dozen systems, they helped spot quite a few issues of
>> > earlier versions of this heuristic.
>> > 
>> > [PATCH 1/9] cpufreq: Implement infrastructure keeping track of
>> > aggregated IO active time.
>> > [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs with
>> > core_funcs"
>> > [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple of long
>> > names"
>> > [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify
>> > intel_pstate_adjust_pstate()"
>> > [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util from
>> > pstate_funcs"
>> > [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass
>> > filtering controller for small core.
>> > [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP controller
>> > based on ACPI FADT profile.
>> > [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP controller
>> > parameters via debugfs.
>> > [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity
>> > to cpufreq.
>> > 
>> > [1] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>> > mark-perf-comparison-J3455.log
>> > [2] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>> > mark-perf-per-watt-comparison-J3455.log
>> > [3] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>> > mark-perf-comparison-J3455-1.log
>> > [4] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>> > mark-perf-comparison-J4205.log
>> > [5] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>> > mark-perf-comparison-J5005.log
>> > [6] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequ
>> > ency-response-magnitude-comparison.svg
>> > [7] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequ
>> > ency-response-phase-comparison.svg
>> > [8] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-
>> > response-comparison-1.svg
>> > [9] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-
>> > response-comparison-2.svg

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-11 16:10     ` Francisco Jerez
@ 2018-04-11 16:26       ` Francisco Jerez
  2018-04-11 17:35         ` Juri Lelli
                           ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Francisco Jerez @ 2018-04-11 16:26 UTC (permalink / raw)
  To: Srinivas Pandruvada, linux-pm, intel-gfx
  Cc: Peter Zijlstra, Eero Tamminen, Rafael J. Wysocki


[-- Attachment #1.1.1: Type: text/plain, Size: 12310 bytes --]

Francisco Jerez <currojerez@riseup.net> writes:

> Hi Srinivas,
>
> Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
>
>> On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote:
>>> Francisco Jerez <currojerez@riseup.net> writes:
>>> 
>> [...]
>>
>>
>>> For the case anyone is wondering what's going on, Srinivas pointed me
>>> at
>>> a larger idle power usage increase off-list, ultimately caused by the
>>> low-latency heuristic as discussed in the paragraph above.  I have a
>>> v2
>>> of PATCH 6 that gives the controller a third response curve roughly
>>> intermediate between the low-latency and low-power states of this
>>> revision, which avoids the energy usage increase while C0 residency
>>> is
>>> low (e.g. during idle) expected for v1.  The low-latency behavior of
>>> this revision is still going to be available based on a heuristic (in
>>> particular when a realtime-priority task is scheduled).  We're
>>> carrying
>>> out some additional testing, I'll post the code here eventually.
>>
>> Please try sched-util governor also. There is a frequency-invariant
>> patch, which I can send you (This eventually will be pushed by Peter).
>> We want to avoid complexity to intel-pstate for non HWP power sensitive
>> platforms as far as possible.
>>
>
> Unfortunately the schedutil governor (whether frequency invariant or
> not) has the exact same energy efficiency issues as the present
> intel_pstate non-HWP governor.  Its response is severely underdamped
> leading to energy-inefficient behavior for any oscillating non-CPU-bound
> workload.  To exacerbate that problem the frequency is maxed out on
> frequent IO waiting just like the current intel_pstate cpu-load

"just like" here is possibly somewhat unfair to the schedutil governor,
admittedly its progressive IOWAIT boosting behavior seems somewhat less
wasteful than the intel_pstate non-HWP governor's IOWAIT boosting
behavior, but it's still largely unhelpful on IO-bound conditions.

> controller does, even though the frequent IO waits may actually be an
> indication that the system is IO-bound (which means that the large
> energy usage increase may not be translated in any performance benefit
> in practice, not to speak of performance being impacted negatively in
> TDP-bound scenarios like GPU rendering).
>
> Regarding run-time complexity, I haven't observed this governor to be
> measurably more computationally intensive than the present one.  It's a
> bunch more instructions indeed, but still within the same ballpark as
> the current governor.  The average increase in CPU utilization on my BXT
> with this series is less than 0.03% (sampled via ftrace for v1, I can
> repeat the measurement for the v2 I have in the works, though I don't
> expect the result to be substantially different).  If this is a problem
> for you there are several optimization opportunities that would cut down
> the number of CPU cycles get_target_pstate_lp() takes to execute by a
> large percent (most of the optimization ideas I can think of right now
> though would come at some accuracy/maintainability/debuggability cost,
> but may still be worth pursuing), but the computational overhead is low
> enough at this point that the impact on any benchmark or real workload
> would be orders of magnitude lower than its variance, which makes it
> kind of difficult to keep the discussion data-driven [as possibly any
> performance optimization discussion should ever be ;)].
>
>>
>> Thanks,
>> Srinivas
>>
>>
>>
>>> 
>>> > [Absolute benchmark results are unfortunately omitted from this
>>> > letter
>>> > due to company policies, but the percent change and Student's T
>>> > p-value are included above and in the referenced benchmark results]
>>> > 
>>> > The most obvious impact of this series will likely be the overall
>>> > improvement in graphics performance on systems with an IGP
>>> > integrated
>>> > into the processor package (though for the moment this is only
>>> > enabled
>>> > on BXT+), because the TDP budget shared among CPU and GPU can
>>> > frequently become a limiting factor in low-power devices.  On
>>> > heavily
>>> > TDP-bound devices this series improves performance of virtually any
>>> > non-trivial graphics rendering by a significant amount (of the
>>> > order
>>> > of the energy efficiency improvement for that workload assuming the
>>> > optimization didn't cause it to become non-TDP-bound).
>>> > 
>>> > See [1]-[5] for detailed numbers including various graphics
>>> > benchmarks
>>> > and a sample of the Phoronix daily-system-tracker.  Some popular
>>> > graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve
>>> > between 5% and 11% on our systems.  The exact improvement can vary
>>> > substantially between systems (compare the benchmark results from
>>> > the
>>> > two different J3455 systems [1] and [3]) due to a number of
>>> > factors,
>>> > including the ratio between CPU and GPU processing power, the
>>> > behavior
>>> > of the userspace graphics driver, the windowing system and
>>> > resolution,
>>> > the BIOS (which has an influence on the package TDP), the thermal
>>> > characteristics of the system, etc.
>>> > 
>>> > Unigine Valley and Heaven improve by a similar factor on some
>>> > systems
>>> > (see the J3455 results [1]), but on others the improvement is lower
>>> > because the benchmark fails to fully utilize the GPU, which causes
>>> > the
>>> > heuristic to remain in low-latency state for longer, which leaves a
>>> > reduced TDP budget available to the GPU, which prevents performance
>>> > from increasing further.  This can be avoided by using the
>>> > alternative
>>> > heuristic parameters suggested in the commit message of PATCH 8,
>>> > which
>>> > provide a lower IO utilization threshold and hysteresis for the
>>> > controller to attempt to save energy.  I'm not proposing those for
>>> > upstream (yet) because they would also increase the risk for
>>> > latency-sensitive IO-heavy workloads to regress (like SynMark2
>>> > OglTerrainFly* and some arguably poorly designed IPC-bound X11
>>> > benchmarks).
>>> > 
>>> > Discrete graphics aren't likely to experience that much of a
>>> > visible
>>> > improvement from this, even though many non-IGP workloads *could*
>>> > benefit by reducing the system's energy usage while the discrete
>>> > GPU
>>> > (or really, any other IO device) becomes a bottleneck, but this is
>>> > not
>>> > attempted in this series, since that would involve making an energy
>>> > efficiency/latency trade-off that only the maintainers of the
>>> > respective drivers are in a position to make.  The cpufreq
>>> > interface
>>> > introduced in PATCH 1 to achieve this is left as an opt-in for that
>>> > reason, only the i915 DRM driver is hooked up since it will get the
>>> > most direct pay-off due to the increased energy budget available to
>>> > the GPU, but other power-hungry third-party gadgets built into the
>>> > same package (*cough* AMD *cough* Mali *cough* PowerVR *cough*) may
>>> > be
>>> > able to benefit from this interface eventually by instrumenting the
>>> > driver in a similar way.
>>> > 
>>> > The cpufreq interface is not exclusively tied to the intel_pstate
>>> > driver, because other governors can make use of the statistic
>>> > calculated as result to avoid over-optimizing for latency in
>>> > scenarios
>>> > where a lower frequency would be able to achieve similar throughput
>>> > while using less energy.  The interpretation of this statistic
>>> > relies
>>> > on the observation that for as long as the system is CPU-bound, any
>>> > IO
>>> > load occurring as a result of the execution of a program will scale
>>> > roughly linearly with the clock frequency the program is run at, so
>>> > (assuming that the CPU has enough processing power) a point will be
>>> > reached at which the program won't be able to execute faster with
>>> > increasing CPU frequency because the throughput limits of some
>>> > device
>>> > will have been attained.  Increasing frequencies past that point
>>> > only
>>> > pessimizes energy usage for no real benefit -- The optimal behavior
>>> > is
>>> > for the CPU to lock to the minimum frequency that is able to keep
>>> > the
>>> > IO devices involved fully utilized (assuming we are past the
>>> > maximum-efficiency inflection point of the CPU's power-to-frequency
>>> > curve), which is roughly the goal of this series.
>>> > 
>>> > PELT could be a useful extension for this model since its largely
>>> > heuristic assumptions would become more accurate if the IO and CPU
>>> > load could be tracked separately for each scheduling entity, but
>>> > this
>>> > is not attempted in this series because the additional complexity
>>> > and
>>> > computational cost of such an approach is hard to justify at this
>>> > stage, particularly since the current governor has similar
>>> > limitations.
>>> > 
>>> > Various frequency and step-function response graphs are available
>>> > in
>>> > [6]-[9] for comparison (obtained empirically on a BXT J3455
>>> > system).
>>> > The response curves for the low-latency and low-power states of the
>>> > heuristic are shown separately -- As you can see they roughly
>>> > bracket
>>> > the frequency response curve of the current governor.  The step
>>> > response of the aggressive heuristic is within a single update
>>> > period
>>> > (even though it's not quite obvious from the graph with the levels
>>> > of
>>> > zoom provided).  I'll attach benchmark results from a slower but
>>> > non-TDP-limited machine (which means there will be no TDP budget
>>> > increase that could possibly mask a performance regression of other
>>> > kind) as soon as they come out.
>>> > 
>>> > Thanks to Eero and Valtteri for testing a number of intermediate
>>> > revisions of this series (and there were quite a few of them) in
>>> > more
>>> > than half a dozen systems, they helped spot quite a few issues of
>>> > earlier versions of this heuristic.
>>> > 
>>> > [PATCH 1/9] cpufreq: Implement infrastructure keeping track of
>>> > aggregated IO active time.
>>> > [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs with
>>> > core_funcs"
>>> > [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple of long
>>> > names"
>>> > [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify
>>> > intel_pstate_adjust_pstate()"
>>> > [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util from
>>> > pstate_funcs"
>>> > [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass
>>> > filtering controller for small core.
>>> > [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP controller
>>> > based on ACPI FADT profile.
>>> > [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP controller
>>> > parameters via debugfs.
>>> > [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity
>>> > to cpufreq.
>>> > 
>>> > [1] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>>> > mark-perf-comparison-J3455.log
>>> > [2] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>>> > mark-perf-per-watt-comparison-J3455.log
>>> > [3] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>>> > mark-perf-comparison-J3455-1.log
>>> > [4] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>>> > mark-perf-comparison-J4205.log
>>> > [5] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>>> > mark-perf-comparison-J5005.log
>>> > [6] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequ
>>> > ency-response-magnitude-comparison.svg
>>> > [7] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequ
>>> > ency-response-phase-comparison.svg
>>> > [8] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-
>>> > response-comparison-1.svg
>>> > [9] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-
>>> > response-comparison-2.svg
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-11 16:26       ` Francisco Jerez
@ 2018-04-11 17:35         ` Juri Lelli
  2018-04-12 21:38           ` Francisco Jerez
  2018-04-12  6:17         ` Srinivas Pandruvada
  2018-04-12  8:58         ` Peter Zijlstra
  2 siblings, 1 reply; 39+ messages in thread
From: Juri Lelli @ 2018-04-11 17:35 UTC (permalink / raw)
  To: Francisco Jerez
  Cc: Javi Merino, linux-pm, Peter Zijlstra, intel-gfx,
	Rafael J. Wysocki, Morten Rasmussen, Patrick Bellasi,
	Srinivas Pandruvada, Eero Tamminen, Dietmar Eggemann

Hi,

On 11/04/18 09:26, Francisco Jerez wrote:
> Francisco Jerez <currojerez@riseup.net> writes:
> 
> > Hi Srinivas,
> >
> > Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
> >
> >> On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote:
> >>> Francisco Jerez <currojerez@riseup.net> writes:
> >>> 
> >> [...]
> >>
> >>
> >>> For the case anyone is wondering what's going on, Srinivas pointed me
> >>> at
> >>> a larger idle power usage increase off-list, ultimately caused by the
> >>> low-latency heuristic as discussed in the paragraph above.  I have a
> >>> v2
> >>> of PATCH 6 that gives the controller a third response curve roughly
> >>> intermediate between the low-latency and low-power states of this
> >>> revision, which avoids the energy usage increase while C0 residency
> >>> is
> >>> low (e.g. during idle) expected for v1.  The low-latency behavior of
> >>> this revision is still going to be available based on a heuristic (in
> >>> particular when a realtime-priority task is scheduled).  We're
> >>> carrying
> >>> out some additional testing, I'll post the code here eventually.
> >>
> >> Please try sched-util governor also. There is a frequency-invariant
> >> patch, which I can send you (This eventually will be pushed by Peter).
> >> We want to avoid complexity to intel-pstate for non HWP power sensitive
> >> platforms as far as possible.
> >>
> >
> > Unfortunately the schedutil governor (whether frequency invariant or
> > not) has the exact same energy efficiency issues as the present
> > intel_pstate non-HWP governor.  Its response is severely underdamped
> > leading to energy-inefficient behavior for any oscillating non-CPU-bound
> > workload.  To exacerbate that problem the frequency is maxed out on
> > frequent IO waiting just like the current intel_pstate cpu-load
> 
> "just like" here is possibly somewhat unfair to the schedutil governor,
> admittedly its progressive IOWAIT boosting behavior seems somewhat less
> wasteful than the intel_pstate non-HWP governor's IOWAIT boosting
> behavior, but it's still largely unhelpful on IO-bound conditions.

Sorry if I jump in out of the blue, but what you are trying to solve
looks very similar to what IPA [1] is targeting as well. I might be
wrong (I'll try to spend more time reviewing your set), but my first
impression is that we should try to solve similar problems with a more
general approach that could benefit different sys/archs.

I'm Cc-ing some Arm folks...

Best,

- Juri

[1] https://developer.arm.com/open-source/intelligent-power-allocation
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-11 16:26       ` Francisco Jerez
  2018-04-11 17:35         ` Juri Lelli
@ 2018-04-12  6:17         ` Srinivas Pandruvada
  2018-04-14  2:00           ` Francisco Jerez
  2018-04-12  8:58         ` Peter Zijlstra
  2 siblings, 1 reply; 39+ messages in thread
From: Srinivas Pandruvada @ 2018-04-12  6:17 UTC (permalink / raw)
  To: Francisco Jerez, linux-pm, intel-gfx
  Cc: Peter Zijlstra, Eero Tamminen, Rafael J. Wysocki

On Wed, 2018-04-11 at 09:26 -0700, Francisco Jerez wrote:
> 
> "just like" here is possibly somewhat unfair to the schedutil
> governor,
> admittedly its progressive IOWAIT boosting behavior seems somewhat
> less
> wasteful than the intel_pstate non-HWP governor's IOWAIT boosting
> behavior, but it's still largely unhelpful on IO-bound conditions.
> 

OK, if you think so, then improve it for sched-util governor or other
mechanisms (as Juri suggested) instead of intel-pstate. This will
benefit all architectures including x86 + non i915.

BTW intel-pstate can be driven by sched-util governor (passive mode),
so if your prove benefits to Broxton, this can be a default.
As before:
- No regression to idle power at all. This is more important than
benchmarks
- Not just score, performance/watt is important

Thanks,
Srinivas


> > controller does, even though the frequent IO waits may actually be
> > an
> > indication that the system is IO-bound (which means that the large
> > energy usage increase may not be translated in any performance
> > benefit
> > in practice, not to speak of performance being impacted negatively
> > in
> > TDP-bound scenarios like GPU rendering).
> > 
> > Regarding run-time complexity, I haven't observed this governor to
> > be
> > measurably more computationally intensive than the present
> > one.  It's a
> > bunch more instructions indeed, but still within the same ballpark
> > as
> > the current governor.  The average increase in CPU utilization on
> > my BXT
> > with this series is less than 0.03% (sampled via ftrace for v1, I
> > can
> > repeat the measurement for the v2 I have in the works, though I
> > don't
> > expect the result to be substantially different).  If this is a
> > problem
> > for you there are several optimization opportunities that would cut
> > down
> > the number of CPU cycles get_target_pstate_lp() takes to execute by
> > a
> > large percent (most of the optimization ideas I can think of right
> > now
> > though would come at some accuracy/maintainability/debuggability
> > cost,
> > but may still be worth pursuing), but the computational overhead is
> > low
> > enough at this point that the impact on any benchmark or real
> > workload
> > would be orders of magnitude lower than its variance, which makes
> > it
> > kind of difficult to keep the discussion data-driven [as possibly
> > any
> > performance optimization discussion should ever be ;)].
> > 
> > > 
> > > Thanks,
> > > Srinivas
> > > 
> > > 
> > > 
> > > > 
> > > > > [Absolute benchmark results are unfortunately omitted from
> > > > > this
> > > > > letter
> > > > > due to company policies, but the percent change and Student's
> > > > > T
> > > > > p-value are included above and in the referenced benchmark
> > > > > results]
> > > > > 
> > > > > The most obvious impact of this series will likely be the
> > > > > overall
> > > > > improvement in graphics performance on systems with an IGP
> > > > > integrated
> > > > > into the processor package (though for the moment this is
> > > > > only
> > > > > enabled
> > > > > on BXT+), because the TDP budget shared among CPU and GPU can
> > > > > frequently become a limiting factor in low-power devices.  On
> > > > > heavily
> > > > > TDP-bound devices this series improves performance of
> > > > > virtually any
> > > > > non-trivial graphics rendering by a significant amount (of
> > > > > the
> > > > > order
> > > > > of the energy efficiency improvement for that workload
> > > > > assuming the
> > > > > optimization didn't cause it to become non-TDP-bound).
> > > > > 
> > > > > See [1]-[5] for detailed numbers including various graphics
> > > > > benchmarks
> > > > > and a sample of the Phoronix daily-system-tracker.  Some
> > > > > popular
> > > > > graphics benchmarks like GfxBench gl_manhattan31 and gl_4
> > > > > improve
> > > > > between 5% and 11% on our systems.  The exact improvement can
> > > > > vary
> > > > > substantially between systems (compare the benchmark results
> > > > > from
> > > > > the
> > > > > two different J3455 systems [1] and [3]) due to a number of
> > > > > factors,
> > > > > including the ratio between CPU and GPU processing power, the
> > > > > behavior
> > > > > of the userspace graphics driver, the windowing system and
> > > > > resolution,
> > > > > the BIOS (which has an influence on the package TDP), the
> > > > > thermal
> > > > > characteristics of the system, etc.
> > > > > 
> > > > > Unigine Valley and Heaven improve by a similar factor on some
> > > > > systems
> > > > > (see the J3455 results [1]), but on others the improvement is
> > > > > lower
> > > > > because the benchmark fails to fully utilize the GPU, which
> > > > > causes
> > > > > the
> > > > > heuristic to remain in low-latency state for longer, which
> > > > > leaves a
> > > > > reduced TDP budget available to the GPU, which prevents
> > > > > performance
> > > > > from increasing further.  This can be avoided by using the
> > > > > alternative
> > > > > heuristic parameters suggested in the commit message of PATCH
> > > > > 8,
> > > > > which
> > > > > provide a lower IO utilization threshold and hysteresis for
> > > > > the
> > > > > controller to attempt to save energy.  I'm not proposing
> > > > > those for
> > > > > upstream (yet) because they would also increase the risk for
> > > > > latency-sensitive IO-heavy workloads to regress (like
> > > > > SynMark2
> > > > > OglTerrainFly* and some arguably poorly designed IPC-bound
> > > > > X11
> > > > > benchmarks).
> > > > > 
> > > > > Discrete graphics aren't likely to experience that much of a
> > > > > visible
> > > > > improvement from this, even though many non-IGP workloads
> > > > > *could*
> > > > > benefit by reducing the system's energy usage while the
> > > > > discrete
> > > > > GPU
> > > > > (or really, any other IO device) becomes a bottleneck, but
> > > > > this is
> > > > > not
> > > > > attempted in this series, since that would involve making an
> > > > > energy
> > > > > efficiency/latency trade-off that only the maintainers of the
> > > > > respective drivers are in a position to make.  The cpufreq
> > > > > interface
> > > > > introduced in PATCH 1 to achieve this is left as an opt-in
> > > > > for that
> > > > > reason, only the i915 DRM driver is hooked up since it will
> > > > > get the
> > > > > most direct pay-off due to the increased energy budget
> > > > > available to
> > > > > the GPU, but other power-hungry third-party gadgets built
> > > > > into the
> > > > > same package (*cough* AMD *cough* Mali *cough* PowerVR
> > > > > *cough*) may
> > > > > be
> > > > > able to benefit from this interface eventually by
> > > > > instrumenting the
> > > > > driver in a similar way.
> > > > > 
> > > > > The cpufreq interface is not exclusively tied to the
> > > > > intel_pstate
> > > > > driver, because other governors can make use of the statistic
> > > > > calculated as result to avoid over-optimizing for latency in
> > > > > scenarios
> > > > > where a lower frequency would be able to achieve similar
> > > > > throughput
> > > > > while using less energy.  The interpretation of this
> > > > > statistic
> > > > > relies
> > > > > on the observation that for as long as the system is CPU-
> > > > > bound, any
> > > > > IO
> > > > > load occurring as a result of the execution of a program will
> > > > > scale
> > > > > roughly linearly with the clock frequency the program is run
> > > > > at, so
> > > > > (assuming that the CPU has enough processing power) a point
> > > > > will be
> > > > > reached at which the program won't be able to execute faster
> > > > > with
> > > > > increasing CPU frequency because the throughput limits of
> > > > > some
> > > > > device
> > > > > will have been attained.  Increasing frequencies past that
> > > > > point
> > > > > only
> > > > > pessimizes energy usage for no real benefit -- The optimal
> > > > > behavior
> > > > > is
> > > > > for the CPU to lock to the minimum frequency that is able to
> > > > > keep
> > > > > the
> > > > > IO devices involved fully utilized (assuming we are past the
> > > > > maximum-efficiency inflection point of the CPU's power-to-
> > > > > frequency
> > > > > curve), which is roughly the goal of this series.
> > > > > 
> > > > > PELT could be a useful extension for this model since its
> > > > > largely
> > > > > heuristic assumptions would become more accurate if the IO
> > > > > and CPU
> > > > > load could be tracked separately for each scheduling entity,
> > > > > but
> > > > > this
> > > > > is not attempted in this series because the additional
> > > > > complexity
> > > > > and
> > > > > computational cost of such an approach is hard to justify at
> > > > > this
> > > > > stage, particularly since the current governor has similar
> > > > > limitations.
> > > > > 
> > > > > Various frequency and step-function response graphs are
> > > > > available
> > > > > in
> > > > > [6]-[9] for comparison (obtained empirically on a BXT J3455
> > > > > system).
> > > > > The response curves for the low-latency and low-power states
> > > > > of the
> > > > > heuristic are shown separately -- As you can see they roughly
> > > > > bracket
> > > > > the frequency response curve of the current governor.  The
> > > > > step
> > > > > response of the aggressive heuristic is within a single
> > > > > update
> > > > > period
> > > > > (even though it's not quite obvious from the graph with the
> > > > > levels
> > > > > of
> > > > > zoom provided).  I'll attach benchmark results from a slower
> > > > > but
> > > > > non-TDP-limited machine (which means there will be no TDP
> > > > > budget
> > > > > increase that could possibly mask a performance regression of
> > > > > other
> > > > > kind) as soon as they come out.
> > > > > 
> > > > > Thanks to Eero and Valtteri for testing a number of
> > > > > intermediate
> > > > > revisions of this series (and there were quite a few of them)
> > > > > in
> > > > > more
> > > > > than half a dozen systems, they helped spot quite a few
> > > > > issues of
> > > > > earlier versions of this heuristic.
> > > > > 
> > > > > [PATCH 1/9] cpufreq: Implement infrastructure keeping track
> > > > > of
> > > > > aggregated IO active time.
> > > > > [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs
> > > > > with
> > > > > core_funcs"
> > > > > [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple
> > > > > of long
> > > > > names"
> > > > > [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify
> > > > > intel_pstate_adjust_pstate()"
> > > > > [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util
> > > > > from
> > > > > pstate_funcs"
> > > > > [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass
> > > > > filtering controller for small core.
> > > > > [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP
> > > > > controller
> > > > > based on ACPI FADT profile.
> > > > > [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP
> > > > > controller
> > > > > parameters via debugfs.
> > > > > [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO
> > > > > activity
> > > > > to cpufreq.
> > > > > 
> > > > > [1] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /bench
> > > > > mark-perf-comparison-J3455.log
> > > > > [2] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /bench
> > > > > mark-perf-per-watt-comparison-J3455.log
> > > > > [3] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /bench
> > > > > mark-perf-comparison-J3455-1.log
> > > > > [4] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /bench
> > > > > mark-perf-comparison-J4205.log
> > > > > [5] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /bench
> > > > > mark-perf-comparison-J5005.log
> > > > > [6] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /frequ
> > > > > ency-response-magnitude-comparison.svg
> > > > > [7] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /frequ
> > > > > ency-response-phase-comparison.svg
> > > > > [8] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /step-
> > > > > response-comparison-1.svg
> > > > > [9] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /step-
> > > > > response-comparison-2.svg
> > 
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-11 16:26       ` Francisco Jerez
  2018-04-11 17:35         ` Juri Lelli
  2018-04-12  6:17         ` Srinivas Pandruvada
@ 2018-04-12  8:58         ` Peter Zijlstra
  2018-04-12 18:34           ` Francisco Jerez
  2 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2018-04-12  8:58 UTC (permalink / raw)
  To: Francisco Jerez
  Cc: intel-gfx, linux-pm, Rafael J. Wysocki, Eero Tamminen,
	Srinivas Pandruvada

On Wed, Apr 11, 2018 at 09:26:11AM -0700, Francisco Jerez wrote:
> "just like" here is possibly somewhat unfair to the schedutil governor,
> admittedly its progressive IOWAIT boosting behavior seems somewhat less
> wasteful than the intel_pstate non-HWP governor's IOWAIT boosting
> behavior, but it's still largely unhelpful on IO-bound conditions.

So you understand why we need the iowait boosting right?

It is just that when we get back to runnable, we want to process the
next data packet ASAP. See also here:

  https://lkml.kernel.org/r/20170522082154.f57cqovterd2qajv@hirez.programming.kicks-ass.net

What I don't really understand is why it is costing so much power; after
all, when we're in iowait the CPU is mostly idle and can power-gate.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-12  8:58         ` Peter Zijlstra
@ 2018-04-12 18:34           ` Francisco Jerez
  2018-04-12 19:33             ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Francisco Jerez @ 2018-04-12 18:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: intel-gfx, linux-pm, Rafael J. Wysocki, Eero Tamminen,
	Srinivas Pandruvada


[-- Attachment #1.1.1: Type: text/plain, Size: 2441 bytes --]

Peter Zijlstra <peterz@infradead.org> writes:

> On Wed, Apr 11, 2018 at 09:26:11AM -0700, Francisco Jerez wrote:
>> "just like" here is possibly somewhat unfair to the schedutil governor,
>> admittedly its progressive IOWAIT boosting behavior seems somewhat less
>> wasteful than the intel_pstate non-HWP governor's IOWAIT boosting
>> behavior, but it's still largely unhelpful on IO-bound conditions.
>
> So you understand why we need the iowait boosting right?
>

Yeah, sort of.  The latency-minimizing state of this governor provides a
comparable effect, but it's based on a pessimistic estimate of the
frequency required for the workload to achieve maximum throughput
(rather than a plain or exponential boost up to the max frequency which
can substantially deviate from that frequency, see the explanation in
PATCH 6 for more details).  It's enabled under conditions partially
overlapping but not identical to iowait boosting: The optimization is
not applied under IO-bound conditions (in order to avoid impacting
energy efficiency negatively for zero or negative payoff), OTOH the
optimization is applied in some cases where the current governor
wouldn't, like RT-priority threads (that's the main difference with v2
I'm planning to send out next week).

> It is just that when we get back to runnable, we want to process the
> next data packet ASAP. See also here:
>
>   https://lkml.kernel.org/r/20170522082154.f57cqovterd2qajv@hirez.programming.kicks-ass.net
>
> What I don't really understand is why it is costing so much power; after
> all, when we're in iowait the CPU is mostly idle and can power-gate.

The reason for the energy efficiency problem of iowait boosting is
precisely the greater oscillation between turbo and idle.  Say that
iowait boost increases the frequency by a factor alpha relative to the
optimal frequency f0 (in terms of energy efficiency) required to execute
some IO-bound workload.  This will cause the CPU to be busy for a
fraction of the time it was busy originally, approximately t1 = t0 /
alpha, which indeed divides the overall energy usage by a factor alpha,
but at the same time multiplies the instantaneous power consumption
while busy by a factor potentially much greater than alpha, since the
CPU's power curve is largely non-linear, and in fact approximately
convex within the frequency range allowed by the policy, so you get an
average energy usage possibly much greater than the optimal.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-12 18:34           ` Francisco Jerez
@ 2018-04-12 19:33             ` Peter Zijlstra
  2018-04-12 19:55               ` Francisco Jerez
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2018-04-12 19:33 UTC (permalink / raw)
  To: Francisco Jerez
  Cc: intel-gfx, linux-pm, Rafael J. Wysocki, Eero Tamminen,
	Srinivas Pandruvada

On Thu, Apr 12, 2018 at 11:34:54AM -0700, Francisco Jerez wrote:
> The reason for the energy efficiency problem of iowait boosting is
> precisely the greater oscillation between turbo and idle.  Say that
> iowait boost increases the frequency by a factor alpha relative to the
> optimal frequency f0 (in terms of energy efficiency) required to execute
> some IO-bound workload.  This will cause the CPU to be busy for a
> fraction of the time it was busy originally, approximately t1 = t0 /
> alpha, which indeed divides the overall energy usage by a factor alpha,
> but at the same time multiplies the instantaneous power consumption
> while busy by a factor potentially much greater than alpha, since the
> CPU's power curve is largely non-linear, and in fact approximately
> convex within the frequency range allowed by the policy, so you get an
> average energy usage possibly much greater than the optimal.

Ah, but we don't (yet) have the (normalized) power curves, so we cannot
make that call.

Once we have the various energy/OPP numbers required for EAS we can
compute the optimal. I think such was even mentioned in the thread I
referred earlier.

Until such time; we boost to max for lack of a better option.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-12 19:33             ` Peter Zijlstra
@ 2018-04-12 19:55               ` Francisco Jerez
  2018-04-13 18:15                 ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Francisco Jerez @ 2018-04-12 19:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: intel-gfx, linux-pm, Rafael J. Wysocki, Eero Tamminen,
	Srinivas Pandruvada


[-- Attachment #1.1.1: Type: text/plain, Size: 2686 bytes --]

Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, Apr 12, 2018 at 11:34:54AM -0700, Francisco Jerez wrote:
>> The reason for the energy efficiency problem of iowait boosting is
>> precisely the greater oscillation between turbo and idle.  Say that
>> iowait boost increases the frequency by a factor alpha relative to the
>> optimal frequency f0 (in terms of energy efficiency) required to execute
>> some IO-bound workload.  This will cause the CPU to be busy for a
>> fraction of the time it was busy originally, approximately t1 = t0 /
>> alpha, which indeed divides the overall energy usage by a factor alpha,
>> but at the same time multiplies the instantaneous power consumption
>> while busy by a factor potentially much greater than alpha, since the
>> CPU's power curve is largely non-linear, and in fact approximately
>> convex within the frequency range allowed by the policy, so you get an
>> average energy usage possibly much greater than the optimal.
>
> Ah, but we don't (yet) have the (normalized) power curves, so we cannot
> make that call.
>
> Once we have the various energy/OPP numbers required for EAS we can
> compute the optimal. I think such was even mentioned in the thread I
> referred earlier.
>
> Until such time; we boost to max for lack of a better option.

Actually assuming that a single geometric feature of the power curve is
known -- it being convex in the frequency range allowed by the policy
(which is almost always the case, not only for Intel CPUs), the optimal
frequency for an IO-bound workload is fully independent of the exact
power curve -- It's just the minimum CPU frequency that's able to keep
the bottlenecking IO device at 100% utilization.  Any frequency higher
than that will lead to strictly lower energy efficiency whatever the
exact form of the power curve is.

I agree though that exact knowledge of the power curve *might* be useful
as a mechanism to estimate the potential costs of exceeding that optimal
frequency (e.g. as a mechanism to offset performance loss heuristically
for the case the workload fluctuates by giving the governor an upward
bias with an approximately known energy cost), but that's not required
for the governor's behavior to be approximately optimal in IO-bound
conditions.  Not making further assumptions about the power curve beyond
its convexity makes the algorithm fairly robust against any inaccuracy
in the power curve numbers (which there will always be, since the energy
efficiency of the workload is really dependent on the behavior of
multiple components of the system interacting with each other), and
makes it easily reusable on platforms where the exact power curves are
not known.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-11 17:35         ` Juri Lelli
@ 2018-04-12 21:38           ` Francisco Jerez
  0 siblings, 0 replies; 39+ messages in thread
From: Francisco Jerez @ 2018-04-12 21:38 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Javi Merino, linux-pm, Peter Zijlstra, intel-gfx,
	Rafael J. Wysocki, Morten Rasmussen, Patrick Bellasi,
	Srinivas Pandruvada, Eero Tamminen, Dietmar Eggemann


[-- Attachment #1.1.1: Type: text/plain, Size: 5053 bytes --]

Juri Lelli <juri.lelli@gmail.com> writes:

> Hi,
>
> On 11/04/18 09:26, Francisco Jerez wrote:
>> Francisco Jerez <currojerez@riseup.net> writes:
>> 
>> > Hi Srinivas,
>> >
>> > Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
>> >
>> >> On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote:
>> >>> Francisco Jerez <currojerez@riseup.net> writes:
>> >>> 
>> >> [...]
>> >>
>> >>
>> >>> For the case anyone is wondering what's going on, Srinivas pointed me
>> >>> at
>> >>> a larger idle power usage increase off-list, ultimately caused by the
>> >>> low-latency heuristic as discussed in the paragraph above.  I have a
>> >>> v2
>> >>> of PATCH 6 that gives the controller a third response curve roughly
>> >>> intermediate between the low-latency and low-power states of this
>> >>> revision, which avoids the energy usage increase while C0 residency
>> >>> is
>> >>> low (e.g. during idle) expected for v1.  The low-latency behavior of
>> >>> this revision is still going to be available based on a heuristic (in
>> >>> particular when a realtime-priority task is scheduled).  We're
>> >>> carrying
>> >>> out some additional testing, I'll post the code here eventually.
>> >>
>> >> Please try sched-util governor also. There is a frequency-invariant
>> >> patch, which I can send you (This eventually will be pushed by Peter).
>> >> We want to avoid complexity to intel-pstate for non HWP power sensitive
>> >> platforms as far as possible.
>> >>
>> >
>> > Unfortunately the schedutil governor (whether frequency invariant or
>> > not) has the exact same energy efficiency issues as the present
>> > intel_pstate non-HWP governor.  Its response is severely underdamped
>> > leading to energy-inefficient behavior for any oscillating non-CPU-bound
>> > workload.  To exacerbate that problem the frequency is maxed out on
>> > frequent IO waiting just like the current intel_pstate cpu-load
>> 
>> "just like" here is possibly somewhat unfair to the schedutil governor,
>> admittedly its progressive IOWAIT boosting behavior seems somewhat less
>> wasteful than the intel_pstate non-HWP governor's IOWAIT boosting
>> behavior, but it's still largely unhelpful on IO-bound conditions.
>
> Sorry if I jump in out of the blue, but what you are trying to solve
> looks very similar to what IPA [1] is targeting as well. I might be
> wrong (I'll try to spend more time reviewing your set), but my first
> impression is that we should try to solve similar problems with a more
> general approach that could benefit different sys/archs.
>

Thanks, seems interesting, I've also been taking a look at your
whitepaper and source code.  The problem we've both been trying to solve
is indeed closely related, there may be an opportunity for sharing
efforts both ways.

Correct me if I didn't understand the whole details about your power
allocation code, but IPA seems to be dividing up the available power
budget proportionally to the power requested by the different actors (up
to the point that causes some actor to reach its maximum power) and
configured weights.  From my understanding of the get_requested_power
implementations for cpufreq and devfreq, the requested power attempts to
approximate the current power usage of each device (whether it's
estimated from the current frequency and a capacitance model, from the
get_real_power callback, or other mechanism), which can be far from the
optimal power consumption in cases where the device's governor is
programming a frequency that wildly deviates from the optimal one (as is
the case with the current intel_pstate governor for any IO-bound
workload, which incidentally will suffer the greatest penalty from a
suboptimal power allocation in cases where the IO device is actually an
integrated GPU).

Is there any mechanism in place to prevent the system from stabilizing
at a power allocation that prevents it from achieving maximum
throughput?  E.g. in a TDP-limited system with two devices consuming a
total power of Pmax = P0(f0) + P1(f1), with f0 much greater than the
optimal, and f1 capped at a frequency lower than the optimal due to TDP
or thermal constraints, and assuming that the system is bottlenecking at
the second device.  In such a scenario wouldn't IPA distribute power in
a way that roughly approximates the pre-existing suboptimal
distribution?

If that's the case, I understand that it's the responsibility of the
device's (or CPU's) frequency governor to request a frequency which is
reasonably energy-efficient in the first place for the balancer to
function correctly? (That's precisely the goal of this series) -- Which
in addition allows the system to use less power to get the same work
done in cases where the system is not thermally or TDP-limited as a
whole, so the balancing logic wouldn't have any effect at all.

> I'm Cc-ing some Arm folks...
>
> Best,
>
> - Juri
>
> [1] https://developer.arm.com/open-source/intelligent-power-allocation

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-12 19:55               ` Francisco Jerez
@ 2018-04-13 18:15                 ` Peter Zijlstra
  2018-04-14  1:57                   ` Francisco Jerez
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2018-04-13 18:15 UTC (permalink / raw)
  To: Francisco Jerez
  Cc: intel-gfx, linux-pm, Rafael J. Wysocki, Eero Tamminen,
	Srinivas Pandruvada

On Thu, Apr 12, 2018 at 12:55:39PM -0700, Francisco Jerez wrote:
> Actually assuming that a single geometric feature of the power curve is
> known -- it being convex in the frequency range allowed by the policy
> (which is almost always the case, not only for Intel CPUs), the optimal
> frequency for an IO-bound workload is fully independent of the exact
> power curve -- It's just the minimum CPU frequency that's able to keep
> the bottlenecking IO device at 100% utilization. 

I think that is difficult to determine with the information at hand. We
have lost all device information by the time we reach the scheduler.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-13 18:15                 ` Peter Zijlstra
@ 2018-04-14  1:57                   ` Francisco Jerez
  2018-04-14  9:49                     ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Francisco Jerez @ 2018-04-14  1:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: intel-gfx, linux-pm, Rafael J. Wysocki, Eero Tamminen,
	Srinivas Pandruvada


[-- Attachment #1.1.1: Type: text/plain, Size: 1876 bytes --]

Peter Zijlstra <peterz@infradead.org> writes:

> On Thu, Apr 12, 2018 at 12:55:39PM -0700, Francisco Jerez wrote:
>> Actually assuming that a single geometric feature of the power curve is
>> known -- it being convex in the frequency range allowed by the policy
>> (which is almost always the case, not only for Intel CPUs), the optimal
>> frequency for an IO-bound workload is fully independent of the exact
>> power curve -- It's just the minimum CPU frequency that's able to keep
>> the bottlenecking IO device at 100% utilization. 
>
> I think that is difficult to determine with the information at hand. We
> have lost all device information by the time we reach the scheduler.

I assume you mean it's difficult to tell whether the workload is
CPU-bound or IO-bound?  Yeah, it's non-trivial to determine whether the
system is bottlenecking on IO, it requires additional infrastructure to
keep track of IO utilization (that's the purpose of PATCH 1), and even
then it involves some heuristic assumptions which are not guaranteed
fail-proof, so the controller needs to be prepared for things to behave
reasonably when the assumptions deviate from reality (see the comments
in PATCH 6 for more details on what happens in such cases) -- How
frequently that happens in practice is what determines how far the
controller's response will be from the optimally energy-efficient
behavior in a real workload.  It seems to work fairly well in practice,
at least in the sample of test-cases I've been able to gather data from
so far.

Anyway that's the difficult part.  Once (if) you know you're IO-bound,
determining the optimal (most energy-efficient) CPU frequency is
relatively straightforward, and doesn't require knowledge of the exact
power curve of the CPU (beyond clamping the controller response to the
convexity region of the power curve).

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-12  6:17         ` Srinivas Pandruvada
@ 2018-04-14  2:00           ` Francisco Jerez
  2018-04-14  4:01             ` Srinivas Pandruvada
  0 siblings, 1 reply; 39+ messages in thread
From: Francisco Jerez @ 2018-04-14  2:00 UTC (permalink / raw)
  To: Srinivas Pandruvada, linux-pm, intel-gfx
  Cc: Peter Zijlstra, Eero Tamminen, Rafael J. Wysocki


[-- Attachment #1.1.1: Type: text/plain, Size: 14887 bytes --]

Hi Srinivas,

Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:

> On Wed, 2018-04-11 at 09:26 -0700, Francisco Jerez wrote:
>> 
>> "just like" here is possibly somewhat unfair to the schedutil
>> governor,
>> admittedly its progressive IOWAIT boosting behavior seems somewhat
>> less
>> wasteful than the intel_pstate non-HWP governor's IOWAIT boosting
>> behavior, but it's still largely unhelpful on IO-bound conditions.
>> 
>
> OK, if you think so, then improve it for sched-util governor or other
> mechanisms (as Juri suggested) instead of intel-pstate.

You may not have realized but this series provides a full drop-in
replacement for the current non-HWP governor of the intel_pstate driver,
it should be strictly superior to the current cpu-load governor in terms
of energy usage and performance under most scenarios (hold on for v2 for
the idle consumption issue).  Main reason it's implemented as a separate
governor currently is for us to be able to deploy it on BXT+ platforms
only for the moment, in order to decrease our initial validation effort
and get enough test coverage on BXT (which is incidentally the platform
that's going to get the greatest payoff) during a few release cycles.

Are you no longer interested in improving those aspects of the non-HWP
governor?  Is it that you're planning to delete it and move back to a
generic cpufreq governor for non-HWP platforms in the near future?

> This will benefit all architectures including x86 + non i915.
>

The current design encourages re-use of the IO utilization statistic
(see PATCH 1) by other governors as a mechanism driving the trade-off
between energy efficiency and responsiveness based on whether the system
is close to CPU-bound, in whatever way is applicable to each governor
(e.g. it would make sense for it to be hooked up to the EPP preference
knob in the case of the intel_pstate HWP governor, which would allow it
to achieve better energy efficiency in IO-bound situations just like
this series does for non-HWP parts).  There's nothing really x86- nor
i915-specific about it.

> BTW intel-pstate can be driven by sched-util governor (passive mode),
> so if your prove benefits to Broxton, this can be a default.
> As before:
> - No regression to idle power at all. This is more important than
> benchmarks
> - Not just score, performance/watt is important
>

Is schedutil actually on par with the intel_pstate non-HWP governor as
of today, according to these metrics and the overall benchmark numbers?

> Thanks,
> Srinivas
>
>
>> > controller does, even though the frequent IO waits may actually be
>> > an
>> > indication that the system is IO-bound (which means that the large
>> > energy usage increase may not be translated in any performance
>> > benefit
>> > in practice, not to speak of performance being impacted negatively
>> > in
>> > TDP-bound scenarios like GPU rendering).
>> > 
>> > Regarding run-time complexity, I haven't observed this governor to
>> > be
>> > measurably more computationally intensive than the present
>> > one.  It's a
>> > bunch more instructions indeed, but still within the same ballpark
>> > as
>> > the current governor.  The average increase in CPU utilization on
>> > my BXT
>> > with this series is less than 0.03% (sampled via ftrace for v1, I
>> > can
>> > repeat the measurement for the v2 I have in the works, though I
>> > don't
>> > expect the result to be substantially different).  If this is a
>> > problem
>> > for you there are several optimization opportunities that would cut
>> > down
>> > the number of CPU cycles get_target_pstate_lp() takes to execute by
>> > a
>> > large percent (most of the optimization ideas I can think of right
>> > now
>> > though would come at some accuracy/maintainability/debuggability
>> > cost,
>> > but may still be worth pursuing), but the computational overhead is
>> > low
>> > enough at this point that the impact on any benchmark or real
>> > workload
>> > would be orders of magnitude lower than its variance, which makes
>> > it
>> > kind of difficult to keep the discussion data-driven [as possibly
>> > any
>> > performance optimization discussion should ever be ;)].
>> > 
>> > > 
>> > > Thanks,
>> > > Srinivas
>> > > 
>> > > 
>> > > 
>> > > > 
>> > > > > [Absolute benchmark results are unfortunately omitted from
>> > > > > this
>> > > > > letter
>> > > > > due to company policies, but the percent change and Student's
>> > > > > T
>> > > > > p-value are included above and in the referenced benchmark
>> > > > > results]
>> > > > > 
>> > > > > The most obvious impact of this series will likely be the
>> > > > > overall
>> > > > > improvement in graphics performance on systems with an IGP
>> > > > > integrated
>> > > > > into the processor package (though for the moment this is
>> > > > > only
>> > > > > enabled
>> > > > > on BXT+), because the TDP budget shared among CPU and GPU can
>> > > > > frequently become a limiting factor in low-power devices.  On
>> > > > > heavily
>> > > > > TDP-bound devices this series improves performance of
>> > > > > virtually any
>> > > > > non-trivial graphics rendering by a significant amount (of
>> > > > > the
>> > > > > order
>> > > > > of the energy efficiency improvement for that workload
>> > > > > assuming the
>> > > > > optimization didn't cause it to become non-TDP-bound).
>> > > > > 
>> > > > > See [1]-[5] for detailed numbers including various graphics
>> > > > > benchmarks
>> > > > > and a sample of the Phoronix daily-system-tracker.  Some
>> > > > > popular
>> > > > > graphics benchmarks like GfxBench gl_manhattan31 and gl_4
>> > > > > improve
>> > > > > between 5% and 11% on our systems.  The exact improvement can
>> > > > > vary
>> > > > > substantially between systems (compare the benchmark results
>> > > > > from
>> > > > > the
>> > > > > two different J3455 systems [1] and [3]) due to a number of
>> > > > > factors,
>> > > > > including the ratio between CPU and GPU processing power, the
>> > > > > behavior
>> > > > > of the userspace graphics driver, the windowing system and
>> > > > > resolution,
>> > > > > the BIOS (which has an influence on the package TDP), the
>> > > > > thermal
>> > > > > characteristics of the system, etc.
>> > > > > 
>> > > > > Unigine Valley and Heaven improve by a similar factor on some
>> > > > > systems
>> > > > > (see the J3455 results [1]), but on others the improvement is
>> > > > > lower
>> > > > > because the benchmark fails to fully utilize the GPU, which
>> > > > > causes
>> > > > > the
>> > > > > heuristic to remain in low-latency state for longer, which
>> > > > > leaves a
>> > > > > reduced TDP budget available to the GPU, which prevents
>> > > > > performance
>> > > > > from increasing further.  This can be avoided by using the
>> > > > > alternative
>> > > > > heuristic parameters suggested in the commit message of PATCH
>> > > > > 8,
>> > > > > which
>> > > > > provide a lower IO utilization threshold and hysteresis for
>> > > > > the
>> > > > > controller to attempt to save energy.  I'm not proposing
>> > > > > those for
>> > > > > upstream (yet) because they would also increase the risk for
>> > > > > latency-sensitive IO-heavy workloads to regress (like
>> > > > > SynMark2
>> > > > > OglTerrainFly* and some arguably poorly designed IPC-bound
>> > > > > X11
>> > > > > benchmarks).
>> > > > > 
>> > > > > Discrete graphics aren't likely to experience that much of a
>> > > > > visible
>> > > > > improvement from this, even though many non-IGP workloads
>> > > > > *could*
>> > > > > benefit by reducing the system's energy usage while the
>> > > > > discrete
>> > > > > GPU
>> > > > > (or really, any other IO device) becomes a bottleneck, but
>> > > > > this is
>> > > > > not
>> > > > > attempted in this series, since that would involve making an
>> > > > > energy
>> > > > > efficiency/latency trade-off that only the maintainers of the
>> > > > > respective drivers are in a position to make.  The cpufreq
>> > > > > interface
>> > > > > introduced in PATCH 1 to achieve this is left as an opt-in
>> > > > > for that
>> > > > > reason, only the i915 DRM driver is hooked up since it will
>> > > > > get the
>> > > > > most direct pay-off due to the increased energy budget
>> > > > > available to
>> > > > > the GPU, but other power-hungry third-party gadgets built
>> > > > > into the
>> > > > > same package (*cough* AMD *cough* Mali *cough* PowerVR
>> > > > > *cough*) may
>> > > > > be
>> > > > > able to benefit from this interface eventually by
>> > > > > instrumenting the
>> > > > > driver in a similar way.
>> > > > > 
>> > > > > The cpufreq interface is not exclusively tied to the
>> > > > > intel_pstate
>> > > > > driver, because other governors can make use of the statistic
>> > > > > calculated as result to avoid over-optimizing for latency in
>> > > > > scenarios
>> > > > > where a lower frequency would be able to achieve similar
>> > > > > throughput
>> > > > > while using less energy.  The interpretation of this
>> > > > > statistic
>> > > > > relies
>> > > > > on the observation that for as long as the system is CPU-
>> > > > > bound, any
>> > > > > IO
>> > > > > load occurring as a result of the execution of a program will
>> > > > > scale
>> > > > > roughly linearly with the clock frequency the program is run
>> > > > > at, so
>> > > > > (assuming that the CPU has enough processing power) a point
>> > > > > will be
>> > > > > reached at which the program won't be able to execute faster
>> > > > > with
>> > > > > increasing CPU frequency because the throughput limits of
>> > > > > some
>> > > > > device
>> > > > > will have been attained.  Increasing frequencies past that
>> > > > > point
>> > > > > only
>> > > > > pessimizes energy usage for no real benefit -- The optimal
>> > > > > behavior
>> > > > > is
>> > > > > for the CPU to lock to the minimum frequency that is able to
>> > > > > keep
>> > > > > the
>> > > > > IO devices involved fully utilized (assuming we are past the
>> > > > > maximum-efficiency inflection point of the CPU's power-to-
>> > > > > frequency
>> > > > > curve), which is roughly the goal of this series.
>> > > > > 
>> > > > > PELT could be a useful extension for this model since its
>> > > > > largely
>> > > > > heuristic assumptions would become more accurate if the IO
>> > > > > and CPU
>> > > > > load could be tracked separately for each scheduling entity,
>> > > > > but
>> > > > > this
>> > > > > is not attempted in this series because the additional
>> > > > > complexity
>> > > > > and
>> > > > > computational cost of such an approach is hard to justify at
>> > > > > this
>> > > > > stage, particularly since the current governor has similar
>> > > > > limitations.
>> > > > > 
>> > > > > Various frequency and step-function response graphs are
>> > > > > available
>> > > > > in
>> > > > > [6]-[9] for comparison (obtained empirically on a BXT J3455
>> > > > > system).
>> > > > > The response curves for the low-latency and low-power states
>> > > > > of the
>> > > > > heuristic are shown separately -- As you can see they roughly
>> > > > > bracket
>> > > > > the frequency response curve of the current governor.  The
>> > > > > step
>> > > > > response of the aggressive heuristic is within a single
>> > > > > update
>> > > > > period
>> > > > > (even though it's not quite obvious from the graph with the
>> > > > > levels
>> > > > > of
>> > > > > zoom provided).  I'll attach benchmark results from a slower
>> > > > > but
>> > > > > non-TDP-limited machine (which means there will be no TDP
>> > > > > budget
>> > > > > increase that could possibly mask a performance regression of
>> > > > > other
>> > > > > kind) as soon as they come out.
>> > > > > 
>> > > > > Thanks to Eero and Valtteri for testing a number of
>> > > > > intermediate
>> > > > > revisions of this series (and there were quite a few of them)
>> > > > > in
>> > > > > more
>> > > > > than half a dozen systems, they helped spot quite a few
>> > > > > issues of
>> > > > > earlier versions of this heuristic.
>> > > > > 
>> > > > > [PATCH 1/9] cpufreq: Implement infrastructure keeping track
>> > > > > of
>> > > > > aggregated IO active time.
>> > > > > [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs
>> > > > > with
>> > > > > core_funcs"
>> > > > > [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple
>> > > > > of long
>> > > > > names"
>> > > > > [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify
>> > > > > intel_pstate_adjust_pstate()"
>> > > > > [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util
>> > > > > from
>> > > > > pstate_funcs"
>> > > > > [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass
>> > > > > filtering controller for small core.
>> > > > > [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP
>> > > > > controller
>> > > > > based on ACPI FADT profile.
>> > > > > [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP
>> > > > > controller
>> > > > > parameters via debugfs.
>> > > > > [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO
>> > > > > activity
>> > > > > to cpufreq.
>> > > > > 
>> > > > > [1] http://people.freedesktop.org/~currojerez/intel_pstate-lp
>> > > > > /bench
>> > > > > mark-perf-comparison-J3455.log
>> > > > > [2] http://people.freedesktop.org/~currojerez/intel_pstate-lp
>> > > > > /bench
>> > > > > mark-perf-per-watt-comparison-J3455.log
>> > > > > [3] http://people.freedesktop.org/~currojerez/intel_pstate-lp
>> > > > > /bench
>> > > > > mark-perf-comparison-J3455-1.log
>> > > > > [4] http://people.freedesktop.org/~currojerez/intel_pstate-lp
>> > > > > /bench
>> > > > > mark-perf-comparison-J4205.log
>> > > > > [5] http://people.freedesktop.org/~currojerez/intel_pstate-lp
>> > > > > /bench
>> > > > > mark-perf-comparison-J5005.log
>> > > > > [6] http://people.freedesktop.org/~currojerez/intel_pstate-lp
>> > > > > /frequ
>> > > > > ency-response-magnitude-comparison.svg
>> > > > > [7] http://people.freedesktop.org/~currojerez/intel_pstate-lp
>> > > > > /frequ
>> > > > > ency-response-phase-comparison.svg
>> > > > > [8] http://people.freedesktop.org/~currojerez/intel_pstate-lp
>> > > > > /step-
>> > > > > response-comparison-1.svg
>> > > > > [9] http://people.freedesktop.org/~currojerez/intel_pstate-lp
>> > > > > /step-
>> > > > > response-comparison-2.svg
>> > 
>> > _______________________________________________
>> > Intel-gfx mailing list
>> > Intel-gfx@lists.freedesktop.org
>> > https://lists.freedesktop.org/mailman/listinfo/intel-gfx

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-14  2:00           ` Francisco Jerez
@ 2018-04-14  4:01             ` Srinivas Pandruvada
  2018-04-16 14:04               ` Eero Tamminen
  0 siblings, 1 reply; 39+ messages in thread
From: Srinivas Pandruvada @ 2018-04-14  4:01 UTC (permalink / raw)
  To: Francisco Jerez, linux-pm, intel-gfx
  Cc: Peter Zijlstra, Eero Tamminen, Rafael J. Wysocki

Hi Francisco,

[...]

> Are you no longer interested in improving those aspects of the non-
> HWP
> governor?  Is it that you're planning to delete it and move back to a
> generic cpufreq governor for non-HWP platforms in the near future?

Yes that is the plan for Atom platforms, which are only non HWP
platforms till now. You have to show good gain for performance and
performance/watt to carry and maintain such big change. So we have to
see your performance and power numbers.

> 
> > This will benefit all architectures including x86 + non i915.
> > 
> 
> The current design encourages re-use of the IO utilization statistic
> (see PATCH 1) by other governors as a mechanism driving the trade-off
> between energy efficiency and responsiveness based on whether the
> system
> is close to CPU-bound, in whatever way is applicable to each governor
> (e.g. it would make sense for it to be hooked up to the EPP
> preference
> knob in the case of the intel_pstate HWP governor, which would allow
> it
> to achieve better energy efficiency in IO-bound situations just like
> this series does for non-HWP parts).  There's nothing really x86- nor
> i915-specific about it.
> 
> > BTW intel-pstate can be driven by sched-util governor (passive
> > mode),
> > so if your prove benefits to Broxton, this can be a default.
> > As before:
> > - No regression to idle power at all. This is more important than
> > benchmarks
> > - Not just score, performance/watt is important
> > 
> 
> Is schedutil actually on par with the intel_pstate non-HWP governor
> as
> of today, according to these metrics and the overall benchmark
> numbers?
Yes, except for few cases. I have not tested recently, so may be
better.

Thanks,
Srinivas


> > Thanks,
> > Srinivas
> > 
> > 
> > > > controller does, even though the frequent IO waits may actually
> > > > be
> > > > an
> > > > indication that the system is IO-bound (which means that the
> > > > large
> > > > energy usage increase may not be translated in any performance
> > > > benefit
> > > > in practice, not to speak of performance being impacted
> > > > negatively
> > > > in
> > > > TDP-bound scenarios like GPU rendering).
> > > > 
> > > > Regarding run-time complexity, I haven't observed this governor
> > > > to
> > > > be
> > > > measurably more computationally intensive than the present
> > > > one.  It's a
> > > > bunch more instructions indeed, but still within the same
> > > > ballpark
> > > > as
> > > > the current governor.  The average increase in CPU utilization
> > > > on
> > > > my BXT
> > > > with this series is less than 0.03% (sampled via ftrace for v1,
> > > > I
> > > > can
> > > > repeat the measurement for the v2 I have in the works, though I
> > > > don't
> > > > expect the result to be substantially different).  If this is a
> > > > problem
> > > > for you there are several optimization opportunities that would
> > > > cut
> > > > down
> > > > the number of CPU cycles get_target_pstate_lp() takes to
> > > > execute by
> > > > a
> > > > large percent (most of the optimization ideas I can think of
> > > > right
> > > > now
> > > > though would come at some
> > > > accuracy/maintainability/debuggability
> > > > cost,
> > > > but may still be worth pursuing), but the computational
> > > > overhead is
> > > > low
> > > > enough at this point that the impact on any benchmark or real
> > > > workload
> > > > would be orders of magnitude lower than its variance, which
> > > > makes
> > > > it
> > > > kind of difficult to keep the discussion data-driven [as
> > > > possibly
> > > > any
> > > > performance optimization discussion should ever be ;)].
> > > > 
> > > > > 
> > > > > Thanks,
> > > > > Srinivas
> > > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > > [Absolute benchmark results are unfortunately omitted
> > > > > > > from
> > > > > > > this
> > > > > > > letter
> > > > > > > due to company policies, but the percent change and
> > > > > > > Student's
> > > > > > > T
> > > > > > > p-value are included above and in the referenced
> > > > > > > benchmark
> > > > > > > results]
> > > > > > > 
> > > > > > > The most obvious impact of this series will likely be the
> > > > > > > overall
> > > > > > > improvement in graphics performance on systems with an
> > > > > > > IGP
> > > > > > > integrated
> > > > > > > into the processor package (though for the moment this is
> > > > > > > only
> > > > > > > enabled
> > > > > > > on BXT+), because the TDP budget shared among CPU and GPU
> > > > > > > can
> > > > > > > frequently become a limiting factor in low-power
> > > > > > > devices.  On
> > > > > > > heavily
> > > > > > > TDP-bound devices this series improves performance of
> > > > > > > virtually any
> > > > > > > non-trivial graphics rendering by a significant amount
> > > > > > > (of
> > > > > > > the
> > > > > > > order
> > > > > > > of the energy efficiency improvement for that workload
> > > > > > > assuming the
> > > > > > > optimization didn't cause it to become non-TDP-bound).
> > > > > > > 
> > > > > > > See [1]-[5] for detailed numbers including various
> > > > > > > graphics
> > > > > > > benchmarks
> > > > > > > and a sample of the Phoronix daily-system-tracker.  Some
> > > > > > > popular
> > > > > > > graphics benchmarks like GfxBench gl_manhattan31 and gl_4
> > > > > > > improve
> > > > > > > between 5% and 11% on our systems.  The exact improvement
> > > > > > > can
> > > > > > > vary
> > > > > > > substantially between systems (compare the benchmark
> > > > > > > results
> > > > > > > from
> > > > > > > the
> > > > > > > two different J3455 systems [1] and [3]) due to a number
> > > > > > > of
> > > > > > > factors,
> > > > > > > including the ratio between CPU and GPU processing power,
> > > > > > > the
> > > > > > > behavior
> > > > > > > of the userspace graphics driver, the windowing system
> > > > > > > and
> > > > > > > resolution,
> > > > > > > the BIOS (which has an influence on the package TDP), the
> > > > > > > thermal
> > > > > > > characteristics of the system, etc.
> > > > > > > 
> > > > > > > Unigine Valley and Heaven improve by a similar factor on
> > > > > > > some
> > > > > > > systems
> > > > > > > (see the J3455 results [1]), but on others the
> > > > > > > improvement is
> > > > > > > lower
> > > > > > > because the benchmark fails to fully utilize the GPU,
> > > > > > > which
> > > > > > > causes
> > > > > > > the
> > > > > > > heuristic to remain in low-latency state for longer,
> > > > > > > which
> > > > > > > leaves a
> > > > > > > reduced TDP budget available to the GPU, which prevents
> > > > > > > performance
> > > > > > > from increasing further.  This can be avoided by using
> > > > > > > the
> > > > > > > alternative
> > > > > > > heuristic parameters suggested in the commit message of
> > > > > > > PATCH
> > > > > > > 8,
> > > > > > > which
> > > > > > > provide a lower IO utilization threshold and hysteresis
> > > > > > > for
> > > > > > > the
> > > > > > > controller to attempt to save energy.  I'm not proposing
> > > > > > > those for
> > > > > > > upstream (yet) because they would also increase the risk
> > > > > > > for
> > > > > > > latency-sensitive IO-heavy workloads to regress (like
> > > > > > > SynMark2
> > > > > > > OglTerrainFly* and some arguably poorly designed IPC-
> > > > > > > bound
> > > > > > > X11
> > > > > > > benchmarks).
> > > > > > > 
> > > > > > > Discrete graphics aren't likely to experience that much
> > > > > > > of a
> > > > > > > visible
> > > > > > > improvement from this, even though many non-IGP workloads
> > > > > > > *could*
> > > > > > > benefit by reducing the system's energy usage while the
> > > > > > > discrete
> > > > > > > GPU
> > > > > > > (or really, any other IO device) becomes a bottleneck,
> > > > > > > but
> > > > > > > this is
> > > > > > > not
> > > > > > > attempted in this series, since that would involve making
> > > > > > > an
> > > > > > > energy
> > > > > > > efficiency/latency trade-off that only the maintainers of
> > > > > > > the
> > > > > > > respective drivers are in a position to make.  The
> > > > > > > cpufreq
> > > > > > > interface
> > > > > > > introduced in PATCH 1 to achieve this is left as an opt-
> > > > > > > in
> > > > > > > for that
> > > > > > > reason, only the i915 DRM driver is hooked up since it
> > > > > > > will
> > > > > > > get the
> > > > > > > most direct pay-off due to the increased energy budget
> > > > > > > available to
> > > > > > > the GPU, but other power-hungry third-party gadgets built
> > > > > > > into the
> > > > > > > same package (*cough* AMD *cough* Mali *cough* PowerVR
> > > > > > > *cough*) may
> > > > > > > be
> > > > > > > able to benefit from this interface eventually by
> > > > > > > instrumenting the
> > > > > > > driver in a similar way.
> > > > > > > 
> > > > > > > The cpufreq interface is not exclusively tied to the
> > > > > > > intel_pstate
> > > > > > > driver, because other governors can make use of the
> > > > > > > statistic
> > > > > > > calculated as result to avoid over-optimizing for latency
> > > > > > > in
> > > > > > > scenarios
> > > > > > > where a lower frequency would be able to achieve similar
> > > > > > > throughput
> > > > > > > while using less energy.  The interpretation of this
> > > > > > > statistic
> > > > > > > relies
> > > > > > > on the observation that for as long as the system is CPU-
> > > > > > > bound, any
> > > > > > > IO
> > > > > > > load occurring as a result of the execution of a program
> > > > > > > will
> > > > > > > scale
> > > > > > > roughly linearly with the clock frequency the program is
> > > > > > > run
> > > > > > > at, so
> > > > > > > (assuming that the CPU has enough processing power) a
> > > > > > > point
> > > > > > > will be
> > > > > > > reached at which the program won't be able to execute
> > > > > > > faster
> > > > > > > with
> > > > > > > increasing CPU frequency because the throughput limits of
> > > > > > > some
> > > > > > > device
> > > > > > > will have been attained.  Increasing frequencies past
> > > > > > > that
> > > > > > > point
> > > > > > > only
> > > > > > > pessimizes energy usage for no real benefit -- The
> > > > > > > optimal
> > > > > > > behavior
> > > > > > > is
> > > > > > > for the CPU to lock to the minimum frequency that is able
> > > > > > > to
> > > > > > > keep
> > > > > > > the
> > > > > > > IO devices involved fully utilized (assuming we are past
> > > > > > > the
> > > > > > > maximum-efficiency inflection point of the CPU's power-
> > > > > > > to-
> > > > > > > frequency
> > > > > > > curve), which is roughly the goal of this series.
> > > > > > > 
> > > > > > > PELT could be a useful extension for this model since its
> > > > > > > largely
> > > > > > > heuristic assumptions would become more accurate if the
> > > > > > > IO
> > > > > > > and CPU
> > > > > > > load could be tracked separately for each scheduling
> > > > > > > entity,
> > > > > > > but
> > > > > > > this
> > > > > > > is not attempted in this series because the additional
> > > > > > > complexity
> > > > > > > and
> > > > > > > computational cost of such an approach is hard to justify
> > > > > > > at
> > > > > > > this
> > > > > > > stage, particularly since the current governor has
> > > > > > > similar
> > > > > > > limitations.
> > > > > > > 
> > > > > > > Various frequency and step-function response graphs are
> > > > > > > available
> > > > > > > in
> > > > > > > [6]-[9] for comparison (obtained empirically on a BXT
> > > > > > > J3455
> > > > > > > system).
> > > > > > > The response curves for the low-latency and low-power
> > > > > > > states
> > > > > > > of the
> > > > > > > heuristic are shown separately -- As you can see they
> > > > > > > roughly
> > > > > > > bracket
> > > > > > > the frequency response curve of the current
> > > > > > > governor.  The
> > > > > > > step
> > > > > > > response of the aggressive heuristic is within a single
> > > > > > > update
> > > > > > > period
> > > > > > > (even though it's not quite obvious from the graph with
> > > > > > > the
> > > > > > > levels
> > > > > > > of
> > > > > > > zoom provided).  I'll attach benchmark results from a
> > > > > > > slower
> > > > > > > but
> > > > > > > non-TDP-limited machine (which means there will be no TDP
> > > > > > > budget
> > > > > > > increase that could possibly mask a performance
> > > > > > > regression of
> > > > > > > other
> > > > > > > kind) as soon as they come out.
> > > > > > > 
> > > > > > > Thanks to Eero and Valtteri for testing a number of
> > > > > > > intermediate
> > > > > > > revisions of this series (and there were quite a few of
> > > > > > > them)
> > > > > > > in
> > > > > > > more
> > > > > > > than half a dozen systems, they helped spot quite a few
> > > > > > > issues of
> > > > > > > earlier versions of this heuristic.
> > > > > > > 
> > > > > > > [PATCH 1/9] cpufreq: Implement infrastructure keeping
> > > > > > > track
> > > > > > > of
> > > > > > > aggregated IO active time.
> > > > > > > [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace
> > > > > > > bxt_funcs
> > > > > > > with
> > > > > > > core_funcs"
> > > > > > > [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a
> > > > > > > couple
> > > > > > > of long
> > > > > > > names"
> > > > > > > [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify
> > > > > > > intel_pstate_adjust_pstate()"
> > > > > > > [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop
> > > > > > > ->update_util
> > > > > > > from
> > > > > > > pstate_funcs"
> > > > > > > [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-
> > > > > > > pass
> > > > > > > filtering controller for small core.
> > > > > > > [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP
> > > > > > > controller
> > > > > > > based on ACPI FADT profile.
> > > > > > > [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP
> > > > > > > controller
> > > > > > > parameters via debugfs.
> > > > > > > [PATCH 9/9] drm/i915/execlists: Report GPU rendering as
> > > > > > > IO
> > > > > > > activity
> > > > > > > to cpufreq.
> > > > > > > 
> > > > > > > [1] http://people.freedesktop.org/~currojerez/intel_pstat
> > > > > > > e-lp
> > > > > > > /bench
> > > > > > > mark-perf-comparison-J3455.log
> > > > > > > [2] http://people.freedesktop.org/~currojerez/intel_pstat
> > > > > > > e-lp
> > > > > > > /bench
> > > > > > > mark-perf-per-watt-comparison-J3455.log
> > > > > > > [3] http://people.freedesktop.org/~currojerez/intel_pstat
> > > > > > > e-lp
> > > > > > > /bench
> > > > > > > mark-perf-comparison-J3455-1.log
> > > > > > > [4] http://people.freedesktop.org/~currojerez/intel_pstat
> > > > > > > e-lp
> > > > > > > /bench
> > > > > > > mark-perf-comparison-J4205.log
> > > > > > > [5] http://people.freedesktop.org/~currojerez/intel_pstat
> > > > > > > e-lp
> > > > > > > /bench
> > > > > > > mark-perf-comparison-J5005.log
> > > > > > > [6] http://people.freedesktop.org/~currojerez/intel_pstat
> > > > > > > e-lp
> > > > > > > /frequ
> > > > > > > ency-response-magnitude-comparison.svg
> > > > > > > [7] http://people.freedesktop.org/~currojerez/intel_pstat
> > > > > > > e-lp
> > > > > > > /frequ
> > > > > > > ency-response-phase-comparison.svg
> > > > > > > [8] http://people.freedesktop.org/~currojerez/intel_pstat
> > > > > > > e-lp
> > > > > > > /step-
> > > > > > > response-comparison-1.svg
> > > > > > > [9] http://people.freedesktop.org/~currojerez/intel_pstat
> > > > > > > e-lp
> > > > > > > /step-
> > > > > > > response-comparison-2.svg
> > > > 
> > > > _______________________________________________
> > > > Intel-gfx mailing list
> > > > Intel-gfx@lists.freedesktop.org
> > > > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-14  1:57                   ` Francisco Jerez
@ 2018-04-14  9:49                     ` Peter Zijlstra
  0 siblings, 0 replies; 39+ messages in thread
From: Peter Zijlstra @ 2018-04-14  9:49 UTC (permalink / raw)
  To: Francisco Jerez
  Cc: intel-gfx, linux-pm, Rafael J. Wysocki, Eero Tamminen,
	Srinivas Pandruvada

On Fri, Apr 13, 2018 at 06:57:39PM -0700, Francisco Jerez wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Thu, Apr 12, 2018 at 12:55:39PM -0700, Francisco Jerez wrote:
> >> Actually assuming that a single geometric feature of the power curve is
> >> known -- it being convex in the frequency range allowed by the policy
> >> (which is almost always the case, not only for Intel CPUs), the optimal
> >> frequency for an IO-bound workload is fully independent of the exact
> >> power curve -- It's just the minimum CPU frequency that's able to keep
> >> the bottlenecking IO device at 100% utilization. 
> >
> > I think that is difficult to determine with the information at hand. We
> > have lost all device information by the time we reach the scheduler.
> 
> I assume you mean it's difficult to tell whether the workload is
> CPU-bound or IO-bound?  Yeah, it's non-trivial to determine whether the
> system is bottlenecking on IO, it requires additional infrastructure to
> keep track of IO utilization (that's the purpose of PATCH 1), and even

Note that I've not actually seen any of your patches; I got Cc'ed on
later.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-14  4:01             ` Srinivas Pandruvada
@ 2018-04-16 14:04               ` Eero Tamminen
  2018-04-16 17:27                 ` Srinivas Pandruvada
  0 siblings, 1 reply; 39+ messages in thread
From: Eero Tamminen @ 2018-04-16 14:04 UTC (permalink / raw)
  To: Srinivas Pandruvada, Francisco Jerez, linux-pm, intel-gfx
  Cc: Peter Zijlstra, Rafael J. Wysocki

Hi,

On 14.04.2018 07:01, Srinivas Pandruvada wrote:
> Hi Francisco,
> 
> [...]
> 
>> Are you no longer interested in improving those aspects of the non-
>> HWP
>> governor?  Is it that you're planning to delete it and move back to a
>> generic cpufreq governor for non-HWP platforms in the near future?
> 
> Yes that is the plan for Atom platforms, which are only non HWP
> platforms till now. You have to show good gain for performance and
> performance/watt to carry and maintain such big change. So we have to
> see your performance and power numbers.

For the active cases, you can look at the links at the beginning / 
bottom of this mail thread.  Francisco provided performance results for 
 >100 benchmarks.


At this side of Atlantic, we've been testing different versions of the 
patchset in past few months for >50 Linux 3D benchmarks on 6 different 
platforms.

On Geminilake and few BXT configurations (where 3D benchmarks are TDP 
limited), many tests' performance improves by 5-15%, also complex ones. 
And more importantly, there were no regressions.

(You can see details + links to more info in Jira ticket VIZ-12078.)

*On (fully) TDP limited cases, power usage (obviously) keeps the same, 
so performance/watt improvements can be derived from the measured 
performance improvements.*


We have data also for earlier platforms from slightly older versions of 
the patchset, but on those it didn't have any significant impact on 
performance.

I think the main reason for this is that BYT & BSW NUCs that we have, 
have space only for single memory module.  Without dual-memory channel 
configuration, benchmarks are too memory-bottlenecked to utilized GPU 
enough to make things TDP limited on those platforms.

However, now that I look at the old BYT & BSW data (for few benchmarks 
which improved most on BXT & GLK), I see that there's a reduction in the 
CPU power utilization according to RAPL, at least on BSW.


	- Eero


>>> This will benefit all architectures including x86 + non i915.
>>>
>>
>> The current design encourages re-use of the IO utilization statistic
>> (see PATCH 1) by other governors as a mechanism driving the trade-off
>> between energy efficiency and responsiveness based on whether the
>> system
>> is close to CPU-bound, in whatever way is applicable to each governor
>> (e.g. it would make sense for it to be hooked up to the EPP
>> preference
>> knob in the case of the intel_pstate HWP governor, which would allow
>> it
>> to achieve better energy efficiency in IO-bound situations just like
>> this series does for non-HWP parts).  There's nothing really x86- nor
>> i915-specific about it.
>>
>>> BTW intel-pstate can be driven by sched-util governor (passive
>>> mode),
>>> so if your prove benefits to Broxton, this can be a default.
>>> As before:
>>> - No regression to idle power at all. This is more important than
>>> benchmarks
>>> - Not just score, performance/watt is important
>>>
>>
>> Is schedutil actually on par with the intel_pstate non-HWP governor
>> as
>> of today, according to these metrics and the overall benchmark
>> numbers?
> Yes, except for few cases. I have not tested recently, so may be
> better.
> 
> Thanks,
> Srinivas
> 
> 
>>> Thanks,
>>> Srinivas
>>>
>>>
>>>>> controller does, even though the frequent IO waits may actually
>>>>> be
>>>>> an
>>>>> indication that the system is IO-bound (which means that the
>>>>> large
>>>>> energy usage increase may not be translated in any performance
>>>>> benefit
>>>>> in practice, not to speak of performance being impacted
>>>>> negatively
>>>>> in
>>>>> TDP-bound scenarios like GPU rendering).
>>>>>
>>>>> Regarding run-time complexity, I haven't observed this governor
>>>>> to
>>>>> be
>>>>> measurably more computationally intensive than the present
>>>>> one.  It's a
>>>>> bunch more instructions indeed, but still within the same
>>>>> ballpark
>>>>> as
>>>>> the current governor.  The average increase in CPU utilization
>>>>> on
>>>>> my BXT
>>>>> with this series is less than 0.03% (sampled via ftrace for v1,
>>>>> I
>>>>> can
>>>>> repeat the measurement for the v2 I have in the works, though I
>>>>> don't
>>>>> expect the result to be substantially different).  If this is a
>>>>> problem
>>>>> for you there are several optimization opportunities that would
>>>>> cut
>>>>> down
>>>>> the number of CPU cycles get_target_pstate_lp() takes to
>>>>> execute by
>>>>> a
>>>>> large percent (most of the optimization ideas I can think of
>>>>> right
>>>>> now
>>>>> though would come at some
>>>>> accuracy/maintainability/debuggability
>>>>> cost,
>>>>> but may still be worth pursuing), but the computational
>>>>> overhead is
>>>>> low
>>>>> enough at this point that the impact on any benchmark or real
>>>>> workload
>>>>> would be orders of magnitude lower than its variance, which
>>>>> makes
>>>>> it
>>>>> kind of difficult to keep the discussion data-driven [as
>>>>> possibly
>>>>> any
>>>>> performance optimization discussion should ever be ;)].
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Srinivas
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>> [Absolute benchmark results are unfortunately omitted
>>>>>>>> from
>>>>>>>> this
>>>>>>>> letter
>>>>>>>> due to company policies, but the percent change and
>>>>>>>> Student's
>>>>>>>> T
>>>>>>>> p-value are included above and in the referenced
>>>>>>>> benchmark
>>>>>>>> results]
>>>>>>>>
>>>>>>>> The most obvious impact of this series will likely be the
>>>>>>>> overall
>>>>>>>> improvement in graphics performance on systems with an
>>>>>>>> IGP
>>>>>>>> integrated
>>>>>>>> into the processor package (though for the moment this is
>>>>>>>> only
>>>>>>>> enabled
>>>>>>>> on BXT+), because the TDP budget shared among CPU and GPU
>>>>>>>> can
>>>>>>>> frequently become a limiting factor in low-power
>>>>>>>> devices.  On
>>>>>>>> heavily
>>>>>>>> TDP-bound devices this series improves performance of
>>>>>>>> virtually any
>>>>>>>> non-trivial graphics rendering by a significant amount
>>>>>>>> (of
>>>>>>>> the
>>>>>>>> order
>>>>>>>> of the energy efficiency improvement for that workload
>>>>>>>> assuming the
>>>>>>>> optimization didn't cause it to become non-TDP-bound).
>>>>>>>>
>>>>>>>> See [1]-[5] for detailed numbers including various
>>>>>>>> graphics
>>>>>>>> benchmarks
>>>>>>>> and a sample of the Phoronix daily-system-tracker.  Some
>>>>>>>> popular
>>>>>>>> graphics benchmarks like GfxBench gl_manhattan31 and gl_4
>>>>>>>> improve
>>>>>>>> between 5% and 11% on our systems.  The exact improvement
>>>>>>>> can
>>>>>>>> vary
>>>>>>>> substantially between systems (compare the benchmark
>>>>>>>> results
>>>>>>>> from
>>>>>>>> the
>>>>>>>> two different J3455 systems [1] and [3]) due to a number
>>>>>>>> of
>>>>>>>> factors,
>>>>>>>> including the ratio between CPU and GPU processing power,
>>>>>>>> the
>>>>>>>> behavior
>>>>>>>> of the userspace graphics driver, the windowing system
>>>>>>>> and
>>>>>>>> resolution,
>>>>>>>> the BIOS (which has an influence on the package TDP), the
>>>>>>>> thermal
>>>>>>>> characteristics of the system, etc.
>>>>>>>>
>>>>>>>> Unigine Valley and Heaven improve by a similar factor on
>>>>>>>> some
>>>>>>>> systems
>>>>>>>> (see the J3455 results [1]), but on others the
>>>>>>>> improvement is
>>>>>>>> lower
>>>>>>>> because the benchmark fails to fully utilize the GPU,
>>>>>>>> which
>>>>>>>> causes
>>>>>>>> the
>>>>>>>> heuristic to remain in low-latency state for longer,
>>>>>>>> which
>>>>>>>> leaves a
>>>>>>>> reduced TDP budget available to the GPU, which prevents
>>>>>>>> performance
>>>>>>>> from increasing further.  This can be avoided by using
>>>>>>>> the
>>>>>>>> alternative
>>>>>>>> heuristic parameters suggested in the commit message of
>>>>>>>> PATCH
>>>>>>>> 8,
>>>>>>>> which
>>>>>>>> provide a lower IO utilization threshold and hysteresis
>>>>>>>> for
>>>>>>>> the
>>>>>>>> controller to attempt to save energy.  I'm not proposing
>>>>>>>> those for
>>>>>>>> upstream (yet) because they would also increase the risk
>>>>>>>> for
>>>>>>>> latency-sensitive IO-heavy workloads to regress (like
>>>>>>>> SynMark2
>>>>>>>> OglTerrainFly* and some arguably poorly designed IPC-
>>>>>>>> bound
>>>>>>>> X11
>>>>>>>> benchmarks).
>>>>>>>>
>>>>>>>> Discrete graphics aren't likely to experience that much
>>>>>>>> of a
>>>>>>>> visible
>>>>>>>> improvement from this, even though many non-IGP workloads
>>>>>>>> *could*
>>>>>>>> benefit by reducing the system's energy usage while the
>>>>>>>> discrete
>>>>>>>> GPU
>>>>>>>> (or really, any other IO device) becomes a bottleneck,
>>>>>>>> but
>>>>>>>> this is
>>>>>>>> not
>>>>>>>> attempted in this series, since that would involve making
>>>>>>>> an
>>>>>>>> energy
>>>>>>>> efficiency/latency trade-off that only the maintainers of
>>>>>>>> the
>>>>>>>> respective drivers are in a position to make.  The
>>>>>>>> cpufreq
>>>>>>>> interface
>>>>>>>> introduced in PATCH 1 to achieve this is left as an opt-
>>>>>>>> in
>>>>>>>> for that
>>>>>>>> reason, only the i915 DRM driver is hooked up since it
>>>>>>>> will
>>>>>>>> get the
>>>>>>>> most direct pay-off due to the increased energy budget
>>>>>>>> available to
>>>>>>>> the GPU, but other power-hungry third-party gadgets built
>>>>>>>> into the
>>>>>>>> same package (*cough* AMD *cough* Mali *cough* PowerVR
>>>>>>>> *cough*) may
>>>>>>>> be
>>>>>>>> able to benefit from this interface eventually by
>>>>>>>> instrumenting the
>>>>>>>> driver in a similar way.
>>>>>>>>
>>>>>>>> The cpufreq interface is not exclusively tied to the
>>>>>>>> intel_pstate
>>>>>>>> driver, because other governors can make use of the
>>>>>>>> statistic
>>>>>>>> calculated as result to avoid over-optimizing for latency
>>>>>>>> in
>>>>>>>> scenarios
>>>>>>>> where a lower frequency would be able to achieve similar
>>>>>>>> throughput
>>>>>>>> while using less energy.  The interpretation of this
>>>>>>>> statistic
>>>>>>>> relies
>>>>>>>> on the observation that for as long as the system is CPU-
>>>>>>>> bound, any
>>>>>>>> IO
>>>>>>>> load occurring as a result of the execution of a program
>>>>>>>> will
>>>>>>>> scale
>>>>>>>> roughly linearly with the clock frequency the program is
>>>>>>>> run
>>>>>>>> at, so
>>>>>>>> (assuming that the CPU has enough processing power) a
>>>>>>>> point
>>>>>>>> will be
>>>>>>>> reached at which the program won't be able to execute
>>>>>>>> faster
>>>>>>>> with
>>>>>>>> increasing CPU frequency because the throughput limits of
>>>>>>>> some
>>>>>>>> device
>>>>>>>> will have been attained.  Increasing frequencies past
>>>>>>>> that
>>>>>>>> point
>>>>>>>> only
>>>>>>>> pessimizes energy usage for no real benefit -- The
>>>>>>>> optimal
>>>>>>>> behavior
>>>>>>>> is
>>>>>>>> for the CPU to lock to the minimum frequency that is able
>>>>>>>> to
>>>>>>>> keep
>>>>>>>> the
>>>>>>>> IO devices involved fully utilized (assuming we are past
>>>>>>>> the
>>>>>>>> maximum-efficiency inflection point of the CPU's power-
>>>>>>>> to-
>>>>>>>> frequency
>>>>>>>> curve), which is roughly the goal of this series.
>>>>>>>>
>>>>>>>> PELT could be a useful extension for this model since its
>>>>>>>> largely
>>>>>>>> heuristic assumptions would become more accurate if the
>>>>>>>> IO
>>>>>>>> and CPU
>>>>>>>> load could be tracked separately for each scheduling
>>>>>>>> entity,
>>>>>>>> but
>>>>>>>> this
>>>>>>>> is not attempted in this series because the additional
>>>>>>>> complexity
>>>>>>>> and
>>>>>>>> computational cost of such an approach is hard to justify
>>>>>>>> at
>>>>>>>> this
>>>>>>>> stage, particularly since the current governor has
>>>>>>>> similar
>>>>>>>> limitations.
>>>>>>>>
>>>>>>>> Various frequency and step-function response graphs are
>>>>>>>> available
>>>>>>>> in
>>>>>>>> [6]-[9] for comparison (obtained empirically on a BXT
>>>>>>>> J3455
>>>>>>>> system).
>>>>>>>> The response curves for the low-latency and low-power
>>>>>>>> states
>>>>>>>> of the
>>>>>>>> heuristic are shown separately -- As you can see they
>>>>>>>> roughly
>>>>>>>> bracket
>>>>>>>> the frequency response curve of the current
>>>>>>>> governor.  The
>>>>>>>> step
>>>>>>>> response of the aggressive heuristic is within a single
>>>>>>>> update
>>>>>>>> period
>>>>>>>> (even though it's not quite obvious from the graph with
>>>>>>>> the
>>>>>>>> levels
>>>>>>>> of
>>>>>>>> zoom provided).  I'll attach benchmark results from a
>>>>>>>> slower
>>>>>>>> but
>>>>>>>> non-TDP-limited machine (which means there will be no TDP
>>>>>>>> budget
>>>>>>>> increase that could possibly mask a performance
>>>>>>>> regression of
>>>>>>>> other
>>>>>>>> kind) as soon as they come out.
>>>>>>>>
>>>>>>>> Thanks to Eero and Valtteri for testing a number of
>>>>>>>> intermediate
>>>>>>>> revisions of this series (and there were quite a few of
>>>>>>>> them)
>>>>>>>> in
>>>>>>>> more
>>>>>>>> than half a dozen systems, they helped spot quite a few
>>>>>>>> issues of
>>>>>>>> earlier versions of this heuristic.
>>>>>>>>
>>>>>>>> [PATCH 1/9] cpufreq: Implement infrastructure keeping
>>>>>>>> track
>>>>>>>> of
>>>>>>>> aggregated IO active time.
>>>>>>>> [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace
>>>>>>>> bxt_funcs
>>>>>>>> with
>>>>>>>> core_funcs"
>>>>>>>> [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a
>>>>>>>> couple
>>>>>>>> of long
>>>>>>>> names"
>>>>>>>> [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify
>>>>>>>> intel_pstate_adjust_pstate()"
>>>>>>>> [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop
>>>>>>>> ->update_util
>>>>>>>> from
>>>>>>>> pstate_funcs"
>>>>>>>> [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-
>>>>>>>> pass
>>>>>>>> filtering controller for small core.
>>>>>>>> [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP
>>>>>>>> controller
>>>>>>>> based on ACPI FADT profile.
>>>>>>>> [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP
>>>>>>>> controller
>>>>>>>> parameters via debugfs.
>>>>>>>> [PATCH 9/9] drm/i915/execlists: Report GPU rendering as
>>>>>>>> IO
>>>>>>>> activity
>>>>>>>> to cpufreq.
>>>>>>>>
>>>>>>>> [1] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /bench
>>>>>>>> mark-perf-comparison-J3455.log
>>>>>>>> [2] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /bench
>>>>>>>> mark-perf-per-watt-comparison-J3455.log
>>>>>>>> [3] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /bench
>>>>>>>> mark-perf-comparison-J3455-1.log
>>>>>>>> [4] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /bench
>>>>>>>> mark-perf-comparison-J4205.log
>>>>>>>> [5] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /bench
>>>>>>>> mark-perf-comparison-J5005.log
>>>>>>>> [6] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /frequ
>>>>>>>> ency-response-magnitude-comparison.svg
>>>>>>>> [7] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /frequ
>>>>>>>> ency-response-phase-comparison.svg
>>>>>>>> [8] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /step-
>>>>>>>> response-comparison-1.svg
>>>>>>>> [9] http://people.freedesktop.org/~currojerez/intel_pstat
>>>>>>>> e-lp
>>>>>>>> /step-
>>>>>>>> response-comparison-2.svg
>>>>>
>>>>> _______________________________________________
>>>>> Intel-gfx mailing list
>>>>> Intel-gfx@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-16 14:04               ` Eero Tamminen
@ 2018-04-16 17:27                 ` Srinivas Pandruvada
  0 siblings, 0 replies; 39+ messages in thread
From: Srinivas Pandruvada @ 2018-04-16 17:27 UTC (permalink / raw)
  To: Eero Tamminen, Francisco Jerez, linux-pm, intel-gfx
  Cc: Peter Zijlstra, Rafael J. Wysocki

On Mon, 2018-04-16 at 17:04 +0300, Eero Tamminen wrote:
> Hi,
> 
> On 14.04.2018 07:01, Srinivas Pandruvada wrote:
> > Hi Francisco,
> > 
> > [...]
> > 
> > > Are you no longer interested in improving those aspects of the
> > > non-
> > > HWP
> > > governor?  Is it that you're planning to delete it and move back
> > > to a
> > > generic cpufreq governor for non-HWP platforms in the near
> > > future?
> > 
> > Yes that is the plan for Atom platforms, which are only non HWP
> > platforms till now. You have to show good gain for performance and
> > performance/watt to carry and maintain such big change. So we have
> > to
> > see your performance and power numbers.
> 
> For the active cases, you can look at the links at the beginning / 
> bottom of this mail thread.  Francisco provided performance results
> for 
>  >100 benchmarks.
Looks like you didn't test the idle cases, which are more important.
Systems will tend to be more idle (increased +50% by the patches). Once
you fix the idle, you have to retest and then results will be
interesting.

Once you fix this, then it is pure algorithm, whether it is done in
intel-pstate or sched-util governor is not a big different. It is
better to do in sched-util as this will benefit all architectures and
will get better test coverage and maintained.

Thanks,
Srinivas



> 
> 
> At this side of Atlantic, we've been testing different versions of
> the 
> patchset in past few months for >50 Linux 3D benchmarks on 6
> different 
> platforms.
> 
> On Geminilake and few BXT configurations (where 3D benchmarks are
> TDP 
> limited), many tests' performance improves by 5-15%, also complex
> ones. 
> And more importantly, there were no regressions.
> 
> (You can see details + links to more info in Jira ticket VIZ-12078.)
> 
> *On (fully) TDP limited cases, power usage (obviously) keeps the
> same, 
> so performance/watt improvements can be derived from the measured 
> performance improvements.*
> 
> 
> We have data also for earlier platforms from slightly older versions
> of 
> the patchset, but on those it didn't have any significant impact on 
> performance.
> 
> I think the main reason for this is that BYT & BSW NUCs that we
> have, 
> have space only for single memory module.  Without dual-memory
> channel 
> configuration, benchmarks are too memory-bottlenecked to utilized
> GPU 
> enough to make things TDP limited on those platforms.
> 
> However, now that I look at the old BYT & BSW data (for few
> benchmarks 
> which improved most on BXT & GLK), I see that there's a reduction in
> the 
> CPU power utilization according to RAPL, at least on BSW.
> 
> 
> 	- Eero
> 
> 
> > > > This will benefit all architectures including x86 + non i915.
> > > > 
> > > 
> > > The current design encourages re-use of the IO utilization
> > > statistic
> > > (see PATCH 1) by other governors as a mechanism driving the
> > > trade-off
> > > between energy efficiency and responsiveness based on whether the
> > > system
> > > is close to CPU-bound, in whatever way is applicable to each
> > > governor
> > > (e.g. it would make sense for it to be hooked up to the EPP
> > > preference
> > > knob in the case of the intel_pstate HWP governor, which would
> > > allow
> > > it
> > > to achieve better energy efficiency in IO-bound situations just
> > > like
> > > this series does for non-HWP parts).  There's nothing really x86-
> > > nor
> > > i915-specific about it.
> > > 
> > > > BTW intel-pstate can be driven by sched-util governor (passive
> > > > mode),
> > > > so if your prove benefits to Broxton, this can be a default.
> > > > As before:
> > > > - No regression to idle power at all. This is more important
> > > > than
> > > > benchmarks
> > > > - Not just score, performance/watt is important
> > > > 
> > > 
> > > Is schedutil actually on par with the intel_pstate non-HWP
> > > governor
> > > as
> > > of today, according to these metrics and the overall benchmark
> > > numbers?
> > 
> > Yes, except for few cases. I have not tested recently, so may be
> > better.
> > 
> > Thanks,
> > Srinivas
> > 
> > 
> > > > Thanks,
> > > > Srinivas
> > > > 
> > > > 
> > > > > > controller does, even though the frequent IO waits may
> > > > > > actually
> > > > > > be
> > > > > > an
> > > > > > indication that the system is IO-bound (which means that
> > > > > > the
> > > > > > large
> > > > > > energy usage increase may not be translated in any
> > > > > > performance
> > > > > > benefit
> > > > > > in practice, not to speak of performance being impacted
> > > > > > negatively
> > > > > > in
> > > > > > TDP-bound scenarios like GPU rendering).
> > > > > > 
> > > > > > Regarding run-time complexity, I haven't observed this
> > > > > > governor
> > > > > > to
> > > > > > be
> > > > > > measurably more computationally intensive than the present
> > > > > > one.  It's a
> > > > > > bunch more instructions indeed, but still within the same
> > > > > > ballpark
> > > > > > as
> > > > > > the current governor.  The average increase in CPU
> > > > > > utilization
> > > > > > on
> > > > > > my BXT
> > > > > > with this series is less than 0.03% (sampled via ftrace for
> > > > > > v1,
> > > > > > I
> > > > > > can
> > > > > > repeat the measurement for the v2 I have in the works,
> > > > > > though I
> > > > > > don't
> > > > > > expect the result to be substantially different).  If this
> > > > > > is a
> > > > > > problem
> > > > > > for you there are several optimization opportunities that
> > > > > > would
> > > > > > cut
> > > > > > down
> > > > > > the number of CPU cycles get_target_pstate_lp() takes to
> > > > > > execute by
> > > > > > a
> > > > > > large percent (most of the optimization ideas I can think
> > > > > > of
> > > > > > right
> > > > > > now
> > > > > > though would come at some
> > > > > > accuracy/maintainability/debuggability
> > > > > > cost,
> > > > > > but may still be worth pursuing), but the computational
> > > > > > overhead is
> > > > > > low
> > > > > > enough at this point that the impact on any benchmark or
> > > > > > real
> > > > > > workload
> > > > > > would be orders of magnitude lower than its variance, which
> > > > > > makes
> > > > > > it
> > > > > > kind of difficult to keep the discussion data-driven [as
> > > > > > possibly
> > > > > > any
> > > > > > performance optimization discussion should ever be ;)].
> > > > > > 
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Srinivas
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > [Absolute benchmark results are unfortunately omitted
> > > > > > > > > from
> > > > > > > > > this
> > > > > > > > > letter
> > > > > > > > > due to company policies, but the percent change and
> > > > > > > > > Student's
> > > > > > > > > T
> > > > > > > > > p-value are included above and in the referenced
> > > > > > > > > benchmark
> > > > > > > > > results]
> > > > > > > > > 
> > > > > > > > > The most obvious impact of this series will likely be
> > > > > > > > > the
> > > > > > > > > overall
> > > > > > > > > improvement in graphics performance on systems with
> > > > > > > > > an
> > > > > > > > > IGP
> > > > > > > > > integrated
> > > > > > > > > into the processor package (though for the moment
> > > > > > > > > this is
> > > > > > > > > only
> > > > > > > > > enabled
> > > > > > > > > on BXT+), because the TDP budget shared among CPU and
> > > > > > > > > GPU
> > > > > > > > > can
> > > > > > > > > frequently become a limiting factor in low-power
> > > > > > > > > devices.  On
> > > > > > > > > heavily
> > > > > > > > > TDP-bound devices this series improves performance of
> > > > > > > > > virtually any
> > > > > > > > > non-trivial graphics rendering by a significant
> > > > > > > > > amount
> > > > > > > > > (of
> > > > > > > > > the
> > > > > > > > > order
> > > > > > > > > of the energy efficiency improvement for that
> > > > > > > > > workload
> > > > > > > > > assuming the
> > > > > > > > > optimization didn't cause it to become non-TDP-
> > > > > > > > > bound).
> > > > > > > > > 
> > > > > > > > > See [1]-[5] for detailed numbers including various
> > > > > > > > > graphics
> > > > > > > > > benchmarks
> > > > > > > > > and a sample of the Phoronix daily-system-
> > > > > > > > > tracker.  Some
> > > > > > > > > popular
> > > > > > > > > graphics benchmarks like GfxBench gl_manhattan31 and
> > > > > > > > > gl_4
> > > > > > > > > improve
> > > > > > > > > between 5% and 11% on our systems.  The exact
> > > > > > > > > improvement
> > > > > > > > > can
> > > > > > > > > vary
> > > > > > > > > substantially between systems (compare the benchmark
> > > > > > > > > results
> > > > > > > > > from
> > > > > > > > > the
> > > > > > > > > two different J3455 systems [1] and [3]) due to a
> > > > > > > > > number
> > > > > > > > > of
> > > > > > > > > factors,
> > > > > > > > > including the ratio between CPU and GPU processing
> > > > > > > > > power,
> > > > > > > > > the
> > > > > > > > > behavior
> > > > > > > > > of the userspace graphics driver, the windowing
> > > > > > > > > system
> > > > > > > > > and
> > > > > > > > > resolution,
> > > > > > > > > the BIOS (which has an influence on the package TDP),
> > > > > > > > > the
> > > > > > > > > thermal
> > > > > > > > > characteristics of the system, etc.
> > > > > > > > > 
> > > > > > > > > Unigine Valley and Heaven improve by a similar factor
> > > > > > > > > on
> > > > > > > > > some
> > > > > > > > > systems
> > > > > > > > > (see the J3455 results [1]), but on others the
> > > > > > > > > improvement is
> > > > > > > > > lower
> > > > > > > > > because the benchmark fails to fully utilize the GPU,
> > > > > > > > > which
> > > > > > > > > causes
> > > > > > > > > the
> > > > > > > > > heuristic to remain in low-latency state for longer,
> > > > > > > > > which
> > > > > > > > > leaves a
> > > > > > > > > reduced TDP budget available to the GPU, which
> > > > > > > > > prevents
> > > > > > > > > performance
> > > > > > > > > from increasing further.  This can be avoided by
> > > > > > > > > using
> > > > > > > > > the
> > > > > > > > > alternative
> > > > > > > > > heuristic parameters suggested in the commit message
> > > > > > > > > of
> > > > > > > > > PATCH
> > > > > > > > > 8,
> > > > > > > > > which
> > > > > > > > > provide a lower IO utilization threshold and
> > > > > > > > > hysteresis
> > > > > > > > > for
> > > > > > > > > the
> > > > > > > > > controller to attempt to save energy.  I'm not
> > > > > > > > > proposing
> > > > > > > > > those for
> > > > > > > > > upstream (yet) because they would also increase the
> > > > > > > > > risk
> > > > > > > > > for
> > > > > > > > > latency-sensitive IO-heavy workloads to regress (like
> > > > > > > > > SynMark2
> > > > > > > > > OglTerrainFly* and some arguably poorly designed IPC-
> > > > > > > > > bound
> > > > > > > > > X11
> > > > > > > > > benchmarks).
> > > > > > > > > 
> > > > > > > > > Discrete graphics aren't likely to experience that
> > > > > > > > > much
> > > > > > > > > of a
> > > > > > > > > visible
> > > > > > > > > improvement from this, even though many non-IGP
> > > > > > > > > workloads
> > > > > > > > > *could*
> > > > > > > > > benefit by reducing the system's energy usage while
> > > > > > > > > the
> > > > > > > > > discrete
> > > > > > > > > GPU
> > > > > > > > > (or really, any other IO device) becomes a
> > > > > > > > > bottleneck,
> > > > > > > > > but
> > > > > > > > > this is
> > > > > > > > > not
> > > > > > > > > attempted in this series, since that would involve
> > > > > > > > > making
> > > > > > > > > an
> > > > > > > > > energy
> > > > > > > > > efficiency/latency trade-off that only the
> > > > > > > > > maintainers of
> > > > > > > > > the
> > > > > > > > > respective drivers are in a position to make.  The
> > > > > > > > > cpufreq
> > > > > > > > > interface
> > > > > > > > > introduced in PATCH 1 to achieve this is left as an
> > > > > > > > > opt-
> > > > > > > > > in
> > > > > > > > > for that
> > > > > > > > > reason, only the i915 DRM driver is hooked up since
> > > > > > > > > it
> > > > > > > > > will
> > > > > > > > > get the
> > > > > > > > > most direct pay-off due to the increased energy
> > > > > > > > > budget
> > > > > > > > > available to
> > > > > > > > > the GPU, but other power-hungry third-party gadgets
> > > > > > > > > built
> > > > > > > > > into the
> > > > > > > > > same package (*cough* AMD *cough* Mali *cough*
> > > > > > > > > PowerVR
> > > > > > > > > *cough*) may
> > > > > > > > > be
> > > > > > > > > able to benefit from this interface eventually by
> > > > > > > > > instrumenting the
> > > > > > > > > driver in a similar way.
> > > > > > > > > 
> > > > > > > > > The cpufreq interface is not exclusively tied to the
> > > > > > > > > intel_pstate
> > > > > > > > > driver, because other governors can make use of the
> > > > > > > > > statistic
> > > > > > > > > calculated as result to avoid over-optimizing for
> > > > > > > > > latency
> > > > > > > > > in
> > > > > > > > > scenarios
> > > > > > > > > where a lower frequency would be able to achieve
> > > > > > > > > similar
> > > > > > > > > throughput
> > > > > > > > > while using less energy.  The interpretation of this
> > > > > > > > > statistic
> > > > > > > > > relies
> > > > > > > > > on the observation that for as long as the system is
> > > > > > > > > CPU-
> > > > > > > > > bound, any
> > > > > > > > > IO
> > > > > > > > > load occurring as a result of the execution of a
> > > > > > > > > program
> > > > > > > > > will
> > > > > > > > > scale
> > > > > > > > > roughly linearly with the clock frequency the program
> > > > > > > > > is
> > > > > > > > > run
> > > > > > > > > at, so
> > > > > > > > > (assuming that the CPU has enough processing power) a
> > > > > > > > > point
> > > > > > > > > will be
> > > > > > > > > reached at which the program won't be able to execute
> > > > > > > > > faster
> > > > > > > > > with
> > > > > > > > > increasing CPU frequency because the throughput
> > > > > > > > > limits of
> > > > > > > > > some
> > > > > > > > > device
> > > > > > > > > will have been attained.  Increasing frequencies past
> > > > > > > > > that
> > > > > > > > > point
> > > > > > > > > only
> > > > > > > > > pessimizes energy usage for no real benefit -- The
> > > > > > > > > optimal
> > > > > > > > > behavior
> > > > > > > > > is
> > > > > > > > > for the CPU to lock to the minimum frequency that is
> > > > > > > > > able
> > > > > > > > > to
> > > > > > > > > keep
> > > > > > > > > the
> > > > > > > > > IO devices involved fully utilized (assuming we are
> > > > > > > > > past
> > > > > > > > > the
> > > > > > > > > maximum-efficiency inflection point of the CPU's
> > > > > > > > > power-
> > > > > > > > > to-
> > > > > > > > > frequency
> > > > > > > > > curve), which is roughly the goal of this series.
> > > > > > > > > 
> > > > > > > > > PELT could be a useful extension for this model since
> > > > > > > > > its
> > > > > > > > > largely
> > > > > > > > > heuristic assumptions would become more accurate if
> > > > > > > > > the
> > > > > > > > > IO
> > > > > > > > > and CPU
> > > > > > > > > load could be tracked separately for each scheduling
> > > > > > > > > entity,
> > > > > > > > > but
> > > > > > > > > this
> > > > > > > > > is not attempted in this series because the
> > > > > > > > > additional
> > > > > > > > > complexity
> > > > > > > > > and
> > > > > > > > > computational cost of such an approach is hard to
> > > > > > > > > justify
> > > > > > > > > at
> > > > > > > > > this
> > > > > > > > > stage, particularly since the current governor has
> > > > > > > > > similar
> > > > > > > > > limitations.
> > > > > > > > > 
> > > > > > > > > Various frequency and step-function response graphs
> > > > > > > > > are
> > > > > > > > > available
> > > > > > > > > in
> > > > > > > > > [6]-[9] for comparison (obtained empirically on a BXT
> > > > > > > > > J3455
> > > > > > > > > system).
> > > > > > > > > The response curves for the low-latency and low-power
> > > > > > > > > states
> > > > > > > > > of the
> > > > > > > > > heuristic are shown separately -- As you can see they
> > > > > > > > > roughly
> > > > > > > > > bracket
> > > > > > > > > the frequency response curve of the current
> > > > > > > > > governor.  The
> > > > > > > > > step
> > > > > > > > > response of the aggressive heuristic is within a
> > > > > > > > > single
> > > > > > > > > update
> > > > > > > > > period
> > > > > > > > > (even though it's not quite obvious from the graph
> > > > > > > > > with
> > > > > > > > > the
> > > > > > > > > levels
> > > > > > > > > of
> > > > > > > > > zoom provided).  I'll attach benchmark results from a
> > > > > > > > > slower
> > > > > > > > > but
> > > > > > > > > non-TDP-limited machine (which means there will be no
> > > > > > > > > TDP
> > > > > > > > > budget
> > > > > > > > > increase that could possibly mask a performance
> > > > > > > > > regression of
> > > > > > > > > other
> > > > > > > > > kind) as soon as they come out.
> > > > > > > > > 
> > > > > > > > > Thanks to Eero and Valtteri for testing a number of
> > > > > > > > > intermediate
> > > > > > > > > revisions of this series (and there were quite a few
> > > > > > > > > of
> > > > > > > > > them)
> > > > > > > > > in
> > > > > > > > > more
> > > > > > > > > than half a dozen systems, they helped spot quite a
> > > > > > > > > few
> > > > > > > > > issues of
> > > > > > > > > earlier versions of this heuristic.
> > > > > > > > > 
> > > > > > > > > [PATCH 1/9] cpufreq: Implement infrastructure keeping
> > > > > > > > > track
> > > > > > > > > of
> > > > > > > > > aggregated IO active time.
> > > > > > > > > [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace
> > > > > > > > > bxt_funcs
> > > > > > > > > with
> > > > > > > > > core_funcs"
> > > > > > > > > [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a
> > > > > > > > > couple
> > > > > > > > > of long
> > > > > > > > > names"
> > > > > > > > > [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify
> > > > > > > > > intel_pstate_adjust_pstate()"
> > > > > > > > > [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop
> > > > > > > > > ->update_util
> > > > > > > > > from
> > > > > > > > > pstate_funcs"
> > > > > > > > > [PATCH 6/9] cpufreq/intel_pstate: Implement variably
> > > > > > > > > low-
> > > > > > > > > pass
> > > > > > > > > filtering controller for small core.
> > > > > > > > > [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP
> > > > > > > > > controller
> > > > > > > > > based on ACPI FADT profile.
> > > > > > > > > [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP
> > > > > > > > > controller
> > > > > > > > > parameters via debugfs.
> > > > > > > > > [PATCH 9/9] drm/i915/execlists: Report GPU rendering
> > > > > > > > > as
> > > > > > > > > IO
> > > > > > > > > activity
> > > > > > > > > to cpufreq.
> > > > > > > > > 
> > > > > > > > > [1] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /bench
> > > > > > > > > mark-perf-comparison-J3455.log
> > > > > > > > > [2] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /bench
> > > > > > > > > mark-perf-per-watt-comparison-J3455.log
> > > > > > > > > [3] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /bench
> > > > > > > > > mark-perf-comparison-J3455-1.log
> > > > > > > > > [4] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /bench
> > > > > > > > > mark-perf-comparison-J4205.log
> > > > > > > > > [5] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /bench
> > > > > > > > > mark-perf-comparison-J5005.log
> > > > > > > > > [6] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /frequ
> > > > > > > > > ency-response-magnitude-comparison.svg
> > > > > > > > > [7] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /frequ
> > > > > > > > > ency-response-phase-comparison.svg
> > > > > > > > > [8] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /step-
> > > > > > > > > response-comparison-1.svg
> > > > > > > > > [9] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /step-
> > > > > > > > > response-comparison-2.svg
> > > > > > 
> > > > > > _______________________________________________
> > > > > > Intel-gfx mailing list
> > > > > > Intel-gfx@lists.freedesktop.org
> > > > > > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-03-28  6:38 [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
                   ` (10 preceding siblings ...)
  2018-04-10 22:28 ` Francisco Jerez
@ 2018-04-17 14:03 ` Chris Wilson
  2018-04-17 15:34   ` Srinivas Pandruvada
  2018-04-17 19:27   ` Francisco Jerez
  11 siblings, 2 replies; 39+ messages in thread
From: Chris Wilson @ 2018-04-17 14:03 UTC (permalink / raw)
  To: Francisco Jerez, linux-pm, intel-gfx, Srinivas Pandruvada
  Cc: Eero Tamminen, Rafael J. Wysocki

I have to ask, if this is all just to work around iowait triggering high
frequencies for GPU bound applications, does it all just boil down to
i915 incorrectly using iowait. Does this patch set perform better than

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index 9ca9c24b4421..7e7c95411bcd 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -1267,7 +1267,7 @@ long i915_request_wait(struct i915_request *rq,
                        goto complete;
                }
 
-               timeout = io_schedule_timeout(timeout);
+               timeout = schedule_timeout(timeout);
        } while (1);
 
        GEM_BUG_ON(!intel_wait_has_seqno(&wait));

Quite clearly the general framework could prove useful in a broader
range of situations, but does the above suffice? (And can be backported
to stable.)
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-17 14:03 ` Chris Wilson
@ 2018-04-17 15:34   ` Srinivas Pandruvada
  2018-04-17 19:27   ` Francisco Jerez
  1 sibling, 0 replies; 39+ messages in thread
From: Srinivas Pandruvada @ 2018-04-17 15:34 UTC (permalink / raw)
  To: Chris Wilson, Francisco Jerez, linux-pm, intel-gfx
  Cc: Eero Tamminen, Rafael J. Wysocki

On Tue, 2018-04-17 at 15:03 +0100, Chris Wilson wrote:
> I have to ask, if this is all just to work around iowait triggering
> high
> frequencies for GPU bound applications, does it all just boil down to
> i915 incorrectly using iowait. Does this patch set perform better
> than
> 
> diff --git a/drivers/gpu/drm/i915/i915_request.c
> b/drivers/gpu/drm/i915/i915_request.c
> index 9ca9c24b4421..7e7c95411bcd 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1267,7 +1267,7 @@ long i915_request_wait(struct i915_request *rq,
>                         goto complete;
>                 }
>  
> -               timeout = io_schedule_timeout(timeout);
> +               timeout = schedule_timeout(timeout);
>         } while (1);
>  
>         GEM_BUG_ON(!intel_wait_has_seqno(&wait));
> 
> Quite clearly the general framework could prove useful in a broader
> range of situations, but does the above suffice? (And can be
> backported
> to stable.)

Definitely a very good test to do.

Thanks,
Srinivas

> -Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
  2018-04-17 14:03 ` Chris Wilson
  2018-04-17 15:34   ` Srinivas Pandruvada
@ 2018-04-17 19:27   ` Francisco Jerez
  1 sibling, 0 replies; 39+ messages in thread
From: Francisco Jerez @ 2018-04-17 19:27 UTC (permalink / raw)
  To: Chris Wilson, linux-pm, intel-gfx, Srinivas Pandruvada
  Cc: Eero Tamminen, Rafael J. Wysocki


[-- Attachment #1.1.1: Type: text/plain, Size: 2559 bytes --]

Hey Chris,

Chris Wilson <chris@chris-wilson.co.uk> writes:

> I have to ask, if this is all just to work around iowait triggering high
> frequencies for GPU bound applications, does it all just boil down to
> i915 incorrectly using iowait. Does this patch set perform better than
>
> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
> index 9ca9c24b4421..7e7c95411bcd 100644
> --- a/drivers/gpu/drm/i915/i915_request.c
> +++ b/drivers/gpu/drm/i915/i915_request.c
> @@ -1267,7 +1267,7 @@ long i915_request_wait(struct i915_request *rq,
>                         goto complete;
>                 }
>  
> -               timeout = io_schedule_timeout(timeout);
> +               timeout = schedule_timeout(timeout);
>         } while (1);
>  
>         GEM_BUG_ON(!intel_wait_has_seqno(&wait));
>
> Quite clearly the general framework could prove useful in a broader
> range of situations, but does the above suffice? (And can be backported
> to stable.)
> -Chris

Nope...  This hunk is one of the first things I tried when I started
looking into this.  It didn't cut it, and it seemed to lead to some
regressions in latency-bound test-cases that were relying on the upward
bias provided by IOWAIT boosting in combination with the upstream
P-state governor.

The reason why it's not sufficient is that the bulk of the energy
efficiency improvement from this series is obtained by dampening
high-frequency oscillations of the CPU P-state, which occur currently
for any periodically fluctuating workload (not only i915 rendering)
regardless of whether IOWAIT boosting kicks in.  i915 using IO waits
does exacerbate the problem with the upstream governor by amplifying the
oscillation, but it's not really the ultimate cause.

In combination with v2 (you can take a peek at the half-baked patch here
[1], planning to make a few more changes this week so it isn't quite
ready for review yet) this hunk will actually cause more serious
regressions because v2 is able to use the frequent IOWAIT stalls of the
i915 driver in combination with an IO-underutilized system as a strong
indication that the workload is latency-bound, which causes it to
transition to latency-minimizing mode, which significantly improves
performance of latency-bound rendering (most visible in a handful X11
test-cases and some GPU/CPU sync test-cases from SynMark2).

[1] https://people.freedesktop.org/~currojerez/intel_pstate-lp/0001-cpufreq-intel_pstate-Implement-variably-low-pass-fil-v1.8.patch

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2018-04-17 19:27 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-28  6:38 [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
2018-03-28  6:38 ` [PATCH 1/9] cpufreq: Implement infrastructure keeping track of aggregated IO active time Francisco Jerez
2018-03-28  6:38 ` [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs with core_funcs" Francisco Jerez
2018-03-28  6:38 ` [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple of long names" Francisco Jerez
2018-03-28  6:38 ` [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify intel_pstate_adjust_pstate()" Francisco Jerez
2018-03-28  6:38 ` [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util from pstate_funcs" Francisco Jerez
2018-03-28  6:38 ` [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass filtering controller for small core Francisco Jerez
2018-03-28  6:38 ` [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP controller based on ACPI FADT profile Francisco Jerez
2018-03-28  6:38 ` [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP controller parameters via debugfs Francisco Jerez
2018-03-28  6:38 ` [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity to cpufreq Francisco Jerez
2018-03-28  8:02   ` Chris Wilson
2018-03-28 18:55     ` Francisco Jerez
2018-03-28 19:20       ` Chris Wilson
2018-03-28 23:19         ` Chris Wilson
2018-03-29  0:32           ` Francisco Jerez
2018-03-29  1:01             ` Chris Wilson
2018-03-29  1:20               ` Chris Wilson
2018-03-30 18:50 ` [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
2018-04-10 22:28 ` Francisco Jerez
2018-04-11  3:14   ` Srinivas Pandruvada
2018-04-11 16:10     ` Francisco Jerez
2018-04-11 16:26       ` Francisco Jerez
2018-04-11 17:35         ` Juri Lelli
2018-04-12 21:38           ` Francisco Jerez
2018-04-12  6:17         ` Srinivas Pandruvada
2018-04-14  2:00           ` Francisco Jerez
2018-04-14  4:01             ` Srinivas Pandruvada
2018-04-16 14:04               ` Eero Tamminen
2018-04-16 17:27                 ` Srinivas Pandruvada
2018-04-12  8:58         ` Peter Zijlstra
2018-04-12 18:34           ` Francisco Jerez
2018-04-12 19:33             ` Peter Zijlstra
2018-04-12 19:55               ` Francisco Jerez
2018-04-13 18:15                 ` Peter Zijlstra
2018-04-14  1:57                   ` Francisco Jerez
2018-04-14  9:49                     ` Peter Zijlstra
2018-04-17 14:03 ` Chris Wilson
2018-04-17 15:34   ` Srinivas Pandruvada
2018-04-17 19:27   ` Francisco Jerez

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.